ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IV-7 On the negative side, dictionaries are often aifficult to construct, particularly if the environment within which they are expected to operate is subject to change; furthermore most dictionaries are useless unless their mode of usage is consistent for all operations. Obviously if a dictionary is used in one [OCRerr]Tay for information classification and in another for information searching, an effective result cannot be guaranteed. Various thesaurus types are examined in more detail in the next few paragraphs. 3. Dictionary Construction A) The Synonym Dictionary (Thesaurus) As previously explained, a thesaurus is a grouping of words, or word stems, into ceftain subject categories, hereafter called concept classes. A typical example is sho[OCRerr][OCRerr] in Fig. 1, where the concept classes are represented by three-digit numbers, and the individual entries are sho[OCRerr]ni under each concept number. In Fig. 2, a similar thesaurus arrangement is shown in alphabetical order of the words included. The concept nuiibers appear in the middle column of Fig. 2 (concept numbers over 32,000 are attached to `tcommon'1 words which are not accepted as information identi- fiers); the last column consists of one or more three-digit syntax codes attached to the words to be used for purposes of syntactic analysis. When constructing a thesaurus to be used for vocabulary normalization, one immediately faces three types of problems: first what words should one include in the thesaurus; secondly, what type of synonym categories should one use (that is, should one aim for broad, inclusive concept classes, or should the classes be narrow and specific); finally, where