ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. Iv-~ system, any content analysis system will have to include methods for consistent language normalization. One of the most effective [OCRerr] for providing such a normalization is by means of suitably constructed dictionaries. The foll[OCRerr][OCRerr]ng types of dictionaries appear to be of interest in this connection: 1) a negative dictionary containing terms whose use is proscribed for content analysis purposes; 2) a thesaurus, or synonym dictionary, which specifies for each dictionary entry, one or more synonym categories, or concept classes; ambiguous entries are then replaced by many concepts and many different words (synonyms) may map into the same concept category; a thesaurus is then used to perform a many-to-many mapping from word entries to concept classes; 3) a phrase dictionary may be used to specify the most frequently used word or concept combinations (called phrases); such a phrase dictionary can often increase the effectiveness of a content analysis by assigning for content identification a relatively unambiguous phrase, instead of two or more ambiguous components (for example, the terms tlprogram?? and [OCRerr] are more ambiguous, standing alone, than the phrase ??p[OCRerr][OCRerr]grammi[OCRerr]g language1); a hierarchical (tree-like) arrangement of terms or concepts, similar to a standard library classification schedule, which makes it possible, given a certain dictionary entry to find more general concepts by going up in the hierarchy, or more specific ones by going down (for example, from a concept such as `syntax", one can obtain the more general `1language", or the more specific `1punctuation"). Dictionaries do not, of course, completely eliminate language ambiguities, but they can serve to reduce the effects of many irregularities by using appropriate dictionary mapping algorithms. For example, a correspondence between a word and a single concept may receive a higher weight than one between