ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Iv-~
system, any content analysis system will have to include methods for
consistent language normalization. One of the most effective [OCRerr] for
providing such a normalization is by means of suitably constructed
dictionaries. The foll[OCRerr][OCRerr]ng types of dictionaries appear to be of interest
in this connection:
1) a negative dictionary containing terms whose use is proscribed
for content analysis purposes;
2) a thesaurus, or synonym dictionary, which specifies for each
dictionary entry, one or more synonym categories, or concept
classes; ambiguous entries are then replaced by many concepts and
many different words (synonyms) may map into the same concept
category; a thesaurus is then used to perform a many-to-many
mapping from word entries to concept classes;
3) a phrase dictionary may be used to specify the most frequently
used word or concept combinations (called phrases); such a phrase
dictionary can often increase the effectiveness of a content analysis
by assigning for content identification a relatively unambiguous
phrase, instead of two or more ambiguous components (for example,
the terms tlprogram?? and [OCRerr] are more ambiguous, standing
alone, than the phrase ??p[OCRerr][OCRerr]grammi[OCRerr]g language1);
a hierarchical (tree-like) arrangement of terms or concepts,
similar to a standard library classification schedule, which makes
it possible, given a certain dictionary entry to find more general
concepts by going up in the hierarchy, or more specific ones by
going down (for example, from a concept such as `syntax", one can
obtain the more general `1language", or the more specific `1punctuation").
Dictionaries do not, of course, completely eliminate language ambiguities,
but they can serve to reduce the effects of many irregularities by using
appropriate dictionary mapping algorithms. For example, a correspondence
between a word and a single concept may receive a higher weight than one between