ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
`v-iC
should each word appear in the thesaurus structure (that is, given a word,
what are to be its assigned concept classes).
Consider first the words to be included. There is usually not much
question[OCRerr]about the fact that common function words (such as "andt1, "or",
"but") should not appear in the synonym dictionary, since these words out
of context provide no indication of subject matter. A significant problem
does, however, arise in connection with very frequent words. These may be
non-technical words in the general vocabulary such as "discuss" and "make";
or they may be technical words which, in their particular environment, are
in effect reasonably common. For example, in a collection dealing with
computer science, such words as Ttmachine1', "computer", or t'automatic" are
in effect common words with reasonably high frequency. If such frequent
words are included in a synonym dictionary, most documents will exhibit
occurrences of[OCRerr]these words, and therefore significant matching coefficients
may be obtained between documents and requests, even though the technical
texts may be really quite dissimilar (except for the fact that they may deal
with computers); if on the other h[OCRerr]nd these words are excluded, it then
becomes possible that one or another document cannot be retrieved when in
fact it ispertinent. Obviously some compromise must be made as usual,
between one' 5 interest in retrieving everything even remotely useful (that
is, between the necessity of obtaining high "recall'1), and the need not to
obtain too much extraneous material (the need for high "precision")
A similar problem arises in connection with very low frequency words.
If, for example, a term such as "Morse Code" is excluded from the dictionary,
then the very few documents dealing with this type of code may not be
retrievable. On the other hand, if "Morse Code" appears in a thesaurus
category together with many other types of coding systems, then a request