ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. `v-iC should each word appear in the thesaurus structure (that is, given a word, what are to be its assigned concept classes). Consider first the words to be included. There is usually not much question[OCRerr]about the fact that common function words (such as "andt1, "or", "but") should not appear in the synonym dictionary, since these words out of context provide no indication of subject matter. A significant problem does, however, arise in connection with very frequent words. These may be non-technical words in the general vocabulary such as "discuss" and "make"; or they may be technical words which, in their particular environment, are in effect reasonably common. For example, in a collection dealing with computer science, such words as Ttmachine1', "computer", or t'automatic" are in effect common words with reasonably high frequency. If such frequent words are included in a synonym dictionary, most documents will exhibit occurrences of[OCRerr]these words, and therefore significant matching coefficients may be obtained between documents and requests, even though the technical texts may be really quite dissimilar (except for the fact that they may deal with computers); if on the other h[OCRerr]nd these words are excluded, it then becomes possible that one or another document cannot be retrieved when in fact it ispertinent. Obviously some compromise must be made as usual, between one' 5 interest in retrieving everything even remotely useful (that is, between the necessity of obtaining high "recall'1), and the need not to obtain too much extraneous material (the need for high "precision") A similar problem arises in connection with very low frequency words. If, for example, a term such as "Morse Code" is excluded from the dictionary, then the very few documents dealing with this type of code may not be retrievable. On the other hand, if "Morse Code" appears in a thesaurus category together with many other types of coding systems, then a request