ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Design Criteria for Automatic Information Systems chapter M. E. Lesk G. Salton Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. V-19 a) very rare terms which occur in a representative sample document collection with insufficient frequency should not be included in the synonym dictionary, since such terms will not provide many matches between the stored items and the search requests; b) very common high-frequency terms should either be eliminated, since they provide little discrimlnation, or shoule be placed into synonym classes of their own, so that they cannot submerge other terms which would be grouped with them; c) terms which have no special significance in a given technical subject area (such as "begin1T, `1indicatet7, "system11, "automatic", etc.) should not be included; d) ambiguous terms, such as for example 11base11, should be coded only for those senses which are likely to occur in the subject area being considered; e) each group of synonymous terms should account for approximately the same total frequency of occurrence of the corresponding words in the document collection; this ensures that each identifier has approximately equal chance of being assigned to a given item. These principles can be embodied in automatic programs for the construction of synonym dictionaries, using word frequency lists and concordances derived from a representative sample document collection. [13] The experience gained with the various thesauruses constructed for the S[OCRerr]RT system leads to Rule 3: Rule 3 : Dictionaries providing synonym recognition are of considerable help in improving retrieval perforaance, particularly when they reflect the properties of the vocabulary under consideration. C) [OCRerr]rase Processing The S[OCRerr] system makes provision for the recognition of 13phrasesl1