ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Design Criteria for Automatic Information Systems
chapter
M. E. Lesk
G. Salton
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
V-19
a)
very rare terms which occur in a representative sample document
collection with insufficient frequency should not be included in
the synonym dictionary, since such terms will not provide many
matches between the stored items and the search requests;
b) very common high-frequency terms should either be eliminated,
since they provide little discrimlnation, or shoule be placed
into synonym classes of their own, so that they cannot submerge
other terms which would be grouped with them;
c) terms which have no special significance in a given technical
subject area (such as "begin1T, `1indicatet7, "system11, "automatic",
etc.) should not be included;
d) ambiguous terms, such as for example 11base11, should be coded only
for those senses which are likely to occur in the subject area
being considered;
e) each group of synonymous terms should account for approximately
the same total frequency of occurrence of the corresponding
words in the document collection; this ensures that each identifier
has approximately equal chance of being assigned to a given item.
These principles can be embodied in automatic programs for the construction
of synonym dictionaries, using word frequency lists and concordances derived
from a representative sample document collection. [13]
The experience gained with the various thesauruses constructed for
the S[OCRerr]RT system leads to Rule 3:
Rule 3 : Dictionaries providing synonym recognition
are of considerable help in improving retrieval
perforaance, particularly when they reflect the
properties of the vocabulary under consideration.
C) [OCRerr]rase Processing
The S[OCRerr] system makes provision for the recognition of 13phrasesl1