ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IV-13
well be eliminated in favor of a term such as 1'computer-controlt1,
since the former are clearly ambiguous in many contexts whereas
the latter is much more specific);
3) non-significant words should be studied carefully before any
are included in the list of words to be eliminated (for example,
a term such as tthand1' should be included in a thesaurus dealing
with biology, but it should not' be included if its high frequency
count is due to expressions such as "0fl the other hand'1);
4)
ambiguous terms should be coded only for those senses which are
likely to be present in the document collections to be treated
(for example, at least two category numbers zmist be shown for
the term 11fie1d11, corresponding on the one hand to the notion
of subject area, and on the other hand to its technical sense
in algebra; however, no category nu[OCRerr][OCRerr]ber need be shown to cover
the notion of 11a patch of land" if the dictionary deals with the
mathematical sciences or related technical fields);
5) each concept class should only include terms of roughly equal
frequency so that the matching characteristics are approximately
the same for each term within a category.
Consider as an example some of the synonym dictionaries constructed
for use with the SMART retrieval system. In that system it [OCRerr] found
useful to operate with a reasonably large number of concept classes (of
the order of 700 for a given restricted subject field), and to use also
a large list of non-significant words to be excluded from the content
indications. This list includes in particular verbs such as "begin",
`1contain", "indicate", "call", 11designate" etc., which could not be
depended upon to provide safe content indication. It was also found
useful to isolate high frequency terms into separate categories so
that these terms would not impair the retrieval effectiveness of other
more specific terms.