ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Design Criteria for Automatic Information Systems
chapter
M. E. Lesk
G. Salton
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
v-]6
Rule 2 : The use of information identifiers which are
weighted in accordance with their presumed
importance leads to large-scale improvements
in retrieval effectiveness, compared with the
use of unweighted terms.
B) Synonym Recognition
One of the perennial problems in automatic language analysis is the
question of language variability among authors, and the linguistic
ambiguities which result. A large number of experiments have therefore
been performed using a variety of synonym dictionaries for each of the
three subject fields under study (11Harris 211 and t1Harris 311 dictionaries
for the computer literature, [OCRerr] or [OCRerr] lists for aeronautical
engineering, and regular thesaurus for documentation). An excerpt of such
a synonym dictionary for the computer literature is shown in Fig. 7 for
the concept class numbers [OCRerr]8 to [OCRerr]l6. Use of such a synonym dictionary
permits the replacement of a variety of related terms by the corresponding
concept classes, thus ensuring the retrieval of documents dealing with the
11manufacture of transistor diodes11 when the query deals with the T1production
11
of solid state rectifiers
The output of Fig. 8 shows that considerable improvements in perfor-
mance are obtainable by means of suitably constructed synonym dictionaries.
The improvement is smallest for the Cranfield collection because the
dictionary available for this collection was not originally constructed
for retrieval purposes. This observation suggests that not all dictionaries
are equally useful. Experiments conducted with the S[OCRerr] system lead to
the following principles of dictionary construction [13]: