IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IX-27
significance of these pairs; the thesaurus does not connect "Navier-
Stokes" with any of these terms. As a result of these associations, this
relevant document is promoted from rank position 143 to rank position 4.
The results of retrieval experiments can be used to determine
the best set of parameters for the association process. The conclusions
agree well with those deduced from the examination of the pairs in part 3.
It is noted there, for example, that words that are either very frequent
or very rare tend to have non-significant associations. The fraction of
meaningful correlations can also be increased by raising the cutoff. The
effect of this on retrieval is shown in Fig. 5, where recall-precision
curves for the stem dictionary directly - without any associations added -
and for two different association strategies, are compared. When all
words, of whatever frequency, are used in the association process, the
resulting curve is usually inferior to the normal word matching run. But
when the frequencies of words employed in the association process are
restricted to the range 6-50, and the cutoff is raised, the resulting
recall-precision curve is everywhere superior to the stem curve.
It is also noted in part 3 that words occurring only three or
four times have fewer significant occurrences than words of six or more
occurrences. [OCRerr]he effect on retrieval of variations in the frequencies of
words used in the association process is shown in greater detail in
Table 9. For both recall and precision purposes, the optimum frequency
range appears to be 6-50, although the differences in performance are
small. Examination of the recall-precision curves of Figs. 6 and 7
shows the frequently crossing curves, and thus the insensitivity to