IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IX-32
these parameters. The majority of the improvements obtained by restricting
frequency of words processed is obtained by removing the associations
involving words of frequency 1 and 2.
A comparison of the two correlation algorithms (cosine and overlap)
is shown in Fig. 8. These curves also cross several times, and neither
correlation coefficient can be called superior. The cutoffs used in the
two methods are chosen to roughly equalize the number of associated pairs.
As the cosine algorithm was designed primarily to handle the request-
document correlation problem, in which the vectors are of widely different
length (which is not so often the case in the present problem, since the
extremely rare and the extremely frequent concepts are omitted), it is not
surprising that the algorithms perform similarly. Since neither correlation
coefficient shows a distinct advantage, the cosine correlation is used in
all other retrieval runs described in this section.
The effect of varying the cutoff used in the association process
is shown in Table 10 and Fig. 8. Again, the curves cross, with the lowest
cutoff being superior at high recall and the highest cutoff being superior
at high precision. As a high cutoff produces the fewest but most reliable
associated pairs, it is expected to be preferable for precision purposes,
whereas a low cutoff produces the largest number of significant pairs and
therefore has an advantage if maximum recall is demanded. The cutoff of 0.9,
however, is so high that such an association process is almost indistinguisable
from the word stem run; and the cutoff of 0.3 is so low as to introduce large
numbers of non-significant pairs. The useful range of cutoffs seems there-
fore to be 0.45-0.75, roughly, for the cosine correlation.
Table 11 shows the effects of varying the relative weight of the
associations (a weighting of 1 renders a word introduced into a document