IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Word-Word Associations in Document Retrieval Systems chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IX-32 these parameters. The majority of the improvements obtained by restricting frequency of words processed is obtained by removing the associations involving words of frequency 1 and 2. A comparison of the two correlation algorithms (cosine and overlap) is shown in Fig. 8. These curves also cross several times, and neither correlation coefficient can be called superior. The cutoffs used in the two methods are chosen to roughly equalize the number of associated pairs. As the cosine algorithm was designed primarily to handle the request- document correlation problem, in which the vectors are of widely different length (which is not so often the case in the present problem, since the extremely rare and the extremely frequent concepts are omitted), it is not surprising that the algorithms perform similarly. Since neither correlation coefficient shows a distinct advantage, the cosine correlation is used in all other retrieval runs described in this section. The effect of varying the cutoff used in the association process is shown in Table 10 and Fig. 8. Again, the curves cross, with the lowest cutoff being superior at high recall and the highest cutoff being superior at high precision. As a high cutoff produces the fewest but most reliable associated pairs, it is expected to be preferable for precision purposes, whereas a low cutoff produces the largest number of significant pairs and therefore has an advantage if maximum recall is demanded. The cutoff of 0.9, however, is so high that such an association process is almost indistinguisable from the word stem run; and the cutoff of 0.3 is so low as to introduce large numbers of non-significant pairs. The useful range of cutoffs seems there- fore to be 0.45-0.75, roughly, for the cosine correlation. Table 11 shows the effects of varying the relative weight of the associations (a weighting of 1 renders a word introduced into a document