IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Word-Word Associations in Document Retrieval Systems chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. Ix-24 The mechanism of precision improvement, as previously stated, is to re-inforce the apparent weight of significant terms by adding their associated terms. This process works because the significant terms generally have more associated pairs than the non-significant terms. It may be asked why this should be so. This feature of the associations derives from the greater concentration of the significant terms in the abstracts. The non-significant terms are generally widely spread among the abstracts, so that it is difficult for any term to match their occurrences. The significant terms are clustered in a few abstracts, and another term can match them easily, since only one or two co-occurrences of terms which occur several times in each document in which they appear is necessary to produce a correlation above cutoff. That is, if terms occur once in each of ten documents, they must occur in six common documents to correlate at a 0.6 level; but if they occur three times in each of three documents, they need only co-occur in two documents to correlate at a level of .67. The tendency of significant terms to bunch up is shown in Table 8 which shows the distribution of occurrences of the ten words occurring fifteen times. It is seen that no non-significant term occurs in fewer than twelve docu- ments; none occurs more than three times in a document; and their average ratio of number of occurrences per document is only 1.1. The significant terms never occur in more than ten documents; every one appears at least three times in some document; and they average 1.8 occurrences per document. An example of the effect of this is shown by query Q116. This query contains twelve words in the word stem matching system, of which the key word is "dissociated". "Dissociated" was outweighted in the search by such high-frequency words as wind", "high", "pressure" , etc.