IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Ix-24
The mechanism of precision improvement, as previously stated, is
to re-inforce the apparent weight of significant terms by adding their
associated terms. This process works because the significant terms generally
have more associated pairs than the non-significant terms. It may be
asked why this should be so. This feature of the associations derives from
the greater concentration of the significant terms in the abstracts.
The non-significant terms are generally widely spread among the abstracts,
so that it is difficult for any term to match their occurrences. The
significant terms are clustered in a few abstracts, and another term can
match them easily, since only one or two co-occurrences of terms which occur
several times in each document in which they appear is necessary to produce
a correlation above cutoff. That is, if terms occur once in each of ten
documents, they must occur in six common documents to correlate at a 0.6
level; but if they occur three times in each of three documents, they
need only co-occur in two documents to correlate at a level of .67. The
tendency of significant terms to bunch up is shown in Table 8 which shows
the distribution of occurrences of the ten words occurring fifteen times.
It is seen that no non-significant term occurs in fewer than twelve docu-
ments; none occurs more than three times in a document; and their average
ratio of number of occurrences per document is only 1.1. The significant
terms never occur in more than ten documents; every one appears at least
three times in some document; and they average 1.8 occurrences per document.
An example of the effect of this is shown by query Q116. This
query contains twelve words in the word stem matching system, of which
the key word is "dissociated". "Dissociated" was outweighted in the
search by such high-frequency words as wind", "high", "pressure" , etc.