IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IX-13
"algorithm" and "computer" is a significant pair.
To determine the fraction of "significant" pairs on the local
basis, the list of pairs was rechecked for significance and each word
looked up in a concordance of the text to determine its local meaning.
The results are shown as a function of frequency in Table 4, and as a
function of cutoff in Table 5. Nearly three-quarters of the pairs are
now meaningful. The remaining pairs which are not composed of related words
are generally stylistic quirks. For example, the word "addition" was used
only as part of the phrase "in addition", which appeared only in a few
abstracts. The word "addition" was thus associated with the other words
in these abstracts even though it had no significant meaning in this
collection.
More often, however, non-significant pairs are derived simply by
accidental preferences of the author or one or more abstracts for certain
words. If one abstract contains many instances of one word, a few in-
stances of another word in that same abstract may appear to be a major
amount of overlap to the association routine. Non-siqnificant pairs,
however, represent only a small amount of the total number of pairs of
words of high frequency when local meanings are taken into account.
The majority of associations represent such "locally" related
words Overall, about three-quarters of the associations consist of re-
lated words; and 80% of these are related only because one of the words
has a peculiar meaning in this collection. Fig. 3 shows additional
examples of these local meanings. As a result of this peculiarity, the
association process is not directly useful for determining word pairs