IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Word-Word Associations in Document Retrieval Systems chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. Ix-12 included as phrases in the phrase dictionary; and only two pairs represent words which are directly related through the hierarchy. An attempt to see what would happen if larger collections were used was also fruitless. A collection of 110,000 words (the ADI collection) yielded 19.7% significant correlations, ranging from about 10% in the lower frequency ranges to 50% in the higher frequency ranges. Because of the extreme length of documents in this collection, these results are not properly comparable with those from collections of abstracts, and further work on much larger collections is needed to determine whether reliable word relationships can be obtained from longer collections. Since few of the word associations represent obvious semantic relationships, it may well be asked what causes the associations. The answer seems to be that they represent relations of "local" semantic meanings peculiar to this collection of documents. That is, the meaning of a word in one particular document collection may differ widely from the normal meaning of the word. When the meanings of the word in the collection are considered, it is found that about 73.1% of the pairs are significant, in this "local" sense. For example, consider the associated pair "scheme" and "machine". This was rated non-significant, since the words in their normal technical meanings are not related. However, it is found in examining the ten occurrences of "machine" that all ten imply "digital computing machine"; although the collection dis- cusses compressors, engines, etc none of these are referred to as "machines". The "local meaning" of "machine" is therefore "computer". Similarly, the major local meaning of "scheme" turns out to be "algorithm" (i.e. not just any kind of plan, but a plan for a digital computer program). clearly,