IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Word-Word Associations in Document Retrieval Systems chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IX-7 represent significant data but are largely chance occurrences. Since these correlations represent the major part of expanded document vectors, they will perturb the run. We have, therefore, adopted the expedient of removing all correlations involving a word occurring fewer than three times from the runs, thus eliminating most of the chance correlations. The ex- pected number of chance correlations between words occurring three times is only one or two. Complete elimination of chance correlations requires the removal, not only of the words occurring very few times, but also the words occurring many times. Any two words which occur in more than half the documents in the collection, for example, are clearly likely to have a high correlation; the expected chance correlation between two words, each occurring in 100 documents, is about .5, which is quite close to the cutoff. The expected correlation between two words occurring in every document is almost certain to be over cutoff. We therefore find it necessary also to remove correlations between words occurring over 100 times (half our document collection size in this particular test). The correlations remaining after these cutoffs are applied represent non-random word co-occurrences. This does not necessarily imply that the words are related semantically. Co-occurrences may result from quirks of an author's style, or from peculiarities of word usage within document col- lections, as well as from actual semantic similarity. Since it has been suggested that word associations can be used to construct thesauruses, it is important to know whether word-word pairs produced by an association process reflect semantic meanings.