IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IX-7
represent significant data but are largely chance occurrences. Since
these correlations represent the major part of expanded document vectors,
they will perturb the run. We have, therefore, adopted the expedient of
removing all correlations involving a word occurring fewer than three times
from the runs, thus eliminating most of the chance correlations. The ex-
pected number of chance correlations between words occurring three times is
only one or two.
Complete elimination of chance correlations requires the removal,
not only of the words occurring very few times, but also the words occurring
many times. Any two words which occur in more than half the documents in
the collection, for example, are clearly likely to have a high correlation;
the expected chance correlation between two words, each occurring in 100
documents, is about .5, which is quite close to the cutoff. The expected
correlation between two words occurring in every document is almost certain to
be over cutoff. We therefore find it necessary also to remove correlations
between words occurring over 100 times (half our document collection size
in this particular test).
The correlations remaining after these cutoffs are applied represent
non-random word co-occurrences. This does not necessarily imply that the
words are related semantically. Co-occurrences may result from quirks of an
author's style, or from peculiarities of word usage within document col-
lections, as well as from actual semantic similarity. Since it has been
suggested that word associations can be used to construct thesauruses, it
is important to know whether word-word pairs produced by an association
process reflect semantic meanings.