IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Ix-lo
only 20% of the correlations were judged significant. High cutoffs
should not be used in the word-word association process if the aim is to
recover a sizable number of significant pairs.
The classification by word frequency also yields no particularly
superior choice of options. Table 3 shows the variation of significance
with the frequencies of the words in the associated pair. High frequency
words, as might be expected, show somewhat more reliable relationships;
but the amount of statistical scatter in this corner of the table (since
the number of high-frequency words is so small) renders the numbers
doubtful. many event, even the best numbers (e.g. correlations of words
above 20 occurrences, based on 8 pairs) are relatively poor; only 37%
significant relations. There is, in short, no choice of frequency or
correlation cutoff which yields reliably significant pairs. Examination
of more complete tables showing both frequency and correlation dependence
of significant pairs also discloses no particularly good combination. It
is believed, then, that for a collection of the size used (40,000 words)
the statistical association process cannot be used to yield reliable
indications of generalized word meanings.
Confirmation of this comes by comparison of word association
pairs with dictionaries, phrase lists, and hierarchies. The IRE collection,
for which a thesaurus of 700 concepts and about 3000 stems, a phrase list
of 400 entries, and a complete hierarchy exist, is used for this test.
A word-word association run was performed and the pairs were checked
against all of these dictionaries. The association process identifies only
one pair which is considered synonymous by the thesaurus; no pairs are