IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Word-Word Associations in Document Retrieval Systems chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. Ix-lo only 20% of the correlations were judged significant. High cutoffs should not be used in the word-word association process if the aim is to recover a sizable number of significant pairs. The classification by word frequency also yields no particularly superior choice of options. Table 3 shows the variation of significance with the frequencies of the words in the associated pair. High frequency words, as might be expected, show somewhat more reliable relationships; but the amount of statistical scatter in this corner of the table (since the number of high-frequency words is so small) renders the numbers doubtful. many event, even the best numbers (e.g. correlations of words above 20 occurrences, based on 8 pairs) are relatively poor; only 37% significant relations. There is, in short, no choice of frequency or correlation cutoff which yields reliably significant pairs. Examination of more complete tables showing both frequency and correlation dependence of significant pairs also discloses no particularly good combination. It is believed, then, that for a collection of the size used (40,000 words) the statistical association process cannot be used to yield reliable indications of generalized word meanings. Confirmation of this comes by comparison of word association pairs with dictionaries, phrase lists, and hierarchies. The IRE collection, for which a thesaurus of 700 concepts and about 3000 stems, a phrase list of 400 entries, and a complete hierarchy exist, is used for this test. A word-word association run was performed and the pairs were checked against all of these dictionaries. The association process identifies only one pair which is considered synonymous by the thesaurus; no pairs are