IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Ix-12
included as phrases in the phrase dictionary; and only two pairs represent
words which are directly related through the hierarchy.
An attempt to see what would happen if larger collections were
used was also fruitless. A collection of 110,000 words (the ADI collection)
yielded 19.7% significant correlations, ranging from about 10% in the
lower frequency ranges to 50% in the higher frequency ranges. Because of
the extreme length of documents in this collection, these results are not
properly comparable with those from collections of abstracts, and further
work on much larger collections is needed to determine whether reliable
word relationships can be obtained from longer collections.
Since few of the word associations represent obvious semantic
relationships, it may well be asked what causes the associations. The
answer seems to be that they represent relations of "local" semantic
meanings peculiar to this collection of documents. That is, the meaning
of a word in one particular document collection may differ widely from
the normal meaning of the word. When the meanings of the word in the
collection are considered, it is found that about 73.1% of the pairs
are significant, in this "local" sense. For example, consider the
associated pair "scheme" and "machine". This was rated non-significant,
since the words in their normal technical meanings are not related.
However, it is found in examining the ten occurrences of "machine" that
all ten imply "digital computing machine"; although the collection dis-
cusses compressors, engines, etc none of these are referred to as "machines".
The "local meaning" of "machine" is therefore "computer". Similarly, the
major local meaning of "scheme" turns out to be "algorithm" (i.e. not just
any kind of plan, but a plan for a digital computer program). clearly,