IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Ix-26
which are too frequent in this. collection to be of much use as search
terms. However, none of those words had any related pairs, while "dis-
sociated" introduces eight new words. As a result, while "dissociated"
represented only 8% of the original query, it and its associations re-
presented 28% of the new query. The additional weight given to this
important term (since all its associations are also introduced into any
document [OCRerr]hich contains the word) causes three documents in rank positions
21, 23, and 27 to be promoted to positions 1, 6, and 7. Note that "dis-
sociation" already appears in these documents before expansion; but it is
not emphasized enough.
Recall-effect improvement (introducing new terms missed in the
original search) is illustrated by a question in the ADI collection, QB2,
on the "testing of automatic information systems." This fails to match
one relevant document which deals with the "evaluation of documentation
techniques". The association procedure connects "automated" in the query
with "experiment" and "reduce"; "reduce" in turn is related to "docu-
mentation". This provides enough overlap to raise the document from
77th place in the rank list of retrieved documents to 9th. It should be
noted that the useful relations are locally significant pairs (e.g.
"automated" and "experimented"; "experiment" and "test" are not associated).
An example from the Cranfield collection is query 226, whose
key term is "Navier-Stokes" (equation). Document 08C does not contain this
word, but it was introduced by the association procedure from the word
"steady". The word "numerical" was introduced into both query and document
from "Navier-Stokes" and "steady", respectively. Again, note the local