IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IX-3
where Wik is the weight of word i in document k, and [OCRerr] is the
correlation between concept i and concept j. Alternatively, the "overlap
correlation" may be used; it is defined as
=[OCRerr] min(w[OCRerr]k[OCRerr]w[OCRerr]k)/min([OCRerr] Wik[OCRerr][OCRerr] wik)
k k k
An example of these correlation procedures is given in Fig. 1. All pairs
of words, in which the c[OCRerr]rrelation exceeds a previously set cutoff, are
used as associated pairs. These pairs are then employed in the document
vector expansion procedure.
Many options are available in this procedure. Either correlation
method may be used; the cutoff may be adjusted arbitrarily; the procedure
may be iterated with the word similarities measured by the correlation
of their lists of related words as determined by the previous iteration;
the weights with which the new words are added to the concept vector may
be changed; and words occurring outside specified frequency ranges may be
omitted from the procedure.
Experiments were performed on three document collections, all used
for many SMART experiments. The Cranfield collection, consisting of 200
abstracts in aeronautics collected by the Aslib-Cranfield project in
England, is used for most of the investigation. Evaluation is based on a
set of 42 actual research questions, with relevance judgments obtained from
the researcher himself. The other collections used are the IRE collection,
consisting of about 780 abstracts in computer science, and the ADI collection
of 82 short papers in documentation. Prepared questions are available for
these collections with relevance judgments made by the authors.