IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IX-2
and the importance attached to it. The concepts may represent words,
groups of synonymous words, phrases, or any other indications reflecting
the content of documents. In a word matching system, for example, each
English stem is a concept, and the number of occurrences of a stem is its
weight. The concept vector then represents a frequency list of the words
in the text or query (with suffixes removed). Retrieval tests are per-
formed by matching queries against documents to find the documents with
the most similar concept vectors.
To simulate a word-word association process, the concept vectors
associated with the requests and documents are augmented by concepts found
to be related to the original concepts. The association procedures con-
struct a list of word pairs which are strongly associated, and for each
word in the concept vector of a document, all words paired with it are
added to the concept vector of the document. The expanded concept vector
is used for retrieval in exactly the same fashion as the original concept
vector, and the results are compared.
Related word pairs are determined by the following algorithm.
For each word in the document collection, a list of the documents in
which the word has occurred is compiled and the frequency of the word is
noted. For each pair of words, these lists are compared and a measure of
similarity between the two concepts is then evaluated. The normal measure
of similarity is the "cosine" correlation, defined by
7 w
= I, wikwjk/[OCRerr]I/w[OCRerr].,2[OCRerr]
k