IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IX. Word-Word Associations in Document Retrieval Systems
M. E. Lesk
1. Introduction
Word normalization procedures in document retrieval systems are
traditionally based on manually constructed thesauruses and term lists.
Recently, automatic methods dependent on statistical co-occurrence of words
have been proposed for the determination of word meanings and the selection
of synonymous words, and it has been asserted that the use of such word-
occurrence statistics can substitute for thesauruses in retrieval systems
Word-association procedures can be investigated through the SMART
automatic document retrieval system, which is capable of simulating a wide
variety of proposed computerized text analysis systems in an experimental
retrieval environment. E3,43 The SMART system includes methods for automatic
processing of text and questions, and for the evaluation of the test results
using a variety of performance measures. Existing test collections and
dictionaries are used to analyze and evaluate the performance of association
procedures for document retrieval.
Ll,2]
2. Method
In the SMART retrieval programs, documents are translated into "concept
vectors", consisting of a list of concepts with attached weights. Each concept
represents a piece of information fo[OCRerr]nd in the text by the analysis
routines, and the weight reflects the number of times the concept was found