IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Word-Word Associations in Document Retrieval Systems chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IX. Word-Word Associations in Document Retrieval Systems M. E. Lesk 1. Introduction Word normalization procedures in document retrieval systems are traditionally based on manually constructed thesauruses and term lists. Recently, automatic methods dependent on statistical co-occurrence of words have been proposed for the determination of word meanings and the selection of synonymous words, and it has been asserted that the use of such word- occurrence statistics can substitute for thesauruses in retrieval systems Word-association procedures can be investigated through the SMART automatic document retrieval system, which is capable of simulating a wide variety of proposed computerized text analysis systems in an experimental retrieval environment. E3,43 The SMART system includes methods for automatic processing of text and questions, and for the evaluation of the test results using a variety of performance measures. Existing test collections and dictionaries are used to analyze and evaluate the performance of association procedures for document retrieval. Ll,2] 2. Method In the SMART retrieval programs, documents are translated into "concept vectors", consisting of a list of concepts with attached weights. Each concept represents a piece of information fo[OCRerr]nd in the text by the analysis routines, and the weight reflects the number of times the concept was found