IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Word-Word Associations in Document Retrieval Systems chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IX-2 and the importance attached to it. The concepts may represent words, groups of synonymous words, phrases, or any other indications reflecting the content of documents. In a word matching system, for example, each English stem is a concept, and the number of occurrences of a stem is its weight. The concept vector then represents a frequency list of the words in the text or query (with suffixes removed). Retrieval tests are per- formed by matching queries against documents to find the documents with the most similar concept vectors. To simulate a word-word association process, the concept vectors associated with the requests and documents are augmented by concepts found to be related to the original concepts. The association procedures con- struct a list of word pairs which are strongly associated, and for each word in the concept vector of a document, all words paired with it are added to the concept vector of the document. The expanded concept vector is used for retrieval in exactly the same fashion as the original concept vector, and the results are compared. Related word pairs are determined by the following algorithm. For each word in the document collection, a list of the documents in which the word has occurred is compiled and the frequency of the word is noted. For each pair of words, these lists are compared and a measure of similarity between the two concepts is then evaluated. The normal measure of similarity is the "cosine" correlation, defined by 7 w = I, wikwjk/[OCRerr]I/w[OCRerr].,2[OCRerr] k