SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) LSI meets TREC: A Status Report chapter S. Dumais National Institute of Standards and Technology Donna K. Harman similarities between any combination of terms and documents can be easily obtained. Retrieval proceeds by using the terms in a query to identify a point in the space, and all documents are then ranked by their similarity to the query. The LSI method has been applied to many of the standard IR collections with favorable results. Using the same tokenization and term weightings, the LSI method has equaled or outperformed standard vector methods and other variants in almost every case, and was as much as 30% better in some cases (Deerwester et al., 1990). As with the standard vector method, differential term weighting and relevance feedback both improve LSI performance substantially [OCRerr]umais, 1991). LSI has also been applied in experiments on relevance feedback (Dumais and Schmitt, 1991), and in filtering applications (Foltz and Dumais, 1992). The TREC conference was an opportunity for us to "scale up" our tools, and to explore the LSI dimension-reduction ideas using a very rich corpus of word usage. This large collection of standard documents and relevance judgements should be a valuable IR resource and an important step in the systematic development of more effective retrieval systems. 2. Application of LSI to the TREC collection 2.1 Overview We used existing LSJ/SVD software for analyzing the training and test collections, and for query processing and retrieval. For pragmatic reasons, we divided the TREC collection into 9 subcollections - APi, DOEl, FRi, WSJi, ZIFFi, AP2, FR2, WSJ2, ZIFF2. Queries were passed against the appropriate subcollections, and the returns recombined to arrive at a single ranked output. There were three main stages involved in processing documents and constructing the relevant data structures. All steps were completely automatic and involved no human intervention. The resulting reduced-dimension representations were used for matching and retrieval. 1. Pre-processing and indexing (extracting terms, calculating term weights, etc.) 2. Computing the SYD (number of dimensions ranged from 235-310) 3. Adding new documents and/or terms 2.1.1 Pre-processing and indexing We did minimal pre-processing on the raw text of the TREC documents. Some markups (any text within <>delimiters) were removed, and all hand-indexed entries were removed from the WSJ and ZIFF collections. Upper case characters were translated into lower case, punctuation was removed, and white spaces were used to delimit terms. A minimum term length of 2 was used. All terms occurring in more than one document, and not on a stop list of 439 words were used to generate a term-document matrix. We did not use: stemming, phrases, syntactic or semantic parsing, word sense disambiguation, heuristic association, spelling checking or correction, 138