SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
LSI meets TREC: A Status Report
chapter
S. Dumais
National Institute of Standards and Technology
Donna K. Harman
similarities between any combination of terms and documents can be easily obtained. Retrieval
proceeds by using the terms in a query to identify a point in the space, and all documents are
then ranked by their similarity to the query.
The LSI method has been applied to many of the standard IR collections with favorable results.
Using the same tokenization and term weightings, the LSI method has equaled or outperformed
standard vector methods and other variants in almost every case, and was as much as 30% better
in some cases (Deerwester et al., 1990). As with the standard vector method, differential term
weighting and relevance feedback both improve LSI performance substantially [OCRerr]umais, 1991).
LSI has also been applied in experiments on relevance feedback (Dumais and Schmitt, 1991),
and in filtering applications (Foltz and Dumais, 1992).
The TREC conference was an opportunity for us to "scale up" our tools, and to explore the LSI
dimension-reduction ideas using a very rich corpus of word usage. This large collection of
standard documents and relevance judgements should be a valuable IR resource and an
important step in the systematic development of more effective retrieval systems.
2. Application of LSI to the TREC collection
2.1 Overview
We used existing LSJ/SVD software for analyzing the training and test collections, and for query
processing and retrieval. For pragmatic reasons, we divided the TREC collection into 9
subcollections - APi, DOEl, FRi, WSJi, ZIFFi, AP2, FR2, WSJ2, ZIFF2. Queries were passed
against the appropriate subcollections, and the returns recombined to arrive at a single ranked
output.
There were three main stages involved in processing documents and constructing the relevant
data structures. All steps were completely automatic and involved no human intervention. The
resulting reduced-dimension representations were used for matching and retrieval.
1. Pre-processing and indexing (extracting terms, calculating term weights, etc.)
2. Computing the SYD (number of dimensions ranged from 235-310)
3. Adding new documents and/or terms
2.1.1 Pre-processing and indexing
We did minimal pre-processing on the raw text of the TREC documents. Some markups (any
text within <>delimiters) were removed, and all hand-indexed entries were removed from the
WSJ and ZIFF collections. Upper case characters were translated into lower case, punctuation
was removed, and white spaces were used to delimit terms. A minimum term length of 2 was
used.
All terms occurring in more than one document, and not on a stop list of 439 words were used to
generate a term-document matrix. We did not use: stemming, phrases, syntactic or semantic
parsing, word sense disambiguation, heuristic association, spelling checking or correction,
138