SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Latent Semantic Indexing (LSI) and TREC-2 chapter S. Dumais National Institute of Standards and Technology D. K. Harman representation, but differ in how the term, document and query vectors are formed. 2. LSI and TREC-1 We used the ThEC-1 conference as an opporunlty to "scale up" our tools, and to explore the LSI dimension-reduction ideas using a very rich corpus of word usage. We we[OCRerr] pleased that we were able to use many of the existing LS[OCRerr]SVD tools on the large collection. (See Dumais, 1993 for details.) We were able to compute the SVDs of 50k docs x 75k words matrices without numerical or convergence problems on a standard Dec5OOO or SpardO workstation. Because of these limits on the size of the matrices we could handle, we divided the [OCRerr]IREC-1 documents into 9 separate subeollections (APi, DOEl, FRi, WSJi, ZIFFi, AP2, FR2, WSJ2, ZIFF2) and worked with these. There are also some theoretical reasons why working with subcollections makes sense. By using topically coherent subcollections one can get better discrimination among documents. When the ZIFF subcollection is analyzed separately, for example, 200 or so dimensions can be devoted to differences among computer-related topics. When these same documents are part of a large corpus, many fewer indexing dimensions will be devoted to discriminating among them. interms of accuracy, LSI performance in TREC-i was about average. There were some obvious problems with our inltial pre-processing of documents (e.g., "U.S." and 11A.T.T." were omitted), and there were some unanticipated problems in coinbining across subeollections. Since many of the top performing automatic systems used SMART's preprocessing, we chose to do so as well for TREC-2. In addition, using the SMART software for this purpose allows us to compare LSI with the comparable vector method, so that we can examine the contribution of LSI per se. We also planned on comparing an LSI analysis using subeollections (from `IREC-i) with an LSI analysis of the entire collection for TREC-2. We had high hopes of being able to build on our TREC-i work for TREC-2. in practice, however, the changes in the pre-processing algorithm, and the decision to use a single combined LsI analysis resulted m our starting from scratch" in many respects. We have completed soine experiments for TREC-2, but we did not get as far as we would have liked, especially for the adhoc topics. 106 3. LSI and TREC-2 3.1 Pre-processing We used the SMART system1 for pre-processing the documents and queries. Some markups (e.g. <> delimiters) were removed, and all hand-indexed entries were removed from the WSJ and ZIFF collections. Upper case characters w&e translated into lower case, punctuation was removed, and white spaces were used to delimit terms. The SMART stop list of 571 words was used as is. The SMART stemmer (a modified Lovins algorithm) was used without modification to strip words endings. We did not use: phrases, proper noun identification, word sense disambiguation, a thesaurus, syntactic or semantic parsing, spelling checking or coffection, complex tokenizers, a controlled vocabulary, or any manual indexing. The result of this pre-processmg can be thought of as a term-document matrix, in which each cell entry indicates the fi[OCRerr]quency with which a term appears in a document. The entries in the term-document matrix were then transformed using an "ltc" weighting. The "ltc" weighting takes the log of individual cell entries, multiplies each entry for a term (row) by the WF weight of the term, and then normalizes the document (col) length. We began by processing the 742358 documents from CD-i and CD-2. Using the mimal pre-processing described above, there were 960765 unique tokens, 512251 unique stems, and 81901331 non-zero entries in the term-document matrix. 742331 documents contained at least one term. To decrease the matrix to a size we thought we could handle, we removed tokens occuiring in fewer than 5 documents. This resulted in 781421 unique tokens, 104533 unique stemmed words, and 81252681 non-zero entries. We used the resulting 742331 document x 104533 term matrix as the starting point for results reported in this paper. The "ltc" weights were computed on this matrix. 3.2 SVD analysis The "ltc" matrix described above was used as input to the SVD algorithm. The SVD program takes the ite 1. The SMART system (version 1 l[OCRerr]) was made available through the SMART group at Cornell University. Chris Buckley was especially generous in consultations about how to get the software to do somewhat non-standard things.