NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Latent Semantic Indexing (LSI) and TREC-2 chapter S. Dumais National Institute of Standards and Technology D. K. Harman transformed term-document matrix as input, and calculates the best "reduced-dimension91 approximation to this matrix. The result of the SVD analysis is a reduced-dimension vector for each term and each document, and a vector of the singular values. The number of dimensions, k, was between 200 and 300 in our experiments. This reduced- dimensional representation is used for retrieval. The cosine between term-term, document-document, or term-document vectors is used as the measure of similarity between them. We were recendy able to compute the SYD analysis of the full 742k x 104k matrix described above, but not in time to include the results in this paper. For the runs submitted, we used a sample of documents from the above matrix. When appropriate, the documents that were not sampled were "folded in" to the resulting reduced-dimension ISI space. In all cases, we used the weights from the 742k x 104k matrix (and did not recompute them for our samples). For the routing experiments, we used the subset of do;cnents for which we had relevance judgements. There were 68385 unique documents with relevance judgements. The SVD analysis was computed on the relevant 68385 document x 88112 term subset of the above matrix, containing 14461782 non-zero cells. A 204 reduced-dimensional approximation took 18 hrs of CPU time to compute on a SpardO workstation. This 204-dimensional representation was used for matching and retrieval. For the adhoc experiments, we took a random sample of 70000 documents. A reduced-dimensional SVD approximation was computed on a 69997 document x 82%8 term matrix (7666044 non-zeros). The resulting 199 reduced-dimensional representation was used for retrieval. The 672331 documents not included in this sample were "folded in" to the 199- dimension LsI space, and the adhoc queries were compared against all 742k documents. These "folded in" documents were located at the weighted vector sum of their coustituent terms. That is, the vector for a new document was computed using the term vectors for all terms in the document. For documents that are actually present in the term- document matrix, this derived vector corresponds exacfly to the document vector given by the SVD. New terms can be added in an analogous fashion. The vector for new terms is computed using the document vectors of all documents in which the term appears. When adding documents and terms in this manner, we assume that the derived semantic space" is fixed and 107 that new items can be fit into it. In general, this is not the same space that one would obtain if a new SYD were calculated using both the original and new documents. `[1 previous experiments, we found that sampling and scaling 50% of the documents, and "folding in" the remaing documents resulted in peiformance that was indistinguishable from that observed when all documents were scaled. Here, however, the scaling is based on less than 10% of the total corpus. We also had (from ThEC-1) LSI analyses of the 9 subeollections in CD-i and CD-2. 3.3 Queries and retrieval Queries were automatically processed in the same way as documents. For queries derived from the topic statement, we began with the full text of each topic (all topic fields), and stripped out the SGML field identifiers. For feedback queries, we used the full text of relevant documents. A query vector (or new document in the case of routing) indicating the frequency with which each term appears in the query was automatically generated for each topic. The query was transformed using SMART's "ltc" weighting. Note that we did not use any Boolean connectors or proximity operators in query formulation. Cibe implicit connectives, as in ordinary vector methods, fall somewhere between ORs and ANDs, but with an additional kind of "flizziness" introduced by the dimension-reduced association matrix representation of terms and documents.) The terms in the query are used to identify a vector in the LSI space; recall that each term has a vector representation in the space. A query is simply located at the weighted vector sum of its constituent term vectors. The cosine between the query vector and every document vector is computed, and documents are ranked in decreasing order of similarity to the query. (Although there are many fewer dimensions than in standard vector retrieval, the entries are almost all non-zero so inverted indices are not useful. This means that each query must be compared to every document.) For 200-dimensional vectors, about 60,000 cosines can be computed per minute on a Sparc2. This time includes both comparison time and ranking time, and assumes that all document vectors are pre-loaded into memory. For the adhoc queries, the time to compare a query to the 743k documents was about 10 minutes if