NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Latent Semantic Indexing (LSI) and TREC-2 chapter S. Dumais National Institute of Standards and Technology D. K. Harman all comparisons were sequential. It is' however, straightforward to split this matching across several machines or to use parallel hardware since all documents are independent. Preliminary experiments using a 16,000 PE MasPar showed that 60,000 cosines could be computed and sorted in less than 1 second. It is important to note that all step in the LSI analysis are completely automatic and involved no human intervention. Documents are automatically processed to derive a term-document matrix. This matrix is decomposed by the SVD software, and the resulting reduced-dimension representation is used for retrieval. While the SVD analysis is somewhat costly in terms of time for large collections, it need is computed only once at the beginrilng to create the reduced-dimension database. Cilie SYD takes only about 2 minutes on a Sparc 10 for a 2k x 5k matrix, but this time increases to about 18 hours for a 60k x 80k matrix.) 3.4 TREC-2: Routing experiments For the routing queries, we created a filter or profile for each of the 50 training topics. We submitted results from two sets of routing queries. in one case, the filter was based on just the topic statements - i.e., we treated the routing queries as if they were adhoc queries. The filter was located at the vector sum of the terms in the topic. We call these the routin[OCRerr]topic (`sfrl) results. This method makes no use of the trahibig data, representing the topic as if it was an adhoc query. in the other case, we used information about relevant documents from the training set. The filter in this case was derived by taking the vector sum or centroid of all relevant documents. We call these the rouhn[OCRerr]reldoes (`sir2) results. There were an average of 328 relevant documents per topic, with a range of 40 to 896. This is a somewhat unusual variant of relevance feedback; we replace (rather than combine) the original topic with relevant documents, and we do not downweight terms that appear in non- relevant documents. These two extremes provide baselines against which to compare other methods for combining information from the original query and feedback about relevant documents. in both cases, the filter was a single vector. New documents were matched against the filter vector and ranked in decreasing order of similarity. The new documents (336306 documents from CD-3) were automatically processed as described in section 3.2 above. It is important to note that only terms from the CD-i and CD-2 traihing collection were used in indexing these documents. Each new document is located at the weighted vector sum of its constituent 108 term vectors in the 204-dimension LSI space (in just the same way as queries are handled). New documents were compared to each of the 50 routing filter vectors using a cosine similarity measure in 204-dimensions. The 1000 best matching documents for each filter were submitted to NIST for evaluation. 3.4.1 Results The main results of the lsirl and lsir2 runs are shown in Table 1. The two runs differ only in how the profile vectors were created - using the weighted average of the words in the topic statement for Isiri routin[OCRerr]topic, and using the weighted average of all relevant documents from the traihing collection (CD-i and CD-2) for lsir2 routin[OCRerr]reldoes. Not surprisingly, the lsir2 profile vectors which take advantage of the known relevant documents do better than the lsirl profile vectors that simply use the topic statement on all measures of performance. The improvement in average precision is 31% (.2622 vs. .3442). Users would get an average of 1 additional relevant document in the top 10 returned using the lsir2 method for filtering. Table 1 Isiri lsir2 rl+r2 (topic wds) (rel does) (sum ri r2) Rel_ret 6522 7155 7367 Avg prec .2622 .3442 .3457 PratlOO .3799 .4524 A394 Pr at 10 .5480 .6660 .6620 R-prec .3050 .3804 .3786 Q >= Median 27 (4) 40(9) 42 (6) Q< Median 23 (0) 10 (0) 8 (0) Table 1: LSI Routing Results. Comparison of topic words vs. relevant documents as routing filters. Compared to other TREC-2 systems, LSI does reasonably well, especially for the routin[OCRerr]reldocs (lsir2) run (and the rl+r2 run to be discussed below). in the case of lsir2, LSI is at or above the median performance for 40 of the 50 topics, and has the best score for 9 topics. LSI performs about average for the routin[OCRerr]topic (Isiri) run even though no information from the training set was used in forming the routing vectors in this case (except, of course, for the global term weights). We have also performed similar comparisons between U