NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Latent Semantic Indexing (LSI) and TREC-2 chapter S. Dumais National Institute of Standards and Technology D. K. Harman small difference between 151 and a comparable vector method in the ThEC environment, given that we have consistenfly observed larger advantages previously. The most likely reason for this is that the `IREC topics are much rich& and more detailed descriptions of searchers' needs than are found in typical IR requests. The average `IREC query has 51 unique words in it, and many of these are very specific names. Since 151 is primarily a recall enhancing method it has littie effect on these already very rich queries. This is much the same conclusion that groups who tried various kinds of query expansion (another recall enhancing method) reached - e.g., see Voorhees 1993. We tried one analysis using the new "suuunary" field for each topic. This field alone is used as a much shorter topic statement (often quite similar to the description field) that covers the relevance assessments. These results are summ&i[OCRerr][OCRerr]d in Table 5. As expected, overall performace is lower than with the ccxnplete topics. More interestingly, the difference between 151 and the standard vector method is now larger - 16% in average precision. This is still a somewhat smaller advantage than we have seen in previous experiments with smaller test collections, but even the summary queries have an average of 11 unique words in them. Table 5 isiasm lsial* 38175 38175 4. Improving performance 4.1 Improving performance - Speed The 15llSVD system was built as a research prototype to investigate many different information retrieval and interface issues. Retrieval efficiency was not a central concern because we first wanted to assess whether the method worked before worrying about efficiency, and because the initial applications of 151 involved much smaller databases of a few thousand documents. Almost no effort went into re-designing the tools to work efficiendy for the large TREC databases. 4.1.1 SVD SVD algorithms get faster all the time. The sparse, iterative algorithm we now use is about 100 times faster than the method we used inltially (Berry, 1992). There are the usual speed-memory tradeoffs in the SVD algorithms, so time can probably be decreased some by using a different algorithm and more memory. Parallel algorithms will help a littie, but probably only by a factor of 2. Finally, all calculations are now done in double precision, soboth time and memory could be decreased by using single precision computations. Preliminary experiments with smaller IR test collections suggest that this decrease in precision will not lead to numerical problems for the SVD algorithm. It is important to note that the pre- processing and SVD analyses are one-time-only costs for relatively stable doinains. Rel_ret 8043 Avg prec .2589 Pr at 100 .3344 Prat 10 .3420 R-prec .3114 Table 5: 151 Adhoc Results - Comparison of standard vector using only documents for judgements were available, Summary field. 4.1.2 Retrieval Query vectors are conipared to eveiy document. This process is linear in the number of documents in the database, and can be quite slow for large databases. Although there are no practical and efficient algorithms for finding nearest neighbors in 200- or 300-dimensional spaces, there are several methods which could be used to speed retrieval. a) Document clustering could be used to reduce the number of comparisons, but accuracy would probably suffer some. We have explored several heuristics for clustering, but none are particularly effective when high levels of accuracy are maintained. b) HNC's Matchplus system (Gallant et al., 1993) uses another approach to reduce the number of alternative documents that must be matched. They use an inltial keyword match to elilninate many documents and calculate reduced-dimension vector scores for only the subset of documents meeting the inltial keyword restriction. This may be a reasonable alternative for long queries (like `IREC), but we believe that recall would be reduced for short queries. In addition two 8676 .3008 .3710 .3680 .3386 Summary Topics. method with 151 which relevance using only the The reduced dimension 151 vector retrieval method can offer performance advantages compared to a standard vector method for the large ThEC collection. The advantages are larger with shorter queries, as expected. The exact natwe of this advantage (e.g., which documents are retrieved by 151 but not the standard vector method) needs to be examined in more detall. 111