SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) LSI meets TREC: A Status Report chapter S. Dumais National Institute of Standards and Technology Donna K. Harman computed and sorted in 1 second. For the routing queries, the time to compare a new document to the filters (50 filters in each of 4 databases) was about .2 sec on a Sparc2 even when the term and filter vectors were read from disk. 3. Improving performance 3.1 Time The LSIISVD system was built as a research prototype to investigate many different information retrieval and interface issues. Retrieval efficiency was not a central concern because we first wanted to assess whether the method worked before worrying about efficiency, and because the initial applications of LSI involved much smaller databases of a few thousand documents. Almost no effort went into re-designing the tools to work efficiently for the large TREC databases. There are some obvious ways to decrease indexing, SVD, and retrieval time, and we discuss some of them below. 3.1.1 Pre-processing, SVD, and database creation About 10% of the time is spent in unnecessary 110 translation (e.g., transposing large matrices), and this will be eliminated soon. About 60% - 70% of the time is spent in the SVD. SYD algorithms get faster all the time. The sparse, iterative algorithm we now use is about 100 times faster than the method we used initially. There are the usual speed-memory tradeoffs in the SYD algorithms, so time can probably be decreased some by using a different algorithm and more memory. Parallel algorithms will help a little, but probably only by a factor of 2. Finally, all calculations are now done in double precision, so both time and memory could be decreased by using single precision. Preliminary experiments with smaller IR test collections suggest that this decrease in precision will not lead to numerical problems for the SYD algorithm. It is important to note that the pre-processing and SVD analyses are one-time-only costs for relatively stable domains. We have also found that at least as many new items can be added to an existing SVD database without redoing the SVD scaling or seriously diminishing retrieval effectiveness. 3.1.2 Query construction and retrieval Some calculations (e.g., scalings of various sorts) were done on the fly but could easily be precomputed. In addition, all calculations were done in floating point, and could be speeded up using integer arithmetic. By dropping all very low vector values, sparse vectors could be constructed to take advantage of efficient methods currently used by vector methods. Query vectors were compared to every document. Document clustering could be used to reduce the number of comparisons. Query matching can also be improved tremendously by simply using more than one machine or parallel hardware. Parallel query matching produces much larger gains than any of the modifications discussed above. Using a 16,000 PE MasPar, with no 144