SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
LSI meets TREC: A Status Report
chapter
S. Dumais
National Institute of Standards and Technology
Donna K. Harman
computed and sorted in 1 second. For the routing queries, the time to compare a new document
to the filters (50 filters in each of 4 databases) was about .2 sec on a Sparc2 even when the term
and filter vectors were read from disk.
3. Improving performance
3.1 Time
The LSIISVD system was built as a research prototype to investigate many different information
retrieval and interface issues. Retrieval efficiency was not a central concern because we first
wanted to assess whether the method worked before worrying about efficiency, and because the
initial applications of LSI involved much smaller databases of a few thousand documents.
Almost no effort went into re-designing the tools to work efficiently for the large TREC
databases. There are some obvious ways to decrease indexing, SVD, and retrieval time, and we
discuss some of them below.
3.1.1 Pre-processing, SVD, and database creation
About 10% of the time is spent in unnecessary 110 translation (e.g., transposing large matrices),
and this will be eliminated soon.
About 60% - 70% of the time is spent in the SVD. SYD algorithms get faster all the time. The
sparse, iterative algorithm we now use is about 100 times faster than the method we used
initially. There are the usual speed-memory tradeoffs in the SYD algorithms, so time can
probably be decreased some by using a different algorithm and more memory. Parallel
algorithms will help a little, but probably only by a factor of 2. Finally, all calculations are now
done in double precision, so both time and memory could be decreased by using single
precision. Preliminary experiments with smaller IR test collections suggest that this decrease in
precision will not lead to numerical problems for the SYD algorithm.
It is important to note that the pre-processing and SVD analyses are one-time-only costs for
relatively stable domains. We have also found that at least as many new items can be added to an
existing SVD database without redoing the SVD scaling or seriously diminishing retrieval
effectiveness.
3.1.2 Query construction and retrieval
Some calculations (e.g., scalings of various sorts) were done on the fly but could easily be
precomputed. In addition, all calculations were done in floating point, and could be speeded up
using integer arithmetic. By dropping all very low vector values, sparse vectors could be
constructed to take advantage of efficient methods currently used by vector methods.
Query vectors were compared to every document. Document clustering could be used to reduce
the number of comparisons. Query matching can also be improved tremendously by simply
using more than one machine or parallel hardware. Parallel query matching produces much
larger gains than any of the modifications discussed above. Using a 16,000 PE MasPar, with no
144