SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
LSI meets TREC: A Status Report
chapter
S. Dumais
National Institute of Standards and Technology
Donna K. Harman
because many terms are now represented in more than one database. There were only
about 2Ooooo unique terms, so a combined database would have been about 930 meg. The
fact that the number of terms does not grow linearly with the number of documents will
help make large SYD calculations possible.
2.1.2 SVD
The SVD program takes the log*entropy transformed term-document matrix as input, and
calculates best "reduced-dimension" approximation to this matrix. The result of the SVD
analysis is a reduced-dimension vector for each term and each document, and a vector of the
singular values. For TREC, we computed a separate SVD for each of the 9 subcollection. The
number of dimensions, k, ranged from 235-310.
2.1.3 Adding new documents and te?7ns
As noted above, the initial indexing and SYD were typically performed on a random sample of
documents from each subcollection. The documents not included in the sample were "folded in"
to the database. These documents were located at the weighted vector sum of their constituent
terms. That is, the vector for a new document was computed using the term vectors for all terms
in the document. These term vectors were combined using the appropriate term weights, and the
singular values to differentially weight each dimension. [OCRerr]etails are given in Deerwester et al.,
1990, p.399.) For documents that are actually present in the term-document matrix, the derived
vector corresponds exactly to the document vector given by the SVD. New terms can be added
in an analogous fashion. The vector for new terms is computed using the document vectors of
all documents in which the term appears. For the TREC experiments, only new documents, not
terms, were added. The sizes of the complete databases (including all documents which were
added) are summarized in Table 1.
When adding documents and terms in this manner, we assume that the derived "semantic space"
is fixed and that new items can be fit into it. In general, this is not the same space that one would
obtain if a new SYD were calculated using both the original and new documents. In previous
experiments, we found that sampling and scaling 50% of the documents, and "folding in" the
remaining documents resulted in performance that was indistinguishable from that observed
when all documents were scaled.
2.2 Timing data
For the TREC experiments, all the pre-processing and retrieval was done on a Sparc2 with 384
meg of RAM. The SYD analyses were run on a Dec5OOO with approximately 380 meg of RAM.
Table 2 provides a summary of times (in minutes) to process documents and create the necessary
data structures. It is important to note that these costs are incurred only once at the beginning.
Subsequent query processing does not require any new SVD calculations or database updates.
140