SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) LSI meets TREC: A Status Report chapter S. Dumais National Institute of Standards and Technology Donna K. Harman because many terms are now represented in more than one database. There were only about 2Ooooo unique terms, so a combined database would have been about 930 meg. The fact that the number of terms does not grow linearly with the number of documents will help make large SYD calculations possible. 2.1.2 SVD The SVD program takes the log*entropy transformed term-document matrix as input, and calculates best "reduced-dimension" approximation to this matrix. The result of the SVD analysis is a reduced-dimension vector for each term and each document, and a vector of the singular values. For TREC, we computed a separate SVD for each of the 9 subcollection. The number of dimensions, k, ranged from 235-310. 2.1.3 Adding new documents and te?7ns As noted above, the initial indexing and SYD were typically performed on a random sample of documents from each subcollection. The documents not included in the sample were "folded in" to the database. These documents were located at the weighted vector sum of their constituent terms. That is, the vector for a new document was computed using the term vectors for all terms in the document. These term vectors were combined using the appropriate term weights, and the singular values to differentially weight each dimension. [OCRerr]etails are given in Deerwester et al., 1990, p.399.) For documents that are actually present in the term-document matrix, the derived vector corresponds exactly to the document vector given by the SVD. New terms can be added in an analogous fashion. The vector for new terms is computed using the document vectors of all documents in which the term appears. For the TREC experiments, only new documents, not terms, were added. The sizes of the complete databases (including all documents which were added) are summarized in Table 1. When adding documents and terms in this manner, we assume that the derived "semantic space" is fixed and that new items can be fit into it. In general, this is not the same space that one would obtain if a new SYD were calculated using both the original and new documents. In previous experiments, we found that sampling and scaling 50% of the documents, and "folding in" the remaining documents resulted in performance that was indistinguishable from that observed when all documents were scaled. 2.2 Timing data For the TREC experiments, all the pre-processing and retrieval was done on a Sparc2 with 384 meg of RAM. The SYD analyses were run on a Dec5OOO with approximately 380 meg of RAM. Table 2 provides a summary of times (in minutes) to process documents and create the necessary data structures. It is important to note that these costs are incurred only once at the beginning. Subsequent query processing does not require any new SVD calculations or database updates. 140