NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) LSI meets TREC: A Status Report chapter S. Dumais National Institute of Standards and Technology Donna K. Harman 2.3.3 Routing quenes: For the routing queries, we created a filter or profile for each of the 50 training topics. We submitted results from two sets of routing queries. In one case, the filter was based on just the topic statements - i.e., we treated the routing queries as if they were adhoc queries. The filter was located at the vector sum of the terms in the topic. We call these the routing_topic cosine results. In the other case, we used feedback about relevant documents from the training set. The filter in this case was derived by taking the vector sum of all relevant documents. We call these the routing[OCRerr]reIdocs cosine results. This was an atypical variant of relevance feedback in that we replaced (rather than combined) the original topic with relevant documents. These two extremes provide baselines against which to compare methods for combining information from the original query and feedback about relevant documents. In both cases, the filter was a single vector. New documents were matched against the filter vector and ranked in decreasing order of similarity. We have previously conducted experiments which suggest that performance can be improved if the filter is represented as several separate vectors. We did not use this method for the TREC results we submitted, but would like to do so in subsequent experiments. (See also Kane-Esrig et al., 1991 or Foltz and Dumais, 1992, for a discussion of multi-point interest profiles in LSI.) For each topic, a separate filter vector was created in 4 training databases (WSJl, APi, FRi, ZIFFi). Thus each of the 50 filters was represented in 4 different databases. New documents (those from WSJ2, AP2, FR2, ZIFF2) were automatically pre-processed and indexed as described above. New documents were "folded in" to the comparable training database and compared to the filter vector in that database. For example, a new document from WSJ2 was added to the WSJi database and compared to the 50 filters in that database. In is important to note that only term vectors and term weights from the training subcollecfions were used in indexing and comparing these new documents to the routing filters. We used 250 dimensions and a cosine similarity measure for all routing comparisons. Results from the different training subcollections were combined using raw cosine values. 2.3.4 Matching/retrieval times Both queries (or filters) and documents were represented as reduced-dimension vectors. The cosine between each query and every document was computed, and documents were ranked in decreasing order of similarity to the query. The cosines were computer using 235 dimensions for the adhoc queries and 250 dimensions for the routing queries. Although there are many fewer dimensions than in standard vector retrieval, the entries are almost all non-zero so inverted indices are not useful. This means that each query must be compared to every document. For 250-dimensional vectors, about 50,000 cosines can be computed per minute on a Sparc2. This time includes both comparison time and ranking time, and assumes that all document vectors are pre-loaded into memory. For the adhoc queries, the time to compare a query to the 742,986 documents was about 12 minutes if all comparisons were sequential. It is straightforward to split this matching across several machines or to use parallel hardware. Preliminary experiments using a 16,000 PE MasPar showed that 50,000 cosines could be 143