SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
LSI meets TREC: A Status Report
chapter
S. Dumais
National Institute of Standards and Technology
Donna K. Harman
2.3.3 Routing quenes:
For the routing queries, we created a filter or profile for each of the 50 training topics. We
submitted results from two sets of routing queries. In one case, the filter was based on just the
topic statements - i.e., we treated the routing queries as if they were adhoc queries. The filter
was located at the vector sum of the terms in the topic. We call these the routing_topic cosine
results. In the other case, we used feedback about relevant documents from the training set. The
filter in this case was derived by taking the vector sum of all relevant documents. We call these
the routing[OCRerr]reIdocs cosine results. This was an atypical variant of relevance feedback in that
we replaced (rather than combined) the original topic with relevant documents. These two
extremes provide baselines against which to compare methods for combining information from
the original query and feedback about relevant documents.
In both cases, the filter was a single vector. New documents were matched against the filter
vector and ranked in decreasing order of similarity. We have previously conducted experiments
which suggest that performance can be improved if the filter is represented as several separate
vectors. We did not use this method for the TREC results we submitted, but would like to do so
in subsequent experiments. (See also Kane-Esrig et al., 1991 or Foltz and Dumais, 1992, for a
discussion of multi-point interest profiles in LSI.)
For each topic, a separate filter vector was created in 4 training databases (WSJl, APi, FRi,
ZIFFi). Thus each of the 50 filters was represented in 4 different databases. New documents
(those from WSJ2, AP2, FR2, ZIFF2) were automatically pre-processed and indexed as
described above. New documents were "folded in" to the comparable training database and
compared to the filter vector in that database. For example, a new document from WSJ2 was
added to the WSJi database and compared to the 50 filters in that database. In is important to
note that only term vectors and term weights from the training subcollecfions were used in
indexing and comparing these new documents to the routing filters.
We used 250 dimensions and a cosine similarity measure for all routing comparisons. Results
from the different training subcollections were combined using raw cosine values.
2.3.4 Matching/retrieval times
Both queries (or filters) and documents were represented as reduced-dimension vectors. The
cosine between each query and every document was computed, and documents were ranked in
decreasing order of similarity to the query. The cosines were computer using 235 dimensions
for the adhoc queries and 250 dimensions for the routing queries. Although there are many
fewer dimensions than in standard vector retrieval, the entries are almost all non-zero so inverted
indices are not useful. This means that each query must be compared to every document.
For 250-dimensional vectors, about 50,000 cosines can be computed per minute on a Sparc2.
This time includes both comparison time and ranking time, and assumes that all document
vectors are pre-loaded into memory. For the adhoc queries, the time to compare a query to the
742,986 documents was about 12 minutes if all comparisons were sequential. It is
straightforward to split this matching across several machines or to use parallel hardware.
Preliminary experiments using a 16,000 PE MasPar showed that 50,000 cosines could be
143