SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Latent Semantic Indexing (LSI) and TREC-2
chapter
S. Dumais
National Institute of Standards and Technology
D. K. Harman
all comparisons were sequential. It is' however,
straightforward to split this matching across several
machines or to use parallel hardware since all
documents are independent. Preliminary experiments
using a 16,000 PE MasPar showed that 60,000 cosines
could be computed and sorted in less than 1 second.
It is important to note that all step in the LSI analysis
are completely automatic and involved no human
intervention. Documents are automatically processed
to derive a term-document matrix. This matrix is
decomposed by the SVD software, and the resulting
reduced-dimension representation is used for retrieval.
While the SVD analysis is somewhat costly in terms
of time for large collections, it need is computed only
once at the beginrilng to create the reduced-dimension
database. Cilie SYD takes only about 2 minutes on a
Sparc 10 for a 2k x 5k matrix, but this time increases to
about 18 hours for a 60k x 80k matrix.)
3.4 TREC-2: Routing experiments
For the routing queries, we created a filter or profile
for each of the 50 training topics. We submitted
results from two sets of routing queries. in one case,
the filter was based on just the topic statements - i.e.,
we treated the routing queries as if they were adhoc
queries. The filter was located at the vector sum of the
terms in the topic. We call these the routin[OCRerr]topic
(`sfrl) results. This method makes no use of the
trahibig data, representing the topic as if it was an
adhoc query. in the other case, we used information
about relevant documents from the training set. The
filter in this case was derived by taking the vector sum
or centroid of all relevant documents. We call these
the rouhn[OCRerr]reldoes (`sir2) results. There were an
average of 328 relevant documents per topic, with a
range of 40 to 896. This is a somewhat unusual
variant of relevance feedback; we replace (rather than
combine) the original topic with relevant documents,
and we do not downweight terms that appear in non-
relevant documents. These two extremes provide
baselines against which to compare other methods for
combining information from the original query and
feedback about relevant documents. in both cases, the
filter was a single vector. New documents were
matched against the filter vector and ranked in
decreasing order of similarity.
The new documents (336306 documents from CD-3)
were automatically processed as described in section
3.2 above. It is important to note that only terms from
the CD-i and CD-2 traihing collection were used in
indexing these documents. Each new document is
located at the weighted vector sum of its constituent
108
term vectors in the 204-dimension LSI space (in just
the same way as queries are handled). New
documents were compared to each of the 50 routing
filter vectors using a cosine similarity measure in
204-dimensions. The 1000 best matching documents
for each filter were submitted to NIST for evaluation.
3.4.1 Results
The main results of the lsirl and lsir2 runs are shown
in Table 1. The two runs differ only in how the profile
vectors were created - using the weighted average of
the words in the topic statement for Isiri
routin[OCRerr]topic, and using the weighted average of all
relevant documents from the traihing collection (CD-i
and CD-2) for lsir2 routin[OCRerr]reldoes. Not
surprisingly, the lsir2 profile vectors which take
advantage of the known relevant documents do better
than the lsirl profile vectors that simply use the topic
statement on all measures of performance. The
improvement in average precision is 31% (.2622 vs.
.3442). Users would get an average of 1 additional
relevant document in the top 10 returned using the
lsir2 method for filtering.
Table 1
Isiri lsir2 rl+r2
(topic wds) (rel does) (sum ri r2)
Rel_ret 6522 7155 7367
Avg prec .2622 .3442 .3457
PratlOO .3799 .4524 A394
Pr at 10 .5480 .6660 .6620
R-prec .3050 .3804 .3786
Q >= Median 27 (4) 40(9) 42 (6)
Q< Median 23 (0) 10 (0) 8 (0)
Table 1: LSI Routing Results. Comparison of topic
words vs. relevant documents as routing filters.
Compared to other TREC-2 systems, LSI does
reasonably well, especially for the routin[OCRerr]reldocs
(lsir2) run (and the rl+r2 run to be discussed below).
in the case of lsir2, LSI is at or above the median
performance for 40 of the 50 topics, and has the best
score for 9 topics. LSI performs about average for the
routin[OCRerr]topic (Isiri) run even though no information
from the training set was used in forming the routing
vectors in this case (except, of course, for the global
term weights).
We have also performed similar comparisons between
U