SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Latent Semantic Indexing (LSI) and TREC-2
chapter
S. Dumais
National Institute of Standards and Technology
D. K. Harman
small difference between 151 and a comparable vector
method in the ThEC environment, given that we have
consistenfly observed larger advantages previously.
The most likely reason for this is that the `IREC topics
are much rich& and more detailed descriptions of
searchers' needs than are found in typical IR requests.
The average `IREC query has 51 unique words in it,
and many of these are very specific names. Since 151
is primarily a recall enhancing method it has littie
effect on these already very rich queries. This is much
the same conclusion that groups who tried various
kinds of query expansion (another recall enhancing
method) reached - e.g., see Voorhees 1993.
We tried one analysis using the new "suuunary" field
for each topic. This field alone is used as a much
shorter topic statement (often quite similar to the
description field) that covers the relevance
assessments. These results are summ&i[OCRerr][OCRerr]d in Table
5. As expected, overall performace is lower than
with the ccxnplete topics. More interestingly, the
difference between 151 and the standard vector
method is now larger - 16% in average precision. This
is still a somewhat smaller advantage than we have
seen in previous experiments with smaller test
collections, but even the summary queries have an
average of 11 unique words in them.
Table 5
isiasm lsial*
38175 38175
4. Improving performance
4.1 Improving performance - Speed
The 15llSVD system was built as a research prototype
to investigate many different information retrieval and
interface issues. Retrieval efficiency was not a central
concern because we first wanted to assess whether the
method worked before worrying about efficiency, and
because the initial applications of 151 involved much
smaller databases of a few thousand documents.
Almost no effort went into re-designing the tools to
work efficiendy for the large TREC databases.
4.1.1 SVD
SVD algorithms get faster all the time. The sparse,
iterative algorithm we now use is about 100 times
faster than the method we used inltially (Berry, 1992).
There are the usual speed-memory tradeoffs in the
SVD algorithms, so time can probably be decreased
some by using a different algorithm and more
memory. Parallel algorithms will help a littie, but
probably only by a factor of 2. Finally, all
calculations are now done in double precision, soboth
time and memory could be decreased by using single
precision computations. Preliminary experiments with
smaller IR test collections suggest that this decrease in
precision will not lead to numerical problems for the
SVD algorithm. It is important to note that the pre-
processing and SVD analyses are one-time-only costs
for relatively stable doinains.
Rel_ret 8043
Avg prec .2589
Pr at 100 .3344
Prat 10 .3420
R-prec .3114
Table 5: 151 Adhoc Results -
Comparison of standard vector
using only documents for
judgements were available,
Summary field.
4.1.2 Retrieval
Query vectors are conipared to eveiy document. This
process is linear in the number of documents in the
database, and can be quite slow for large databases.
Although there are no practical and efficient
algorithms for finding nearest neighbors in 200- or
300-dimensional spaces, there are several methods
which could be used to speed retrieval. a) Document
clustering could be used to reduce the number of
comparisons, but accuracy would probably suffer
some. We have explored several heuristics for
clustering, but none are particularly effective when
high levels of accuracy are maintained. b) HNC's
Matchplus system (Gallant et al., 1993) uses another
approach to reduce the number of alternative
documents that must be matched. They use an inltial
keyword match to elilninate many documents and
calculate reduced-dimension vector scores for only the
subset of documents meeting the inltial keyword
restriction. This may be a reasonable alternative for
long queries (like `IREC), but we believe that recall
would be reduced for short queries. In addition two
8676
.3008
.3710
.3680
.3386
Summary Topics.
method with 151
which relevance
using only the
The reduced dimension 151 vector retrieval method
can offer performance advantages compared to a
standard vector method for the large ThEC collection.
The advantages are larger with shorter queries, as
expected. The exact natwe of this advantage (e.g.,
which documents are retrieved by 151 but not the
standard vector method) needs to be examined in more
detall.
111