SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Latent Semantic Indexing (LSI) and TREC-2
chapter
S. Dumais
National Institute of Standards and Technology
D. K. Harman
data structures need to be maintained. c) Query
matching can also be improved tremendously by
simply using more than one machine or parallel
hardware. Using a 16,000 PE MasPar, with no
attempt to optimize the data storage or sorting, we
decreased the time required to match a 200-
dimensional query vector against all document vectors
and sort by a factor of 60 to 100.
4.2 Improving Performance - Accuracy
We have only begun to look at a large number of
parametric variations that might improve LSI
performance. One important variable for LSI retrieval
is the number of dimensions in the reduced dimension
space. In previous experiments we have found that
performance improves as the number of dimensions is
increased up to 200 or 300 dimensions, and decreases
slowly after that to the level observed for the standard
vector method (Dumais, 1991). We have examined
`IREC-2 performance using fewer dimensions than
reported above (204 for the routing queries and 199
for the adhoc queries) and consistendy found worse
performance. Thus, it looks like we could improve
performance simply by increasing the number of
dimensions some. Unfortunately, this requires
reruning the SVD.
We also noticed that many of the adhoc queries
contained "NOTS". Since LSI does not use any
Boolean logic and represents a query as the vector
sum of its constituent terms, we thought that removing
this information might help. We modified the topic
statements by hand to remove negated phrases.
Performance improved by less that 2%.
We still need to experiment with different term
weighting methods. For the routing and adhoc
experiments we used SMART's "ltc" weighting for
both the corpus of documents and the queries.
Buckley and Salton's [OCRerr]fl[OCRerr]C-1 paper suggests that
alternative weightings may be more effective for the
large TIl:EC document collection. Reweighting the
query vectors is easy. Reweighting the document
collection is more difficult, because this changes the
term-document matrix and a new SYD is required.
For the routing queries we would like to try several
alternative methods of combining information from
the original query and the relevant documents to take
better advantage of the good training data that is
available. We expect term re-weighting and the use of
negative information (e.g., down weighting terms
from non-relevant documents) to improve
performance some.
112
In order to better understand retrieval performance we
have begun to examine two kinds of retrieval failures:
false alarms, and misses. False alarms are documents
that LSI ranks highly that are judged to be irrelevant.
Misses are relevant documents that are not in the top
1000 returned by LSI.
4.2.1 False Alarms.
The most common reason for false alarms was lack of
specificity. These highly ranked but irrelevant articles
were generally about the topic of interest but did not
meet some of the restrictions described in the topic
statement. Many topics required this kind of detailed
processing or fact-linding that the LSI system was not
designed to address. Precision of LSI matching can be
increased by many of the standard techniques - proper
noun identification, use of syntactic or statistically-
derived phrases, or a two-pass approach involving a
standard initial global matching followed by a more
detailed analysis of the top few thousand documents.
Buckley and Salton (1992, SMART's global and local
matching), Evans et al. (1992, CLARIT's evoke and
discrjiimate strategy), Nelson (1992, ConQuest's
global match followed by the use of locality of
information), and Jakobs, Knipka and Rau (1992,
GE's pre-filter followed by a variety of more stringent
tests) all used two-pass approaches to good advantage
in ThEC-1 or TREC-2. We would like to try some of
these methods, and will focus on general-purpose,
completely automatic methods that do not have to be
modified for each new domain or query restriction.
Another possible reason for false alarms appears to be
the result of inappropriate query pre-processing. The
use of negation is the best example of this problem.
32 of 50 adhoc queries contain some negation in the
topic statement. Some pretiminary experiments
(described briefly above) found only a small
improvement m performance when negated
information was manually removed from the topics.
Another example of inappropriate query processing
involved the use of logical connectives. LSI does not
handle Boolean combinations of words, and often
returned articles covering only a subset of ANDed
topics. often one aspect of the query appears to
dominate (typically the one described by the terms
with high weights). Limiting the contribution of any
one term to the overall similarity score might help this
problem.
Finally, it is not at all clear why about 20% of the false
alarms were returned by LSI. Since LSI uses a
statistically-derived "semantic" space and not
surface-level word overlap for matching queries to