SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
LSI meets TREC: A Status Report
chapter
S. Dumais
National Institute of Standards and Technology
Donna K. Harman
Inappropriate subcollection selection accounts for many of LSI's failures on computer-related
topics. Consider, for example, topic 037 about IBM's SAA standards. LSI performs very poorly
compared to other systems on this topic, returning the fewest relevant articles (19) and having
the lowest 11-pt precision (.0088). Summing over all Systems for this topic, 68% of the total
number of returned articles (554/812) and 99% of the relevant articles (444/449) were from
ZIFF2. For LSI, however, only 17 of the top200 articles (8%) were from ZIFF2; all 17 of these
articles were relevant. When a comparable proportion of LSI documents were selected from
ZIFF2, the number relevant increased to 175 and the 11-pt precision increased to .3532. This
performance places LSI slightly above the median for topic 037. We performed the same
analysis for all topics in which more than 33% of the total returned articles were from ZIFF.
There were 26 such topics (19 Adhoc Topics from 026-050, and 7 Routing Topics). For these
topics, the mean percent of ZIFF articles chosen by all systems was 59%, compared with 9% for
LSI. When comparable proportions of ZIFF articles were selected for LSI, the average number
of relevant documents increased from 26 to 58, and the 11-pt precision increased from .0554 to
.1111. A total of 700 new relevant documents were found for the 19 Adhoc Topics, and 125
new relevant documents were found for the 7 Routing Topics. Performance improvements were
observed for 23 of the 26 topics - two of those in which there were no improvements were about
AT&T products where poor pre-processing omitted AT&T from the LSI query (see below).
While these results are encouraging, the problem of how to select appropriate subcollections is
not solved. For the routing topics, we could use training data to set some apriori mixture of
articles from various subcollections. This strategy is not, however, generally applicable for
adhoc queries. We will examine some more appropnate way of combining across subcollections
to take distributional effects like this into account. Alternatively, we could use randomly
selected documents (rather than topically organized ones) to create the subcollections. Finally,
we could use a single large combined scaling in which there would be no need to combine across
subcollections.
Finally, some misses were attributable to poor text (and query) pre-processing and tokenization.
Since we do not keep single letters or use a database of company and location names, several
important acronyms like U.S. and AT&T disappeared completely from both articles and queries!
Not surprisingly, this resulted in many missed documents. We noticed that many of the top
performing automatic systems used SMART's pre-processing, and we hope to do so as well for
TREC-2. This will allow us to better understand the usefulness of LSI per se without many of
the additional confounds introduced with different indexing.
3.3 Open experimental issues
The results of the failure analyses suggest several directions to pursue for TREC-2, including:
improving pre-processing and tokenization; exploring some precision-enhancing methods; and
developing methods for more effecfively combining across subcollections. We also hope to
explore three additional.
148