SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) LSI meets TREC: A Status Report chapter S. Dumais National Institute of Standards and Technology Donna K. Harman Somewhat surprisingly, the query using just the topic terms was about 25% more accurate than the feedback query (2234 vs. 1837 relevant articles; .1235 vs. .0972 11-pt precision). We suspect that part of the problem was attributable to the small number and inaccuracy of relevance judgements in the initial training set. This had substantial impact on performance for some topics because our feedback queries were based only on the relevant articles (and ignored the original topic description). For Topic 050, for example, there was only one relevant articles, and it did not appear to us to be relevant to the topic - the Topic was about "potential military interest in virtual reality", and the so-called relevant article was about "Denmark's crisis on nuclear power threatening its membership in NATO". Not surprisingly, using only this article as a query, no relevant articles about virtual reality were returned. Now that we have a larger number of hopefully more accurate relevance judgements, we will repeat this basic comparison. We will then use these two baseline runs to explore: a) combining the relevant documents and the original topic; b) selecting only some relevant documents and/or discriminating terms; and c) representing the query vector as several points of interest rather than a single average. 3.2.2 Failure analyses In order to better understand retrieval performance we examined two kinds of retrieval failures: false alarms, and misses. False alarms are documents that LSI ranks highly that are judged to be irrelevant. Misses are relevant documents that are not in the top 200 returned by LSI. The observations presented below are based on preliminary analyses of some topics on which LSI performed poorly. Although we suggest methods for improving performance, most have not been tested systematically on the entire TREC collecfion, although we plan to do so. 3.2.2.1 False Alarms. The most common reason for false alarms (accounting for approximately 50% of those we examined) was lack of specificity. These highly ranked but irrelevant articles were generally about the topic of interest but did not meet some of the restrictions described in the topic. Many topics required this kind of detailed processing or fact- finding that the LSI system was not designed to address. Precision of LSI matching can be increased by many of the standard techniques - proper noun identification, use of syntactic or statistically-derived phrases, or a two-pass approach involving a standard initial global matching followed by a more detailed analysis of the top few thousand documents. Salton and Buckley (SMART's global and local matching), Evans (CLARIT's evoke and discriminate strategy), Nelson (ConQuest's global match followed by the use of locality of information), and Jakobs, Krupka and Rau (GE's pre-filter followed by a variety of more stringent tests) all used two-pass approaches to good advantage in the TREC tests. We intend to try some of these methods for TREC-2, and will focus on general-purpose, completely automatic methods that do not have to be modified for each new domain or query restricfion. Another common cause of false alarms appears to be the result of inappropriate query pre- processing. The use of negation is the best example of this problem. About 20% of the TREC topics contained explicit negations. LSI included negated words in the query along with all the other words. Topic 094, about computer-aided crime, also stated that articles that simply mentioned the spread of a computer virus or worm were NOT relevant. The first 20 documents that LSI returned were all about computer viruses! Another example of inappropriate query 146