NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Latent Semantic Indexing (LSI) and TREC-2 chapter S. Dumais National Institute of Standards and Technology D. K. Harman data structures need to be maintained. c) Query matching can also be improved tremendously by simply using more than one machine or parallel hardware. Using a 16,000 PE MasPar, with no attempt to optimize the data storage or sorting, we decreased the time required to match a 200- dimensional query vector against all document vectors and sort by a factor of 60 to 100. 4.2 Improving Performance - Accuracy We have only begun to look at a large number of parametric variations that might improve LSI performance. One important variable for LSI retrieval is the number of dimensions in the reduced dimension space. In previous experiments we have found that performance improves as the number of dimensions is increased up to 200 or 300 dimensions, and decreases slowly after that to the level observed for the standard vector method (Dumais, 1991). We have examined `IREC-2 performance using fewer dimensions than reported above (204 for the routing queries and 199 for the adhoc queries) and consistendy found worse performance. Thus, it looks like we could improve performance simply by increasing the number of dimensions some. Unfortunately, this requires reruning the SVD. We also noticed that many of the adhoc queries contained "NOTS". Since LSI does not use any Boolean logic and represents a query as the vector sum of its constituent terms, we thought that removing this information might help. We modified the topic statements by hand to remove negated phrases. Performance improved by less that 2%. We still need to experiment with different term weighting methods. For the routing and adhoc experiments we used SMART's "ltc" weighting for both the corpus of documents and the queries. Buckley and Salton's [OCRerr]fl[OCRerr]C-1 paper suggests that alternative weightings may be more effective for the large TIl:EC document collection. Reweighting the query vectors is easy. Reweighting the document collection is more difficult, because this changes the term-document matrix and a new SYD is required. For the routing queries we would like to try several alternative methods of combining information from the original query and the relevant documents to take better advantage of the good training data that is available. We expect term re-weighting and the use of negative information (e.g., down weighting terms from non-relevant documents) to improve performance some. 112 In order to better understand retrieval performance we have begun to examine two kinds of retrieval failures: false alarms, and misses. False alarms are documents that LSI ranks highly that are judged to be irrelevant. Misses are relevant documents that are not in the top 1000 returned by LSI. 4.2.1 False Alarms. The most common reason for false alarms was lack of specificity. These highly ranked but irrelevant articles were generally about the topic of interest but did not meet some of the restrictions described in the topic statement. Many topics required this kind of detailed processing or fact-linding that the LSI system was not designed to address. Precision of LSI matching can be increased by many of the standard techniques - proper noun identification, use of syntactic or statistically- derived phrases, or a two-pass approach involving a standard initial global matching followed by a more detailed analysis of the top few thousand documents. Buckley and Salton (1992, SMART's global and local matching), Evans et al. (1992, CLARIT's evoke and discrjiimate strategy), Nelson (1992, ConQuest's global match followed by the use of locality of information), and Jakobs, Knipka and Rau (1992, GE's pre-filter followed by a variety of more stringent tests) all used two-pass approaches to good advantage in ThEC-1 or TREC-2. We would like to try some of these methods, and will focus on general-purpose, completely automatic methods that do not have to be modified for each new domain or query restriction. Another possible reason for false alarms appears to be the result of inappropriate query pre-processing. The use of negation is the best example of this problem. 32 of 50 adhoc queries contain some negation in the topic statement. Some pretiminary experiments (described briefly above) found only a small improvement m performance when negated information was manually removed from the topics. Another example of inappropriate query processing involved the use of logical connectives. LSI does not handle Boolean combinations of words, and often returned articles covering only a subset of ANDed topics. often one aspect of the query appears to dominate (typically the one described by the terms with high weights). Limiting the contribution of any one term to the overall similarity score might help this problem. Finally, it is not at all clear why about 20% of the false alarms were returned by LSI. Since LSI uses a statistically-derived "semantic" space and not surface-level word overlap for matching queries to