NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval chapter P. Jacobs National Institute of Standards and Technology D. K. Harman retrieval method with alternative ranking methods to see, for example, whether more terms are really necessary in order to get better ranking results. The separation of retrieval and ranking seems to be a valuable tool both for experimental research and for iden- tifying different techniques for applications. It is clearly a problem with both TREC-1 and TREC-2 that the routing task requires a comparison of documents across a large collection, when most routing applications deal with a stream of documents individually or in small groups. 4 Analysis of Results The results raise a number of important issues, espe- cially: why Boolean approximation works as well as it does, particularly why it works for routing; where statis- tical weighting could help more; what sort of topics this approach does well on (and which topics it does badly on); and other obvious areas for improvement. One of the most important sources of information about the advantages and disadvantages of each approach comes from comparing the performance of different sys- tems on different topics. Unfortunately, this is also a very difficult task, because, while it is easy to tell which systems did well on which topic, it is often hard to gener- alize from that evidence wh[OCRerr] the approach worked or why it didn't. As we have mentioned, the Boolean approach is very erratic with respect to performance by topic, as compared with other systems, particularly the statistical methods that emphasize weighting. For example, our manual rout- ing system, which was clearly one of the best systems, had the top performance (in precision at 100 documents) on 8 topics, but was below median on 17 topics (out of 50). In the 11-point averages, that system was below median on 22 topics-more than 40% of the time-although it out- performed most of the systems on average. By contrast, one of the Cornell systems [1] was above median on ev- ery topic! This suggests that our approach degrades less gracefully than other approaches, and that it is important to explore Boolean methods as an adjunct to other meth- ods that work in the cases where the Boolean approach seems to fail. Conversely, our system had top or near-top scores on a significant number of topics; it is important to know how to take advantage of this within the context of weighting systems. There seem to be several different explanations for vari- ation on the topics. First, there are topics, as we have dis- cussed, that are particularly well suited to Boolean meth- ods (and others that are not well suited at all). Second, there are cases where the training method seems to work particularly well. Third, there are cases where the man- ual approach might work well because there are terms in the topic description that are particularly misleading. Fi- nally, there are many reasons why our approach can fail, particularly on topics with very small numbers of relevant documents and in cases where the topics are very vaguely specified. One of the topics where GE had the best results was Topic 53, "leveraged buy-outs". The topic description specified that relevant documents had to describe an LBO above $100 million in value, and give the terms of the buy-out. Apparently, the $100 million figure is not irn- portant, because most of the LBO's that are reported are major buy-outs. However, the terms (the specification of the dollar amount) are required. Many articles about LBO's do not report dollar amounts. This is similar to the "Airbus subsidies" topic, where many articles that talk about Airbus do not mention subsidies, and they are not relevant. The advantage here seems to be that the hard Boolean outperforms the weighting approaches be- cause weighting, without Booleans, is likely to give an article with many LBO words, but no dollar figures, a high weight, just as it could a high weight to an article about Airbus that doesn't mention subsidies. The effect of training seems to help in the "leveraged buy-out" case as well. The training picked up many names of companies involved in buy-outs, like "Safeway" and "Dart", and these were included in the queries. This perhaps helped to separate articles about specific buy- outs from buy-outs in general. A similar effect came about on Topic 92, "international military equipment sales", where the training pulled in names of many of the weapons typically sold on the international market. Topic 86, "bank failures" was another topic where the GE system outperformed all others on both the 11-point average and precision at 100 documents. This result is hard to explain, but the one conspicuous fact about the topic is that our query does not include the word "bank". It does include the names of many prominent banks, so it may be, like the LBO case, that good performance on this topic depends mainly on distinguishing specific references to failures from general discussions about bank failures, for example, the S&L crisis. On the topics that have very few relevant documents, our approach often failed because, in the absence of train- ing data, it tended to undergenerate; thus very few texts (sometimes none) would match the Boolean query and other systems with good weighting would pull in more relevant documents. In these cases, a system that finds one relevant document scores much better than a system that finds zero, so the penalty for undergeneration is very high. The second class of topics where we seem to go wrong is in those that are vaguely specified. For example, Topic 74, "policy conflict" is a very hard topic, where the de- scription does not include very much information. Texts rarely mention policy conflict, and, when they do, they are rarely relevant. On the other hand, texts about to- bacco policies and health are likely to be relevant. This 197