SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval chapter P. Jacobs National Institute of Standards and Technology D. K. Harman GE in TREC-2: Results of a Booleaii Approximation Method for Routing and Retrieval* Paul S. Jacobs GE Research and Development Center Schenectady, NY 12301 psjacobs[OCRerr]crd.ge.com Abstract This report describes a few experiments aimed at producing hzgh accuracy routing and re- trieval with a simple Boolean engine. There are several motivations for this work, including: (1) using Boolean term combznations as a filter for advanced data extraction systems, ([OCRerr]) improving alegacy" Boolean retrieval systems by helping to automate the generation of Boolean queries, and (3) focusing on query content, rather than re- trieval or ranking, as the key to system perfor- mance. The results show very high accuracy, and significant progress, using a Boolean engine for routing based on querzes that are manually generated with the help of corpus data. In ad- dition, the results of a straightforward imple- mentation of a fully automat:c ad hoc method show some promise of being able to do good au- tomatic query construction within the context of a Boolean system. 1 Introduction Full-text search is currently the simplest and most commonly-used method for locating information in large volumes of free text. Because users are accustomed to describing what they are looking for with specific words, and those words are often found in the texts, searching the text for selected words or word combinations is a natural and easy-to-implement method for information retrieval. However, it can be very inaccurate. It can be especially difficult for searchers to compose "queries" that combine the words that are effective in locating relevant material without finding large quantities of irrelevant information as well. One way to cope with this difficulty, while still preserving the advantages of the full-text search engine, *This research was sponsored in part by the Advanced Research Project Agency. The views and conclusions contained in this doc- ument are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Advanced Research Project Agency or the US Government. is to help to automate the process of generating Boolean queries. This was the focus of GE's TREC-2 effort. GE's involvement in TREC represents a relatively low level of effort aimed at bringing together natural language text processing, data extraction, and statistical corpus analysis methods. Our project uses innovative approaches for extracting information from text, best exemplified in our results in the MUC and TIPSIER extraction evalua- tions [7, 3] and in operational text management systems in GE. In TREC-1, we attempted to show the benefit of natural language interpretation by using Boolean ap- proximation to select portions of text that could be fur- ther interpreted. The main result of this was that natural language seems to have very little to offer as a precision filtering method, because routing and retrieval problems stem largely from having the wrong terms in the queries [6]. Thus, in TREC-2, we have stuck with the Boolean engine, concentrating on the use of corpus analysis to im- prove the queries. Figure 1 summarizes our TREC results. Our results in TREC-2, as in TREC-1, were quite good relative to other systems. The manual routing system, which comprised over 99% of our effort, produced an 11-point average of .3308, with an average of 45 relevant documents in the top 100. This put GE's system at the very top of the man- ual routing category (the system with the best 11-point average in this category was slightly higher on the 11- point average and had slightly fewer relevant documents, on average, in the top 100). The residual effort went into a fully automatic ad hoc method, which produced an 11 point average of .2183 and an average of 37 relevant documents in the top 100. As in TREC-1, performance varied dramatically by topic. The routing system showed the best results (in terms of preci- sion at 100 documents) on 8 of 50 topics. Yet it was below median on 17 topics. This not only suggests areas for fur- ther improvement, but also shows an important difference between the Boolean approach and some of the statistical retrieval systems. The Boolean approach does much bet- ter on certain topics, but the statistical approaches have more consistent performance. 191