SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval chapter P. Jacobs National Institute of Standards and Technology D. K. Harman AD HOC TEST ROUTING TEST 11-pt. ReI.@ 11-pt. ReI.@ avg. 100 docs. avg. 100 docs. Boolean `92 .2029 47.2 .2078 35.6 Pattern matcher `92 .1961 46.2 .1851 34.6 Avg. median run `92 .1585 39.7 .1246 28.6 Boolean `93 .2183* 37* .3308 45 Avg. median run `93 .2620 41 .2910 41 [OCRerr]= fully automatic Figure 1: Summary GE Results on TREC-1 and TREC-2 While it is very hard to measure progress between TREC-1 and TREC-2 because none of the numbers are directly comparable, the difference between our .2078 routing average in TREC-1 and .3308 in TREC-2 is large enough that we are quite pleased with the rate of progress and confident that we are nowhere near the peak that can be achieved with manual, Boolean routing. In addition, the automatic ad hoc method that we tried, while show- ing terrible performance on some of the more convoluted topic descriptions, had a better average than our manual ad hoc system last year. Thus, there is a great deal to be gained using cor- pus analysis to automate or assist in query generation, even within the context of straightforward retrieval meth- ods. In addition to having promise for "legacy" oper- ational systems, these results suggest that natural lan- guage methods, focused on corpus analysis and query gen- eration, probably can help in improving the performance of many information retrieval systems. 2 Boolean Approximation The basic approach in our system is to compile queries into Boolean tables that can be matched at high speed against a stream of input text. This approach is meant for routing, and also to be compatible with "downstream" analysis such as what we do in TIPSTER data extraction. In fact, the Boolean compiler we use is designed for han- dling the much more complex expressions that our system uses in data extraction. Figure 2 illustrates the approach. We call this Boolean approximation because the Boolean expressions used in the basic matching engine are an approximation to more detailed processing of texts, in the sense that they are guaranteed to admit all text that would be admitted by more detailed processing, but will usually also admit many texts that would be rejected by more detailed con- straints. This is a very general method, in that the sys- tem can be configured to apply many different stages of analysis, from "shallower" processing to "deeper" inter- pretation, with each stage applying stricter constraints- for example, word order, proximity, semantic constraints, and so forth. Furthermore, at each stage, the effects of filtering can be measured, generally showing a loss of re- call and gain in precision. In TREC-1 [6], we measured this tradeoff and found that the highest 11-point averages came from the first stage of filtering; in other words, the gains in precision in later stages were not enough to make up for the loss of recall on these measures. The figure also illustrates the flow of information at development time and the sort of knowledge that applies at each stage. For example, at development time, knowl- edge can be mapped from the deeper levels into shallower levels. At run time, the subsequent stages of language analysis apply this knowledge in stages. For example, our system can analyze joint venture texts (in English and Japanese), looking for, among other things, infor- mation about the joint venture company by recognizing that the company is often the object of the verb estab- lish. In the Boolean stage, this can be approximated by looking for the combination of words like establish with words like venture. In the finite state pattern matching stage, the system might look for any word with the root establish followed by the venture term (and perhaps the reverse order with the verb in a passive form). In deeper interpretation, the system applies syntactic and semantic constraints to recognize the different ways that the con- 192