NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval chapter P. Jacobs National Institute of Standards and Technology D. K. Harman Development time knowledge base compilation Run-time control flow Boolean approximation Routingi Boolean [OCRerr] (establish OR establlshed) Retrieval tables AND venture Lexico -semantic regular expressions { Fin fte state pattern matching Extraction (root establish) * venture Statistical and manual .4 Lexicon, grammar, knolwedge bases K i Natural language semantic interpretation (c-establishing (r-effected (c-venture)) Interpretation Figure 2: Approximation and Natural Language Processing cept of establishing can be expressed, and insure that the words appear in a grammatically acceptable way in the input. This more detailed analysis can be crucial in data extraction tasks. A critical point about this model is that even detailed knowledge about language often ends up contributing to the Boolean tables, so the Boolean expressions are con- siderable more complex than those that could easily be created by hand. Furthermore, the Boolean expressions are a relaxed form of more complex knowledge-based con- structs. It is very difficult to predict the impact of sim- ply relaxing constraints. For example, TREC topic 53 includes the phrase "leveraged buy-out". In a strict syn- tactic analysis, this phrase would have to appear exactly in that form, enforcing both proximity ("leveraged" ad- jacent to "buy-out") and order ("leveraged" before "buy- out"). But relaxing such constraints seldom has any se- vere effect on results: the Boolean expression (leveraged AND buy-out) in our system will recognize the two terms occurring anywhere in the same paragraph, which is actu- ally likely to admit more texts about leveraged buy-outs than admit texts in which the words coincidentally ap- pear together. Even the phrase "prime rate", which matches some ir- relevant texts in its Boolean form (including, for example, one or two texts about Japan that mention "prime min- ister" in the context of "growth rate"), also admits some additional relevant texts in which "prime" appears in the same paragraph as "rate". The well-known effects of word order and proximity on meaning, exemplified by the dis- tinction between "blind Venetian" and "Venetian blind" do not seem to appear very frequently in real examples. At least, these effects may be less frequent In TREC-2, we did not apply any constraints stricter than Booleans at run-time; in other words, we used only a Boolean retrieval engine (because in TREC-1 we proved that the stricter constraints didn't help). We also had to implement a module to relax some of the "hard" Boolean constraints for topics with a very small number of relevant documents, because of the nature of the performance met- rics. However, we still used the knowledge that was devel- oped for more detailed processing, including the phrases, semantic groupings and so forth. In addition, we added a more sophisticated ranking mechanism than we used in TREC-1, because ranking is very important in the evalua- tion. But the only retrieval engine in our TREC-2 system is a Boolean matcher. 2.1 The Boolean Matcher The Boolean tables are efficiently organized so that a C program (which we now know as NLGREP) can match them against incoming texts at a rate of about 1 million words per minute. This program spends little time on anything other than marking where words in each doc- ument match terms in the table. In both routing and retrieval, the total number of terms used for a set of 50 topics is about 2000, or about 40 unique terms per topic. All other words are ignored entirely. For example, the following is a query for the topic "South African sanctions" (using the enhanced regular expression language of GE's pattern matcher [5]): 193