SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval
chapter
P. Jacobs
National Institute of Standards and Technology
D. K. Harman
Development time
knowledge base compilation
Run-time control flow
Boolean approximation Routingi
Boolean
[OCRerr] (establish OR establlshed) Retrieval
tables
AND venture
Lexico -semantic
regular expressions
{
Fin fte state pattern
matching Extraction
(root establish) * venture
Statistical and manual
.4
Lexicon,
grammar,
knolwedge bases K
i
Natural language
semantic interpretation
(c-establishing
(r-effected (c-venture))
Interpretation
Figure 2: Approximation and Natural Language Processing
cept of establishing can be expressed, and insure that the
words appear in a grammatically acceptable way in the
input. This more detailed analysis can be crucial in data
extraction tasks.
A critical point about this model is that even detailed
knowledge about language often ends up contributing to
the Boolean tables, so the Boolean expressions are con-
siderable more complex than those that could easily be
created by hand. Furthermore, the Boolean expressions
are a relaxed form of more complex knowledge-based con-
structs. It is very difficult to predict the impact of sim-
ply relaxing constraints. For example, TREC topic 53
includes the phrase "leveraged buy-out". In a strict syn-
tactic analysis, this phrase would have to appear exactly
in that form, enforcing both proximity ("leveraged" ad-
jacent to "buy-out") and order ("leveraged" before "buy-
out"). But relaxing such constraints seldom has any se-
vere effect on results: the Boolean expression (leveraged
AND buy-out) in our system will recognize the two terms
occurring anywhere in the same paragraph, which is actu-
ally likely to admit more texts about leveraged buy-outs
than admit texts in which the words coincidentally ap-
pear together.
Even the phrase "prime rate", which matches some ir-
relevant texts in its Boolean form (including, for example,
one or two texts about Japan that mention "prime min-
ister" in the context of "growth rate"), also admits some
additional relevant texts in which "prime" appears in the
same paragraph as "rate". The well-known effects of word
order and proximity on meaning, exemplified by the dis-
tinction between "blind Venetian" and "Venetian blind"
do not seem to appear very frequently in real examples.
At least, these effects may be less frequent
In TREC-2, we did not apply any constraints stricter
than Booleans at run-time; in other words, we used only
a Boolean retrieval engine (because in TREC-1 we proved
that the stricter constraints didn't help). We also had to
implement a module to relax some of the "hard" Boolean
constraints for topics with a very small number of relevant
documents, because of the nature of the performance met-
rics. However, we still used the knowledge that was devel-
oped for more detailed processing, including the phrases,
semantic groupings and so forth. In addition, we added
a more sophisticated ranking mechanism than we used in
TREC-1, because ranking is very important in the evalua-
tion. But the only retrieval engine in our TREC-2 system
is a Boolean matcher.
2.1 The Boolean Matcher
The Boolean tables are efficiently organized so that a C
program (which we now know as NLGREP) can match
them against incoming texts at a rate of about 1 million
words per minute. This program spends little time on
anything other than marking where words in each doc-
ument match terms in the table. In both routing and
retrieval, the total number of terms used for a set of 50
topics is about 2000, or about 40 unique terms per topic.
All other words are ignored entirely.
For example, the following is a query for the topic
"South African sanctions" (using the enhanced regular
expression language of GE's pattern matcher [5]):
193