SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval
chapter
P. Jacobs
National Institute of Standards and Technology
D. K. Harman
AD HOC TEST ROUTING TEST
11-pt. ReI.@ 11-pt. ReI.@
avg. 100 docs. avg. 100 docs.
Boolean `92 .2029 47.2 .2078 35.6
Pattern matcher `92 .1961 46.2 .1851 34.6
Avg. median run `92 .1585 39.7 .1246 28.6
Boolean `93 .2183* 37* .3308 45
Avg. median run `93 .2620 41 .2910 41
[OCRerr]= fully automatic
Figure 1: Summary GE Results on TREC-1 and TREC-2
While it is very hard to measure progress between
TREC-1 and TREC-2 because none of the numbers are
directly comparable, the difference between our .2078
routing average in TREC-1 and .3308 in TREC-2 is large
enough that we are quite pleased with the rate of progress
and confident that we are nowhere near the peak that can
be achieved with manual, Boolean routing. In addition,
the automatic ad hoc method that we tried, while show-
ing terrible performance on some of the more convoluted
topic descriptions, had a better average than our manual
ad hoc system last year.
Thus, there is a great deal to be gained using cor-
pus analysis to automate or assist in query generation,
even within the context of straightforward retrieval meth-
ods. In addition to having promise for "legacy" oper-
ational systems, these results suggest that natural lan-
guage methods, focused on corpus analysis and query gen-
eration, probably can help in improving the performance
of many information retrieval systems.
2 Boolean Approximation
The basic approach in our system is to compile queries
into Boolean tables that can be matched at high speed
against a stream of input text. This approach is meant
for routing, and also to be compatible with "downstream"
analysis such as what we do in TIPSTER data extraction.
In fact, the Boolean compiler we use is designed for han-
dling the much more complex expressions that our system
uses in data extraction.
Figure 2 illustrates the approach. We call this Boolean
approximation because the Boolean expressions used in
the basic matching engine are an approximation to more
detailed processing of texts, in the sense that they are
guaranteed to admit all text that would be admitted
by more detailed processing, but will usually also admit
many texts that would be rejected by more detailed con-
straints. This is a very general method, in that the sys-
tem can be configured to apply many different stages of
analysis, from "shallower" processing to "deeper" inter-
pretation, with each stage applying stricter constraints-
for example, word order, proximity, semantic constraints,
and so forth. Furthermore, at each stage, the effects of
filtering can be measured, generally showing a loss of re-
call and gain in precision. In TREC-1 [6], we measured
this tradeoff and found that the highest 11-point averages
came from the first stage of filtering; in other words, the
gains in precision in later stages were not enough to make
up for the loss of recall on these measures.
The figure also illustrates the flow of information at
development time and the sort of knowledge that applies
at each stage. For example, at development time, knowl-
edge can be mapped from the deeper levels into shallower
levels. At run time, the subsequent stages of language
analysis apply this knowledge in stages. For example,
our system can analyze joint venture texts (in English
and Japanese), looking for, among other things, infor-
mation about the joint venture company by recognizing
that the company is often the object of the verb estab-
lish. In the Boolean stage, this can be approximated by
looking for the combination of words like establish with
words like venture. In the finite state pattern matching
stage, the system might look for any word with the root
establish followed by the venture term (and perhaps the
reverse order with the verb in a passive form). In deeper
interpretation, the system applies syntactic and semantic
constraints to recognize the different ways that the con-
192