SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval
chapter
P. Jacobs
National Institute of Standards and Technology
D. K. Harman
GE in TREC-2: Results of a Booleaii Approximation Method
for Routing and Retrieval*
Paul S. Jacobs
GE Research and Development Center
Schenectady, NY 12301
psjacobs[OCRerr]crd.ge.com
Abstract
This report describes a few experiments
aimed at producing hzgh accuracy routing and re-
trieval with a simple Boolean engine. There are
several motivations for this work, including: (1)
using Boolean term combznations as a filter for
advanced data extraction systems, ([OCRerr]) improving
alegacy" Boolean retrieval systems by helping to
automate the generation of Boolean queries, and
(3) focusing on query content, rather than re-
trieval or ranking, as the key to system perfor-
mance. The results show very high accuracy,
and significant progress, using a Boolean engine
for routing based on querzes that are manually
generated with the help of corpus data. In ad-
dition, the results of a straightforward imple-
mentation of a fully automat:c ad hoc method
show some promise of being able to do good au-
tomatic query construction within the context of
a Boolean system.
1 Introduction
Full-text search is currently the simplest and most
commonly-used method for locating information in large
volumes of free text. Because users are accustomed to
describing what they are looking for with specific words,
and those words are often found in the texts, searching the
text for selected words or word combinations is a natural
and easy-to-implement method for information retrieval.
However, it can be very inaccurate. It can be especially
difficult for searchers to compose "queries" that combine
the words that are effective in locating relevant material
without finding large quantities of irrelevant information
as well. One way to cope with this difficulty, while still
preserving the advantages of the full-text search engine,
*This research was sponsored in part by the Advanced Research
Project Agency. The views and conclusions contained in this doc-
ument are those of the authors and should not be interpreted as
representing the official policies, either expressed or implied, of the
Advanced Research Project Agency or the US Government.
is to help to automate the process of generating Boolean
queries. This was the focus of GE's TREC-2 effort.
GE's involvement in TREC represents a relatively low
level of effort aimed at bringing together natural language
text processing, data extraction, and statistical corpus
analysis methods. Our project uses innovative approaches
for extracting information from text, best exemplified in
our results in the MUC and TIPSIER extraction evalua-
tions [7, 3] and in operational text management systems
in GE. In TREC-1, we attempted to show the benefit
of natural language interpretation by using Boolean ap-
proximation to select portions of text that could be fur-
ther interpreted. The main result of this was that natural
language seems to have very little to offer as a precision
filtering method, because routing and retrieval problems
stem largely from having the wrong terms in the queries
[6]. Thus, in TREC-2, we have stuck with the Boolean
engine, concentrating on the use of corpus analysis to im-
prove the queries.
Figure 1 summarizes our TREC results. Our results in
TREC-2, as in TREC-1, were quite good relative to other
systems. The manual routing system, which comprised
over 99% of our effort, produced an 11-point average of
.3308, with an average of 45 relevant documents in the top
100. This put GE's system at the very top of the man-
ual routing category (the system with the best 11-point
average in this category was slightly higher on the 11-
point average and had slightly fewer relevant documents,
on average, in the top 100).
The residual effort went into a fully automatic ad hoc
method, which produced an 11 point average of .2183 and
an average of 37 relevant documents in the top 100. As in
TREC-1, performance varied dramatically by topic. The
routing system showed the best results (in terms of preci-
sion at 100 documents) on 8 of 50 topics. Yet it was below
median on 17 topics. This not only suggests areas for fur-
ther improvement, but also shows an important difference
between the Boolean approach and some of the statistical
retrieval systems. The Boolean approach does much bet-
ter on certain topics, but the statistical approaches have
more consistent performance.
191