SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval
chapter
P. Jacobs
National Institute of Standards and Technology
D. K. Harman
retrieval method with alternative ranking methods to see,
for example, whether more terms are really necessary in
order to get better ranking results.
The separation of retrieval and ranking seems to be a
valuable tool both for experimental research and for iden-
tifying different techniques for applications. It is clearly a
problem with both TREC-1 and TREC-2 that the routing
task requires a comparison of documents across a large
collection, when most routing applications deal with a
stream of documents individually or in small groups.
4 Analysis of Results
The results raise a number of important issues, espe-
cially: why Boolean approximation works as well as it
does, particularly why it works for routing; where statis-
tical weighting could help more; what sort of topics this
approach does well on (and which topics it does badly
on); and other obvious areas for improvement.
One of the most important sources of information
about the advantages and disadvantages of each approach
comes from comparing the performance of different sys-
tems on different topics. Unfortunately, this is also a
very difficult task, because, while it is easy to tell which
systems did well on which topic, it is often hard to gener-
alize from that evidence wh[OCRerr] the approach worked or why
it didn't.
As we have mentioned, the Boolean approach is very
erratic with respect to performance by topic, as compared
with other systems, particularly the statistical methods
that emphasize weighting. For example, our manual rout-
ing system, which was clearly one of the best systems, had
the top performance (in precision at 100 documents) on 8
topics, but was below median on 17 topics (out of 50). In
the 11-point averages, that system was below median on
22 topics-more than 40% of the time-although it out-
performed most of the systems on average. By contrast,
one of the Cornell systems [1] was above median on ev-
ery topic! This suggests that our approach degrades less
gracefully than other approaches, and that it is important
to explore Boolean methods as an adjunct to other meth-
ods that work in the cases where the Boolean approach
seems to fail. Conversely, our system had top or near-top
scores on a significant number of topics; it is important
to know how to take advantage of this within the context
of weighting systems.
There seem to be several different explanations for vari-
ation on the topics. First, there are topics, as we have dis-
cussed, that are particularly well suited to Boolean meth-
ods (and others that are not well suited at all). Second,
there are cases where the training method seems to work
particularly well. Third, there are cases where the man-
ual approach might work well because there are terms in
the topic description that are particularly misleading. Fi-
nally, there are many reasons why our approach can fail,
particularly on topics with very small numbers of relevant
documents and in cases where the topics are very vaguely
specified.
One of the topics where GE had the best results was
Topic 53, "leveraged buy-outs". The topic description
specified that relevant documents had to describe an LBO
above $100 million in value, and give the terms of the
buy-out. Apparently, the $100 million figure is not irn-
portant, because most of the LBO's that are reported are
major buy-outs. However, the terms (the specification
of the dollar amount) are required. Many articles about
LBO's do not report dollar amounts. This is similar to
the "Airbus subsidies" topic, where many articles that
talk about Airbus do not mention subsidies, and they are
not relevant. The advantage here seems to be that the
hard Boolean outperforms the weighting approaches be-
cause weighting, without Booleans, is likely to give an
article with many LBO words, but no dollar figures, a
high weight, just as it could a high weight to an article
about Airbus that doesn't mention subsidies.
The effect of training seems to help in the "leveraged
buy-out" case as well. The training picked up many
names of companies involved in buy-outs, like "Safeway"
and "Dart", and these were included in the queries. This
perhaps helped to separate articles about specific buy-
outs from buy-outs in general. A similar effect came
about on Topic 92, "international military equipment
sales", where the training pulled in names of many of
the weapons typically sold on the international market.
Topic 86, "bank failures" was another topic where the
GE system outperformed all others on both the 11-point
average and precision at 100 documents. This result is
hard to explain, but the one conspicuous fact about the
topic is that our query does not include the word "bank".
It does include the names of many prominent banks, so it
may be, like the LBO case, that good performance on this
topic depends mainly on distinguishing specific references
to failures from general discussions about bank failures,
for example, the S&L crisis.
On the topics that have very few relevant documents,
our approach often failed because, in the absence of train-
ing data, it tended to undergenerate; thus very few texts
(sometimes none) would match the Boolean query and
other systems with good weighting would pull in more
relevant documents. In these cases, a system that finds
one relevant document scores much better than a system
that finds zero, so the penalty for undergeneration is very
high.
The second class of topics where we seem to go wrong
is in those that are vaguely specified. For example, Topic
74, "policy conflict" is a very hard topic, where the de-
scription does not include very much information. Texts
rarely mention policy conflict, and, when they do, they
are rarely relevant. On the other hand, texts about to-
bacco policies and health are likely to be relevant. This
197