SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval
chapter
P. Jacobs
National Institute of Standards and Technology
D. K. Harman
2.2.1 "Manual" queries for routing 2.2.2 "Hard" vs. "Soft" Booleans
In manual routing, our approach uses a statistical cor-
pus analysis, developed originally for text categorization
[4], to pull out terms based on their relative frequency
in relevant documents for each topic. The statistic used
combines the entropy-based mutual information statistic
(testing the independence of each term with each topic)
with a correction for low-frequency terms and for ambigu-
ous words. Words with high weights have a high degree
of association with a topic. This statistical analysis is
also used in ranking. The base weighting formula is the
following:
C(log2 b)(log2 v)
where C is a constant, 6 is number of times a term
appears in a story assigned to a particular category, for
example, and log2 r is the log of the ratio of combined
probabilities (i.e., of a particular word or phrase occur-
ring in a text about a particular category) to the prod-
uct of independent probabilities[OCRerr]the mutual information
statistic. This tests the assumption that the use of the
word and the category of the text are independent. When
this assumption is false, the word gets a high positive or
negative weight.
For example, the following are the top words for Topic
51, "Airbus subsidies":
A-330 TOPIC51 1263.3
AIRBUS TOPIC51 1183.1
A-340 TOPIC51 1178.2
INDUSTRIE TOPIC51 1071.9
A-320 TOPIC51 1067.5
MESSERSCHMITT-BOELKOW-BLOHN TOPIC51 851.3
AERONAUTICAS TOPIC51 843.3
CONSTRUCCIONES TOPIC51 807.9
AEROSPATIALE TOPIC51 762.5
MBB TOPIC51 722.6
WIDE-BODY TOPIC51 617.7
MD-il TOPIC51 613.8
TOULOUSE TOPIC51 228.5
JETLINERS TOPIC51 217.8
LUFTHANSA TOPIC51 196.6
MD-80 TOPIC51 196.6
Clearly, these words all have some reason to be associ-
ated with this topic, but adding them to the appropriate
group in each query (or ignoring them entirely) is a "man-
ual" process. Our manual routing queries, therefore, are
a combination of the regular expressions that were devel-
oped from the topic descriptions with terms added that
were selected from the automatic training. This is, we
believe, a very practical manual approach that has very
good performance.
The Boolean matcher uses a "hard" Boolean approach, in
that it will admit only texts, for each query, that satisfy
the conditions of that query. For example, in Topic 51
above, "Airbus subsidies", the matcher will allow texts
only that have boih and Airbus term and a subsidy term
in the same paragraph. However, this is a narrow topic,
and TREC-2 allows each system to produce 1000 texts
for each topic. The evaluation metrics offer no penalty
for filling up the list of 1000 with texts that are likely to
be irrelevant. So, in order to provide increased flexibility
and consider larger numbers of texts for each topic, we
used an additional engine only for the purpose of pulling
in texts for very specific queries like this one.
The system is still a hard Boolean system in that texts
that satisfy the Boolean conditions will always be ranked
higher than texts that do not satisfy the conditions; how-
ever, texts that do not satisfy the conditions can appear
on the final ranked lists. The "soft" engine considers such
texts by relaxing some of the Boolean conditions, effec-
tively pulling in texts that have a large number of terms
that match the query, but do not necessarily meet all the
conditions. This component of the system is more like
statistical retrieval engines; however, it does not have a
large impact on the overall scores, because it only af-
fects the results at the low-precision extreme (the low-
est rankings) for queries that match very few documents.
In fact, for Topic 51, the hard Boolean query matches
11 texts, and the soft method pulls in an additional 989
texts. But there are only 11 texts that are judged rele-
vant, of which 10 satisfy the hard Boolean. So enforcing
the hard Boolean condition seems to work well for this
topic, and the soft Boolean doesn't contribute much.
For some topics, the balance between the hard Boolean
condition and the soft Boolean isn't so clear; i.e. not en-
forcing the hard condition would lead to better ranking.
This seems be a function of how well the topics fit with
Boolean expressions in general. "Airbus subsidies" is re-
ally a Boolean topic, in that a relevant text must say
something about Airbus and something about subsidies.
Other topics like "automation" are much harder to ex-
press in a Boolean form. We will cover these issues in
topic-by-topic performance later in this paper.
2.2.3 "Automatic" queries for ad hoc retrieval
The "m?[OCRerr]ual" query method is partly automated in the
sense that the corpus-based statistical training suggests
many of the terms that are used in the queries. But it is
"manual" in that the initial formulation of the queries
is done manually from the TREC topic descriptions.
We have tried, as a simple experiment, to generate the
Boolean term groupings and expand each term automat-
ically from the topic descriptions. In the days before the
TREC-2 ad hoc test, we tried several different ways of do-
195