SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval
chapter
P. Jacobs
National Institute of Standards and Technology
D. K. Harman
mg this automatic Boolean query generation, and chose
the one that worked best on the sample data.
Our first attempt was to use the common methods for
finding collocations and word associations in sentences,
and these worked horribly for term expansion. The prob-
lem is that this approach finds more associations like "fu-
neral" and "home" than it does "hostage" and "captive",
and the latter, text-level associations are what's required
to generate good queries.
The "solution" we tried was, for a sample of about 10
million words in the corpus, to choose the top 20 words
based on TF.IDF weights for each document, store the
frequency of association among these terms, and then
weight each pair using the weighted mutual information
statistic of the previous section. This was much better
than using sentence-level information, although it is still
a very straightforward approach. For example, the fol-
lowing are the top 10 terms associated with the word
"hostage" (in order):
hostages
Lebanon
Beirut
Iran
release
Terry
kidnappers
kidnapped
Jihad
Anderson
While this is certainly not the optimal set of terms to
use in place of "hostage",it is a good start.
The next problem in automatic query construction is
when to use a combination of terms and when to use
a single term. For example, the term "weather-related
fatalities" is a combination of two word groups (weather
and fatalities) while "Iran-contra affair" is really only one
group (Iran-contra), even though it might appear that
"affair" is a significant term.
Again we took the direct approach, choosing to com-
bine terms whenever there was a reasonable percentage of
overlap between their associated terms. This worked sur-
prisingly well in cases where the topic title was a good de-
scription (e.g. "welfare reform") and very badly for those
with vague titles (e.g. "find innovative companies"). We
tried to recover from these by including more words from
the description and narrative, but then we had to start
recognizing the language of these descriptions, filtering
out words like "relevant", "mention" and so forth. At this
crude stage, the main problem with the query generation
method is using the structure of the topic descriptions.
The second major issue with automatic query genera-
tion is that it isn't nearly as good at finding good terms
as the process of training from data and relevance judge-
ments, as used in the routing experiments. The relevance
judgements used for routing contain large volumes ofrela-
tively high-accuracy data, while the training used for term
expansion in query generation relied on relatively small
volumes of relatively noisy data. For example, the word
"welfare" used in one of the ad hoc topics occurred with a
high enough TF.IDF weight only 29 times in the training
sample, and the most frequently associated term, "chil-
dren", occurred only 6 times. In order to establish good
associations between "welfare" and less frequent terms,
we would need much more data. The data from TREC-2
seem to suggest that low-frequency terms contribute more
in term expansion than high-frequency terms, so using a
"small" training sample (10 million words is only about
3% of the corpus) was a major error. We made many
other mistakes in the training method, including mixing
samples from the Federal Register and DOE sources with
other texts that are much more likely to be relevant. This
leaves a lot of room for future experiments and improve-
ment.
The fully automatic ad hoc system certainly didn't do
as well as the manual routing system, but it was still at
or above median for more than half of the ad hoc topics.
Considering that this method could be used within the
context of most any legacy retrieval system, the result
is worth noting. Furthermore, the generation of Boolean
queries from natural language descriptions is an interest-
ing, as well as practical, research problem, because many
different retrieval systems can make some use of Boolean
queries.
3 Ranking
In both routing and ad hoc, we used a set of word weights
for ranking, acquired using the relevance judgements in
the routing case and from the corpus data in the ad hoc
case. In routing, the weights reflect the statistical mea-
sure of association between the term and each topic (using
the weighted mutual information score given earlier). In
the ad hoc case, the weight is a function of the frequency
of the term in the topic description, the inverse collec-
tion frequency, and an additional factor to weight certain
components of the topic descriptions (such as the title and
description) more heavily than others. We combined the
weighted frequency of these terms with an overall count
of the number of topic hits per document, normalizing for
document length, to produce a score for each document.
This was the result of trying many different approaches
on the test data, so it was definitely a good method for
our system.
ilowever, in comparing our results with those of other
systems, our precision curve across various recall points
is not nearly as good as a system that does really good
ranking. In routing, we are not sure that ranking is im-
portant, but it is certainly important in getting good re-
sults in TREC. So, we are inclined to try to combine our
196