SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
TREC-II Routing Experiments with the TRW/Paracel Fast Data Finder
chapter
M. Mettler
National Institute of Standards and Technology
D. K. Harman
reports, etc.) are not relevant to this topic; nor are articles which describe subsidies not
directed toward Airbus.
Traditional IR term weighting techniques do not give any explicit benefit to articles which
conjoin ideas. Articles which include terms relevant to each of the component sub-topics
will receive high scores; but so will articles which include many terms relevant to only one
sub-topic. Recent efforts implicitly include conjunctions through the use of phrases as
terms in otherwise traditional statistical methods.
An alternative is the use of boolean operators. This has the desired effect -- an AND of
terms forces a conjunction -- but the use of booleans in IR has been viewed with some
skepticism and disfavor. Boolean operators often find a conjunction of terms where none
truly exists (for example, Airbus and subsidies might be mentioned in two separate and
unrelated portions of an article); or, if made sufficiently restrictive to eliminate spurious
matches, boolean-based searches often miss relevant articles.
We have followed an approach which incorporates both ideas. Rather than focus on
specific phrases, we search for terms in proximity to one another. The terms in the query
are chosen to represent each of the constituent sub-topics, just as in a boolean search. The
specificity of the query is adjusted by varying the required proximity of the terms. Thus,
for Airbus subsidies we might search for terms representing "Airbus" in a range of
proximities to terms representing "subsidies".
This approach allows conjunctions to be graded. A small proximity restriction (say, 3
words) yields results similar to a keyphrase search, indicating that the two concepts are
indeed associated in the article and that the article is relevant to the topic. A large proximity
restriction (1 article) is analogous to a simple boolean keyword search and retreives articles
in which the concept terms may be only loosely associated. Intermediate proximities (1
sentence, 1 paragraph, etc.) indicate intermediate degrees of association and intermediate
recall/precision trade-offs.
It is also possible to use multiple proximities in a single query with this method, or to use
proximities and occurrence frequencies together, to form multi-dimensional arrays of query
parameters. For example, for Topic 62, Military Coups Dtetat, the number of conjunctions
was traded off against the proximity of the conjunction to form a two-dimensional query
set.
For the initial experiment, lists of synonyms representative of each idea in a topic were
manually built, and one- or two-dimensional query sets were built from these lists. These
queries were then run against the training database, and after some feedback, the query sets
were finalized. Each finalized query set was run against the training database to determine
a ranking of the queries based solely on selectivity.
Table I shows a sample proximity query and Table III shows our TREC-Il results. The
number of relevant documents retrieved by the proximity method queries are labeled
TRW1.
203