SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
A Boolean Approximation Method for Query Construction and Topic Assignment in TREC
chapter
P. Jacobs
G. Krupka
L. Rau
National Institute of Standards and Technology
Donna K. Harman
There are many alternatives to full-text search that can produce much
higher accuracy, including statistical methods that weight matches based on
relative word frequencies, automatic indexing strategies, and knowledge-based
approaches that give very high accuracy for repetitive searches at the cost of a
large amount of work in constructing a knowledge base. The ideal search strat-
egy would be one with the accuracy of knowledge-based approaches, but with
the simple efficiency of word searches. This is the motivation for the method
described here.
The selection of this method for the TREC evaluation combined a sense of
the practice of information retrieval with a particular interpretation of what
the evaluation is about. For example, the strategy makes several assumptions
about the task that are apparently different from those made by other sites, and
which perhaps take a looser and less academic view of the experiment. These
key assumptions are:
* Relevance over ranking. The focus of text interpretation is to assign to
each text, with the highest possible accuracy, the set of topics or content
indicators that apply to that text (i.e., particularly for routing, to treat
the topic or relevance of each text individually rather than as a relative
measure against other texts in the corpus).
* Technology over engineering. The goal of the experiment is to show high-
accuracy, practical results, avoiding the threatening limitations of disk
size, memory management, and tractability. (The system did no pre-
indexing or analysis of any portion of the corpus.)
* Tezi inie[OCRerr]reiai:on over query inieryrelalion. As there may not be any
principled structure or methodology in the sample queries, the emphasis on
matching texts to an internal representation of each query, rather than on
automatic query processing, pushes the limits of the routing and retrieval
engine instead of the interface to the user.
These choices were made for various reasons, and are presented here not as
the righi way to view the task but as one software system's implicit focus. A
project with a different choice of focus could, for example, produce lower accu-
racy but have the benefit of fully automatic query processing; another project
might have higher accuracy on the top few texts in each category but lower recall
on a general routing task. We view our (preliminary) results as very satisfying
as a test in a high-throughput, high-accuracy, routing-style interpretation.
The sections that follow cover the motivation for the design and implemen-
tation of this method, some specific details of the experiment, and an analysis
of the results.
298