SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
A Boolean Approximation Method for Query Construction and Topic Assignment in TREC
chapter
P. Jacobs
G. Krupka
L. Rau
National Institute of Standards and Technology
Donna K. Harman
quite simple. Second, it is easier for a system to recognize most or all company
names than it is to list all the possible words that could appear in the name
of a company. Third, the amount of information captured in the query is still
limited; for example, it ignores the order that words appear as well as their ad-
jacency or proximity. Many text search systems allow augmentations to queries
that express these constraints, but this makes the queries still more difficult
to construct and makes searches less efficient. Hence Boolean retrieval systems
remain, in practice, awkward and inaccurate.
The experiment that this team performed in TREC hatched under the pres-
sure of a very difficult task with very limited resources. While a number of
individuals participated (including the three authors and five others), the pr[OCRerr]
gram described here resulted from less than a week of programming over and
above the existing software tools we had in hand to apply to the task. The
constraints thus forced upon the effort included the realization that not much
could be salvaged from the queries as distributed, that the sample relevance
judgements were incomplete and the training samples too small for most statis-
tical tests, and that indexing the corpus by any practical method could delay the
project by weeks, overflow the disks, and prevent any corrections to the method.
This led to a "bare bones" strategy that takes advantage of two strategies: (1)
treating boolean queries as an approximation to more detailed, structured rep-
resentations, and (2) using the co[OCRerr]us, rather than the queries, as the main
source of information for formulating the topics.
3 Our Approach
The fundamental idea behind the method is to take a knowledge-based descrip-
tion of a query or topic, and convert it to a Boolean form that can be efficiently
applied by a text-search engine. This Boolean form, furthermore, must be an ap-
proxima[OCRerr]ion of the knowledge based query, in the formal sense that the Boolean
expression should match all texts that the knowledge-based query would, but
perhaps can admit more texts.
There are several key advantages to this approach of generating a Boolean
query from a knowledge-based description. The simplest benefit is that it makes
building queries easier, because much of the work in forming complex Boolean
expressions is done automatically. A second major advantage is that the Boolean
queries, in approximating a knowledge-based approach, are more likely to give
accurate results. Finally, because retrieval using the automatically[OCRerr]generated
Boolean query approximates the knowledge-based query, the knowledge-based
system can run on the results of Boolean retrieval, thus enhancing precision
without having to apply the more computationally[OCRerr]intensive knowledge-based
processing to very much text.
One of the obvious problems to overcome in TREC was, with a limited
amount of time to formulate 100 queries, with a small amount of training data
300