SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
A Boolean Approximation Method for Query Construction and Topic Assignment in TREC
chapter
P. Jacobs
G. Krupka
L. Rau
National Institute of Standards and Technology
Donna K. Harman
as .03. To us, narrowing the query seemed a good idea because the precision
on this topic otherwise would have been low, but we did not realize that the
documents that the system didn'i re[OCRerr]rievc were still treated as incorrect in this
calculation.
On Topic 43 (1991 Al conferences), our system produced 3 documents, all
of which were irrelevant. This "routing" topic was later discarded because no
relevant documents were found in the corpus, but there is nothing inherently
wrong with testing topics for which there is no data. In fact, the ideal routing
system should produce 0 hits for such a topic, not 200 hits as dictated in TREC.
Certainly ranking and routing don't go together in any real task on a gigabyte
sample. One way that future evaluations can test routing is to use a random (or
otherwise fair) sample of the collections as a test, judge every document in that
sample with respect to every query, and then measure each system's recall and
precision on the basis of the sample. This would probably require less hand-
work in judging relevance, but would require that each system produce topic
assignments for every document in the collection (from which the assignments
for the test sample would be extracted post hoc). This could be impossible for
some systems. On the other hand, the strategy would give real numbers for
both recall and precision, and would be much truer to the routing task.
7 Utility
The main purpose of this method is as a front-end for computation-intensive
natural language processing of large bodies of text. Because the pre-filter closely
approximates more in-depth processing with a very fast, efficient process, it
permits detailed processing of large volumes of text by discarding most of the
irrelevant material and by producing a rough approximation of the more detailed
processing.
The method is more broadly applicable to problems in information dissemi-
nation and retrieval. Accuracy is only one appealing characteristic of the tech-
nique, since the main innovation is that it allows for improved accuracy within
the context of traditional word-based full-text search.
In addition to the programs described here, the method was tested with a
statistical corpus analyzer that helps to identify candidate words and phrases
to include in queries. This method helps to overcome some of the limitations
of word-based methods in cases where statistical approaches clearly seem to do
better. As an additional experiment, this automated corpus analysis can be
used to reduce further the amount of labor involved in building queries.
307