SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
A Boolean Approximation Method for Query Construction and Topic Assignment in TREC
chapter
P. Jacobs
G. Krupka
L. Rau
National Institute of Standards and Technology
Donna K. Harman
conditions are satisfied. A topic test produces a match if it has become true at
the end of this process, meaning that the paragraph or document has passed
the pre-filter for that query. A single paragraph, of course, can satisfy multiple
queries.
The pattern matcher then loads in only those texts that have passed the
pre-filter, and uses the original queries to apply more stringent tests to those
texts, such as ordering, proximity, and more complex lexical tests. In typical
cases, the amount of text the pattern matcher must scan is only 2-3% of the
words in the original, and the pattern matcher discards 2({30% of the matches
of the pre-filter. Hence, well over 95% of the text filtering is done by the pre-
filter, making the whole process efficient as well as showing that the pre-filter
closely approximates the knowledge-based pat tern matcher.
4 Implementation and Results
The lexical analyzer and the matching method described above were programmed
in the C programming language, compiled, and tested on a corpus of about 1.2
gigabytes of text (about 200 million words) with a set of 50 topics, as part
of the government-sponsored Text R[OCRerr]trieval Conference (TREC). The program
was then tested on an additional gigabyte of text (the "second" disk) with an
additional 50 topics, and the results were submitted to the government for com-
parative evaluation purposes. The results, in this case, were a ranked list of up
to 200 documents that matched each query.
The final test, run on two SUN workstations, took about 10 hours of elapsed
time for the entire corpus. The pre-filter ran at a speed of several hundred
thousand words per minute, thus covering the entire contents of each disk in
a few hours. This would reduce the text to 3% of the original data, with
each paragraph marked according to the queries that matched in the pre-filter.
During query refinement, a statistical filter ran through the entire contents of
this marked portion of the corpus, pulling out words and phrases with a high
correlation (a variation of the metric for categorization reported in [1)) to a
particular topic. For every topic, then, the statistical filter listed terms not in
the (Boolean) query that occurred with a disproportionate frequency in text
that matched the query. This, for example, produced the terms Buthelezi and
puni[OCRerr]ive measures in the South Africa query above.
Figure 1 summarizes some of the official results from TREC, including the
number of times the systems achieved the best score, and were above, below
and equal to the median score. The system was very close to the top in both
the ad hoc and routing tasks, and was well above median on most topics. The
pre-filter, surprisingly, was better than the pattern matcher almost across the
board, indicating that with this style of evaluation there seemed to be no real
benefit to deeper analysis than a simple Boolean match. The logical conclusion
is that a Boolean routing or retrieval engine can perform as well as much more
304