SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text
chapter
A. Kent
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
Donna K. Harman
If the conjunct had fewer words than was required to meet this termination condition, other
words were added to the concept by finding words in the remaining text that were adjacent to
the first word in the concept. Finally, words that occurred rarely in the database were added
as simple disjuncts if the query was still estimated to return fewer than the required number
of documents. For example, to return 500 documents for query 82 on genetic engineering, we
generated the Boolean query:
(genetic AND engineering) OR (manipulation AND molecular)
OR (biotechnology AND genetic AND manipulation)
OR (animal AND drug AND plant)
The second approach started by forming a simple disjunction of rare words, up to 10% of
the required number of documents. Next, two key descriptors were identified, using the text
of the description primarily, but also using the narrative. The key descriptors were created as
a set of three words each, where each key descriptor is intended to capture a principal char-
acteristic of the query. The frequency of the terms in the query and the adjacency of words
were used to generate the key descriptors. Sometimes the algorithm would form two variations
of the one concept, sometimes two distinct concepts. For instance query 82 formed the key
descriptors:
to generate
(genetic, engineering, developed)
(genetic, engineering, manipulation)
the query:
(genetic AND engineering) OR (genetic AND manipulation)
The third approach was quite similar to the second, except that more emphasis was placed
on the title in generating the concepts, and, by using pairs of words in the query, finer gradation
of the Boolean queries was possible. The concepts for queries 82 and 87 were the same. The
Boolean query for query 82 was the same as for the second algorithm, however the query
generated for 87 was:
institutionslailed OR (actions[OCRerr]cnminal AND officers)
When evaluating these approaches, the key consideration is whether or not relevant doc-
uments are identified. For a fair comparison, each algorithm should return the same average
number of documents. In Table 4, each algorithm has returned an average of about 400 docu-
ments. The recall and precision results are all based on the same ranking formula. Note that
since these algorithms identified documents that had not been assessed as relevant or irrelevant,
we made the (pessimistic) assumption that all such documents were irrelevant.
Algorithm [OCRerr] Concepts [OCRerr] Description J Description with Pairs ]
Docs. Requested 200 700 600
Docs. Returned 453 354 360
Disjuncts 3.5 1.8 4.4
Recall 0.155 0.170 0.146
Precision 0.150 0.144 0.121
Table 4: Comparison of Boolean query generation algorithms
To show how increasing the number of documents requested affects these algorithms, in
Table 5, we show how many documents were retrieved and how many disjuncts were used
when each of the algorithms were requested to return 1,000 documents.
237