NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text chapter A. Kent A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology Donna K. Harman If the conjunct had fewer words than was required to meet this termination condition, other words were added to the concept by finding words in the remaining text that were adjacent to the first word in the concept. Finally, words that occurred rarely in the database were added as simple disjuncts if the query was still estimated to return fewer than the required number of documents. For example, to return 500 documents for query 82 on genetic engineering, we generated the Boolean query: (genetic AND engineering) OR (manipulation AND molecular) OR (biotechnology AND genetic AND manipulation) OR (animal AND drug AND plant) The second approach started by forming a simple disjunction of rare words, up to 10% of the required number of documents. Next, two key descriptors were identified, using the text of the description primarily, but also using the narrative. The key descriptors were created as a set of three words each, where each key descriptor is intended to capture a principal char- acteristic of the query. The frequency of the terms in the query and the adjacency of words were used to generate the key descriptors. Sometimes the algorithm would form two variations of the one concept, sometimes two distinct concepts. For instance query 82 formed the key descriptors: to generate (genetic, engineering, developed) (genetic, engineering, manipulation) the query: (genetic AND engineering) OR (genetic AND manipulation) The third approach was quite similar to the second, except that more emphasis was placed on the title in generating the concepts, and, by using pairs of words in the query, finer gradation of the Boolean queries was possible. The concepts for queries 82 and 87 were the same. The Boolean query for query 82 was the same as for the second algorithm, however the query generated for 87 was: institutionslailed OR (actions[OCRerr]cnminal AND officers) When evaluating these approaches, the key consideration is whether or not relevant doc- uments are identified. For a fair comparison, each algorithm should return the same average number of documents. In Table 4, each algorithm has returned an average of about 400 docu- ments. The recall and precision results are all based on the same ranking formula. Note that since these algorithms identified documents that had not been assessed as relevant or irrelevant, we made the (pessimistic) assumption that all such documents were irrelevant. Algorithm [OCRerr] Concepts [OCRerr] Description J Description with Pairs ] Docs. Requested 200 700 600 Docs. Returned 453 354 360 Disjuncts 3.5 1.8 4.4 Recall 0.155 0.170 0.146 Precision 0.150 0.144 0.121 Table 4: Comparison of Boolean query generation algorithms To show how increasing the number of documents requested affects these algorithms, in Table 5, we show how many documents were retrieved and how many disjuncts were used when each of the algorithms were requested to return 1,000 documents. 237