SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) A Boolean Approximation Method for Query Construction and Topic Assignment in TREC chapter P. Jacobs G. Krupka L. Rau National Institute of Standards and Technology Donna K. Harman conditions are satisfied. A topic test produces a match if it has become true at the end of this process, meaning that the paragraph or document has passed the pre-filter for that query. A single paragraph, of course, can satisfy multiple queries. The pattern matcher then loads in only those texts that have passed the pre-filter, and uses the original queries to apply more stringent tests to those texts, such as ordering, proximity, and more complex lexical tests. In typical cases, the amount of text the pattern matcher must scan is only 2-3% of the words in the original, and the pattern matcher discards 2({30% of the matches of the pre-filter. Hence, well over 95% of the text filtering is done by the pre- filter, making the whole process efficient as well as showing that the pre-filter closely approximates the knowledge-based pat tern matcher. 4 Implementation and Results The lexical analyzer and the matching method described above were programmed in the C programming language, compiled, and tested on a corpus of about 1.2 gigabytes of text (about 200 million words) with a set of 50 topics, as part of the government-sponsored Text R[OCRerr]trieval Conference (TREC). The program was then tested on an additional gigabyte of text (the "second" disk) with an additional 50 topics, and the results were submitted to the government for com- parative evaluation purposes. The results, in this case, were a ranked list of up to 200 documents that matched each query. The final test, run on two SUN workstations, took about 10 hours of elapsed time for the entire corpus. The pre-filter ran at a speed of several hundred thousand words per minute, thus covering the entire contents of each disk in a few hours. This would reduce the text to 3% of the original data, with each paragraph marked according to the queries that matched in the pre-filter. During query refinement, a statistical filter ran through the entire contents of this marked portion of the corpus, pulling out words and phrases with a high correlation (a variation of the metric for categorization reported in [1)) to a particular topic. For every topic, then, the statistical filter listed terms not in the (Boolean) query that occurred with a disproportionate frequency in text that matched the query. This, for example, produced the terms Buthelezi and puni[OCRerr]ive measures in the South Africa query above. Figure 1 summarizes some of the official results from TREC, including the number of times the systems achieved the best score, and were above, below and equal to the median score. The system was very close to the top in both the ad hoc and routing tasks, and was well above median on most topics. The pre-filter, surprisingly, was better than the pattern matcher almost across the board, indicating that with this style of evaluation there seemed to be no real benefit to deeper analysis than a simple Boolean match. The logical conclusion is that a Boolean routing or retrieval engine can perform as well as much more 304