SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) A Boolean Approximation Method for Query Construction and Topic Assignment in TREC chapter P. Jacobs G. Krupka L. Rau National Institute of Standards and Technology Donna K. Harman as .03. To us, narrowing the query seemed a good idea because the precision on this topic otherwise would have been low, but we did not realize that the documents that the system didn'i re[OCRerr]rievc were still treated as incorrect in this calculation. On Topic 43 (1991 Al conferences), our system produced 3 documents, all of which were irrelevant. This "routing" topic was later discarded because no relevant documents were found in the corpus, but there is nothing inherently wrong with testing topics for which there is no data. In fact, the ideal routing system should produce 0 hits for such a topic, not 200 hits as dictated in TREC. Certainly ranking and routing don't go together in any real task on a gigabyte sample. One way that future evaluations can test routing is to use a random (or otherwise fair) sample of the collections as a test, judge every document in that sample with respect to every query, and then measure each system's recall and precision on the basis of the sample. This would probably require less hand- work in judging relevance, but would require that each system produce topic assignments for every document in the collection (from which the assignments for the test sample would be extracted post hoc). This could be impossible for some systems. On the other hand, the strategy would give real numbers for both recall and precision, and would be much truer to the routing task. 7 Utility The main purpose of this method is as a front-end for computation-intensive natural language processing of large bodies of text. Because the pre-filter closely approximates more in-depth processing with a very fast, efficient process, it permits detailed processing of large volumes of text by discarding most of the irrelevant material and by producing a rough approximation of the more detailed processing. The method is more broadly applicable to problems in information dissemi- nation and retrieval. Accuracy is only one appealing characteristic of the tech- nique, since the main innovation is that it allows for improved accuracy within the context of traditional word-based full-text search. In addition to the programs described here, the method was tested with a statistical corpus analyzer that helps to identify candidate words and phrases to include in queries. This method helps to overcome some of the limitations of word-based methods in cases where statistical approaches clearly seem to do better. As an additional experiment, this automated corpus analysis can be used to reduce further the amount of labor involved in building queries. 307