SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) A Boolean Approximation Method for Query Construction and Topic Assignment in TREC chapter P. Jacobs G. Krupka L. Rau National Institute of Standards and Technology Donna K. Harman There are many alternatives to full-text search that can produce much higher accuracy, including statistical methods that weight matches based on relative word frequencies, automatic indexing strategies, and knowledge-based approaches that give very high accuracy for repetitive searches at the cost of a large amount of work in constructing a knowledge base. The ideal search strat- egy would be one with the accuracy of knowledge-based approaches, but with the simple efficiency of word searches. This is the motivation for the method described here. The selection of this method for the TREC evaluation combined a sense of the practice of information retrieval with a particular interpretation of what the evaluation is about. For example, the strategy makes several assumptions about the task that are apparently different from those made by other sites, and which perhaps take a looser and less academic view of the experiment. These key assumptions are: * Relevance over ranking. The focus of text interpretation is to assign to each text, with the highest possible accuracy, the set of topics or content indicators that apply to that text (i.e., particularly for routing, to treat the topic or relevance of each text individually rather than as a relative measure against other texts in the corpus). * Technology over engineering. The goal of the experiment is to show high- accuracy, practical results, avoiding the threatening limitations of disk size, memory management, and tractability. (The system did no pre- indexing or analysis of any portion of the corpus.) * Tezi inie[OCRerr]reiai:on over query inieryrelalion. As there may not be any principled structure or methodology in the sample queries, the emphasis on matching texts to an internal representation of each query, rather than on automatic query processing, pushes the limits of the routing and retrieval engine instead of the interface to the user. These choices were made for various reasons, and are presented here not as the righi way to view the task but as one software system's implicit focus. A project with a different choice of focus could, for example, produce lower accu- racy but have the benefit of fully automatic query processing; another project might have higher accuracy on the top few texts in each category but lower recall on a general routing task. We view our (preliminary) results as very satisfying as a test in a high-throughput, high-accuracy, routing-style interpretation. The sections that follow cover the motivation for the design and implemen- tation of this method, some specific details of the experiment, and an analysis of the results. 298