SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) A Boolean Approximation Method for Query Construction and Topic Assignment in TREC chapter P. Jacobs G. Krupka L. Rau National Institute of Standards and Technology Donna K. Harman quite simple. Second, it is easier for a system to recognize most or all company names than it is to list all the possible words that could appear in the name of a company. Third, the amount of information captured in the query is still limited; for example, it ignores the order that words appear as well as their ad- jacency or proximity. Many text search systems allow augmentations to queries that express these constraints, but this makes the queries still more difficult to construct and makes searches less efficient. Hence Boolean retrieval systems remain, in practice, awkward and inaccurate. The experiment that this team performed in TREC hatched under the pres- sure of a very difficult task with very limited resources. While a number of individuals participated (including the three authors and five others), the pr[OCRerr] gram described here resulted from less than a week of programming over and above the existing software tools we had in hand to apply to the task. The constraints thus forced upon the effort included the realization that not much could be salvaged from the queries as distributed, that the sample relevance judgements were incomplete and the training samples too small for most statis- tical tests, and that indexing the corpus by any practical method could delay the project by weeks, overflow the disks, and prevent any corrections to the method. This led to a "bare bones" strategy that takes advantage of two strategies: (1) treating boolean queries as an approximation to more detailed, structured rep- resentations, and (2) using the co[OCRerr]us, rather than the queries, as the main source of information for formulating the topics. 3 Our Approach The fundamental idea behind the method is to take a knowledge-based descrip- tion of a query or topic, and convert it to a Boolean form that can be efficiently applied by a text-search engine. This Boolean form, furthermore, must be an ap- proxima[OCRerr]ion of the knowledge based query, in the formal sense that the Boolean expression should match all texts that the knowledge-based query would, but perhaps can admit more texts. There are several key advantages to this approach of generating a Boolean query from a knowledge-based description. The simplest benefit is that it makes building queries easier, because much of the work in forming complex Boolean expressions is done automatically. A second major advantage is that the Boolean queries, in approximating a knowledge-based approach, are more likely to give accurate results. Finally, because retrieval using the automatically[OCRerr]generated Boolean query approximates the knowledge-based query, the knowledge-based system can run on the results of Boolean retrieval, thus enhancing precision without having to apply the more computationally[OCRerr]intensive knowledge-based processing to very much text. One of the obvious problems to overcome in TREC was, with a limited amount of time to formulate 100 queries, with a small amount of training data 300