SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) A Boolean Approximation Method for Query Construction and Topic Assignment in TREC chapter P. Jacobs G. Krupka L. Rau National Institute of Standards and Technology Donna K. Harman The problerns with undergeneration (and the related problem of not doing a very good job of ranking the documents) were due to the fact that our sys- tem was designed for routing, while TREC used traditional retrieval evaluation methods, along with a 200-document cutoff, effectively counting recall on the harder topics much more heavily than overall recall. Our approach can correct for this by using a more flexible statistical method to expand the queries and by performing a more sophisticated ranking (the document ranking as reported was implemented post hoc in one line of Unix code). More important than the problems to correct, there is an important result here to build on. Our experience has been that pattern matching can be a close approximation for this sort of task to natural language processing, so it might seem that advanced methods are much more critical for finding what to put in the queries than they are for the detailed analysis of the texts. The general framework of this approach means that, with the continued development of advanced methods for natural-language based corpus analysis, substantial performance improvements can come within the context of almost any current text retrieval systems. 6 Evaluation Methods One unusual characteristic of our method is that it assumes that each relevance judgement that the system makes is made independently of all other texts, as in a routing task where the system processes each incoming message in turn and assigns topics or actions for filing or routing that message. Certainly, this style has certain advantages-it is simple, clear, and makes parallel processing easy- and it reflects some real assumptions about the nature of the task. However, although it seems to have done very well relative to other systems, it is not consistent with the instructions for submitting results in TREC, and certainly doesn't lead to the best possible showing on some of the results. Topic 77, about poaching techniques, is one example of the different (naive, perhaps) perspective toward evaluation that our system adopts. The query specifies: A relevant document will identify the type of wildlife being poached, the poachzng technique or method of killing the wildlife which is used by the poacher, and the reason for the poaching (e.g. for a trophy, meat, or money). This is a very specific query. Our test (bootstrapping) sample produced a good number of hits, but most of them failed to include one of the required pieces of information, usually the technique or method of killing. So, we narrowed the query. The result is that, for this query, the system produced 9 total documents, 6 of which were judged relevant. This is high precision (.67), but it doesn't help the overall results, since for this topic the precision at 200 documents is treated 306