SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) A Boolean Approximation Method for Query Construction and Topic Assignment in TREC chapter P. Jacobs G. Krupka L. Rau National Institute of Standards and Technology Donna K. Harman 2 Background There are several different methods for general-purpose information retrieval from text, as mentioned, including full-text search, statistical and probabilistic techniques, and those using automatic or manual indexing. Of these, word- based full-text search is generally believed to be the least accurate, but by far the most widely practiced. The method is used in spite of its inaccuracy because it is so easy to implement, and because it gives a direct and natural way for users to describe what they are looking for. Knowledge-based approaches-those that rely on formal descriptions of con- cepts rather than combinations of words-can be the most accurate for certain limited tasks, but are the most difficult to implement and apply to general information retrieval problems. While there are a variety of knowledg[OCRerr]based methods, they share the common framework of looking for structured informa- tion in a text, while other methods look for simple combinations of words. For example, a simple query looking for stories about takeovers by GE in a typical word-based system might be (GE AND ACQUISITION), while in a knowledge- based system it would look more like ($GE * $acquire * *compauy) where the terms marked by $ indicate classes of words or phrases that the system can recognize; for example, GE might include GE and General Electric but not General Electric company of Britain, and $comp[OCRerr]y would match any company named or described. The knowledge-based approach is more accurate, in general, for two rea- sons. First, it allows the program scanning the texts to look for many ways of expressing the same concept, enhancing recal[OCRerr]the percentage of relevant docu- ments that the program retrieves. Second, it focuses search on specific pieces of information, thus enhancing p[OCRerr]cision-the percentage of retrieved texts that are relevant. In this case, improved recall would come from spotting the variety of ways of expressing the concept of acquiring, and improved precision comes from eliminating texts that are about the acquisition of real estate, products, and so forth, as well as acquisitions by other companies of GE's products and services. Most full-text search systems are driven by Boolean queries-terms joined together by AND, OR, and NOT, such as (GE AND ACQUISITION) above. These can be arbitrarily complex, and one can start to approximate the opera- tion of a knowledg[OCRerr]based approach by combining large numbers of terms. For example, the GE query above could be expressed as: (AND (OR GE (AND GENERAL ELECThIC (NOT (AND OF BRITAIN)))) (OR ACQUIRED ACQUISITION BOUGHT (AND (OR TAKE TOOK) OVER) (OR COMPANY CORP INC CO LTD SA AG ...)) There are several problems with this approach. First, it is very difficult for users to formulate queries of this sort, so, in practice, Boolean queries are 299