SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
A Boolean Approximation Method for Query Construction and Topic Assignment in TREC
chapter
P. Jacobs
G. Krupka
L. Rau
National Institute of Standards and Technology
Donna K. Harman
2 Background
There are several different methods for general-purpose information retrieval
from text, as mentioned, including full-text search, statistical and probabilistic
techniques, and those using automatic or manual indexing. Of these, word-
based full-text search is generally believed to be the least accurate, but by far
the most widely practiced. The method is used in spite of its inaccuracy because
it is so easy to implement, and because it gives a direct and natural way for
users to describe what they are looking for.
Knowledge-based approaches-those that rely on formal descriptions of con-
cepts rather than combinations of words-can be the most accurate for certain
limited tasks, but are the most difficult to implement and apply to general
information retrieval problems. While there are a variety of knowledg[OCRerr]based
methods, they share the common framework of looking for structured informa-
tion in a text, while other methods look for simple combinations of words. For
example, a simple query looking for stories about takeovers by GE in a typical
word-based system might be (GE AND ACQUISITION), while in a knowledge-
based system it would look more like ($GE * $acquire * *compauy) where
the terms marked by $ indicate classes of words or phrases that the system
can recognize; for example, GE might include GE and General Electric but not
General Electric company of Britain, and $comp[OCRerr]y would match any company
named or described.
The knowledge-based approach is more accurate, in general, for two rea-
sons. First, it allows the program scanning the texts to look for many ways of
expressing the same concept, enhancing recal[OCRerr]the percentage of relevant docu-
ments that the program retrieves. Second, it focuses search on specific pieces
of information, thus enhancing p[OCRerr]cision-the percentage of retrieved texts that
are relevant. In this case, improved recall would come from spotting the variety
of ways of expressing the concept of acquiring, and improved precision comes
from eliminating texts that are about the acquisition of real estate, products,
and so forth, as well as acquisitions by other companies of GE's products and
services.
Most full-text search systems are driven by Boolean queries-terms joined
together by AND, OR, and NOT, such as (GE AND ACQUISITION) above.
These can be arbitrarily complex, and one can start to approximate the opera-
tion of a knowledg[OCRerr]based approach by combining large numbers of terms. For
example, the GE query above could be expressed as:
(AND (OR GE (AND GENERAL ELECThIC (NOT (AND OF BRITAIN))))
(OR ACQUIRED ACQUISITION BOUGHT (AND (OR TAKE TOOK) OVER)
(OR COMPANY CORP INC CO LTD SA AG ...))
There are several problems with this approach. First, it is very difficult
for users to formulate queries of this sort, so, in practice, Boolean queries are
299