SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
A Boolean Approximation Method for Query Construction and Topic Assignment in TREC
chapter
P. Jacobs
G. Krupka
L. Rau
National Institute of Standards and Technology
Donna K. Harman
* Repetition
* - 0 or more
+ - 1 or more
* Range
- 0 to N
- 1 to N
In practice, certain of these features are used more than others, and most
queries rely most heavily on different lexical categories, grouping, and wildcards.
For example, a simple description of the query looking for texts describing sanc-
tions against South Africa is the following:
sanction == [(member sanction sanctions disinvestment)
<Sullivan Principles>
<punitive *2 measures>]
sa[OCRerr]rica == [(member Buthelezi Pretoria anti-apartheid apartheid)
<De Kierk> <South (member Airica A[OCRerr]rican)> ]
;; rule 1
*sanction *50 $sa[OCRerr]rica => (mark-topic 52)
;;; rule 2
*sa[OCRerr]rica *20 $sanction => (mark-topic 52)
This description says that any matching text must have bo[OCRerr]h an indicator of
South Africa ($safrica) and one of sanctions ($sanction), and that the sanction
phrase must occur within 50 words of the South Africa phrase, except if it only
comes afterwards, in which case it must come within 20 words.
A sanction phrase can be any of the simple words sanction, sanctions, or
disinvestment, or any phrase including punitive measures with no more than two
intervening words (like punitive economic measures). A South Africa phrase can
also be either one of a group of simple words, or a phrase, like De Kierk, South
Africa, or South African.
These queries or topic descriptions can be quite complex, and the method
has been designed to handle many queries simultaneously, so the rule compiler
is designed to produce expressions that can be efficiently applied within a l&ge
set of queries. This is important because many queries can share the same
simple terms or combinations of terms, and because the pre-filter must match
the simplest expressions first.
For the topic description given above, the output of the rule compiler will
include the following tests:
52 TERN AFRICAN
2029 TERN NEASURES
302