SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) A Boolean Approximation Method for Query Construction and Topic Assignment in TREC chapter P. Jacobs G. Krupka L. Rau National Institute of Standards and Technology Donna K. Harman * Repetition * - 0 or more + - 1 or more * Range - 0 to N - 1 to N In practice, certain of these features are used more than others, and most queries rely most heavily on different lexical categories, grouping, and wildcards. For example, a simple description of the query looking for texts describing sanc- tions against South Africa is the following: sanction == [(member sanction sanctions disinvestment) <Sullivan Principles> <punitive *2 measures>] sa[OCRerr]rica == [(member Buthelezi Pretoria anti-apartheid apartheid) <De Kierk> <South (member Airica A[OCRerr]rican)> ] ;; rule 1 *sanction *50 $sa[OCRerr]rica => (mark-topic 52) ;;; rule 2 *sa[OCRerr]rica *20 $sanction => (mark-topic 52) This description says that any matching text must have bo[OCRerr]h an indicator of South Africa ($safrica) and one of sanctions ($sanction), and that the sanction phrase must occur within 50 words of the South Africa phrase, except if it only comes afterwards, in which case it must come within 20 words. A sanction phrase can be any of the simple words sanction, sanctions, or disinvestment, or any phrase including punitive measures with no more than two intervening words (like punitive economic measures). A South Africa phrase can also be either one of a group of simple words, or a phrase, like De Kierk, South Africa, or South African. These queries or topic descriptions can be quite complex, and the method has been designed to handle many queries simultaneously, so the rule compiler is designed to produce expressions that can be efficiently applied within a l&ge set of queries. This is important because many queries can share the same simple terms or combinations of terms, and because the pre-filter must match the simplest expressions first. For the topic description given above, the output of the rule compiler will include the following tests: 52 TERN AFRICAN 2029 TERN NEASURES 302