SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
A Boolean Approximation Method for Query Construction and Topic Assignment in TREC
chapter
P. Jacobs
G. Krupka
L. Rau
National Institute of Standards and Technology
Donna K. Harman
(none for the ad hoc test), how to make the queries practical and accurate.
Our choice here was to keep the initial queries relatively simple, and to run the
results of a "first pass" retrieval against the entire corpus through a statistical
filter to pull out terms that would help to augment or refine the query. In
addition, the matching engine would display the exact portion of each text that
(correctly or incorrectly) matched the query, making it easy to correct glaring
errors and refine ambiguous terrns. This amounts to a peculiar sort of feedback
mechanism that relies on detailed analysis of portions of the corpus instead of
user input.
3.1 Detailed Method
The method brings together four key elements: (1) a language for express-
ing knowledge-based topics or queries, developed at GE and described in the
open literature, (2) a new program to generate Boolean expressions that ap-
proximate these queries (called the riLic compiler), (3) a program to match
the automatically-generated expressions against text to be retrieved (called the
pre-flUer), and (4) a knowledge-based pattern matcher, described in the open
literature [2], that takes the results of the first match and rejects texts that do
not satisfy the more constrained, knowledge-based query.
Because the pattern matcher is designed as an efficient "trigger" mechanism
and an aid in parsing, the knowledge-based queries are mostly simple combina-
tions of lexical categories. The patterns largely adopt the language of regular
expressions, including the following terms and operators:
* Lexical features that can be tested in a pattern:
- token "name" (e.g. "AK-47")
- lexical category (e.g. "adj")
- root (e.g. "shoot")
- conceptual category (e.g. "human")
* Logical combination of lexical feature tests
- OR, AND ,and NOT
* Wild cards
$ - 0 or 1 tokens
* - 0 or more tokens
+ - 1 or more tokens
* Variable assignment from pattern components
=
* Grouping operators:
<>for grouping
[]for disjunctive grouping
301