SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval
chapter
P. Jacobs
National Institute of Standards and Technology
D. K. Harman
sanction == E(member sanction sanctions 2140 AND 2138 2139
disinvestment) 2142 AND 2141 2029
<Sullivan Principles> 2143 OR 2137 2140 2142
<punitive *2 measures>] 2148 OR 2144 2145 2146 2147
2151 AND 2149 2150
sa[OCRerr]rica == E(member Buthelezi Pretoria 2154 OR 2153 52
anti-apartheid apartheid) 2155 AND 2152 2154
<De Kierk> 2156 OR 2148 2151 2155
<South (member Africa African)> ] 2157 AND 2143 2156
2158 AND 2156 2143
,; rule 1
$sanction * $safrica => (mark-topic 52) TOPICOS2 OR 2157 2158
This description says that any matching text must have
both an indicator of South Africa ($safrica) and one of
sanctions ($sanction), and that the sanction phrase and
South Africa phrase must appear in the same paragraph
in the document.
A sanction phrase can I)e any of the simple words sanc-
tion, sanctions, or disinvestment, or any phrase includ-
ing punitive measures with no more than two intervening
words (like punitive economic measures). A South Africa
phrase can also be either one of a group of simple words,
or a phrase, like De Kierk, South Africa, or South African.
These queries or topic descriptions can be quite com-
plex, and the method has been designed to handle many
queries simultaneously, so the rule compiler is designed to
produce expressions that can be efficiently applied within
a large set of queries. This is important because many
queries can share the same simple terms or combinations
of terms, and because the Boolean matcher must match
the simplest expressions first.
For the topic description given above, the output of the
rule compiler will include the following tests:
52 TERM AFRICAN
2029 TERM MEASURES
2134 TERM SANCTION
2135 TERM SANCTIONS
2136 TERM DISINVESTMENT
2138 TERM SULLIVAN
2139 TERM PRINCIPLES
2141 TERM PUNITIVE
2144 TERM BUTHELEZI
2145 TERM PRETORIA
2146 TERM ANTI-APARTHEID
2147 TERM APARTHEID
2149 TERM DE
2150 TERM KLERK
2152 TERM SOUTH
2153 TERM AFRICA
2137 OR 2134 2135 2136
Each line in the above data gives a unique number
(or topic designator) to the test, a test identifier (either
TERM for a simple word test, OR, or AND), and a list
of simple terms or previous tests. For example, test 2137
depends on tests 2134, 2135, and 2136, and is true if any
of those tests is true, namely, if the text includes any
of the words sanction, sanctions, or disinvestment. The
tests are automatically ordered so that all tests that are
dependent on other tests will have higher numbers than
the tests they depend on; thus all TERM tests appear
first. In this case, the TERM test AFRICAN appears
with a much lower number simply because it is used in
many different queries.
The matcher, which can work either on complete doc-
uments or paragraphs (but we used paragraph matching
only in TREC-2) goes through every word in its input
and, using a fast table look-up, sets the TERM tests to
true for every word it encounters. At the end of input, ei-
ther the end of the paragraph or end of each document, it
runs through the table of possible tests from low numbers
to high numbers and sets tests to true if their conditions
are satisfied. A topic test produces a match if it has be-
come true at the end of this process, meaning that the
paragraph or document has passed the pre-filter for that
query. A single paragraph, of course, can satisfy multiple
queries.
This portion of the system was implemented in the
space of a few days, and is almost entirely the same as in
TREC-1. Our focus since last year has been on query con-
struction and ranking rather than matching or retrieval.
2.2 Query construction
Our approach assumes, in general, that manual query
construction is acceptable for routing. In ad hoc retrieval,
query time can be of the essence, but in many routing
applications, queries are developed and refined over time.
The amount of time spent on query construction using
a manual method in our system is comparable to the
amount of time spent on the topic descriptions used for
automatic query generation.
194