SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Okapi at TREC
chapter
S. Robertson
S. Walker
M. Hancock-Beaulieu
A. Gull
M. Lau
National Institute of Standards and Technology
Donna K. Harman
considerable amount of use under live conditions.
It is a set of functions from which experienced
designers and programrners can construct retrieval
systems, rather than a finished "product".
3. Concurrent developments
3.1 Towards a distributed system
This development reflects a long-standing plan
for the Okapi project, but was brought forward to
facilitate work on the TREC database.
Okapi has been split into a Basic Search System
(BS S) and a number of front-end systems. The
BSS is essentially a database engine offering
basic text retrieval functionality, extended in
various ways to allow weighting, ranking and
relevance feedback etc. Although the front-end
systems at present reside on the same machine,
the dialogue between the front-end and the BSS is
roughly comparable to that which might take
place using the Z39.50 or Search & Retrieve
protocols. It concerns mainly specifications for
and descriptions of search sets, and involves
actual records only at the time of display.
All automatic searching for the TREC project
involved purpose-written front-ends to the BSS.
A further front-end was developed for manual
searching. This was designed to include most of
the functionality of the old interactive version of
Okapi, but not to emulate its user interface; it is
command-driven.
3.2 Mixing Boolean and weighted searching
One characteristic of the BSS needs explaining.
The BSS is capable of conducting Boolean
searches as well as weighted (best match)
searches. Furthermore, any Boolean expression
(resulting in an undifferentiated search set) can be
treated as if it were a single term in the weighted
searching model. This is compatible with the
approach taken in the Cirt system (which acted as
a front-end to a Boolean host) (Robertson et al.,
1986); particular examples of uses in Cirt include
ORed synonyms and phrases constructed with the
ADJ operator. The Okapi BSS does not at present
allow proximity operators such as ADJ, but the
principle is the same.
To a very limited extent, this facility was used by
23
the manual searchers (see 5.3).
3.3 Term selection for query expansion
Interactive Okapi automatically selected terms
from relevant documents for query expansion by
taking the top x (=20) terms according to their
relevance weights. The BSS version uses the
Robertson selection value (Robertson, 1990),
approximately r*w (where w is the usual F4
weight). (See also discussion in section 6.3,
which shows that there was an error in taking this
approximation.) Also, the interface used in the
manual TREC experiments allows semi-automatic
query expansion, in that the list of candidate terms
can be displayed for the searcher to make
selections from (and then entered manually), or
the top 20 terms can be used automatically.
Terms once selected are weighted using F4 in the
usual way, except with the modification indicated
below.
3.4 Bias towards query terms
In interactive Okapi, the terms in the original
query held no special position in the query
expansion process, except in the sense that a
"semi-stopword" in the original query would be a
candidate for the feedback query, whereas the
same term occurring in a relevant document but
not in the query would not be considered.
For the TREC experiments, some bias in favour
of query terms was built in, in the form of some
hypothetical relevant documents assumed to
contain the query terms (Harman, 1992;
Bookstein, 1983). These hypothetical relevant
documents then contributed to the calculation of
F4. Different quantitative assumptions were
made in different TREC experiments (see section
5), but once again an error crept into the
implementation of this facility (see section 6.3).
4. Input processing
4.1 Converting the raw files
The Okapi system needs databases to be in its
own format, in which each record consists of an
identical sequence of fields in the form of
terminated text strings. Fields are identified by
sequence number only. Using the given