SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Okapi at TREC
chapter
S. Robertson
S. Walker
M. Hancock-Beaulieu
A. Gull
M. Lau
National Institute of Standards and Technology
Donna K. Harman
changes which happened concurrently with, and
were necessary for, the TREC work.
Okapi is a family of bibliographic retrieval
systems, developed under a series of grants from
the British Library. It is suitable for searching
files of records whose fields contain textual data
of variable length up to a few tens of thousands of
characters. It allows the implementation of a
variety of search techniques based on the
probabilistic retrieval model, with easy-to-use
interfaces, on databases of operational size and
under operational conditions (Walker, 1989;
Walker & De Vere, 1990; Walker & Hancock-
Beaulieu, 1991; Hancock-Beaulieu & Walker,
1992).
The main purpose of the Okapi installation at City
is to allow the use of a variety of evaluation
methods, including live-user evaluation in the
context of user information-seeking behaviour.
2.1 Search techniques
The interactive Okapi system uses probabilistic
"best match" searching, and can handle queries of
up to 32 terms. (There is no Boolean search
facility in interactive Okapi -- but see 3.2 below
concerning the development system.) Search
terms may be keywords or phrases, or any other
record component which has been indexed, and
are extracted automatically by very simple
"parsing" of an initial natural language query.
Search terms are assigned weights, based on
inverse document frequency in the absence of
relevance information and on the F4 formula
given in Robertson & Sparck Jones (1976) when
relevance information is available. The match
function is a simple sum-of-weights. There are
facilities for "adjusting" the weighting to favour
(for example) terms occurring in specified fields.
There is also a limited alphabetical browsing
facility (of records in index term order).
The F4 formula, point-5 version, is:
(r+O.5) (N-R-n+r+O.5)
w = log (R-r+O.5) (n-r+O.5)
where N = collection size
n = number of postings of term
R = total known relevant documents
r = number of these posted to the term
The inverse document frequency (IDF) weight is
22
F4 with R=r=O, i.e.
w = log (N-n+O.5)/(fl+O.5)
2.2 Relevance feedback and query expansion
The system can invite relevance judgments from
the user, and following one or more positive
relevance assessments it can perform an
"expanded" search, using the original query terms
together with additional terms extracted
automatically from the relevant records. This
procedure can be iterated.
2.3 Language processing
Very simple text and linguistic processing is
applied during indexing and searching.
There are two levels of automatic stemming, and
a mainly rule-based procedure for conflating
British and American spellings.
There are facilities for constructing and using a
simple linguistic knowledge base containing "go"
phrases, classes of terms to be treated as
synonymous, prefixes, stopwords and phrases,
and "semi-stopwords"--- words and phrases to be
treated as relatively unimportant in processing a
query.
2.4 Usage
The interactive system is intended for highly
interactive use by untrained users.
2.5 Logging
The system can produce detailed logs of both user
and system activity, down to keystroke level and
sub-second granularity.
2.6 Present use and status
The present use of Okapi is primarily as a tool for
the evaluation of highly interactive bibliographic
search systems with untrained users. It is also to
be used in an investigation of the use of linguistic
knowledge structures (e.g. thesauri) in text
retrieval systems.
The system is not commercially available. It is not
finished, maintained or documented to
commercial standards. It is, however, designed
for live use, and there has, over the years, been a