SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Okapi at TREC-2
chapter
S. Robertson
S. Walker
S. Jones
M. Hancock-Beaulieu
M. Gatford
National Institute of Standards and Technology
D. K. Harman
Okapi at TI{EC-2
S E Robertson*
S Walker* S JoneS* M M Hancock-Beaulieu5
M Gatford*
Advisers: E Michael Keen (University of Wales, Aberyst-
wyth), Karen Sparck Jones (Cambridge University), Peter
Willett (University of Sheffield)
1 Introduction
This paper reports on City University's work on the
TREC-2 project from its commencement up to Novem-
ber 1993. It includes many results which were obtained
after the August 1993 deadline for submission of official
results.
For TREC-2, as for TREC-1, City University used
versions of the Okapi text retrieval system much as de-
scribed in [2] (see also [3, 4]). Okapi is a simple and
robust set-oriented system based on a generalised prob-
abilistic model with facilities for relevance feedback, but
also supporting a full range of deterministic Boolean and
quasi-Boolean operations.
For TREC-1 [1] the "standard" Robertson-Sparck
Jones weighting function was used for all runs (equa-
tion 1, see also [5]). City's performance was not out-
standingly good among comparable systems, and the
intention for TREC-2 was to develop and investigate a
number of alternative probabilistic term-weighting func-
tions. Other possibilities included varieties of query ex-
pansion, database models enabling paragraph retrieval
and the use of phrases obtained by query parsing.
Unfortunately, a prolonged disk failure prevented re-
alistic test runs until almost the deadline for submission
of results. A full inversion of the disks 1 and 2 database
was only achieved a few hours before the final auto-
matic runs. None of the new weighting functions (Sec-
tion 1.1) was properly evaluated until after the results
had been submitted to NIST; we have since discovered
that several of these models perform much better than
the weighting functions used for the official runs, and
most of the results reported herein are from these later
runs.
1.1 The system
The Okapi system comprises a search engine or basic
search system (BSS), a low level interface used mainly
for batch runs and a user interface for the manual search
* Centre for Interactive Systems Research, Department of In-
formation Science, City University, Northampton Square, London
EC1V OHB, UK
21
experiments (Section 5), together with data conver-
sion and inversion utilities. The hardware consisted of
Sun SPARC machines with up to 40 MB of memory,
and, occasionally, about 8 GB of disk storage. Several
databases were used from time to time: full disks 1 and
2, AP (disk 1) and WSJ (disk 1), full disk 3. All in-
verted indexes included complete within-document p0-
sitional information, enabling term frequency and term
proximity to be used. Typical index size overhead was
around 80% of the textfile size. Elapsed time for in-
version of disks 1 and 2 was about two days. Running
a single topic with evaluation averaged from about one
minute to ten minutes, depending strongly on the num-
ber of query terms. All preliminary evaluation used the
"old" SMART evaluation program. Runs tabulated in
this paper used an early version of the new evaluation
program, for which we are grateful to Chris Buckley of
Cornell University.
2 Some new probabilistic
models
Statistical approaches to information retrieval have tra-
ditionally (to over-simplify grossly) taken two forms:
(a) approaches based on formal models, where the
model specifies an exact formula;
(b) ad-hoc approaches, where formulae are tried be-
cause they seem to be plausible.
Both categories have had some notable successes. A
more recent variant is the regression approach of Fuhr
and Cooper (see, for example, [6]), which incorporates
ad-hoc choice of independent variables and functions
of them with a formal model for assessing their value
in retrieval, selecting from among them and assigning
weights to them.
One problem with the formal model approach is that
it is often very difficult to take into account the wide
variety of variables that are thought or known to influ-
ence retrieval. The difficulty arises either because there
is no known basis for a model containing such variables,
or because any such model may simply be too complex
to give a usable exact formula.
One problem with the ad-hoc approach is that there is
little guidance as to how to deal with specific variables-
one has to guess at a formula and try it out. This