SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Okapi at TREC
chapter
S. Robertson
S. Walker
M. Hancock-Beaulieu
A. Gull
M. Lau
National Institute of Standards and Technology
Donna K. Harman
7. Results and discussion
Full results can be seen in the official tables. The
evaluation of the feedback run was treated in a
somewhat special way, by agreement with the
organizers. The original plan had been to do
"residual ranking" evaluation, i.e. to remove from
the collection those items which were assessed for
relevance for feedback purposes, and to evaluate
two runs (with or without feedback) on the
reduced collection. This would have allowed a
comparison between these two runs, but not
between the feedback run and any of the other
results presented.
Instead, a "frozen rank" evaluation was used, in
which the documents examined for relevance
before feedback were retained as the top-ranking
documents in the feedback run. This simulates a
real search, in that those documents would have
been seen (in some form) by the user and would
therefore have to be regarded as part of the output
of the system. Therefore it may be seen as a
fairer evaluation of feedback than residual
ranking, although it is likely to reduce the
apparent effect of feedback.
A very brief summary of the results, taking just
two measures from the tables, is as follows:
11-point Precision
average at 5 docs
cityal 12.1% 49.6% Ad-hoc auto
citymi 15.6% 57.6% Manual
citym2* 18.2% 58.8% Feedback
cityri 17.7% 54.8% Routing
(*Frozen ranks evaluation)
The performance of the automatic ad-hoc run is
really rather poor. The manual run without
feedback is better. Feedback does clearly produce
an improvement (though not, of course, given the
frozen ranks evaluation, at the high-precision
end). It seems that both the choice of terms and
the liberal relevance judgements by non-expert
students are effective at least to some degree.
(We have yet to compare the individual
judgements by the students with the "correct"
ones provided by the TREC organizers, or to
establish whether the "correct" judgements would
28
have given us greater performance benefits.) The
routing results seem reasonable.
In general, we believe that the simple, robust and
minimum-effort methods we have adopted in
Okapi have been shown to work, even with very
different material (both documents and queries)
from that for which Okapi was originally
designed. Performance, both in absolute terms
and relative to the other TREC entries, is
respectable but by no means wonderful. We also
believe that there is much scope for improvement;
there are other simple and robust methods (such
as other weighting formulae or different
treatments of compound terms) to which Okapi
would be hospitable, and which may bring
performance up to a more acceptable level. We
look forward to TREC 2.
References
Bookstein, A. (1983). Information retrieval: a sequential
learning process. Journal of the American Society for
Information Science 34(5) 331-342.
Hancock-Beau lieu M. & Walker S. (1992). An evaluation
of automatic query expansion in an online library
catalogue. Journal of Documentation 48(4) 406A21.
Harman, D. (1992). Relevance feedback revisited. In:
SIGIR 92-- Proc. 15th International Conference on
Research and Development in Information Retrieval,
ACM Press, 1-10.
Harper, DJ. & van Rijsbergen, CJ. (1978). An
evaluation of feedback in document retrieval using co-
occurrence data. Journal of Documentation 34(3), 189-
216.
Porter, M.F. (1980). An algorithm for suffix stripping.
Program 14(3)130-137
Robertson, S.E. (1990). On term selection for query
expansion. Journal of Documentation 46(4), 359-364.
Robertson, S.E. & Sparck Jones, K. (1976). Relevance
weighting of search terms. Journal of the American
Society of Information Science 27(3), 129-146.
Robertson, S.E., Thompson, C.L., Macaskill, MJ. &
Bovey, J. (1986). Weighting, ranking and relevance
feedback in a front-end system. Journal of Information
Science 12(1/2), 71-75.