SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Okapi at TREC chapter S. Robertson S. Walker M. Hancock-Beaulieu A. Gull M. Lau National Institute of Standards and Technology Donna K. Harman 7. Results and discussion Full results can be seen in the official tables. The evaluation of the feedback run was treated in a somewhat special way, by agreement with the organizers. The original plan had been to do "residual ranking" evaluation, i.e. to remove from the collection those items which were assessed for relevance for feedback purposes, and to evaluate two runs (with or without feedback) on the reduced collection. This would have allowed a comparison between these two runs, but not between the feedback run and any of the other results presented. Instead, a "frozen rank" evaluation was used, in which the documents examined for relevance before feedback were retained as the top-ranking documents in the feedback run. This simulates a real search, in that those documents would have been seen (in some form) by the user and would therefore have to be regarded as part of the output of the system. Therefore it may be seen as a fairer evaluation of feedback than residual ranking, although it is likely to reduce the apparent effect of feedback. A very brief summary of the results, taking just two measures from the tables, is as follows: 11-point Precision average at 5 docs cityal 12.1% 49.6% Ad-hoc auto citymi 15.6% 57.6% Manual citym2* 18.2% 58.8% Feedback cityri 17.7% 54.8% Routing (*Frozen ranks evaluation) The performance of the automatic ad-hoc run is really rather poor. The manual run without feedback is better. Feedback does clearly produce an improvement (though not, of course, given the frozen ranks evaluation, at the high-precision end). It seems that both the choice of terms and the liberal relevance judgements by non-expert students are effective at least to some degree. (We have yet to compare the individual judgements by the students with the "correct" ones provided by the TREC organizers, or to establish whether the "correct" judgements would 28 have given us greater performance benefits.) The routing results seem reasonable. In general, we believe that the simple, robust and minimum-effort methods we have adopted in Okapi have been shown to work, even with very different material (both documents and queries) from that for which Okapi was originally designed. Performance, both in absolute terms and relative to the other TREC entries, is respectable but by no means wonderful. We also believe that there is much scope for improvement; there are other simple and robust methods (such as other weighting formulae or different treatments of compound terms) to which Okapi would be hospitable, and which may bring performance up to a more acceptable level. We look forward to TREC 2. References Bookstein, A. (1983). Information retrieval: a sequential learning process. Journal of the American Society for Information Science 34(5) 331-342. Hancock-Beau lieu M. & Walker S. (1992). An evaluation of automatic query expansion in an online library catalogue. Journal of Documentation 48(4) 406A21. Harman, D. (1992). Relevance feedback revisited. In: SIGIR 92-- Proc. 15th International Conference on Research and Development in Information Retrieval, ACM Press, 1-10. Harper, DJ. & van Rijsbergen, CJ. (1978). An evaluation of feedback in document retrieval using co- occurrence data. Journal of Documentation 34(3), 189- 216. Porter, M.F. (1980). An algorithm for suffix stripping. Program 14(3)130-137 Robertson, S.E. (1990). On term selection for query expansion. Journal of Documentation 46(4), 359-364. Robertson, S.E. & Sparck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society of Information Science 27(3), 129-146. Robertson, S.E., Thompson, C.L., Macaskill, MJ. & Bovey, J. (1986). Weighting, ranking and relevance feedback in a front-end system. Journal of Information Science 12(1/2), 71-75.