SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Okapi at TREC-2 chapter S. Robertson S. Walker S. Jones M. Hancock-Beaulieu M. Gatford National Institute of Standards and Technology D. K. Harman Okapi at TI{EC-2 S E Robertson* S Walker* S JoneS* M M Hancock-Beaulieu5 M Gatford* Advisers: E Michael Keen (University of Wales, Aberyst- wyth), Karen Sparck Jones (Cambridge University), Peter Willett (University of Sheffield) 1 Introduction This paper reports on City University's work on the TREC-2 project from its commencement up to Novem- ber 1993. It includes many results which were obtained after the August 1993 deadline for submission of official results. For TREC-2, as for TREC-1, City University used versions of the Okapi text retrieval system much as de- scribed in [2] (see also [3, 4]). Okapi is a simple and robust set-oriented system based on a generalised prob- abilistic model with facilities for relevance feedback, but also supporting a full range of deterministic Boolean and quasi-Boolean operations. For TREC-1 [1] the "standard" Robertson-Sparck Jones weighting function was used for all runs (equa- tion 1, see also [5]). City's performance was not out- standingly good among comparable systems, and the intention for TREC-2 was to develop and investigate a number of alternative probabilistic term-weighting func- tions. Other possibilities included varieties of query ex- pansion, database models enabling paragraph retrieval and the use of phrases obtained by query parsing. Unfortunately, a prolonged disk failure prevented re- alistic test runs until almost the deadline for submission of results. A full inversion of the disks 1 and 2 database was only achieved a few hours before the final auto- matic runs. None of the new weighting functions (Sec- tion 1.1) was properly evaluated until after the results had been submitted to NIST; we have since discovered that several of these models perform much better than the weighting functions used for the official runs, and most of the results reported herein are from these later runs. 1.1 The system The Okapi system comprises a search engine or basic search system (BSS), a low level interface used mainly for batch runs and a user interface for the manual search * Centre for Interactive Systems Research, Department of In- formation Science, City University, Northampton Square, London EC1V OHB, UK 21 experiments (Section 5), together with data conver- sion and inversion utilities. The hardware consisted of Sun SPARC machines with up to 40 MB of memory, and, occasionally, about 8 GB of disk storage. Several databases were used from time to time: full disks 1 and 2, AP (disk 1) and WSJ (disk 1), full disk 3. All in- verted indexes included complete within-document p0- sitional information, enabling term frequency and term proximity to be used. Typical index size overhead was around 80% of the textfile size. Elapsed time for in- version of disks 1 and 2 was about two days. Running a single topic with evaluation averaged from about one minute to ten minutes, depending strongly on the num- ber of query terms. All preliminary evaluation used the "old" SMART evaluation program. Runs tabulated in this paper used an early version of the new evaluation program, for which we are grateful to Chris Buckley of Cornell University. 2 Some new probabilistic models Statistical approaches to information retrieval have tra- ditionally (to over-simplify grossly) taken two forms: (a) approaches based on formal models, where the model specifies an exact formula; (b) ad-hoc approaches, where formulae are tried be- cause they seem to be plausible. Both categories have had some notable successes. A more recent variant is the regression approach of Fuhr and Cooper (see, for example, [6]), which incorporates ad-hoc choice of independent variables and functions of them with a formal model for assessing their value in retrieval, selecting from among them and assigning weights to them. One problem with the formal model approach is that it is often very difficult to take into account the wide variety of variables that are thought or known to influ- ence retrieval. The difficulty arises either because there is no known basis for a model containing such variables, or because any such model may simply be too complex to give a usable exact formula. One problem with the ad-hoc approach is that there is little guidance as to how to deal with specific variables- one has to guess at a formula and try it out. This