SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Okapi at TREC chapter S. Robertson S. Walker M. Hancock-Beaulieu A. Gull M. Lau National Institute of Standards and Technology Donna K. Harman considerable amount of use under live conditions. It is a set of functions from which experienced designers and programrners can construct retrieval systems, rather than a finished "product". 3. Concurrent developments 3.1 Towards a distributed system This development reflects a long-standing plan for the Okapi project, but was brought forward to facilitate work on the TREC database. Okapi has been split into a Basic Search System (BS S) and a number of front-end systems. The BSS is essentially a database engine offering basic text retrieval functionality, extended in various ways to allow weighting, ranking and relevance feedback etc. Although the front-end systems at present reside on the same machine, the dialogue between the front-end and the BSS is roughly comparable to that which might take place using the Z39.50 or Search & Retrieve protocols. It concerns mainly specifications for and descriptions of search sets, and involves actual records only at the time of display. All automatic searching for the TREC project involved purpose-written front-ends to the BSS. A further front-end was developed for manual searching. This was designed to include most of the functionality of the old interactive version of Okapi, but not to emulate its user interface; it is command-driven. 3.2 Mixing Boolean and weighted searching One characteristic of the BSS needs explaining. The BSS is capable of conducting Boolean searches as well as weighted (best match) searches. Furthermore, any Boolean expression (resulting in an undifferentiated search set) can be treated as if it were a single term in the weighted searching model. This is compatible with the approach taken in the Cirt system (which acted as a front-end to a Boolean host) (Robertson et al., 1986); particular examples of uses in Cirt include ORed synonyms and phrases constructed with the ADJ operator. The Okapi BSS does not at present allow proximity operators such as ADJ, but the principle is the same. To a very limited extent, this facility was used by 23 the manual searchers (see 5.3). 3.3 Term selection for query expansion Interactive Okapi automatically selected terms from relevant documents for query expansion by taking the top x (=20) terms according to their relevance weights. The BSS version uses the Robertson selection value (Robertson, 1990), approximately r*w (where w is the usual F4 weight). (See also discussion in section 6.3, which shows that there was an error in taking this approximation.) Also, the interface used in the manual TREC experiments allows semi-automatic query expansion, in that the list of candidate terms can be displayed for the searcher to make selections from (and then entered manually), or the top 20 terms can be used automatically. Terms once selected are weighted using F4 in the usual way, except with the modification indicated below. 3.4 Bias towards query terms In interactive Okapi, the terms in the original query held no special position in the query expansion process, except in the sense that a "semi-stopword" in the original query would be a candidate for the feedback query, whereas the same term occurring in a relevant document but not in the query would not be considered. For the TREC experiments, some bias in favour of query terms was built in, in the form of some hypothetical relevant documents assumed to contain the query terms (Harman, 1992; Bookstein, 1983). These hypothetical relevant documents then contributed to the calculation of F4. Different quantitative assumptions were made in different TREC experiments (see section 5), but once again an error crept into the implementation of this facility (see section 6.3). 4. Input processing 4.1 Converting the raw files The Okapi system needs databases to be in its own format, in which each record consists of an identical sequence of fields in the form of terminated text strings. Fields are identified by sequence number only. Using the given