SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Okapi at TREC chapter S. Robertson S. Walker M. Hancock-Beaulieu A. Gull M. Lau National Institute of Standards and Technology Donna K. Harman changes which happened concurrently with, and were necessary for, the TREC work. Okapi is a family of bibliographic retrieval systems, developed under a series of grants from the British Library. It is suitable for searching files of records whose fields contain textual data of variable length up to a few tens of thousands of characters. It allows the implementation of a variety of search techniques based on the probabilistic retrieval model, with easy-to-use interfaces, on databases of operational size and under operational conditions (Walker, 1989; Walker & De Vere, 1990; Walker & Hancock- Beaulieu, 1991; Hancock-Beaulieu & Walker, 1992). The main purpose of the Okapi installation at City is to allow the use of a variety of evaluation methods, including live-user evaluation in the context of user information-seeking behaviour. 2.1 Search techniques The interactive Okapi system uses probabilistic "best match" searching, and can handle queries of up to 32 terms. (There is no Boolean search facility in interactive Okapi -- but see 3.2 below concerning the development system.) Search terms may be keywords or phrases, or any other record component which has been indexed, and are extracted automatically by very simple "parsing" of an initial natural language query. Search terms are assigned weights, based on inverse document frequency in the absence of relevance information and on the F4 formula given in Robertson & Sparck Jones (1976) when relevance information is available. The match function is a simple sum-of-weights. There are facilities for "adjusting" the weighting to favour (for example) terms occurring in specified fields. There is also a limited alphabetical browsing facility (of records in index term order). The F4 formula, point-5 version, is: (r+O.5) (N-R-n+r+O.5) w = log (R-r+O.5) (n-r+O.5) where N = collection size n = number of postings of term R = total known relevant documents r = number of these posted to the term The inverse document frequency (IDF) weight is 22 F4 with R=r=O, i.e. w = log (N-n+O.5)/(fl+O.5) 2.2 Relevance feedback and query expansion The system can invite relevance judgments from the user, and following one or more positive relevance assessments it can perform an "expanded" search, using the original query terms together with additional terms extracted automatically from the relevant records. This procedure can be iterated. 2.3 Language processing Very simple text and linguistic processing is applied during indexing and searching. There are two levels of automatic stemming, and a mainly rule-based procedure for conflating British and American spellings. There are facilities for constructing and using a simple linguistic knowledge base containing "go" phrases, classes of terms to be treated as synonymous, prefixes, stopwords and phrases, and "semi-stopwords"--- words and phrases to be treated as relatively unimportant in processing a query. 2.4 Usage The interactive system is intended for highly interactive use by untrained users. 2.5 Logging The system can produce detailed logs of both user and system activity, down to keystroke level and sub-second granularity. 2.6 Present use and status The present use of Okapi is primarily as a tool for the evaluation of highly interactive bibliographic search systems with untrained users. It is also to be used in an investigation of the use of linguistic knowledge structures (e.g. thesauri) in text retrieval systems. The system is not commercially available. It is not finished, maintained or documented to commercial standards. It is, however, designed for live use, and there has, over the years, been a