SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Okapi at TREC chapter S. Robertson S. Walker M. Hancock-Beaulieu A. Gull M. Lau National Institute of Standards and Technology Donna K. Harman Walker, S. (1989). The Okapi online catalogue research projects. In: The online catalogue: developments and directions, edited by Charles R Hudreth. Library Association. 84-106. Walker S. & De Vere R. (1990). Improving subject retrieval in online catalogues: 2. Relevance feedback and query expansion. British Library (British Library Research Paper 72.) ISBN 0-7123-3219-7 Walker S. & Hancock-Beaulieu M. (1991). Okapi at City: an evaluation facility for interactive IR. British Library Research Report 6056. They are used to determine the linguistic processing to be applied to queries and the parameters to be used for index lookup and for the extraction of terms for automatic query expansion search mnemonics (e.g. [OCRerr], ABS, AUTH) used in query parsing display parameters, defining two levels of display language knowledge bases Up to three of these may be associated with a database to allow linguistic processing to depend on the type of data being searched or extracted. Typically, these are common to a number of databases of similar type and usage. Input Appendix: System architecture Platform The system runs on Sun hardware. It should port fairly easily to other UNIX platforms, at least of the BSD type. All the search and indexing code is in C. Source file conversion programs and log analysis programs may be written in awk. Database structure A database consists of text file (1)ibliographic file) this is the dataset from which searches retrieve records up to three indexes Bach index consists of primary and secondary dictionaries and a posting file. There are several types of index. One contains no positional informa- tion below the level of records, and is suitable for "phrases" like personal names and titles. Others contain positional information in the form field, sentence, word number for every occurrence of every indexed term. An index can contain terms for up to 16 different types of search. a set of parameter files: database description parameter indexing parameters (one set for each index). These define how indexing is to be performed in terms of linguistic knowledge base, stemming function, procedure for extracting index terms and the fields and subfields from which they are to be extracted. search type (or group) parameters These are closely related to indexing parameters. 29 Source files are stored in a simple format where each record starts with a field directory giving the length of each field, followed by the text of the fields. Fields may contain a limited range of subfield or role markers, indicating the nature of the following data. There are facilities for importing a few types of bibliographic files, including UKMARC and ISO 2709. An "Okapi exchange" format also exists. Character coding is ASCII with a "shift" character (`\`) to allow the encoding of characters above hex 7F. No data compression is used. Output (interactive Okapi only) Output is to character-based terminals or windows, or hard copy. There are two levels of record display and printout -- brief (one line) and full. Record layout is determined by parameters and is fairly flexible. Database maintenance There are no record editing and no index updating facilities. Source file and indexes must be completely regenerated when necessary. Indexing storage overheads Several types of index are available. Depending on the nature of the database and the extent of indexing required overheads range from about 10% to 120% of the biblio- graphic file size. Performance Index lookup is fast because each lookup only requires one disk access. A multi-term search runs in time approximately proportional to the total number of postings for all the terms inthequery.