SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Okapi at TREC
chapter
S. Robertson
S. Walker
M. Hancock-Beaulieu
A. Gull
M. Lau
National Institute of Standards and Technology
Donna K. Harman
Walker, S. (1989). The Okapi online catalogue research
projects. In: The online catalogue: developments and
directions, edited by Charles R Hudreth. Library
Association. 84-106.
Walker S. & De Vere R. (1990). Improving subject
retrieval in online catalogues: 2. Relevance feedback
and query expansion. British Library (British Library
Research Paper 72.) ISBN 0-7123-3219-7
Walker S. & Hancock-Beaulieu M. (1991). Okapi at
City: an evaluation facility for interactive IR. British
Library Research Report 6056.
They are used to determine the linguistic
processing to be applied to queries and the
parameters to be used for index lookup and for
the extraction of terms for automatic query
expansion
search mnemonics (e.g. [OCRerr], ABS, AUTH) used in
query parsing
display parameters, defining two levels of display
language knowledge bases
Up to three of these may be associated with a
database to allow linguistic processing to depend on
the type of data being searched or extracted.
Typically, these are common to a number of
databases of similar type and usage.
Input
Appendix: System architecture
Platform
The system runs on Sun hardware. It should port fairly
easily to other UNIX platforms, at least of the BSD type.
All the search and indexing code is in C. Source file
conversion programs and log analysis programs may be
written in awk.
Database structure
A database consists of
text file (1)ibliographic file)
this is the dataset from which searches retrieve
records
up to three indexes
Bach index consists of primary and secondary
dictionaries and a posting file. There are several
types of index. One contains no positional informa-
tion below the level of records, and is suitable for
"phrases" like personal names and titles. Others
contain positional information in the form field,
sentence, word number for every occurrence of
every indexed term. An index can contain terms for
up to 16 different types of search.
a set of parameter files:
database description parameter
indexing parameters (one set for each index).
These define how indexing is to be performed in
terms of linguistic knowledge base, stemming
function, procedure for extracting index terms
and the fields and subfields from which they are
to be extracted.
search type (or group) parameters
These are closely related to indexing parameters.
29
Source files are stored in a simple format where each record
starts with a field directory giving the length of each field,
followed by the text of the fields. Fields may contain a
limited range of subfield or role markers, indicating the
nature of the following data. There are facilities for
importing a few types of bibliographic files, including
UKMARC and ISO 2709. An "Okapi exchange" format
also exists. Character coding is ASCII with a "shift"
character (`\`) to allow the encoding of characters above hex
7F. No data compression is used.
Output (interactive Okapi only)
Output is to character-based terminals or windows, or hard
copy. There are two levels of record display and printout --
brief (one line) and full. Record layout is determined by
parameters and is fairly flexible.
Database maintenance
There are no record editing and no index updating facilities.
Source file and indexes must be completely regenerated
when necessary.
Indexing storage overheads
Several types of index are available. Depending on the
nature of the database and the extent of indexing required
overheads range from about 10% to 120% of the biblio-
graphic file size.
Performance
Index lookup is fast because each lookup only requires one
disk access. A multi-term search runs in time approximately
proportional to the total number of postings for all the terms
inthequery.