SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Okapi at TREC
chapter
S. Robertson
S. Walker
M. Hancock-Beaulieu
A. Gull
M. Lau
National Institute of Standards and Technology
Donna K. Harman
information about the makeup and structure of the
source material, together with quite a lot of trial
and error, a program (lex + C) was written to
convert all the raw datasets into a unified 25-field
structure. The only fields common to all input
were "text't, document-ID and a "source" field
containing "fr", "doe" etc., so all records consisted
mainly of empty fields. The "source" fields were
intended solely as "limit" criteria, but were
unused except perhaps by one or two of the
human searchers. Fields other than the three
mentioned were used solely for display. Any
records longer than 64K were truncated at 64K.
This truncation only affected the text field (field
25).
Conversion, which included decompression,
conversion to Okapi format using the lex-C
program and a second-stage conversion to
runtime format ran at about 10 records/sec on a
SPARC machine.
4.2 Inversion
scratch the surface of the problem.
Stopwords: 120
Semi-stopwords: 256
These were humanly selected following trial
indexing runs. The criteria were (1) small
retrieval value and (2) high posting count.
Examples: 100, begin, carry, date, december,
enough, include, meanwhile, run, take, why,
without, yesterday.
Prefixes: 18
The purpose of this list is to cause <prefix>-
<word> and <prefix><word> to be treated
identically for any value of <word>
Go phrases: 27
Examples: cold war, middle class, saudi arabia
Synonym classes: 300, containing about 700 words
Examples:
australia, australian, australasia, australasian
buyout, buy out
mit, massachusetts institute of technology
porn, porno, pornography, pornographic
The text field was reduced to "words", stemmed
using the moderate-strength Porter algorithm
(Porter, 1980) with modifications aimed at
conflating British and American spellings, filtered
through a local database (GSL, see below)
containing stopwords, semi-stopwords, prefixes, a
few "go" phrases (phrases to be treated as words),
and a list of classes of words and phrases to be
treated as synonymous. The document-ID field
was extracted unchanged. Inversion took about 33
hours CPU on a SPARC machine (about 6
documents per second, but this increases more
than linearly with number of documents). The
result was a simple inverted file structure with no
within-field positional information (insufficient
disk space). There were facilities for limiting
searches by source dataset, by various document
length ranges and by odd/even half-collection (for
comparison experiments).
4.4 Some statistics
First Second
part
511514
part
230936
Total
documents
Truncated 603
(over 64K)
Size(MB) 1107
(bibfile only, runtime format)
Inversion 44
overheads (%)
Unique index terms
(excluding document numbers)
Mean unique index terms/document
Postings
(excluding document numbers)
Mean postings/document
(document "length")
Both
742450
531 1134
759 1866
N/K 44
1040415
1A3
95898880
132
4.3 The local GSL database
The Go-See-List (GSL) for the TREC
experiments was based on exisfing databases, but
was somewhat extended for TREC. Both original
and extensions were derived in a fairly ad-hoc
fashion (some entries were identified by
examining a list of the most frequent terms in the
first part of the TREC collection). This is not a
sophisticated facility, and can only be said to
24
5. Experiments
Following the TREC design, results were
submitted for routing queries on the second set of
records, and for ad-hoc queries on the combined
set. Routing queries were processed
automatically only (section 5.2; results table
cityri); ad-hoc queries were processed
automatically (5.1; cityal) and manually, with
feedback on the manual searches (5.3; results
without feedback in table citymi, and with