SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Okapi at TREC chapter S. Robertson S. Walker M. Hancock-Beaulieu A. Gull M. Lau National Institute of Standards and Technology Donna K. Harman information about the makeup and structure of the source material, together with quite a lot of trial and error, a program (lex + C) was written to convert all the raw datasets into a unified 25-field structure. The only fields common to all input were "text't, document-ID and a "source" field containing "fr", "doe" etc., so all records consisted mainly of empty fields. The "source" fields were intended solely as "limit" criteria, but were unused except perhaps by one or two of the human searchers. Fields other than the three mentioned were used solely for display. Any records longer than 64K were truncated at 64K. This truncation only affected the text field (field 25). Conversion, which included decompression, conversion to Okapi format using the lex-C program and a second-stage conversion to runtime format ran at about 10 records/sec on a SPARC machine. 4.2 Inversion scratch the surface of the problem. Stopwords: 120 Semi-stopwords: 256 These were humanly selected following trial indexing runs. The criteria were (1) small retrieval value and (2) high posting count. Examples: 100, begin, carry, date, december, enough, include, meanwhile, run, take, why, without, yesterday. Prefixes: 18 The purpose of this list is to cause <prefix>- <word> and <prefix><word> to be treated identically for any value of <word> Go phrases: 27 Examples: cold war, middle class, saudi arabia Synonym classes: 300, containing about 700 words Examples: australia, australian, australasia, australasian buyout, buy out mit, massachusetts institute of technology porn, porno, pornography, pornographic The text field was reduced to "words", stemmed using the moderate-strength Porter algorithm (Porter, 1980) with modifications aimed at conflating British and American spellings, filtered through a local database (GSL, see below) containing stopwords, semi-stopwords, prefixes, a few "go" phrases (phrases to be treated as words), and a list of classes of words and phrases to be treated as synonymous. The document-ID field was extracted unchanged. Inversion took about 33 hours CPU on a SPARC machine (about 6 documents per second, but this increases more than linearly with number of documents). The result was a simple inverted file structure with no within-field positional information (insufficient disk space). There were facilities for limiting searches by source dataset, by various document length ranges and by odd/even half-collection (for comparison experiments). 4.4 Some statistics First Second part 511514 part 230936 Total documents Truncated 603 (over 64K) Size(MB) 1107 (bibfile only, runtime format) Inversion 44 overheads (%) Unique index terms (excluding document numbers) Mean unique index terms/document Postings (excluding document numbers) Mean postings/document (document "length") Both 742450 531 1134 759 1866 N/K 44 1040415 1A3 95898880 132 4.3 The local GSL database The Go-See-List (GSL) for the TREC experiments was based on exisfing databases, but was somewhat extended for TREC. Both original and extensions were derived in a fairly ad-hoc fashion (some entries were identified by examining a list of the most frequent terms in the first part of the TREC collection). This is not a sophisticated facility, and can only be said to 24 5. Experiments Following the TREC design, results were submitted for routing queries on the second set of records, and for ad-hoc queries on the combined set. Routing queries were processed automatically only (section 5.2; results table cityri); ad-hoc queries were processed automatically (5.1; cityal) and manually, with feedback on the manual searches (5.3; results without feedback in table citymi, and with