SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) UCLA-Okapi at TREC-2: Query Expansion Experiments chapter E. Efthimiadis P. Biron National Institute of Standards and Technology D. K. Harman UCLA-Okapi at TREC-2: Query Expansion Experiments Efthimis N. Efthimiadis* and Paul V. Biron Graduate School of Library and Information Science University of California at Los Angeles 1 Introduction This is the first participation of the Graduate School of Li- brary and Information Science, University of California at Los Angeles in the TREC Conference. For TREC-2, Cat- egory B, UCLA used a version of the Okapi text retrieval system that was made available to UCLA by City Univer- sity, London, UK. OKAPI has been described in TREC- 1 (Robertson, Walker, Hancock-Beaulieu, Gull & Lau, 1993a) as well as in this conference (Robertson, Walker, Jones, Hancock-Beaulieu, & Gatford, 1994). Okapi is a simple set-oriented system based on a generalized proba- bilistic model with facilities for relevance feedback. In addi- tion OKAPI supports a full range of deterministic Boolean and quasi-Boolean operations. 1.1 Objectives The main research objective of the UCLA participation in TREC-2 was to investigate query expansion within the framework as provided by Okapi. More specifically, the objectives were to: * use an enhanced version of the G[OCRerr]See-List (GSL) and evaluate its effect on retrieval performance. * investigate the performance of query expansion with and without relevance information by varying the number of documents that are treated as relevant and the number of terms that are included in the expan- sion. * compare the performance of different ranking alg[OCRerr] rithms for the ranking of terms for term selection dur- ing query expansion. * compare the effectiveness in retrieval of user assigned relevance judgements against hypothetically assumed relevance judgements based on the top X documents. *To whom all correspondence should be addressed. Grad- uate School of Library and Information Science, University of California at Los Angeles, 405 Hilgard Avenue, Los Angeles, CA 90024-1520, e-mail: iacxene[OCRerr]mvs.oac.ucla.edu 1 279 1.2 The Okapi version at UCLA and the WSJ database The Okapi system consists of a low level search engine or basic search system (BSS), a user interface for the man- ual search experiments and data conversion and inversion utilities. The UCLA hardware consisted of Sun SPARC-2 machine with 32 MB of memory, and 1 GB of disk storage. The Wall Street Journal (WSJ) database was used for both the routing and ad-hoc searches. Because of the lack of adequate disk space on the UCLA machine the database was indexed at City University by Stephen Walker and it was then transferred (FTP-ed) to UCLA. For TREC2 the Okapi databases were built by index- ing mainly the DOCNO and TEXT fields of the records. Inverted indexes included complete within-document posi- tional information, enabling term frequency and term prox- imity to be used. Okapi's typical index size overhead is around 80% of the textfile size. The elapsed time for inver- sion of the WSJ database was about 12 hours. At this point it is worth noting of (a) the nature of the WSJ records, and (b) a limitation of Okapi's due to index- ing. (a) The WSJ records consist of documents that do not have the same kind of structure found in bibliographic databases, such as INSPEC or ERIC. The records contain the full-text of stories and have varied length, mostly longer than the length of an average abstract of a bibliographic database. In addition, the language and the style is mostly `journalistic' as opposed to `scientific', i.e. less structured. One important issue is that some WSJ records often con- tain short multi-story articles which are completely unre- lated one from the other. This type of record is usually a compilation of a number of one- or two-paragraph long news stories. The stories share no content relation between them, the only common feature is their co-existence in the same record. This has implications in retrieval effective- ness, especially when such records are included in the pool