SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
UCLA-Okapi at TREC-2: Query Expansion Experiments
chapter
E. Efthimiadis
P. Biron
National Institute of Standards and Technology
D. K. Harman
UCLA-Okapi at TREC-2: Query Expansion Experiments
Efthimis N. Efthimiadis* and Paul V. Biron
Graduate School of Library and Information Science
University of California at Los Angeles
1 Introduction
This is the first participation of the Graduate School of Li-
brary and Information Science, University of California at
Los Angeles in the TREC Conference. For TREC-2, Cat-
egory B, UCLA used a version of the Okapi text retrieval
system that was made available to UCLA by City Univer-
sity, London, UK. OKAPI has been described in TREC-
1 (Robertson, Walker, Hancock-Beaulieu, Gull & Lau,
1993a) as well as in this conference (Robertson, Walker,
Jones, Hancock-Beaulieu, & Gatford, 1994). Okapi is a
simple set-oriented system based on a generalized proba-
bilistic model with facilities for relevance feedback. In addi-
tion OKAPI supports a full range of deterministic Boolean
and quasi-Boolean operations.
1.1 Objectives
The main research objective of the UCLA participation
in TREC-2 was to investigate query expansion within the
framework as provided by Okapi. More specifically, the
objectives were to:
* use an enhanced version of the G[OCRerr]See-List (GSL) and
evaluate its effect on retrieval performance.
* investigate the performance of query expansion with
and without relevance information by varying the
number of documents that are treated as relevant and
the number of terms that are included in the expan-
sion.
* compare the performance of different ranking alg[OCRerr]
rithms for the ranking of terms for term selection dur-
ing query expansion.
* compare the effectiveness in retrieval of user assigned
relevance judgements against hypothetically assumed
relevance judgements based on the top X documents.
*To whom all correspondence should be addressed. Grad-
uate School of Library and Information Science, University of
California at Los Angeles, 405 Hilgard Avenue, Los Angeles,
CA 90024-1520, e-mail: iacxene[OCRerr]mvs.oac.ucla.edu 1
279
1.2 The Okapi version at UCLA and the
WSJ database
The Okapi system consists of a low level search engine or
basic search system (BSS), a user interface for the man-
ual search experiments and data conversion and inversion
utilities.
The UCLA hardware consisted of Sun SPARC-2 machine
with 32 MB of memory, and 1 GB of disk storage.
The Wall Street Journal (WSJ) database was used for
both the routing and ad-hoc searches. Because of the lack
of adequate disk space on the UCLA machine the database
was indexed at City University by Stephen Walker and it
was then transferred (FTP-ed) to UCLA.
For TREC2 the Okapi databases were built by index-
ing mainly the DOCNO and TEXT fields of the records.
Inverted indexes included complete within-document posi-
tional information, enabling term frequency and term prox-
imity to be used. Okapi's typical index size overhead is
around 80% of the textfile size. The elapsed time for inver-
sion of the WSJ database was about 12 hours.
At this point it is worth noting of (a) the nature of the
WSJ records, and (b) a limitation of Okapi's due to index-
ing.
(a) The WSJ records consist of documents that do not
have the same kind of structure found in bibliographic
databases, such as INSPEC or ERIC. The records contain
the full-text of stories and have varied length, mostly longer
than the length of an average abstract of a bibliographic
database. In addition, the language and the style is mostly
`journalistic' as opposed to `scientific', i.e. less structured.
One important issue is that some WSJ records often con-
tain short multi-story articles which are completely unre-
lated one from the other. This type of record is usually
a compilation of a number of one- or two-paragraph long
news stories. The stories share no content relation between
them, the only common feature is their co-existence in the
same record. This has implications in retrieval effective-
ness, especially when such records are included in the pool