SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
UCLA-Okapi at TREC-2: Query Expansion Experiments
chapter
E. Efthimiadis
P. Biron
National Institute of Standards and Technology
D. K. Harman
The wpq algorithm combines the effects of the relevance
weighting theory, as expressed by the Wf4 component,
which assign greater importance to the infrequent terms
with the frequency of occurrence of a term in the relevant
document set.
2.2.2 The emim algorithm
The expected mutual information measure (emim) is a
term weighting model incorporating relevance information
in which it is assumed that index terms may not be dis-
tributed independently of each other. (van Rijsbergen,
1977; Harper and van Rusbergen, 1978; van Rijsbergen,
Harper & Porter, 1981)
The emim weight reduces to the f[OCRerr] weight when the "de-
gree of involvement", i.e. the joint probabilities, are all
unity. Assuming the same definitions for n, N, r, R, as
those already used earlier, the emim weight of a term is
calculated as follows:
Eiq = pilill - P12i12 - p2ii21 + p22i22
= 1og;[OCRerr].r
(n-[OCRerr]N
-log .(n-r)
(N-R)n
(R-r)N
-log .(R-r)
(N-n)R
(N-n-R+r)N
+log (`v-n-R+r)
(N-n)(N-R)
2.2.3 The porie[OCRerr] algorithm
Porter and Galpin (1988) describe a ranking formula used
in the MUSCAT online catalogue:
r n
porler = - N
where r, R, n, N are defined as in the f[OCRerr] weight (eq. 1).
2.2.4 The r[OCRerr]lohi algorithm
The r[OCRerr]lohi algorithm has been proposed by Efthimiadis
(1993a) as the result of the observation of the ranking be-
havior of six algorithms used for ranking terms for query
expansion.
The r[OCRerr]lohi ranking algorithm:
* ranks terms according to r, i.e. their frequency of oc-
currence in the relevant document set, in descending
order and
* resolves ties according to their term frequency, n, from
low-to-high frequency.
It was hypothesized that the r[OCRerr]lohi algorithm would have
an almost identical ranking to porier and a performance ap-
proaching that of wpq and emim. More differences between
the algorithms may occur if the size of the set of relevant
documents (R) gets larger. Conclusions about the algo-
rithm however could not be drawn before it was evaluated
against the other algorithms. The results of that evaluation
are reported in Efthimiadis (in press) where the r[OCRerr]lohi al-
gorithm demonstrated better performance when compared
to the other algorithms.
2.2.5 The r[OCRerr]hilo sort
A variant of the r[OCRerr]lohi algorithm is to rank candidate terms
for query expansion using the r[OCRerr]hilo rank which:
* ranks terms according to r, i.e. their frequency of oc-
currence in the relevant document set, and
* resolves ties according to their term frequency, n, from
high-t[OCRerr]low frequency.
Since the r[OCRerr]hilo algorithm will result in sorting terms in
exactly the opposite way of the r[OCRerr]loh i algorithm it was in-
cluded as a control for the study.
3 Methodology
3.1 Runs
(5) Initial tests were performed in topics 1-50 where the depen-
dent variables were the weighting function and the query
processing of terms. [OCRerr]From the results obtained it was es-
tablished that the function to use will be BM15 and that
the parsing of the Topics would include both single terms
and "phrases" as defined by comma delimited text in the
Topics.
The table below (Table 3.1 gives all the variables used
in constructing the runs. The options available for each
variable are also provided.
Weighting Function: Best match function BM15 (see
equation 2).
Phrases: Choice of YES, NO, or BOTH. This determines
the type of parsing of the "Concepts" and "Title" fields
281