SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
UCLA-Okapi at TREC-2: Query Expansion Experiments
chapter
E. Efthimiadis
P. Biron
National Institute of Standards and Technology
D. K. Harman
of documents that provide terms for query expansion, be-
cause of the noise introduced by the terms taken from the
irrelevant stories.
(b) This last issue relates to a limitation of Okapi. The
version of Okapi used at UCLA retrieves documents at the
record level only. Retrieval at the paragraph level, which
would have facilitated a better handling of some issues like
the above, is not currently available.
2 The weighting functions
The weighting of search terms can be said to involve two
levels:
level 1: A weighting function is used to weight the terms
for the initial query as well as the terms for subsequent
search iterations of the same query or some modified
version of the query.
level 2: A weighting function is used for the weighting of
candidate terms for query expansion.
Sections 2.1 and 2.2 discuss functions used in level 1 and
level 2 respectively.
2.1 Search term weighting
The theory of relevance weights (Robertson & Sparck
Jones, 1976) provides the basic probabilistic model. The
binary independence or relevance weight model assigns a
weight to each term and the matching function for each
document is given by the `[OCRerr]impk sum-of-weights' over all
of the terms in the query.
The weight of a term is calculated by following function
which is also known as the ft point-5 formula:
Wf4 = log (r + .5)(N - iz - R + r + .5)
(n - r + .5)(R - r + .5)
where,
N is the total number of documents in the collec-
tion;
R is the sample of relevant documents as defined
by the user's feedback;
n is the number of documents indexed by term t;
r is the number of relevant documents (from the
sample R) assigned to term t.
When relevance information is not available the above
weight reduces to approximately the inverse document fre-
quency (IDF).
For calculating the total weight of a document the fol-
lowing function was used which is based on the binary inde-
pendence model, and takes into consideration the 2-Poisson
model for within document frequency (tf) and the docu-
ment length. These are described in detail in Robertson et
al (1993b). The purpose of the UCLA Okapi system was
to evaluate the existing Okapi models and therefore did
not allow for modifications of the existing functions. For
compatibility purposes and for comparisons it was decided
to use the BM15 (best match) function for the runs. The
BM15 best match weigthing function is:
dOOW6[OCRerr]9htbml5 = [OCRerr](((k1tftf)) X wf4) + k2 x [OCRerr]q x ((aavedl - dl)
vedl+ dl)
0
where k1 and k2 are unknown constants. In the UCLA-
Okapi implementation the values for these constants are:
k1 = 1 and k2 = 1.
2.2 Query expansion term weighting
Ther anking algorithms that were considered for the rank-
ing of terms for query expansion were: wpq, emim, porter,
r[OCRerr]lohi and r[OCRerr]hilo. These algorithms are described briefly
below.
2.2.1 The wpq algorithm
This algorithm is based on an independence assumption
that holds between a query expansion term and the terms in
the entire previous search formulation (Robertson, 1990).
According to the relevance weighting theory, the inclusion
of term tin the search formulation with weight Wj will
increase the effectiveness of retrieval by
(1) wpq = Wt (Pt - qt)
(3)
where, Wt is a weighting function, which in this case is the
Wf 4; Pt is the probability of term t occurring in a relevant
document; and qt is the probability of a term t occurring
in a non-relevant document.
This means that irrespective of the weighting function
(Wt) used the rule for deciding the inclusion of a term in a
query expansion search should be based on the ranking of
wpq instead of Wt alone. Substituting the weighting func-
tion and the probability of relevance in wpq with r, R, n,
Nwe get:
wpq = ig (r (+nL5)r(N+F5[OCRerr]R[OCRerr]Rr+[OCRerr].5)5) .([OCRerr]Rr[OCRerr]NnIRr)
280
0