SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Okapi at TREC-2
chapter
S. Robertson
S. Walker
S. Jones
M. Hancock-Beaulieu
M. Gatford
National Institute of Standards and Technology
D. K. Harman
[6] Cooper, W. et aL Probabilistic retrieval in the TIP-
STER collection: an application of staged logistic
regression. In: [1] (pp.73-88).
[7] Harter, S.P. A probabilistic approach to automatic
keyword indexing. Journal of the American Societ[OCRerr]
for Information Science, 26,197-206 and 280-289.
[8] Robertson, S.E, Van Rusbergen, C.J. & Porter,
M.F. Probabilistic models of indexing and search-
ing. In Oddy, R.N. et al. (Eds.), Information
Retrieval Research (pp.35-56). London: Butter-
worths, 1981.
[9] Keen, E.M. The use of term position devices in
ranked output experiments. Journal of Documen-
tation, 17,1991,1-22.
[10] Harman, D. Relevance feedback revisited. In: SI-
GIR 92. Proceedings of the 15th International Con-
ference on Research and Development in Informa-
tion Retrieval (pp.280-289). ACM Press, 1992.
[11] Robertson, S.E. On Term Selection for Query Ex-
pansion. Journal of Documentation, 16,1990, 359-
364.
A 2-Poisson model with
document length component
Basic ideas
The basic weighting function used is that developed in [8],
and may be expressed as follows:
w(x) = P(xIR)P(o IR)
log
P(xIR)P(QIR)
where
(8)
x is a vector of information about the document;
0 is a reference vector representing a zero-weighted
document;
R and R are relevance and non-relevance respec-
tively.
For example, each component of x[OCRerr] may represent the pres-
ence/absence of a query term in the document (or, as in the
case of formula 2 in the main text, its document frequency);
0 would then be the "natural" zero vector representing all
query terms absent. In this formulation, independence as-
sumptions lead to the decomposition of w into additive com-
ponents such as individual term weights.
A document length may be added as a component of x
however, document length does not so obviously have a "nat-
ural" zero (an actual document of zero length is a patholog-
ical case). Instead, we may use the average length of a doc-
ument for reference; thus we would expect to get a formula
in which the document length component disappears for a
document of average length, but not for other lengths.
31
Suppose, then, that the average length of a document is
A. The weighting formula becomes:
w(x, d) = log P((x, d)IR)P((Q, A)IR)
P((x, d)IR)P((0, A) iR)
where d is document length, and :: represents all other in-
formation about the document. This may be decomposed as
follows:
w(X, d) = w(X[OCRerr], d)1 + w(x, d)2
where
w(::, d)1 - log P(xI(R,d))P(Qt(R,d))
P(xI([OCRerr]R,d))P(0 I(R,d))
and
w(x, d)2 - log P((Q,d)IR)P((Q,[OCRerr])IR)
P((Q[OCRerr]d)IR)P((Q[OCRerr][OCRerr])jR)
These two components are discussed further below.
Hypotheses
(9)
As indicated in the main text, one may imagine different
reasons why documents should vary in length. The two hy-
potheses given there ("scope" and "verbosity" hypotheses)
may be regarded as opposite poles of explanation. The ar-
guments below are based on the Verbosity hypothesis only.
The Verbosity hypothesis would imply that document
properties such as relevance and eliteness can be regarded as
independent of document length; given eliteness for a term,
however, the number of occurrences of that term would de-
pend on document length. In particular, if we assume that
the two Poisson parameters for a given term, A and [OCRerr], are
appropriate for documents of average length, then the num-
ber of occurrences of the term in documents of length d wrn
be 2-Poisson with means Ad/A and [OCRerr]d/A.
Second component
The second component of equation 9 is
w(x[OCRerr], d)2 = log P(0[OCRerr]I(R, d))P(Qj(R, A)) + log P(dIR)P(A IR)
P(QJ(R, d))P(QI(R, A)) P(dIR)P(A IR)
Under the Verbosity hypothesis, the second part of this
formula is zero. Making the usual term-independence as-
sumptions, the first part may be decomposed into a sum of
components for each query term, thus:
(p[OCRerr]e[OCRerr]Ad/[OCRerr] + (1 - pI)e[OCRerr][OCRerr]dIA)(qIe[OCRerr]A + (1 - q')e-[OCRerr])
w(i, d)2 = log (qIe[OCRerr]Ad/A + (1 - qI)e[OCRerr][OCRerr]d/[OCRerr])(pIe[OCRerr]A + (1 - p')e-[OCRerr])
(10)
where t is a query term and p', q', A and [OCRerr] are as in formula
2. Note that there is a component for each query term,
whether or not the term is in the document.
For almost all normal query terms (i.e. for any terms that
are not actually detrimental to the query), we can assume
that p' > q' and A > iŁ. In this case, formula 10 can be
shown to be monotonic decreasing with d, from a maximum
as d 0, through zero when d = A, and to a minimum as
d 00. As indicated, there is one such factor for each of
the nq query terms.
Once again, we can devise a very much simpler function
which approximates to this behaviour; this is the justifica-
tion for formula 5 in the main text.