SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Okapi at TREC-2
chapter
S. Robertson
S. Walker
S. Jones
M. Hancock-Beaulieu
M. Gatford
National Institute of Standards and Technology
D. K. Harman
problem is also apparent in the regression approach-
although "trying it out" has a somewhat different sense
here (the formula is tried in a regression model, rather
than in a retrieval test).
The discussions of Sections 2.1 and 2.3 exemplify an
approach which may offer some reconciliation of these
ideas. Essentially it is to take a formal model which
provides an exact but intractable formula, and use it to
suggest a much simpler formula. The simpler formula
can then be tried in an ad-hoc fashion, or used in turn in
a regression approach. Although we have not yet taken
this latter step of using regression, we believe that the
present suggestion lends itself to such methods.
2.1 The basic model
The basic probabilistic model is the traditional rel[OCRerr]
vance weight model [5], under which each term is given a
weight as defined below, and the score (matching value)
for each document is the sum of the weights of the
matching terms:
w = log (r + 0.5)/(R - r + 0.5)
(n-r+0.5)/(N-n-R+r+0.5)
where
N is the number of indexed documents;
n the number of documents containing the
term;
R the number of known relevant documents;
r the number of relevant documents containing
the term.
(1)
This approximates to inverse collection frequency
(ICF) when there is no relevance informati6n. It will
be referred to below (with or without relevance infor-
mation) as
2.2 The 2-Poisson model and term
frequency
One example of these problems concerns within-
document term frequency (if). This variable figures in
a number of ad-hoc formulae, and it seems clear that
it can contribute to better retrieval performance. How-
ever, there is no obvious reason why any particular func-
tion of if should be used in retrieval. There is not much
in the way of formal models which include a if comp[OCRerr]
nent; one which does is the 2-Poisson model [7, 8].
The 2-Poisson model postulates that the distribution
of within-document frequencies of a content-bearing
term is a mixture of two Poisson distributions: one set
of documents (the "elite" set for the particular term,
which may be interpreted to mean those documents
which can be said to be "about" the concept represented
22
by the term) will exhibit a Poisson distribution of a cer-
tain mean, while the remainder may also contain the
term but much less frequently (a smaller Poisson mean).
Some earlier work in this area [8] attempted to use an
exact formula derived from the model, but had limited
success, probably partly because of the problem of esti-
mating the required quantities. The approach here is to
use the behaviour of the exact formula to suggest a very
much simpler function of if which behaves in a similar
way.
The exact formula, for an additive weight in the style
of w(i), of a term I which occurs if times, is
(pIAtf 6-A + (1 - [OCRerr] + (1 - q')e[OCRerr]")
w = log (q[OCRerr]Atfe[OCRerr]A + (1 - q[OCRerr])[OCRerr]tf6[OCRerr]Ii)(pIe[OCRerr]A + (1 - p')e[OCRerr]'£)
(2)
where
A is the Poisson mean for if in the elite set for
is the Poisson mean for if in the non-elite
set;
p `is the probability of a document being elite
for t given that it is relevant;
q'is the probability of a document being elite
given that it is non-relevant.
As a function of if, this can be shown to behave as
follows: it is zero for if = 0; it increases monotonically
with if, but at an ever-decreasing rate; it approaches an
asymptotic maximum as if gets large. The maximum
is approximately the binary independence weight that
would be assigned to an infallible indicator of eliteness.
A very simple formula which exhibits similar be-
haviour is if/(if + consiani). This has an asymptotic
limit of unity, so must be multiplied by an appropriate
binary independence weight. The regular binary inde-
pendence weight for the presence/absence of the term
may be used for this purpose. Thus the weight becomes
_ if
(ki+if)
(3)
where k1 is an unknown constant.
Several points may be made concerning this argu-
ment. It is not by any stretch of the imagination
a strong quantitative argument; one may have many
reservations about the 2-Poisson model itself, and the
transformations sketched above are hardly justifiable in
any formal way. However, it results in a modification of
the binary independence weight which is at least plau-
sible, and has just slightly more justification than plau-
sibility alone.
The constant k1 in the formula is not in any way
determined by the argument. The effect of choice of
constant is to determine the strength of the relationship
between weight and if: a large constant will make for a
relation close to proportionality (where if is relatively