SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Okapi at TREC-2
chapter
S. Robertson
S. Walker
S. Jones
M. Hancock-Beaulieu
M. Gatford
National Institute of Standards and Technology
D. K. Harman
small); a small k1 will mean that if has relatively little
effect on the weight (at least when if> 0, i.e. when the
term is present).
Our approach has been to try out various values of
k1 (around 1 may be about right for the full disks 1 and
2 database). However, in the longer term we hope to
use regression methods to determine the constant. It
is not, unfortunately, in a form directly susceptible to
the methods of Fuhr or Cooper, but we hope to develop
suitable methods.
2.3 Document length
The 2-Poisson model in effect assumes that documents
(i.e. records) are all of equal length. Document length
is a variable which figures in a number of weighting for-
mulae.
We may postulate at least two reasons why docu-
ments might vary in length. Some documents may sim-
ply cover more material than others; an extreme version
of this hypothesis would have a long document consist-
ing of a number of unrelated short documents concate-
nated together (the "scope hypothesis"). An opposite
view would have long documents like short documents,
but longer: in other words, a long document covers a
similar scope to a short document, but simply uses more
words (the "verbosity hypothesis")
It seems likely that real document collections contain
a mixture of these effects; individual long documents
may be at either extreme or of some hybrid type. All
the discussion below assumes the verbosity hypothesis;
no progress has yet been made with models based on
the scope hypothesis.
The simplest way to deal with this model is to take
the formula above, but normalise if for document length
(dl). If we assume that the value of k1 is appropriate
to documents of average length (avdl), then this model
can be expressed as
w= (kalv>[OCRerr]d:f+if)w(i)
A more detailed analysis of the effect on the Poisson
model of the verbosity hypothesis is given in Appendix
7.4. This shows that the appropriate matching value for
a document contains two components. The first compo-
nent is a conventional sum of term weights, each term
weight dependent on both if and dl; the second is a cor-
rection factor dependent on the document length and
the number of terms in the query (nq), though nol on
which terms match. A similar argument to the above
for if suggests the following simple formulation:
correcijon facior = k2 x nq(avdl - dl)
(avdl + dl)
where k2 is another unknown constant.
Again, k2 is not specified by the model, and must
(at present, at least) be discovered by trial and error.
Values in the range 0.0-0.3 appear about right for the
TREC databases (if natural logarithms are used in the
term-weighting functions1), with the lower values being
better for equation 4 termweights and the higher values
for equation 3.
2.4 Query term frequency and query
length
A similar approach may be taken to within-query term
frequency. In this case we postulate an "elite" set of
queries for a given term: the occurrence of a term in the
query is taken as evidence for the eliteness of the query
for that term. This would suggest a similar multiplier
for the weight:
_ qif
(k3 + q1f)W(l) (6)
In this case, experiments suggest a large value of k3
to be effective-indeed the limiting case, which is equiv-
alent to
w = qif[OCRerr] w(i) (7)
appears to be the most effective.
We may combine a formula such as 6 or 7 with a
document term frequency formula such as 3. In practice
this seems to be a useful device, although the theory
requires more work to validate it.
2.5 Adjacency
The recent success of weighting schemes involving a
term-proximity component [9] has prompted consider-
ation of including some such component in the Okapi
weighting. Although this does not yet extend to a full
Keen-type weighting, a method allowing for adjacency
of some terms has been developed.
(4) Weighting formulae such as w(i) can in principle be
applied to any identifiable and searchable entity (such
as, for example, a Boolean search expression). An ob-
vious candidate for such a weight is any identifiable
phrase. However, the problem lies in identifying suit-
able phrases. Generally such schemes have been applied
only to predetermined phrases (e.g. those given in a dic-
tionary and identified in the documents in the course of
indexing). Keen's methods would suggest constructing
phrases from all possible pairs (or perhaps larger sets)
of query terms at search time; however, for queries of
the sort of size found in TREC, that would probably
generate far too many phrases.
The approach here has been to take pairs of terms
(5) which are adjacent in the query as candidate phrases.
iT0 obtain weights within a range suitable for storage as 16-bit
integers, the Okapi system uses logarithms to base 20.1
23