SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Okapi at TREC-2
chapter
S. Robertson
S. Walker
S. Jones
M. Hancock-Beaulieu
M. Gatford
National Institute of Standards and Technology
D. K. Harman
Terms Wts AveP P5 P30 P100 RP Rd
All Final 0.407 1.000 0.867 0.640 0.457 0.739
Topic Final 0.362 1.000 0.733 0.620 0.440 0.698
Topic Orig 0.373 0.600 0.800 0.700 0.433 0.680
Discussion
The main motive for experimenting with this type of
query expansion is that it is one way of finding terms
which are in some sense closely associated with the
query as a whole. It does not fit particularly well with
the Robertson/Sparck Jones type of probabilistic theory
[5], the validity of which depends on pairwise indepen-
dence of terms in both relevant and nonrelevant docu-
ments. However, it is clear, if only from the results in
this paper, that mutual dependence does not necessarily
lead to poor results.
There are many variables involved. In our rather lim-
ited experiments most of the initial feedback searches
were done under the conditions of the first row of Ta-
ble 2, that is with terms from title, concepts, narrative
and description (there were a few runs using title and
concepts only, but the results for most topics were not
good); and weighting function BM11 with termweights
given by equation 6 with large k3 (1000). This gave
nearly the best precision at 5 and 30 documents of any
of our results. The number of feedback documents was
constant across topics and was varied between 10 and
50. For the final search, terms were always weighted
with BM11, but several values of k3 were tried (in-
cluding zero). Some runs used topic terms only and
some used expansion terms as well. There was one run
omitting narrative and description terms from the final
search, but it was not among the very best and is not
reported in the table. The number of terms in the final
search was varied from 10 upwards, terms being selected
as usual in descending order of Iermwe?gh[OCRerr] x [OCRerr]Rr. Some
evaluations were done using frozen ranks, in case the
initial searches tended to give better low precision, but
this turned out not to be the case.
A few of the results are summarised in Table 6. They
include results which appear better than the best oth-
erwise obtained, but the difference is small, and these
runs have not yet been repeated on the other topic sets.
A qq weight component is still needed (compare rows
2 and 14 of the table). The number of feedback docu-
ments is not critical. Speeding searching by using only
the top 10 or 20 terms is detrimental.
It is interesting that results do not seem to be very
greatly affected by the precision of the feedback set.
Looking at the individual topics in the run represented
by the top row of Table 6, 25 did better than in the
feedback run, 18 did worse and the remainder about the
same. Restricting to the 20 topics where the precision at
30 in the feedback set was below 0.5, the corresponding
figures are 7,10 and 3.
29
6.2 Stemming
A comparison was made on the AP database between
the normal Okapi stemming which removes many suf-
fixes and a "weak" stemming procedure which only con-
flates singular and plural forms and removes "mg" end-
ings. For some weighting functions weak stemming in-
creased precision by about 2% and decreased recall by
about 1%, but the observed difference is unlikely to be
significant.
6.3 Stoplists
Some runs were done on the AP database to investigate
the effect of stoplist size. A small stoplist consisted of
the 17 words
a, the, an, at, by, into, on, for, from, to,
with, of' and, or, in, not, et
and a large one contained 209 articles, conjunctions,
prepositions, pronouns and verbs.
There was no significant difference in the results of the
runs, but the index size was about 25% greater with the
small stoplist.
7 Conclusions and prospects
7.1 The new probabilistic models
The most significant result is perhaps the great improve-
ment in the automatic results brought about by the new
term weighting models. In the ad-hoc runs, with no qif
component, BM1S is 14% better than BM1 on average
precision and about 9% better on high precision and re-
call. The corresponding figures for BM11 are 51% and
34% (Table 3). For the routing runs, where a consider-
able amount of relevance information had contributed to
the term weights, the improvement is less, but still very
significant (Table 4). For the manual feedback searches
(Table 5) there was a small improvement when they
were re-run with BM11 replacing BM1S in the final it-
eration.
The drawback of these two models is that the theory
says nothing about the estimation of the constants, or
rather parameters, k1 and k2. It may be assumed that
these depend on the database, and probably also on the
nature of the queries and on the amount of relevance
information available. We do not know how sensitive
they are to any of these factors. Estimation cannot be
done without sets of queries and relevance judgments,
and even then, since the models are not linear, they do
not lend themselves to estimation by logistic regression.
The values we used were arrived at by long sequences
of trials mainly using topics 51-100 on the disks 1 and
2 database, with the TREC-1 relevance sets.