SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Okapi at TREC
chapter
S. Robertson
S. Walker
M. Hancock-Beaulieu
A. Gull
M. Lau
National Institute of Standards and Technology
Donna K. Harman
The results for the manual ad-hoc queries are
given in the official tables as cityml (without
feedback) and citym2 (with feedback). For a
discussion of the results, and of the evaluation
method for citym2, see section 7.
6. Some observations on the experiments
6.1 Local relevance judgements
We experimented with making our own relevance
judgements, based on the topics as provided.
Although these experiments were on a very small
scale and not very systematic, our impression was
that it was usually possible to reproduce the
judgements provided centrally, with a high chance
of agreement. If this is so, it presumably reflects
(a) the relafively highly specified nature of the
topics (as compared to most IR queries!), and (b)
the fact that the centrally-provided judgements are
being made by experts other than the original
requester. Thus we felt justified in attempting to
improve our routing queries by providing some
more relevance judgements of our own,
particularly in cases where there were few
centrally-provided ones. Note that the relevance
weighting method used (F4 formula in section
2.2) takes account only of positive relevance
judgements; items judged non-relevant are
combined with items not judged (the complement
method: Harper and van Rijsbergen, 1978).
However, as indicated in section 5.3, there were
topics for which (under strict relevance criteria)
the relevant documents were very sparse, and
relevance feedback would not have had much
effect. In these cases, for the manual searches
only, searchers were encouraged to make more
generous relevance judgements (i.e. to accept as
relevant some documents that did not meet all the
criteria precisely). The argument behind this
guideline was that relevance feedback should
work better given some partially-relevant items
than with few or no relevant items. This
argument obviously requires testing.
6.2 Bias to query terms
The bias in favour of original query terms
discussed in section 3.4 was an attempt to
represent the prior knowledge that a term chosen
by the original requester or a searcher is likely to
be good in terms of the probabilistic model. This
27
argument relates to, but is not limited to,
Harman's argument about negative weights
(Harman, 1992). The point-S formula used in the
relevance weighting model actually has a built-in
bias which might be described as "0.5 out of 1".
The biases used in different TREC experiments
(10 out of 10 and 2 out of 3) were chosen
arbitrarily; unfortunately there was no time to do
any extensive tesfing to enable a better-informed
decision.
A bias such as 2 out of 3 has the curious effect of
downgrading some very good query terms (any
term that occurs in all the known relevant). This
was part of the reason for trying the 10 out of 10
bias. However, there may be good reason for this
effect: even very good results on the known
relevant should not persuade us that p is actually
unity.
6.3 Two implementation errors
There were also two errors in the implementation
of this bias. In the relevance weighting formula,
the probability p (that the term occurs in a
relevant document) is estimated directly from the
known relevant documents; the bias is correcdy
used to modify this estimate (e.g. r->r+2 and
R->R+3 in the formula p=r/R). But the
corresponding non-relevance probability q is
normally estimated by the complement method
(i.e. all documents in the collection not known to
be relevant are assumed to be non-relevant,
q= (n-r) I (N-R)). In the implementation used for
TREC, the modifications to r and R were
incorrectly carried over to the q estimate.
The second error occurred in the term selection
value for query expansion. The full selection
value should be w (p-q). Since q is normally very
small compared to p, this can be approximated by
wp. Since in a simple relevance feedback version,
p=rIR and R is the same for all terms (i.e. the
number of known relevant), ranking in wp order is
the same as ranking in wr order. So in the TREC
implementation, wr was used. However, the
modification to R for query terms invalidates the
second assumption (that R is the same for all
terms), so wp should have been used.
These errors will have had the effect of over-
emphasizing some infrequent query-terms, but
will probably not have affected the overall results
greatly.