SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Okapi at TREC chapter S. Robertson S. Walker M. Hancock-Beaulieu A. Gull M. Lau National Institute of Standards and Technology Donna K. Harman The results for the manual ad-hoc queries are given in the official tables as cityml (without feedback) and citym2 (with feedback). For a discussion of the results, and of the evaluation method for citym2, see section 7. 6. Some observations on the experiments 6.1 Local relevance judgements We experimented with making our own relevance judgements, based on the topics as provided. Although these experiments were on a very small scale and not very systematic, our impression was that it was usually possible to reproduce the judgements provided centrally, with a high chance of agreement. If this is so, it presumably reflects (a) the relafively highly specified nature of the topics (as compared to most IR queries!), and (b) the fact that the centrally-provided judgements are being made by experts other than the original requester. Thus we felt justified in attempting to improve our routing queries by providing some more relevance judgements of our own, particularly in cases where there were few centrally-provided ones. Note that the relevance weighting method used (F4 formula in section 2.2) takes account only of positive relevance judgements; items judged non-relevant are combined with items not judged (the complement method: Harper and van Rijsbergen, 1978). However, as indicated in section 5.3, there were topics for which (under strict relevance criteria) the relevant documents were very sparse, and relevance feedback would not have had much effect. In these cases, for the manual searches only, searchers were encouraged to make more generous relevance judgements (i.e. to accept as relevant some documents that did not meet all the criteria precisely). The argument behind this guideline was that relevance feedback should work better given some partially-relevant items than with few or no relevant items. This argument obviously requires testing. 6.2 Bias to query terms The bias in favour of original query terms discussed in section 3.4 was an attempt to represent the prior knowledge that a term chosen by the original requester or a searcher is likely to be good in terms of the probabilistic model. This 27 argument relates to, but is not limited to, Harman's argument about negative weights (Harman, 1992). The point-S formula used in the relevance weighting model actually has a built-in bias which might be described as "0.5 out of 1". The biases used in different TREC experiments (10 out of 10 and 2 out of 3) were chosen arbitrarily; unfortunately there was no time to do any extensive tesfing to enable a better-informed decision. A bias such as 2 out of 3 has the curious effect of downgrading some very good query terms (any term that occurs in all the known relevant). This was part of the reason for trying the 10 out of 10 bias. However, there may be good reason for this effect: even very good results on the known relevant should not persuade us that p is actually unity. 6.3 Two implementation errors There were also two errors in the implementation of this bias. In the relevance weighting formula, the probability p (that the term occurs in a relevant document) is estimated directly from the known relevant documents; the bias is correcdy used to modify this estimate (e.g. r->r+2 and R->R+3 in the formula p=r/R). But the corresponding non-relevance probability q is normally estimated by the complement method (i.e. all documents in the collection not known to be relevant are assumed to be non-relevant, q= (n-r) I (N-R)). In the implementation used for TREC, the modifications to r and R were incorrectly carried over to the q estimate. The second error occurred in the term selection value for query expansion. The full selection value should be w (p-q). Since q is normally very small compared to p, this can be approximated by wp. Since in a simple relevance feedback version, p=rIR and R is the same for all terms (i.e. the number of known relevant), ranking in wp order is the same as ranking in wr order. So in the TREC implementation, wr was used. However, the modification to R for query terms invalidates the second assumption (that R is the same for all terms), so wp should have been used. These errors will have had the effect of over- emphasizing some infrequent query-terms, but will probably not have affected the overall results greatly.