NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Okapi at TREC-2 chapter S. Robertson S. Walker S. Jones M. Hancock-Beaulieu M. Gatford National Institute of Standards and Technology D. K. Harman problem is also apparent in the regression approach- although "trying it out" has a somewhat different sense here (the formula is tried in a regression model, rather than in a retrieval test). The discussions of Sections 2.1 and 2.3 exemplify an approach which may offer some reconciliation of these ideas. Essentially it is to take a formal model which provides an exact but intractable formula, and use it to suggest a much simpler formula. The simpler formula can then be tried in an ad-hoc fashion, or used in turn in a regression approach. Although we have not yet taken this latter step of using regression, we believe that the present suggestion lends itself to such methods. 2.1 The basic model The basic probabilistic model is the traditional rel[OCRerr] vance weight model [5], under which each term is given a weight as defined below, and the score (matching value) for each document is the sum of the weights of the matching terms: w = log (r + 0.5)/(R - r + 0.5) (n-r+0.5)/(N-n-R+r+0.5) where N is the number of indexed documents; n the number of documents containing the term; R the number of known relevant documents; r the number of relevant documents containing the term. (1) This approximates to inverse collection frequency (ICF) when there is no relevance informati6n. It will be referred to below (with or without relevance infor- mation) as 2.2 The 2-Poisson model and term frequency One example of these problems concerns within- document term frequency (if). This variable figures in a number of ad-hoc formulae, and it seems clear that it can contribute to better retrieval performance. How- ever, there is no obvious reason why any particular func- tion of if should be used in retrieval. There is not much in the way of formal models which include a if comp[OCRerr] nent; one which does is the 2-Poisson model [7, 8]. The 2-Poisson model postulates that the distribution of within-document frequencies of a content-bearing term is a mixture of two Poisson distributions: one set of documents (the "elite" set for the particular term, which may be interpreted to mean those documents which can be said to be "about" the concept represented 22 by the term) will exhibit a Poisson distribution of a cer- tain mean, while the remainder may also contain the term but much less frequently (a smaller Poisson mean). Some earlier work in this area [8] attempted to use an exact formula derived from the model, but had limited success, probably partly because of the problem of esti- mating the required quantities. The approach here is to use the behaviour of the exact formula to suggest a very much simpler function of if which behaves in a similar way. The exact formula, for an additive weight in the style of w(i), of a term I which occurs if times, is (pIAtf 6-A + (1 - [OCRerr] + (1 - q')e[OCRerr]") w = log (q[OCRerr]Atfe[OCRerr]A + (1 - q[OCRerr])[OCRerr]tf6[OCRerr]Ii)(pIe[OCRerr]A + (1 - p')e[OCRerr]'Ł) (2) where A is the Poisson mean for if in the elite set for is the Poisson mean for if in the non-elite set; p `is the probability of a document being elite for t given that it is relevant; q'is the probability of a document being elite given that it is non-relevant. As a function of if, this can be shown to behave as follows: it is zero for if = 0; it increases monotonically with if, but at an ever-decreasing rate; it approaches an asymptotic maximum as if gets large. The maximum is approximately the binary independence weight that would be assigned to an infallible indicator of eliteness. A very simple formula which exhibits similar be- haviour is if/(if + consiani). This has an asymptotic limit of unity, so must be multiplied by an appropriate binary independence weight. The regular binary inde- pendence weight for the presence/absence of the term may be used for this purpose. Thus the weight becomes _ if (ki+if) (3) where k1 is an unknown constant. Several points may be made concerning this argu- ment. It is not by any stretch of the imagination a strong quantitative argument; one may have many reservations about the 2-Poisson model itself, and the transformations sketched above are hardly justifiable in any formal way. However, it results in a modification of the binary independence weight which is at least plau- sible, and has just slightly more justification than plau- sibility alone. The constant k1 in the formula is not in any way determined by the argument. The effect of choice of constant is to determine the strength of the relationship between weight and if: a large constant will make for a relation close to proportionality (where if is relatively