NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Okapi at TREC-2 chapter S. Robertson S. Walker S. Jones M. Hancock-Beaulieu M. Gatford National Institute of Standards and Technology D. K. Harman small); a small k1 will mean that if has relatively little effect on the weight (at least when if> 0, i.e. when the term is present). Our approach has been to try out various values of k1 (around 1 may be about right for the full disks 1 and 2 database). However, in the longer term we hope to use regression methods to determine the constant. It is not, unfortunately, in a form directly susceptible to the methods of Fuhr or Cooper, but we hope to develop suitable methods. 2.3 Document length The 2-Poisson model in effect assumes that documents (i.e. records) are all of equal length. Document length is a variable which figures in a number of weighting for- mulae. We may postulate at least two reasons why docu- ments might vary in length. Some documents may sim- ply cover more material than others; an extreme version of this hypothesis would have a long document consist- ing of a number of unrelated short documents concate- nated together (the "scope hypothesis"). An opposite view would have long documents like short documents, but longer: in other words, a long document covers a similar scope to a short document, but simply uses more words (the "verbosity hypothesis") It seems likely that real document collections contain a mixture of these effects; individual long documents may be at either extreme or of some hybrid type. All the discussion below assumes the verbosity hypothesis; no progress has yet been made with models based on the scope hypothesis. The simplest way to deal with this model is to take the formula above, but normalise if for document length (dl). If we assume that the value of k1 is appropriate to documents of average length (avdl), then this model can be expressed as w= (kalv>[OCRerr]d:f+if)w(i) A more detailed analysis of the effect on the Poisson model of the verbosity hypothesis is given in Appendix 7.4. This shows that the appropriate matching value for a document contains two components. The first compo- nent is a conventional sum of term weights, each term weight dependent on both if and dl; the second is a cor- rection factor dependent on the document length and the number of terms in the query (nq), though nol on which terms match. A similar argument to the above for if suggests the following simple formulation: correcijon facior = k2 x nq(avdl - dl) (avdl + dl) where k2 is another unknown constant. Again, k2 is not specified by the model, and must (at present, at least) be discovered by trial and error. Values in the range 0.0-0.3 appear about right for the TREC databases (if natural logarithms are used in the term-weighting functions1), with the lower values being better for equation 4 termweights and the higher values for equation 3. 2.4 Query term frequency and query length A similar approach may be taken to within-query term frequency. In this case we postulate an "elite" set of queries for a given term: the occurrence of a term in the query is taken as evidence for the eliteness of the query for that term. This would suggest a similar multiplier for the weight: _ qif (k3 + q1f)W(l) (6) In this case, experiments suggest a large value of k3 to be effective-indeed the limiting case, which is equiv- alent to w = qif[OCRerr] w(i) (7) appears to be the most effective. We may combine a formula such as 6 or 7 with a document term frequency formula such as 3. In practice this seems to be a useful device, although the theory requires more work to validate it. 2.5 Adjacency The recent success of weighting schemes involving a term-proximity component [9] has prompted consider- ation of including some such component in the Okapi weighting. Although this does not yet extend to a full Keen-type weighting, a method allowing for adjacency of some terms has been developed. (4) Weighting formulae such as w(i) can in principle be applied to any identifiable and searchable entity (such as, for example, a Boolean search expression). An ob- vious candidate for such a weight is any identifiable phrase. However, the problem lies in identifying suit- able phrases. Generally such schemes have been applied only to predetermined phrases (e.g. those given in a dic- tionary and identified in the documents in the course of indexing). Keen's methods would suggest constructing phrases from all possible pairs (or perhaps larger sets) of query terms at search time; however, for queries of the sort of size found in TREC, that would probably generate far too many phrases. The approach here has been to take pairs of terms (5) which are adjacent in the query as candidate phrases. iT0 obtain weights within a range suitable for storage as 16-bit integers, the Okapi system uses logarithms to base 20.1 23