NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Okapi at TREC-2 chapter S. Robertson S. Walker S. Jones M. Hancock-Beaulieu M. Gatford National Institute of Standards and Technology D. K. Harman [6] Cooper, W. et aL Probabilistic retrieval in the TIP- STER collection: an application of staged logistic regression. In: [1] (pp.73-88). [7] Harter, S.P. A probabilistic approach to automatic keyword indexing. Journal of the American Societ[OCRerr] for Information Science, 26,197-206 and 280-289. [8] Robertson, S.E, Van Rusbergen, C.J. & Porter, M.F. Probabilistic models of indexing and search- ing. In Oddy, R.N. et al. (Eds.), Information Retrieval Research (pp.35-56). London: Butter- worths, 1981. [9] Keen, E.M. The use of term position devices in ranked output experiments. Journal of Documen- tation, 17,1991,1-22. [10] Harman, D. Relevance feedback revisited. In: SI- GIR 92. Proceedings of the 15th International Con- ference on Research and Development in Informa- tion Retrieval (pp.280-289). ACM Press, 1992. [11] Robertson, S.E. On Term Selection for Query Ex- pansion. Journal of Documentation, 16,1990, 359- 364. A 2-Poisson model with document length component Basic ideas The basic weighting function used is that developed in [8], and may be expressed as follows: w(x) = P(xIR)P(o IR) log P(xIR)P(QIR) where (8) x is a vector of information about the document; 0 is a reference vector representing a zero-weighted document; R and R are relevance and non-relevance respec- tively. For example, each component of x[OCRerr] may represent the pres- ence/absence of a query term in the document (or, as in the case of formula 2 in the main text, its document frequency); 0 would then be the "natural" zero vector representing all query terms absent. In this formulation, independence as- sumptions lead to the decomposition of w into additive com- ponents such as individual term weights. A document length may be added as a component of x however, document length does not so obviously have a "nat- ural" zero (an actual document of zero length is a patholog- ical case). Instead, we may use the average length of a doc- ument for reference; thus we would expect to get a formula in which the document length component disappears for a document of average length, but not for other lengths. 31 Suppose, then, that the average length of a document is A. The weighting formula becomes: w(x, d) = log P((x, d)IR)P((Q, A)IR) P((x, d)IR)P((0, A) iR) where d is document length, and :: represents all other in- formation about the document. This may be decomposed as follows: w(X, d) = w(X[OCRerr], d)1 + w(x, d)2 where w(::, d)1 - log P(xI(R,d))P(Qt(R,d)) P(xI([OCRerr]R,d))P(0 I(R,d)) and w(x, d)2 - log P((Q,d)IR)P((Q,[OCRerr])IR) P((Q[OCRerr]d)IR)P((Q[OCRerr][OCRerr])jR) These two components are discussed further below. Hypotheses (9) As indicated in the main text, one may imagine different reasons why documents should vary in length. The two hy- potheses given there ("scope" and "verbosity" hypotheses) may be regarded as opposite poles of explanation. The ar- guments below are based on the Verbosity hypothesis only. The Verbosity hypothesis would imply that document properties such as relevance and eliteness can be regarded as independent of document length; given eliteness for a term, however, the number of occurrences of that term would de- pend on document length. In particular, if we assume that the two Poisson parameters for a given term, A and [OCRerr], are appropriate for documents of average length, then the num- ber of occurrences of the term in documents of length d wrn be 2-Poisson with means Ad/A and [OCRerr]d/A. Second component The second component of equation 9 is w(x[OCRerr], d)2 = log P(0[OCRerr]I(R, d))P(Qj(R, A)) + log P(dIR)P(A IR) P(QJ(R, d))P(QI(R, A)) P(dIR)P(A IR) Under the Verbosity hypothesis, the second part of this formula is zero. Making the usual term-independence as- sumptions, the first part may be decomposed into a sum of components for each query term, thus: (p[OCRerr]e[OCRerr]Ad/[OCRerr] + (1 - pI)e[OCRerr][OCRerr]dIA)(qIe[OCRerr]A + (1 - q')e-[OCRerr]) w(i, d)2 = log (qIe[OCRerr]Ad/A + (1 - qI)e[OCRerr][OCRerr]d/[OCRerr])(pIe[OCRerr]A + (1 - p')e-[OCRerr]) (10) where t is a query term and p', q', A and [OCRerr] are as in formula 2. Note that there is a component for each query term, whether or not the term is in the document. For almost all normal query terms (i.e. for any terms that are not actually detrimental to the query), we can assume that p' > q' and A > iŁ. In this case, formula 10 can be shown to be monotonic decreasing with d, from a maximum as d 0, through zero when d = A, and to a minimum as d 00. As indicated, there is one such factor for each of the nq query terms. Once again, we can devise a very much simpler function which approximates to this behaviour; this is the justifica- tion for formula 5 in the main text.