SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression chapter W. Cooper A. Chen F. Gey National Institute of Standards and Technology D. K. Harman TABLE I: Equations to Approximate the Components of Eq. (1) log 0(R) [b0+b1M] log 0(R 1A1) - log 0(R) [a0 + a1X1,1 + . + aKX1,K + aK+lM J - [b0+b1M] log 0(R lAM) - log 0(R) [a0 + alXM,l + . +aKXM,K + aK+lM] - [b0+b1MJ M log0(RIA1,...,A[OCRerr]) - [a0M + a1 [OCRerr] Xm,i m=1 What must be done, then, is to find approximating expressions for the various components in the right side of Eq. (1). Table I (above) shows the interrelationships among the quantities involved. In this array, the material to the left of the column of approximation signs just restates Eq. (1), for Eq. (1) asserts that the expressions above the horizontal line sum to the expression below the line. On the right of each approximation sign is a simple formula that might be used to approximate the expression on the left. (More complex formulae could of course also be considered.) Each right-side formula involves a linear combination of K variables Xm.i,... , Xm.K corresponding to the K retrieval properties (frequency values, etc.) used to characterize the term in question. The idea is to apply logistic regression to find the values for the coefficients (the a's and b's) in the right side that produce the best regression fit to the learning sample. Once these constants have been determined, retrieval can be carried out by sum- ming the right sides to obtain the desired estimate of log 0(R I A1.....AN) for any given query-document palr. It may not be immediately apparent why terms in M have been included in the approximating formulas on the right. In Eq. (1), the number N of avallable retrieval clues is actually part of the conditionalizing information, a fact that could have been stressed by writing 0(R I N) in place of 0(R) and 0(R I A[OCRerr], N) in place of 0(R I A[OCRerr]). So the approximating formula for log 0(R) is really an approxi- mation for log 0(RI M) and must involve a term in M. (Simple functions of M such as or log (M) might also be considered.) Similar remarks apply to the formulas for M + + aK X Xm,K + aK+1M2] - (M - 1) [ b0 + b1M] m=1 approximating log (R I Am). There are two approaches to the problem of finding coefficients to use in the approximating expressions. The first is what might be called the `triples-then-palrs' approach. It starts out with a logistic regression analysis performed directly on a sample of query-document-term triples constructed from the learning set. In the sample, each triple is accompanied by the clue-values associated with the match term in the query and document, and by the human relevance judgement for the query and docu- menL The clue-values are used as the values of the inde- pendent variables in the regression, and the relevance judgements as the values of the the dependent variable. After the desired coefficients have been obtained, the resulting regression equation can be applied to evaluate the left-side expressions, which can in turn be summed down the M terms to obtain a preliminary estimate of log0(RIA1,...,A[OCRerr]). A second regression analysis is then performed on a sample of query-document pairs, using the preliminary logodds prediction from the first (triples) analysis as an independent variable. This second analysis serves the pur- pose of correcting for distortions introduced by the Assumption of Linked Dependence. It also allows retrieval properties not associated with particular match terms, such as document length, age, etc., to be intro- duced. The triples-then-pairs approach is the one used by the Berkeley group in its Trec-l research (Cooper, Gey & Chen 1992). The theory behind it is presented in more detail in (Cooper, Dabney & Gey 1992). 59