SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression
chapter
W. Cooper
A. Chen
F. Gey
National Institute of Standards and Technology
D. K. Harman
TABLE I: Equations to Approximate the Components of Eq. (1)
log 0(R) [b0+b1M]
log 0(R 1A1) - log 0(R) [a0 + a1X1,1 + . + aKX1,K + aK+lM J - [b0+b1M]
log 0(R lAM) - log 0(R) [a0 + alXM,l + . +aKXM,K + aK+lM]
- [b0+b1MJ
M
log0(RIA1,...,A[OCRerr]) - [a0M + a1 [OCRerr] Xm,i
m=1
What must be done, then, is to find approximating
expressions for the various components in the right side of
Eq. (1). Table I (above) shows the interrelationships
among the quantities involved. In this array, the material
to the left of the column of approximation signs just
restates Eq. (1), for Eq. (1) asserts that the expressions
above the horizontal line sum to the expression below the
line. On the right of each approximation sign is a simple
formula that might be used to approximate the expression
on the left. (More complex formulae could of course also
be considered.) Each right-side formula involves a linear
combination of K variables Xm.i,... , Xm.K corresponding
to the K retrieval properties (frequency values, etc.) used
to characterize the term in question. The idea is to apply
logistic regression to find the values for the coefficients
(the a's and b's) in the right side that produce the best
regression fit to the learning sample. Once these constants
have been determined, retrieval can be carried out by sum-
ming the right sides to obtain the desired estimate of
log 0(R I A1.....AN) for any given query-document
palr.
It may not be immediately apparent why terms in M
have been included in the approximating formulas on the
right. In Eq. (1), the number N of avallable retrieval clues
is actually part of the conditionalizing information, a fact
that could have been stressed by writing 0(R I N) in place
of 0(R) and 0(R I A[OCRerr], N) in place of 0(R I A[OCRerr]). So the
approximating formula for log 0(R) is really an approxi-
mation for log 0(RI M) and must involve a term in M.
(Simple functions of M such as or log (M) might also
be considered.) Similar remarks apply to the formulas for
M
+ + aK X Xm,K + aK+1M2] - (M - 1) [ b0 + b1M]
m=1
approximating log (R I Am).
There are two approaches to the problem of finding
coefficients to use in the approximating expressions. The
first is what might be called the `triples-then-palrs'
approach. It starts out with a logistic regression analysis
performed directly on a sample of query-document-term
triples constructed from the learning set. In the sample,
each triple is accompanied by the clue-values associated
with the match term in the query and document, and by
the human relevance judgement for the query and docu-
menL The clue-values are used as the values of the inde-
pendent variables in the regression, and the relevance
judgements as the values of the the dependent variable.
After the desired coefficients have been obtained, the
resulting regression equation can be applied to evaluate
the left-side expressions, which can in turn be summed
down the M terms to obtain a preliminary estimate of
log0(RIA1,...,A[OCRerr]).
A second regression analysis is then performed on a
sample of query-document pairs, using the preliminary
logodds prediction from the first (triples) analysis as an
independent variable. This second analysis serves the pur-
pose of correcting for distortions introduced by the
Assumption of Linked Dependence. It also allows
retrieval properties not associated with particular match
terms, such as document length, age, etc., to be intro-
duced. The triples-then-pairs approach is the one used by
the Berkeley group in its Trec-l research (Cooper, Gey &
Chen 1992). The theory behind it is presented in more
detail in (Cooper, Dabney & Gey 1992).
59