SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression
chapter
W. Cooper
F. Grey
A. Chen
National Institute of Standards and Technology
Donna K. Harman
Eq. (2) follows from the `Assumption of Linked Dependence.' Considering for
simplicity only the special case of two properties, this assumption can be expressed as the
equality
P(A1,A2IR) _ P(A1IR) P(A2IR)
P(A1,A2Ik) P(A1I[OCRerr] P(A2Ik)
Intuitively this asserts that the degree of dependency that exists between properties A1
and A2 in the set of relevance-related query[OCRerr]document pairs is linked in a certain way
with the degree of dependency that exists among the nonrelevance-related pairs. It is a
weaker assumption than the independence postulates commonly encountered in the litera-
ture on probabilistic IR. For discussion and a derivation of Eq. (2) from the Linked
Dependence Assumption see Cooper (1991) and Cooper, Dabney & Gey (1992 op. cit.).
The role of the third equation, as developed for this experiment at least, is to cor-
rect for deficiencies in the second equation. There are two major sources of distortion to
contend with. One is that the validity of Eq. (2) depends on the Linked Dependence
Assumption, a simplifying assumption that is at best only approximately true. Especially
when the number N of term matches is large, Eq. (2) (if uncorrected) is capable of
grossly overestimating the logodds of relevance. The other source of distortion is that
Eq. (2) as it stands fails to take into account the fact that longer documents will tend to
produce more term matches than shorter ones simply by chance. If nothing were done to
correct it, longer documents could receive much higher relevance probability estimates
than shorter ones merely by virtue of their length.
This latter failing is related to a subtle criticism that can be raised against the policy
of using only term matches, never mismatches, as the composite clues. Actually, a term
match (i.e. term present in both query and document) is only one of four conditions that
might obtain for a term vis-a-vis a given query and document. The others are: ii) term
present in query but absent from document; iii) term present in document but absent from
query; and iv) term absent from both query and document. Had clues of other types been
recognized, accorded their own regression equations, and allowed to make their own con-
tribution to Z, it might not have been necessary to make any correction for document
length. But because we have taken the computationally convenient shortcut of ignoring
all evidence of types ii) - iv) at the first stage of the analysis, we must compensate some-
how at the second stage for the distortion caused by that oversimplification.
The corrective equation, as developed for the WSJ collection through another
application of logistic regression to the learning sample, is
log O(RIA1,...,AN) 6.08+3.63 log max(Z, 1) -1.45logL
where
(3)
Z = the value of the summation expression in the right side of Eq. (2);
max( Z, 1) is the larger of Z and 1; and
L = the length of the document under consideration, expressed as the total number
of stem occurrences in the document counting separate occurrences of the same
stem separately.
78