NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression chapter W. Cooper F. Grey A. Chen National Institute of Standards and Technology Donna K. Harman Eq. (2) follows from the `Assumption of Linked Dependence.' Considering for simplicity only the special case of two properties, this assumption can be expressed as the equality P(A1,A2IR) _ P(A1IR) P(A2IR) P(A1,A2Ik) P(A1I[OCRerr] P(A2Ik) Intuitively this asserts that the degree of dependency that exists between properties A1 and A2 in the set of relevance-related query[OCRerr]document pairs is linked in a certain way with the degree of dependency that exists among the nonrelevance-related pairs. It is a weaker assumption than the independence postulates commonly encountered in the litera- ture on probabilistic IR. For discussion and a derivation of Eq. (2) from the Linked Dependence Assumption see Cooper (1991) and Cooper, Dabney & Gey (1992 op. cit.). The role of the third equation, as developed for this experiment at least, is to cor- rect for deficiencies in the second equation. There are two major sources of distortion to contend with. One is that the validity of Eq. (2) depends on the Linked Dependence Assumption, a simplifying assumption that is at best only approximately true. Especially when the number N of term matches is large, Eq. (2) (if uncorrected) is capable of grossly overestimating the logodds of relevance. The other source of distortion is that Eq. (2) as it stands fails to take into account the fact that longer documents will tend to produce more term matches than shorter ones simply by chance. If nothing were done to correct it, longer documents could receive much higher relevance probability estimates than shorter ones merely by virtue of their length. This latter failing is related to a subtle criticism that can be raised against the policy of using only term matches, never mismatches, as the composite clues. Actually, a term match (i.e. term present in both query and document) is only one of four conditions that might obtain for a term vis-a-vis a given query and document. The others are: ii) term present in query but absent from document; iii) term present in document but absent from query; and iv) term absent from both query and document. Had clues of other types been recognized, accorded their own regression equations, and allowed to make their own con- tribution to Z, it might not have been necessary to make any correction for document length. But because we have taken the computationally convenient shortcut of ignoring all evidence of types ii) - iv) at the first stage of the analysis, we must compensate some- how at the second stage for the distortion caused by that oversimplification. The corrective equation, as developed for the WSJ collection through another application of logistic regression to the learning sample, is log O(RIA1,...,AN) 6.08+3.63 log max(Z, 1) -1.45logL where (3) Z = the value of the summation expression in the right side of Eq. (2); max( Z, 1) is the larger of Z and 1; and L = the length of the document under consideration, expressed as the total number of stem occurrences in the document counting separate occurrences of the same stem separately. 78