NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression chapter W. Cooper A. Chen F. Gey National Institute of Standards and Technology D. K. Harman The second way of determining the coefficients -- the one used in Trec-2 -- is the `pairS-only' approach. It requires only one regression analysis, performed on a sample of query-document pairs. It is based on the trivial observation that in the right side of the above array instead of adding across rows and then down the resulting column of sums, one can equivalently add down columns and across the resulting row of sums. Under either proce- dure the grand total value for log O(R I A1....AM) will be the same. Summing down the columns and then across the totals gives the expression shown in the bottom line of the array. It simplifies to logO(RIA1........AN) M M a1 [OCRerr] Xm,l+...+aK X Xm,K m=l m=l + (a0 - b0 + b1)M + (aK+l - b1)M2 Since there is no need to keep the a[OCRerr] coefficients segre- gated from the b [OCRerr] coefficients to get a predictive equation, this suggests the adoption of a regression equation of form logO(RIA1,..., AM) M M C0 + C1 [OCRerr] Xm,i + ... + CK X Xm,K m=l +CK+lM + CK+2M2 The coefficients C0....CK+2 may be found by a logistic regression on a sample of query-document pairs con- structed from the learning sample. In the sample each pair is accompanied by its K different Xm,k-values each already summed over all match terms for the pair, the val- ues of M and M2, and (to serve as the dependent variable in the regression) the human relevance judgement for the pair. But if only one level of regression analysis is to be performed, where is the correction for the Assumption of Linked Dependence to take place? That assumption causes mischief because it creates a tendency for the pre- dicted logodds of relevance to increase roughly linearly with the number of match terms, whereas the true increase is less than linear. This can be corrected by modifying the variables in such a way that their values rise less rapidly than the number of match terms as the number of match terms increases. The variables can, for instance, be multi- plied by some function f(M) that drops gently with increasing M, say 1 1 . The exact form of - or [OCRerr] 1+logM 60 the function can be decided during the course of the regression analysis. With such a damping factor included, the foregoing regression equation becomes Equation (2): logO(RIA1....AM) M M CO+Clf(M)XXm,l+[OCRerr][OCRerr][OCRerr]+CKf(M) XXm,K m=l + CK+lM + CK+2M2 In our experiments, this simple modification was found to improve the regression fit and the precisiontrecall perfor- mance. It would appear therefore to be a worth-while refinement of the basic model. Note, however, that this adjustment only removes a general bias. It does nothing to exploit the possibility of measuring dependencies between particular stems to improve retrieval effective- ness. To exploit individual dependencies would be desir- able in principle, but would require a substantial elabora- tion of the model for what might well turn out to be an insignificant improvement in effectiveness (for discussion see Cooper (1991)). Optimally Relativized Frequencies The philosophy of the project called for the use of a few well-chosen retrieval clues. The most obvious clues to be exploited in connection with match terms are their frequencies of occurrence in the query and document. What is not so obvious is the exact form the frequencies should take. For instance, should they be absolute or rela- tive frequencies, or something in between? The IR literature mentions two ways in which fre- quencies might be relativized for use in retrieval. The first is to divide the absolute frequency of occurrence of the term in the query or document by the length of the query or document, or some parameter closely associated with length. The second is to divide the relative frequency so obtained by the relative frequency of occurrence of the term in the entire collection considered as one long run- ning text. Both kinds of relativization seem potentially beneficial, but the question remains whether these rela- tivizations, if they are indeed helpful, should be carried out in full strength, or whether some sort of blend of abso- lute and relative frequencies might serve beuer. To answer this question, regression techniques were used in a side investigation to discover the optimal extent