SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression
chapter
W. Cooper
A. Chen
F. Gey
National Institute of Standards and Technology
D. K. Harman
The second way of determining the coefficients --
the one used in Trec-2 -- is the `pairS-only' approach. It
requires only one regression analysis, performed on a
sample of query-document pairs. It is based on the trivial
observation that in the right side of the above array
instead of adding across rows and then down the resulting
column of sums, one can equivalently add down columns
and across the resulting row of sums. Under either proce-
dure the grand total value for log O(R I A1....AM) will
be the same.
Summing down the columns and then across the
totals gives the expression shown in the bottom line of the
array. It simplifies to
logO(RIA1........AN)
M M
a1 [OCRerr] Xm,l+...+aK X Xm,K
m=l m=l
+ (a0 - b0 + b1)M + (aK+l - b1)M2
Since there is no need to keep the a[OCRerr] coefficients segre-
gated from the b [OCRerr] coefficients to get a predictive equation,
this suggests the adoption of a regression equation of form
logO(RIA1,..., AM)
M M
C0 + C1 [OCRerr] Xm,i + ... + CK X Xm,K
m=l +CK+lM + CK+2M2
The coefficients C0....CK+2 may be found by a logistic
regression on a sample of query-document pairs con-
structed from the learning sample. In the sample each pair
is accompanied by its K different Xm,k-values each
already summed over all match terms for the pair, the val-
ues of M and M2, and (to serve as the dependent variable
in the regression) the human relevance judgement for the
pair.
But if only one level of regression analysis is to be
performed, where is the correction for the Assumption of
Linked Dependence to take place? That assumption
causes mischief because it creates a tendency for the pre-
dicted logodds of relevance to increase roughly linearly
with the number of match terms, whereas the true increase
is less than linear. This can be corrected by modifying the
variables in such a way that their values rise less rapidly
than the number of match terms as the number of match
terms increases. The variables can, for instance, be multi-
plied by some function f(M) that drops gently with
increasing M, say 1 1 . The exact form of
- or
[OCRerr] 1+logM
60
the function can be decided during the course of the
regression analysis.
With such a damping factor included, the foregoing
regression equation becomes
Equation (2):
logO(RIA1....AM)
M M
CO+Clf(M)XXm,l+[OCRerr][OCRerr][OCRerr]+CKf(M) XXm,K
m=l
+ CK+lM + CK+2M2
In our experiments, this simple modification was found to
improve the regression fit and the precisiontrecall perfor-
mance. It would appear therefore to be a worth-while
refinement of the basic model. Note, however, that this
adjustment only removes a general bias. It does nothing
to exploit the possibility of measuring dependencies
between particular stems to improve retrieval effective-
ness. To exploit individual dependencies would be desir-
able in principle, but would require a substantial elabora-
tion of the model for what might well turn out to be an
insignificant improvement in effectiveness (for discussion
see Cooper (1991)).
Optimally Relativized Frequencies
The philosophy of the project called for the use of a
few well-chosen retrieval clues. The most obvious clues
to be exploited in connection with match terms are their
frequencies of occurrence in the query and document.
What is not so obvious is the exact form the frequencies
should take. For instance, should they be absolute or rela-
tive frequencies, or something in between?
The IR literature mentions two ways in which fre-
quencies might be relativized for use in retrieval. The first
is to divide the absolute frequency of occurrence of the
term in the query or document by the length of the query
or document, or some parameter closely associated with
length. The second is to divide the relative frequency so
obtained by the relative frequency of occurrence of the
term in the entire collection considered as one long run-
ning text. Both kinds of relativization seem potentially
beneficial, but the question remains whether these rela-
tivizations, if they are indeed helpful, should be carried
out in full strength, or whether some sort of blend of abso-
lute and relative frequencies might serve beuer.
To answer this question, regression techniques were
used in a side investigation to discover the optimal extent