SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression
chapter
W. Cooper
F. Grey
A. Chen
National Institute of Standards and Technology
Donna K. Harman
such as the Akaike Information Criterion. Another was the extent to which the ordering
imposed on the triples by the logodds estimates assigned to them by the model resembled
the ideal ordering in which all relevant triples precede all nonrelevant triples. This was
measured statistically using various rank correlation coefficients including Kendall's Tau-
a and the Goodman-Kruskal Gamma. Still another property that was taken into account
was the shape of a variable's graph when the observed logodds was plotted against the
variable values arranged along the X-axis in deciles. In such a graph, a shape that
roughly resembles a straight line is considered desirable. In some cases it was found that
pretransforming a variable by taking its logarithm helped to produce such a straight line.
This was in fact one of the motivations for logging the variables. Another motive for
doing so was the observation that it seemed to improve the general fit of the model as
measured by the other indicators.
Fortunately, the various criteria used to optimize the choice of variables were rarely
in qualitative conflict. But there was insufficient time to explore all variables of potential
interest, or to investigate more elaborate possibilities such as interaction terms. All
regression analyses were run using the SAS statistical package (Version 6.06) on an IBM
3090 mainframe computer with accelerated math capabilities. The SAS package auto-
matically supplies most of the diagnostic statistics mentioned above. Detailed discus-
sions of logistic model building can be found in works on logistic regression (Hosmer &
Lemeshow 1989; Collet 1991).
It is worth noting that all the choices among possible predictor variables were
made, and their weights in the equation determined, as part of the regression analysis.
There was no recourse to traditional retrieval trials based on precision, recall, and the
like. The regression procedures were found to be more convenient and efficient than
experimentation of the usual kind in which there is no way other than trial-and-error to
converge on optimal numeric coefficients. Ideally one might think of combining the two
techniques -- that is, of using traditional methods to confirm the findings of the regression
studies. However, time pressures prevented the pursuit of that luxury.
Next, Eq. (1) was applied to each triple in the sample to calculate the estimate of
the logodds of relevance log 0(R I A[OCRerr]) for the query and document in question, given the
characteristics of the match stem. Then in accordance with Eq. (2) the estimated prior
logodds of relevance log 0(R) = - 6.725 was subtracted from each of these estimates,
and the differences summed within each query-document pair to obtain the value of Z for
the pair.
For the second-stage regression, the value of Z calculated as just described,
together with the length L of the document, were recorded for each query-document pair
log Z
in the sample. From these the value of was calculated for each pair. (To ensure
L04
that the logarithm would always be well defined, Z was first replaced by max (Z, 1).)
Using this ratio as the independent variable and the binary relevance judgements as the
dependent variable, a second weighted logistic regression was run. This produced two
coefficients -- an intercept and a slope -- specifying a regression equation. Simple manip-
ulations were used to transform it into the form displayed earlier as Eq. (3).
log Z
Here are some of the considerations that led to the use of as the variable on
L04
82