SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression
chapter
W. Cooper
F. Grey
A. Chen
National Institute of Standards and Technology
Donna K. Harman
were to be treated as though they were a random sample of one out of every ten members
of the first of these two sets. This implies among other things the supposition that the
professional searchers had found 10% of the relevance-related pairs that could have been
found had the entire collection been examined for every query. The 10% figure was an
unsupported guess.
This working assumption made it possible to construct a two-part (stratified) sam-
ple for the WSJ collection. The first portion or stratum of the sample consisted of the set
of 2194 query-document pairs for which human judgements of relevance and nonrele-
vance were available. In the subsequent regression analysis, each of these was weighted
by a factor of 10, reflecting the assumption stated above. The second part of the con-
structed sample consisted of a set of 1687 query-document pairs drawn essentially ran-
domly from among all query-document pairs that shared at least one stem in common.
(The latter sample was obtained by considering all query-document pairs constructible
from the 52 training queries and the documents in the WSJ training collection, ordering
them by query number and within query number by document number, choosing every
2444th pair from this ordering, and discarding any pairs so chosen for which there was no
overlap between query and document. The result may be considered a random sample,
though technically it was slightly preferable to a random sample isofar as it was stratified
by query.) The pairs in this second set were assumed to lie outside the first set and to be
nonrelevance-related (though not verified this assumption was probably approximately
true). Each pair was accorded a weight of 2400 in the subsequent regression analysis.
It may be worth reiterating that the necessity of specifying the number of rele-
vance-related pairs by sheer guesswork, together with the general artificiality of the con-
structed sample, dashed all hope of producing well-calibrated estimates to present to the
users.
The Regression Analysis
As the first stage of the regression analysis, for each query-document pair in the
sample the set of all stems shared in common by the query and the document was assem-
bled. This resulted in an expanded sample of 30,234 query-document-stem triples. For
each triple, the values of the six statistics [OCRerr] X6 were calculated and recorded along
with the relevance judgement associated with the query and document in question. A
weighted logistic regression analysis was performed on this data, with [OCRerr] X6 serving
as the independent variables and the binary relevance judgements as the dependent vari-
able. The result was the set of coefficients in Eq. (1). In other words, Eq. (1) is the
regression equation that was found to be the logistic equation of maximum likelihood
(the `best fit') for the data in the sample.
How were the six retrieval clue-types chosen? Actually the aforedescribed regres-
sion analysis was performed repeatedly for various types and combinations of indepen-
dent variables before these six were settled upon. Of those tried out, these turned out in
combination to exhibit the most predictive power -- that is, to offer the best prospects of
yielding useful estimates of logodds of relevance.
`Predictive power' was judged according to several interrelated criteria. One was
the extent to which a model based on a clue- combination under investigation was found
to fit the data, as measured by the -2 Log Likelihood statistic or one of its minor variants
81