SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression chapter W. Cooper F. Grey A. Chen National Institute of Standards and Technology Donna K. Harman were to be treated as though they were a random sample of one out of every ten members of the first of these two sets. This implies among other things the supposition that the professional searchers had found 10% of the relevance-related pairs that could have been found had the entire collection been examined for every query. The 10% figure was an unsupported guess. This working assumption made it possible to construct a two-part (stratified) sam- ple for the WSJ collection. The first portion or stratum of the sample consisted of the set of 2194 query-document pairs for which human judgements of relevance and nonrele- vance were available. In the subsequent regression analysis, each of these was weighted by a factor of 10, reflecting the assumption stated above. The second part of the con- structed sample consisted of a set of 1687 query-document pairs drawn essentially ran- domly from among all query-document pairs that shared at least one stem in common. (The latter sample was obtained by considering all query-document pairs constructible from the 52 training queries and the documents in the WSJ training collection, ordering them by query number and within query number by document number, choosing every 2444th pair from this ordering, and discarding any pairs so chosen for which there was no overlap between query and document. The result may be considered a random sample, though technically it was slightly preferable to a random sample isofar as it was stratified by query.) The pairs in this second set were assumed to lie outside the first set and to be nonrelevance-related (though not verified this assumption was probably approximately true). Each pair was accorded a weight of 2400 in the subsequent regression analysis. It may be worth reiterating that the necessity of specifying the number of rele- vance-related pairs by sheer guesswork, together with the general artificiality of the con- structed sample, dashed all hope of producing well-calibrated estimates to present to the users. The Regression Analysis As the first stage of the regression analysis, for each query-document pair in the sample the set of all stems shared in common by the query and the document was assem- bled. This resulted in an expanded sample of 30,234 query-document-stem triples. For each triple, the values of the six statistics [OCRerr] X6 were calculated and recorded along with the relevance judgement associated with the query and document in question. A weighted logistic regression analysis was performed on this data, with [OCRerr] X6 serving as the independent variables and the binary relevance judgements as the dependent vari- able. The result was the set of coefficients in Eq. (1). In other words, Eq. (1) is the regression equation that was found to be the logistic equation of maximum likelihood (the `best fit') for the data in the sample. How were the six retrieval clue-types chosen? Actually the aforedescribed regres- sion analysis was performed repeatedly for various types and combinations of indepen- dent variables before these six were settled upon. Of those tried out, these turned out in combination to exhibit the most predictive power -- that is, to offer the best prospects of yielding useful estimates of logodds of relevance. `Predictive power' was judged according to several interrelated criteria. One was the extent to which a model based on a clue- combination under investigation was found to fit the data, as measured by the -2 Log Likelihood statistic or one of its minor variants 81