SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression
chapter
W. Cooper
F. Grey
A. Chen
National Institute of Standards and Technology
Donna K. Harman
Calling this sum Z, Eq. (3) is applied to obtain the estimate of the logodds that the
document is relevant to the query. Using Eq. (4), this logodds is translated into a proba-
bility, and the documents are sorted for the user in descending order of their estimated
probabilities.
Constructing the Sample
The SLR method calls for the use of a learning sample of relevance judgements
that can be assumed accurate. Among other requirements the sample must (a) be suffi-
ciently large and include a sufficient number of queries; and (b) have a definite, complete
statistical structure of some kind (e.g. that of a random or a stratified sample) that can
serve as a basis for making statistical inferences. Arguably the TIPSTER data supplied
for TREC experimentation fulfilled condition (a) for one or two of the five collections
(WSJ and perhaps ZIFF). However, so far as could be ascertained from information
made available to participants, the second condition was not satisfied for any of the five
collections. The samples of relevance data that were provided had no apparent statistical
structure of the kind upon which statistical reasoning is ordinarily based. The omission is
understandable but unfortunate from the point of view of probabilistic retrieval.
For the present experiment the lack of a known statistical sample structure was
highly problematical and almost fatal. It meant that the method of interest could not be
rigorously applied to the available data. The theory on which SLR is based consists of a
careful chain of statistical reasoning. If one of the links is weak, scientifically sound esti-
mates of relevance probability cannot be expected.
Rather than abandon the experiment entirely, however, the following stopgap mea-
sure was adopted. An arbitrary assumption would be made about the structure of the
sample, and the analysis would be carried out on the basis of this fictitious assumption as
though it were known to be true. It was thought that this desperation measure would at
least allow the investigators to gain experience in applying the method to a large data set
(albeit on a hypothetical basis), and to illustrate the methodology for other interested par-
ties. Also, it was thought that the fictitious assumption could be chosen in such a way
that the experimental evidence gathered about the comparative worth of various retrieval
clues would probably have at least some value. Finally, it was hoped that the output
orderings might be reasonably effective in spite of the likely miscalibration of the
retrieval status values as probability estimates.
While this policy allowed participation in the venture to continue, the artificiality
of the constructed sample diminished to some unknown extent the retrieval effectiveness
of the prototype system. This should be remembered when interpreting the results of the
experiment: they do not fairly represent the SLR methodology's capabilities. Had the
investigators been in a position to design their own sampling procedure, the results would
presumably have been better. In the same spirit it should be kept in mind that the level of
accuracy of the probability estimates themselves cannot be taken as indicative of the reli-
ability of SLR.
The arbitrary assumption that was settled upon for the sample construction was that
for the WSJ collection the universe of all query-document pairs is separable into two sets:
one a small set rich in relevance-related pairs, the other a large set containing no rele-
vance-related pairs. The supplied judgements of relevance and irrelevance for the WSJ
80