NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression chapter W. Cooper F. Grey A. Chen National Institute of Standards and Technology Donna K. Harman Calling this sum Z, Eq. (3) is applied to obtain the estimate of the logodds that the document is relevant to the query. Using Eq. (4), this logodds is translated into a proba- bility, and the documents are sorted for the user in descending order of their estimated probabilities. Constructing the Sample The SLR method calls for the use of a learning sample of relevance judgements that can be assumed accurate. Among other requirements the sample must (a) be suffi- ciently large and include a sufficient number of queries; and (b) have a definite, complete statistical structure of some kind (e.g. that of a random or a stratified sample) that can serve as a basis for making statistical inferences. Arguably the TIPSTER data supplied for TREC experimentation fulfilled condition (a) for one or two of the five collections (WSJ and perhaps ZIFF). However, so far as could be ascertained from information made available to participants, the second condition was not satisfied for any of the five collections. The samples of relevance data that were provided had no apparent statistical structure of the kind upon which statistical reasoning is ordinarily based. The omission is understandable but unfortunate from the point of view of probabilistic retrieval. For the present experiment the lack of a known statistical sample structure was highly problematical and almost fatal. It meant that the method of interest could not be rigorously applied to the available data. The theory on which SLR is based consists of a careful chain of statistical reasoning. If one of the links is weak, scientifically sound esti- mates of relevance probability cannot be expected. Rather than abandon the experiment entirely, however, the following stopgap mea- sure was adopted. An arbitrary assumption would be made about the structure of the sample, and the analysis would be carried out on the basis of this fictitious assumption as though it were known to be true. It was thought that this desperation measure would at least allow the investigators to gain experience in applying the method to a large data set (albeit on a hypothetical basis), and to illustrate the methodology for other interested par- ties. Also, it was thought that the fictitious assumption could be chosen in such a way that the experimental evidence gathered about the comparative worth of various retrieval clues would probably have at least some value. Finally, it was hoped that the output orderings might be reasonably effective in spite of the likely miscalibration of the retrieval status values as probability estimates. While this policy allowed participation in the venture to continue, the artificiality of the constructed sample diminished to some unknown extent the retrieval effectiveness of the prototype system. This should be remembered when interpreting the results of the experiment: they do not fairly represent the SLR methodology's capabilities. Had the investigators been in a position to design their own sampling procedure, the results would presumably have been better. In the same spirit it should be kept in mind that the level of accuracy of the probability estimates themselves cannot be taken as indicative of the reli- ability of SLR. The arbitrary assumption that was settled upon for the sample construction was that for the WSJ collection the universe of all query-document pairs is separable into two sets: one a small set rich in relevance-related pairs, the other a large set containing no rele- vance-related pairs. The supplied judgements of relevance and irrelevance for the WSJ 80