SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression chapter W. Cooper F. Grey A. Chen National Institute of Standards and Technology Donna K. Harman Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression Wm. S. Cooper Fredric C. Gey Aitao Chen S.L.I.S., University of California Berkeley, CA 94720 ABSTRACT: In this experiment the TIPSTER test collections were used as a vehicle for evaluating an approach to probabilistic retrieval for full-text documents. The methodology in question, called `staged logistic regression,' involves two or more stages of logistic regression analysis of a learning sample of relevance judgements. The aim is to produce effec- tive inifial probability rankings of documents, without undue computa- tional complexity at run time, by applying regression equations derived with the help of standard statistical software packages. In addition, the experiment explored the feasibility of using equations derived from train- ing data for one document collection in a different document collection for which no training data happens to be available, and of calculating docu- ment relevance probabilities accurately enough so that they can be dis- played as part of the output seen by the user. The regression equations were implemented as retrieval rules in an experimental prototype system obtained by modifying the SMART retrieval system. Introduction The Berkeley group's interest in participating in the NIST/TkEC Conference was stimulated by the opportunity it offered to gain experience with a methodology called `Staged Logistic Regression.' This technique (hereinafter abbreviated `SLR') is a sys- tematic approach to retrieval system design based on probabilistic and statistical princi- ples. It has been under study at Berkeley as a possible means of achieving effective prob- abilistic retrieval including acceptably accurate estimates of relevance probability without undue computational complexity. In order to test out the SLR methodology on the TIPSThR data base in as straight- forward a fashion as possible, attenfion was restricted to the problem of how to form the initial document ranking -- that is, the output ranking first offered by the system to the user in response to the user's original query. This problem should not be confused with the subsequent task of exploiting any relevance judgements that may be obtainable from the user once he or she has started down the output ranking and is actively examining documents. The latter problem -- the matter of how to exploit intra-search `relevance 73