SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression
chapter
W. Cooper
F. Grey
A. Chen
National Institute of Standards and Technology
Donna K. Harman
Probabilistic Retrieval in the TIPSTER Collections:
An Application of Staged Logistic Regression
Wm. S. Cooper
Fredric C. Gey
Aitao Chen
S.L.I.S., University of California
Berkeley, CA 94720
ABSTRACT: In this experiment the TIPSTER test collections were
used as a vehicle for evaluating an approach to probabilistic retrieval for
full-text documents. The methodology in question, called `staged logistic
regression,' involves two or more stages of logistic regression analysis of
a learning sample of relevance judgements. The aim is to produce effec-
tive inifial probability rankings of documents, without undue computa-
tional complexity at run time, by applying regression equations derived
with the help of standard statistical software packages. In addition, the
experiment explored the feasibility of using equations derived from train-
ing data for one document collection in a different document collection for
which no training data happens to be available, and of calculating docu-
ment relevance probabilities accurately enough so that they can be dis-
played as part of the output seen by the user. The regression equations
were implemented as retrieval rules in an experimental prototype system
obtained by modifying the SMART retrieval system.
Introduction
The Berkeley group's interest in participating in the NIST/TkEC Conference was
stimulated by the opportunity it offered to gain experience with a methodology called
`Staged Logistic Regression.' This technique (hereinafter abbreviated `SLR') is a sys-
tematic approach to retrieval system design based on probabilistic and statistical princi-
ples. It has been under study at Berkeley as a possible means of achieving effective prob-
abilistic retrieval including acceptably accurate estimates of relevance probability without
undue computational complexity.
In order to test out the SLR methodology on the TIPSThR data base in as straight-
forward a fashion as possible, attenfion was restricted to the problem of how to form the
initial document ranking -- that is, the output ranking first offered by the system to the
user in response to the user's original query. This problem should not be confused with
the subsequent task of exploiting any relevance judgements that may be obtainable from
the user once he or she has started down the output ranking and is actively examining
documents. The latter problem -- the matter of how to exploit intra-search `relevance
73