SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression
chapter
W. Cooper
F. Grey
A. Chen
National Institute of Standards and Technology
Donna K. Harman
processing system utilizing similar kinds of evidence.
3. Standard statistical program packages can be used for an SLR analysis even in
large collections, circumventing much of the need for retrieval trials of conven-
tional design and allowing the choice of variables and the determination of optimal
numerical parameters to be carried out more conveniently.
4. The TREC results suggest that use of the SLR approach need not necessarily be
ruled out because of a lack of training data for the particular document collection to
which it must be applied. A respectable level of effectiveness can (under at least
some conditions) be achieved through the extrapolation into the new collection of
regression equations derived from a collection for which training data already
exists.
5. The present experiment failed to demonstrate that the SLR method is capable of
producing, in large collections, probability of relevance estimates sufficiently well-
calibrated to be presented to the users as part of the output display. It is suspected
that this failure is associated with the peculiar limitations of the training data sup-
plied for this initial TREC conference. If so, use of the more extensive training
data to be made available for future conferences may be sufficient to resolve this
problem.
Acknowledgements
The theory of staged logistic regression, developed in collaboration with Dan Dab-
ney of U.C.L.A, was originally stimulated by discussions with James Allen and Gerard
Salton of Cornell University and Stephen Robertson of City University, London. The
computer science department at Cornell University provided a hospitable environment for
the early stages of the theoretical development. Ray Larson at U.C. Berkeley con-
tributed experienced advice on the conversion of SMART and on general systems prob-
lems. We are indebted to Chris Buckley for supporting our efforts to make SMART
regressive, and to all past contributors to SMART for making this valuable research tool
available. The work stations used for the experiment were DEC 5000's supplied by the
Sequoia 2000 project at the University of California, a project principally funded by the
Digital Equipment Corporation. A DARPA grant supported the programming effort.
References
Bookstein, A. Probability and fuzzy set applications to information retrieval. In
M. Williams (ed.), Annual Review of Information Science and Technology, 20, White
Plains, NY: Knowledge Industry Publications. 1985.
Collett, D. Modelling Binary Data. London: Chapman & Hall; 1991.
Cooper, W. S. Exploiting the maximum entropy principle to increase retrieval
effectiveness. Journal of the American Society for Information Science, 34(1): 31-39;
1983.
86