SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression
chapter
W. Cooper
A. Chen
F. Gey
National Institute of Standards and Technology
D. K. Harman
Full Text Retrieval based on Probabilistic Equations
with Coefficients fitted by Logistic Regression
Wm. S. Cooper
Aitao Chen
Fredric C. Gey
S.L.I.S., University of California
Berkeley, CA 94720
ABSTRACT
The experiments described here are part of a research
program whose objective is to develop a full-text
retrieval methodology that is statistically sound and
powerful, yet reasonably simple. The methodology is
based on the use of a probabilistic model whose
parameters are fitted empirically to a learning set of
relevance judgements by logistic regression. The
method was applied to the TIPSTER data with opti-
mally relativized frequencies of occurrence of match
stems as the regression variables. In a routing
retrieval experiment, these were supplemented by
other variables corresponding to sums of logodds asso-
ciated with particular match stems.
Introduction
The full-text retrieval design problem is largely a
problem in the combination of statistical evidence. With
this as its premise, the Berkeley group has concentrated on
the challenge of finding a statistical methodology for com-
bining retrieval clues in as powerful a way as possible,
consistent with reasonable analytic and computational
simplicity. Thus our research focus has been on the gen-
eral logic of how to combine clues, with no attempt made
at this stage to exploit as many clues as possible. We feel
that if a straightforward statistical methodology can be
found that extracts a maximum of retrieval power from a
few good clues, and the methodology is clearly hospitable
to the introduction of further clues in future, progress will
have been made.
We join Fuhr and Buckley (1991, 1992) in thinking
that an especially promising path to such a methodology is
to combine a probabilistic retrieval model with the tech-
niques of statistical regression. Under this approach a
57
probabilistic model is used to deduce the general form that
the document-ranking equation should take, after which
regression analysis is applied to obtain empincally-based
values for the constants that appear in the equation. In
this way the probabilistic theory is made to constrain the
universe of logically possible retrieval rules that could be
chosen, and the regression techniques complete the choice
by optimizing the model's fit to the learning data.
The probabilistic model adopted by the Berkeley
group is derived from a statistical assumption of `linked
dependence'. This assumption is weaker than the historic
independence assumptions usually discussed. In its sim-
plest form the Berkeley model also differs from most tra-
ditional models in that it is of `Type 0'-- meaning that the
analysis is carried out w.r.t. sets of query-document pairs
rather than w.r.t. particular queries or particular docu-
ments. (For a fuller explanation of this typology see
Robertson, Maron & Cooper 1982.) But when relevance
judgement data specialized to the currently submitted
search query is available, say in the form of relevance
feedback or routing history data, the model is flexible
enough to accommodate it (resulting in `Type 2' retrieval.)
Logistic regression (see e.g. Hosmer & Lemeshow
(1989)) is the most appropriate type of regression for this
kind of IR prediction. Although standard multiple regres-
sion analysis has been used successfully by others in com-
parable circumstances [OCRerr]uhr & Buckley op. ciL), we
believe logistic regression to be logically more appropri-
ate for reasons set forth elsewhere (Cooper, Dabney &
Gey 1992). Logistic regression, which accepts binary
training data and yields probability estimates in the form
of logodds values, goes hand-in-glove with a probabilistic
IR model that is to be fitted to binary relevance judgement
data and whose predictor variables are themselves
logodds.