SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression
chapter
W. Cooper
A. Chen
F. Gey
National Institute of Standards and Technology
D. K. Harman
SAS provides the most complete built-in diagnostics for
logistic regression. BLSS was found to be especially con-
venient for interactive use in a UNIX environment, and
ended up being the most heavily used.
The prototype retrieval system itself was imple-
mented as a modification of the SMART system with
SMART's vector-similarity subroutines replaced by the
probabilistic computations of Eqs. (3) and (5). For the
runs Brkiy3 and Brkly4, which used only Eq. (3), only
minimal modifications of SMART were needed, and at
run time the retrieval efficiency remained essentially the
same as for the unmodified SMART system. This demon-
strates that probabilistic retrieval need be no more cum-
bersome computationally than the vector processing alter-
natives. For BrklyS, which used Eq. (5), the modifications
were somewhat more extensive and retrieval took about
20% longer.
Retrieval Effectiveness
The Berkeley system achieved an average precision
over all documents (an `11-point average') of 32.7% for
the ad hoc retrieval run Brkly3, and 29.0% and 35.4%
respectively for the routing runs Brkly4 and Brkly5. The
distinct improvement in effectiveness of BrklyS over
Brkly4 suggests that in routing retrieval the use of fre-
quency information about individual query stems is worth
while.
At the `0 recall level' a precision of 84.7%, the
highest recorded at the conference, was achieved in the ad
hoc run. The high effectiveness of the Berkeley system
for the first few retrieved documents may be explainable
in terms of the practice, mentioned earlier, of redoing the
regression analysis for the highest-ranked 500 documents
for each query. This technique ensures an especially good
regression fit for the query[OCRerr]document pairs that are espe-
cially likely to be relevant, thus emphasizing good perfor-
mance near the top of the ranking where it is most impor-
tanL
The generally high retrieval effectiveness of the
Berkeley system should be interpreted in the light of the
fact that the system probably uses less evidence -- that is,
fewer retrieval clues -- than any of the other high-
performing TREC-2 systems. In fact, the only clues used
were the frequency characteristics of single stems (not
even phrases were included). What this suggests is that
the underlying probabilistic logic may have the capacity to
exploit exceptionally fully whatever clues may be
65
available.
Summary and Conclusions
The Berkeley design approach is based on a proba-
bilistic model derived from the linked dependence
assumption. The variables of the probability-ranking
retrieval equation and their coefficients are determined by
logistic regression on a judgement sample. Though the
model is hospitable to the utilization of other kinds of evi-
dence, in this particular investigation the only variables
used were optimally relativized frequencies (ORF's) of
match stems.
The approach was found to have the following
advantages:
Experimental Efficiency. Since the numeric coeffi-
cients in a regression equation are determined
simultaneously in one computation, trial-and-error
experimentation involving the evaluation of
retrieval output to optimize parameters is largely
avoidable.
2. Computational Simplicity. For ad hoc retrieval and
routing retrieval that does not involve individual
stem statistics, the computational simplicity and
efficiency achieved by the model at run time are
comparable to that of simple vector processing
retrieval models. For routing retrieval that exploits
individual stem frequencies the programming is
somewhat more complicated and runs slightly
slower.
3 Effective Retrieval. The level of retrieval effective-
ness as measured by precision and recall is high rel-
ative to the simple clue-types used.
4. potential for Well-Calibrated Probability Estimates.
In-the-ballpark estimates of document relevance
probabilities suitable for output display would
appear to be within reach.
Acknowledgements
We are indebted to Ray Larson and Chris Plaunt for
helpful systems advice, as well as to the several col-
leagues already acknowledged in our Trec-1 report. The
work stations used for the experiment were supplied by
the Sequoia 2000 project at the University of California, a
project principally funded by the Digital Equipment Cor-
poration. A DARPA grant supported the programming
effort.