SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression
chapter
W. Cooper
A. Chen
F. Gey
National Institute of Standards and Technology
D. K. Harman
Calibration of Probability Estimates
The most important role of the relevance probability
estimates Calculated by a probabilistic IR system is to rank
the output documents in as effective a search order as pos-
sible. For this ranking function it is only the relative sizes
of the probability estimates that matter, not their absolute
magnitudes. However, it is also desirable that the absolute
sizes of these estimates be at least somewhat realistic. If
they are, they can be displayed to provide guidance to the
users in their decisions as to when to stop searching down
the ranking. This capability is a potentially important side
benefit of the probabilistic approach.
One way of testing the realism of the probability
estimates is to see whether they are `well-Calibrated'.
Good calibration means that when a large number of prob-
ability estimates whose magnitudes happen to fall in a cer-
tain small range are examined, the proportion of the trials
in question with positive outcomes also falls in or close to
that range. To test the calibration of the probability pre-
dictions produced by the Berkeley approach, the 50,000
query-document pairs in the ad hoc entry Brkly3 along
with their accompanying relevance probability estimates
were sorted in descending order of magnitude of estimate.
Pairs for which human judgements of relevance-
relatedness were unavailable were discarded; this left
22,352 sorted pairs for which both the system's probabil-
ity estimates of relevance and the `correct' binary judge-
ments of relevance were available. This shorter list was
divided into blocks of 1,000 pairs each -- the thousand
pairs with the highest probability estimates, the thousand
with the next highest, and so forth. Within each block the
`actual' probability was estimated as the proportion of the
1,000 pairs that had been judged to be relevance-related
by humans. This was compared against the mean of all
the system probability estimates in the block. For a well-
calibrated system these figures should be approximately
equal.
The results of the comparison are displayed in Table
III. It can be seen that the system's probability predic-
tions, while not wildly inaccurate, are generally somewhat
higher than the actual proportions of relevant pairs. The
same phenomenon of mild overestimation was observed
when the runs Brkly4 and BrklyS were tested for well-
calibratedness in a similar way.
Since no systematic overestimation was observed
when the calibration of the formula was originally tested
against the learning data, it seems likely that the
64
TABLE III: Calibration of Ad Hoc
Relevance[OCRerr]Probability Estimates
Query-doc Mean System Proportion
Pair Probability Actually
Ranks Estimate Relevant
1 to 1,000 0.66 0.60
1,001 to 2,000 0.63 0.47
2,001 to 3,000 0.61 0.44
3,001 to 4,000 0.58 0.41
4,001 to 5,000 0.55 0.38
5,001 to 6,000 0.53 0.34
6,001 to 7,000 0.50 0.36
7,001 to 8,000 0.48 0.36
8,001 to 9,000 0.46 0.36
9,001 to 10,000 0A4 0.38
10,001 to 11,000 0.42 0.39
11,001 to 12,000 OA1 0.36
12,001 to 13,000 0.39 0.37
13,001 to 14,000 0.37 0.36
14,001 to 15,000 0.36 0.35
15,001to16,000 0.34 0.31
16,001 to 17,000 0.32 0.29
17,Ooltol8,000 0.31 0.28
18,001 to 19,000 0.29 0.23
19,001 to 20,000 0.28 0.22
20,001 to 21,000 0.25 0.21
21,001 to 22,000 0.23 0.23
22,001 to 22,352 0.18 0.19
overestimation seen in the table is due mainly to the shift
from learning data to test data. Naturally, predictive for-
mulae that have been fine-tuned to a certain set of learning
data will perform less well when applied to a new set of
data to which they have not been fine-tuned. If this is
indeed the root cause of the observed overestimation, it
could perhaps be compensated for (at least to an extent
sufficient for practical pwposes) by the crude expedient of
lowering all predicted probabilities to, say, around four
fifths of their originally calculated values before display-
ing them to the users.
Computational Experience
The statistical program packages used in the course
of the analysis included SAS, S, and BLSS. Of these,