SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression chapter W. Cooper A. Chen F. Gey National Institute of Standards and Technology D. K. Harman Calibration of Probability Estimates The most important role of the relevance probability estimates Calculated by a probabilistic IR system is to rank the output documents in as effective a search order as pos- sible. For this ranking function it is only the relative sizes of the probability estimates that matter, not their absolute magnitudes. However, it is also desirable that the absolute sizes of these estimates be at least somewhat realistic. If they are, they can be displayed to provide guidance to the users in their decisions as to when to stop searching down the ranking. This capability is a potentially important side benefit of the probabilistic approach. One way of testing the realism of the probability estimates is to see whether they are `well-Calibrated'. Good calibration means that when a large number of prob- ability estimates whose magnitudes happen to fall in a cer- tain small range are examined, the proportion of the trials in question with positive outcomes also falls in or close to that range. To test the calibration of the probability pre- dictions produced by the Berkeley approach, the 50,000 query-document pairs in the ad hoc entry Brkly3 along with their accompanying relevance probability estimates were sorted in descending order of magnitude of estimate. Pairs for which human judgements of relevance- relatedness were unavailable were discarded; this left 22,352 sorted pairs for which both the system's probabil- ity estimates of relevance and the `correct' binary judge- ments of relevance were available. This shorter list was divided into blocks of 1,000 pairs each -- the thousand pairs with the highest probability estimates, the thousand with the next highest, and so forth. Within each block the `actual' probability was estimated as the proportion of the 1,000 pairs that had been judged to be relevance-related by humans. This was compared against the mean of all the system probability estimates in the block. For a well- calibrated system these figures should be approximately equal. The results of the comparison are displayed in Table III. It can be seen that the system's probability predic- tions, while not wildly inaccurate, are generally somewhat higher than the actual proportions of relevant pairs. The same phenomenon of mild overestimation was observed when the runs Brkly4 and BrklyS were tested for well- calibratedness in a similar way. Since no systematic overestimation was observed when the calibration of the formula was originally tested against the learning data, it seems likely that the 64 TABLE III: Calibration of Ad Hoc Relevance[OCRerr]Probability Estimates Query-doc Mean System Proportion Pair Probability Actually Ranks Estimate Relevant 1 to 1,000 0.66 0.60 1,001 to 2,000 0.63 0.47 2,001 to 3,000 0.61 0.44 3,001 to 4,000 0.58 0.41 4,001 to 5,000 0.55 0.38 5,001 to 6,000 0.53 0.34 6,001 to 7,000 0.50 0.36 7,001 to 8,000 0.48 0.36 8,001 to 9,000 0.46 0.36 9,001 to 10,000 0A4 0.38 10,001 to 11,000 0.42 0.39 11,001 to 12,000 OA1 0.36 12,001 to 13,000 0.39 0.37 13,001 to 14,000 0.37 0.36 14,001 to 15,000 0.36 0.35 15,001to16,000 0.34 0.31 16,001 to 17,000 0.32 0.29 17,Ooltol8,000 0.31 0.28 18,001 to 19,000 0.29 0.23 19,001 to 20,000 0.28 0.22 20,001 to 21,000 0.25 0.21 21,001 to 22,000 0.23 0.23 22,001 to 22,352 0.18 0.19 overestimation seen in the table is due mainly to the shift from learning data to test data. Naturally, predictive for- mulae that have been fine-tuned to a certain set of learning data will perform less well when applied to a new set of data to which they have not been fine-tuned. If this is indeed the root cause of the observed overestimation, it could perhaps be compensated for (at least to an extent sufficient for practical pwposes) by the crude expedient of lowering all predicted probabilities to, say, around four fifths of their originally calculated values before display- ing them to the users. Computational Experience The statistical program packages used in the course of the analysis included SAS, S, and BLSS. Of these,