SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression chapter W. Cooper A. Chen F. Gey National Institute of Standards and Technology D. K. Harman SAS provides the most complete built-in diagnostics for logistic regression. BLSS was found to be especially con- venient for interactive use in a UNIX environment, and ended up being the most heavily used. The prototype retrieval system itself was imple- mented as a modification of the SMART system with SMART's vector-similarity subroutines replaced by the probabilistic computations of Eqs. (3) and (5). For the runs Brkiy3 and Brkly4, which used only Eq. (3), only minimal modifications of SMART were needed, and at run time the retrieval efficiency remained essentially the same as for the unmodified SMART system. This demon- strates that probabilistic retrieval need be no more cum- bersome computationally than the vector processing alter- natives. For BrklyS, which used Eq. (5), the modifications were somewhat more extensive and retrieval took about 20% longer. Retrieval Effectiveness The Berkeley system achieved an average precision over all documents (an `11-point average') of 32.7% for the ad hoc retrieval run Brkly3, and 29.0% and 35.4% respectively for the routing runs Brkly4 and Brkly5. The distinct improvement in effectiveness of BrklyS over Brkly4 suggests that in routing retrieval the use of fre- quency information about individual query stems is worth while. At the `0 recall level' a precision of 84.7%, the highest recorded at the conference, was achieved in the ad hoc run. The high effectiveness of the Berkeley system for the first few retrieved documents may be explainable in terms of the practice, mentioned earlier, of redoing the regression analysis for the highest-ranked 500 documents for each query. This technique ensures an especially good regression fit for the query[OCRerr]document pairs that are espe- cially likely to be relevant, thus emphasizing good perfor- mance near the top of the ranking where it is most impor- tanL The generally high retrieval effectiveness of the Berkeley system should be interpreted in the light of the fact that the system probably uses less evidence -- that is, fewer retrieval clues -- than any of the other high- performing TREC-2 systems. In fact, the only clues used were the frequency characteristics of single stems (not even phrases were included). What this suggests is that the underlying probabilistic logic may have the capacity to exploit exceptionally fully whatever clues may be 65 available. Summary and Conclusions The Berkeley design approach is based on a proba- bilistic model derived from the linked dependence assumption. The variables of the probability-ranking retrieval equation and their coefficients are determined by logistic regression on a judgement sample. Though the model is hospitable to the utilization of other kinds of evi- dence, in this particular investigation the only variables used were optimally relativized frequencies (ORF's) of match stems. The approach was found to have the following advantages: Experimental Efficiency. Since the numeric coeffi- cients in a regression equation are determined simultaneously in one computation, trial-and-error experimentation involving the evaluation of retrieval output to optimize parameters is largely avoidable. 2. Computational Simplicity. For ad hoc retrieval and routing retrieval that does not involve individual stem statistics, the computational simplicity and efficiency achieved by the model at run time are comparable to that of simple vector processing retrieval models. For routing retrieval that exploits individual stem frequencies the programming is somewhat more complicated and runs slightly slower. 3 Effective Retrieval. The level of retrieval effective- ness as measured by precision and recall is high rel- ative to the simple clue-types used. 4. potential for Well-Calibrated Probability Estimates. In-the-ballpark estimates of document relevance probabilities suitable for output display would appear to be within reach. Acknowledgements We are indebted to Ray Larson and Chris Plaunt for helpful systems advice, as well as to the several col- leagues already acknowledged in our Trec-1 report. The work stations used for the experiment were supplied by the Sequoia 2000 project at the University of California, a project principally funded by the Digital Equipment Cor- poration. A DARPA grant supported the programming effort.