NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression chapter W. Cooper A. Chen F. Gey National Institute of Standards and Technology D. K. Harman Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression Wm. S. Cooper Aitao Chen Fredric C. Gey S.L.I.S., University of California Berkeley, CA 94720 ABSTRACT The experiments described here are part of a research program whose objective is to develop a full-text retrieval methodology that is statistically sound and powerful, yet reasonably simple. The methodology is based on the use of a probabilistic model whose parameters are fitted empirically to a learning set of relevance judgements by logistic regression. The method was applied to the TIPSTER data with opti- mally relativized frequencies of occurrence of match stems as the regression variables. In a routing retrieval experiment, these were supplemented by other variables corresponding to sums of logodds asso- ciated with particular match stems. Introduction The full-text retrieval design problem is largely a problem in the combination of statistical evidence. With this as its premise, the Berkeley group has concentrated on the challenge of finding a statistical methodology for com- bining retrieval clues in as powerful a way as possible, consistent with reasonable analytic and computational simplicity. Thus our research focus has been on the gen- eral logic of how to combine clues, with no attempt made at this stage to exploit as many clues as possible. We feel that if a straightforward statistical methodology can be found that extracts a maximum of retrieval power from a few good clues, and the methodology is clearly hospitable to the introduction of further clues in future, progress will have been made. We join Fuhr and Buckley (1991, 1992) in thinking that an especially promising path to such a methodology is to combine a probabilistic retrieval model with the tech- niques of statistical regression. Under this approach a 57 probabilistic model is used to deduce the general form that the document-ranking equation should take, after which regression analysis is applied to obtain empincally-based values for the constants that appear in the equation. In this way the probabilistic theory is made to constrain the universe of logically possible retrieval rules that could be chosen, and the regression techniques complete the choice by optimizing the model's fit to the learning data. The probabilistic model adopted by the Berkeley group is derived from a statistical assumption of `linked dependence'. This assumption is weaker than the historic independence assumptions usually discussed. In its sim- plest form the Berkeley model also differs from most tra- ditional models in that it is of `Type 0'-- meaning that the analysis is carried out w.r.t. sets of query-document pairs rather than w.r.t. particular queries or particular docu- ments. (For a fuller explanation of this typology see Robertson, Maron & Cooper 1982.) But when relevance judgement data specialized to the currently submitted search query is available, say in the form of relevance feedback or routing history data, the model is flexible enough to accommodate it (resulting in `Type 2' retrieval.) Logistic regression (see e.g. Hosmer & Lemeshow (1989)) is the most appropriate type of regression for this kind of IR prediction. Although standard multiple regres- sion analysis has been used successfully by others in com- parable circumstances [OCRerr]uhr & Buckley op. ciL), we believe logistic regression to be logically more appropri- ate for reasons set forth elsewhere (Cooper, Dabney & Gey 1992). Logistic regression, which accepts binary training data and yields probability estimates in the form of logodds values, goes hand-in-glove with a probabilistic IR model that is to be fitted to binary relevance judgement data and whose predictor variables are themselves logodds.