SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection chapter N. Fuhr U. Pfeifer C. Bremkamp M. Pollmann National Institute of Standards and Technology D. K. Harman Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection Norbert Fuhr, Ulrich Pfeifer, Christoph Bremkamp, Michael Pollmann University of Dortmund, Germany Chris Cornell Buckley University Abstract In this paper, we describe the application of probabilis- tic models for indexing and retrieval with the TREC-2 collection. This database consists of about a million documents (2 gigabytes of data) and 100 queries (50 routing and 50 adhoc topics). For document indexing, we use a description-oriented approach which exploits relevance feedback data in order to produce a probabilis- tic indexing with single terms as well as with phrases. With the adhoc queries, we present a new query term weighting method based on a training sample of other queries. For the routing queries, the RPI model is ap- plied which combines probabilistic indexing with query term weighting based on query-specific feedback data. The experimental results of our approach show very good performance for both types of queries. 1 Introduction The good TREC-1 results of our group described in [Fuhr & Buckley 93] have confirmed the general concept of probabilistic retrieval as a learning approach. In this paper, we describe some improvements of the indexing and retrieval procedures. For that, we first give a brief outline of the document indexing procedure which is based on description-oriented indexing in combination with polynomial regression. Section 3 describes query term weighting for adhoc queries, where we have devel- oped a new learning method based on a training sam- ple of other queries and corresponding relevance judge- ments. In section 4, the construction of the routing queries is presented, which is based on the probabilistic RPI retrieval model for query-specific feedback data. In the final conclusions, we suggest some further improve- ments of our method. tails): Let dm denote a document, t[OCRerr] a term and R the fact that a query-document pair is judged relevant, then P(RIt[OCRerr], dm) denotes the probability that document dm will be judged relevant w.r.t. an arbitrary query that contains term t[OCRerr] Since these weights can hardly be es- timated directly, we use the description-oriented index- ing approach. Here term-document pairs Yi[OCRerr] dm) are mapped onto so-called relevance descriptions [OCRerr] dm). The elements X[OCRerr] of the relevance description contain val- ues of features of t[OCRerr] dm and their relationship, like e.g. if within-documeut frequency (wdf) of ii[OCRerr] logidf = log(iuverse document frequency), 1ognumi[OCRerr]rms = log(number of different terms in dm), imarif = 1/(maximum wdf of a term in dm) is[OCRerr]single =1, if term is a single word, =0 otherwise i8[OCRerr]phrase =1, if term is a phrase, =0 otherwise. (As phrases, we considered all adjacent non-stopwords that occurred at least 25 times in the D1 +D2 (training) document set.) Based on these relevance descriptions, we estimate the probability P(Rjx}t[OCRerr], dm)) that an arbitrary term- document pair having relevance description i will be in- volved in a relevant query-document relationship. This probability is estimated by a s[OCRerr]called indexing func- tion u(x[OCRerr]. Different regression methods or probabilistic classification algorithms can serve as indexing function. For our retrieval runs submitted to TREC-2, we used polynomial regression for developing an indexing func- tion of the form u(i) = (1) 2 Document indexing The task of probabilistic document indexing can be de- scribed as follows (see [Fuhr & Buckley 91] for more de- 67 where the components of v[OCRerr]x) are products of elements of i. The indexing function actually used has the form