NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection chapter N. Fuhr U. Pfeifer C. Bremkamp M. Pollmann National Institute of Standards and Technology D. K. Harman b0 b1 * is[OCRerr]single b2 is[OCRerr]single b3 is[OCRerr]single b4 lQqnumierms b5 * is[OCRerr]phrase b6 * is[OCRerr]phrase * if * logidf imaxif + imaxif + * if * logid! * imaxif + * if * logidf * imaxif + * if imaxif + b7 is[OCRerr]phrase logidf. mula The coefficient vector b is computed based on a training sample of query-document-pairs with relevance judge- ments Since polynomial functions may yield results outside the interval [0,1], these values were mapped onto the corre- sponding boundaries of this interval. For each phrase occurring in a document, indexing weights for the phrase as well as for its two components (as single words) were computed. There are two major problems with this approach which we are investigating currently: 1. Which factors should be used for defining the in- dexing functions? We are developing a tool that supports a statistical analysis of single factors for this purpose. 2. What is the best type of indexing function? Pre- vious experiments have suggested that regression methods outperform other probabilistic classifica- tion methods. As a reasonable alternative to poly- nomial regression, logistic regression seems to offer some advantages (see also [Fuhr & Pfeifer 94]). As a major benefit, logistic functions yield only values between 0 and 1, 50 there is no problem with out- liers. We are performing experiments with logistic regression and compare the results to those based on polynomial regression. 3 Query term weighting for ad- hoc queries 3.1 Theoretical background The basis of our query term weighting scheme for ad-hoc queries is the linear utility-theoretic retrieval function described in [Wong & Yao 89]. Let qT[OCRerr] denote the set of terms occurring in the query, and Uim the indexing weight [OCRerr](x}i[OCRerr], dm)) (with Uim = 0 for terms i[OCRerr] not oc- curring in dm). If Cik gives the utility of term i[OCRerr] for the actual query q[OCRerr], then the utility of document dm w.r.t. query q[OCRerr] can be computed by the retrieval function Q(qk,dm)= [OCRerr] Cik[OCRerr]ILim. t For the estimation of the utility weights Cik, we applied two different methods. As a heuristic approach, we used tf weights (the num- + ber of occurrences of the term i[OCRerr] in the query), which had shown good results in the experiments described in [Fuhr & Buckley 91]. + As a second method, we applied linear regression to this problem. Based on the concept of polynomial retrieval functions as described in [Fuhr 89b], one can estimate the probability of relevance of q[OCRerr] w.r.t. dm by the for- P(RIqk, dm) [OCRerr] Cik [OCRerr]im t[OCRerr]EqT[OCRerr] (3) If we had relevance feedback data for the specific query (as is the case for the routing queries), this function could be used directly for regression. For the ad-hoc- queries, however, we have only feedback information about other queries. For this reason, we regard query features instead of specific queries. This can be done by considering for each query term the same features as described before in the context of document indexing. Assume that we have a set of features {fo,fi,..., fi} and that x5[OCRerr] denotes the value of feature fj for term i[OCRerr] Then we assume that query term weight Cik can be estimated by linear regression according to the formula 1 Cik = [OCRerr]1[OCRerr]0a5x5[OCRerr]. (4) Here the factor IqTk serves for the purpose of normaliza- tion across different queries, since queries with a larger number of terms tend to yield higher retrieval status val- ues with formula 2. The factors a5 are the coefficients that are to be derived by means of regression. Now we have the problem that regression cannot be applied to eqn 4, since we do not observe Cik values directly. In- stead, we observe relevance judgements. This leads us back to the polynomial retrieval function 3, where we substitute eqn 4 for c[OCRerr]k: P(Rqk,d,,,) [OCRerr] t EqT[OCRerr] ([OCRerr]`0aixii) Uim with = Za3[OCRerr]jjT[OCRerr]1xiiuim [OCRerr]=0 t.Eq[OCRerr]T = [OCRerr]a5y5 (5) 5=0 1 yj = [OCRerr] [OCRerr]qT[OCRerr]1xiiuirn (6) t (2) Equation 5 shows that we can apply linear regression of the form P(RIqk, dm) [OCRerr] a[OCRerr] [OCRerr] to a training sample 68