SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) UCLA-Okapi at TREC-2: Query Expansion Experiments chapter E. Efthimiadis P. Biron National Institute of Standards and Technology D. K. Harman of documents that provide terms for query expansion, be- cause of the noise introduced by the terms taken from the irrelevant stories. (b) This last issue relates to a limitation of Okapi. The version of Okapi used at UCLA retrieves documents at the record level only. Retrieval at the paragraph level, which would have facilitated a better handling of some issues like the above, is not currently available. 2 The weighting functions The weighting of search terms can be said to involve two levels: level 1: A weighting function is used to weight the terms for the initial query as well as the terms for subsequent search iterations of the same query or some modified version of the query. level 2: A weighting function is used for the weighting of candidate terms for query expansion. Sections 2.1 and 2.2 discuss functions used in level 1 and level 2 respectively. 2.1 Search term weighting The theory of relevance weights (Robertson & Sparck Jones, 1976) provides the basic probabilistic model. The binary independence or relevance weight model assigns a weight to each term and the matching function for each document is given by the `[OCRerr]impk sum-of-weights' over all of the terms in the query. The weight of a term is calculated by following function which is also known as the ft point-5 formula: Wf4 = log (r + .5)(N - iz - R + r + .5) (n - r + .5)(R - r + .5) where, N is the total number of documents in the collec- tion; R is the sample of relevant documents as defined by the user's feedback; n is the number of documents indexed by term t; r is the number of relevant documents (from the sample R) assigned to term t. When relevance information is not available the above weight reduces to approximately the inverse document fre- quency (IDF). For calculating the total weight of a document the fol- lowing function was used which is based on the binary inde- pendence model, and takes into consideration the 2-Poisson model for within document frequency (tf) and the docu- ment length. These are described in detail in Robertson et al (1993b). The purpose of the UCLA Okapi system was to evaluate the existing Okapi models and therefore did not allow for modifications of the existing functions. For compatibility purposes and for comparisons it was decided to use the BM15 (best match) function for the runs. The BM15 best match weigthing function is: dOOW6[OCRerr]9htbml5 = [OCRerr](((k1tftf)) X wf4) + k2 x [OCRerr]q x ((aavedl - dl) vedl+ dl) 0 where k1 and k2 are unknown constants. In the UCLA- Okapi implementation the values for these constants are: k1 = 1 and k2 = 1. 2.2 Query expansion term weighting Ther anking algorithms that were considered for the rank- ing of terms for query expansion were: wpq, emim, porter, r[OCRerr]lohi and r[OCRerr]hilo. These algorithms are described briefly below. 2.2.1 The wpq algorithm This algorithm is based on an independence assumption that holds between a query expansion term and the terms in the entire previous search formulation (Robertson, 1990). According to the relevance weighting theory, the inclusion of term tin the search formulation with weight Wj will increase the effectiveness of retrieval by (1) wpq = Wt (Pt - qt) (3) where, Wt is a weighting function, which in this case is the Wf 4; Pt is the probability of term t occurring in a relevant document; and qt is the probability of a term t occurring in a non-relevant document. This means that irrespective of the weighting function (Wt) used the rule for deciding the inclusion of a term in a query expansion search should be based on the ranking of wpq instead of Wt alone. Substituting the weighting func- tion and the probability of relevance in wpq with r, R, n, Nwe get: wpq = ig (r (+nL5)r(N+F5[OCRerr]R[OCRerr]Rr+[OCRerr].5)5) .([OCRerr]Rr[OCRerr]NnIRr) 280 0