NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) UCLA-Okapi at TREC-2: Query Expansion Experiments chapter E. Efthimiadis P. Biron National Institute of Standards and Technology D. K. Harman The wpq algorithm combines the effects of the relevance weighting theory, as expressed by the Wf4 component, which assign greater importance to the infrequent terms with the frequency of occurrence of a term in the relevant document set. 2.2.2 The emim algorithm The expected mutual information measure (emim) is a term weighting model incorporating relevance information in which it is assumed that index terms may not be dis- tributed independently of each other. (van Rijsbergen, 1977; Harper and van Rusbergen, 1978; van Rijsbergen, Harper & Porter, 1981) The emim weight reduces to the f[OCRerr] weight when the "de- gree of involvement", i.e. the joint probabilities, are all unity. Assuming the same definitions for n, N, r, R, as those already used earlier, the emim weight of a term is calculated as follows: Eiq = pilill - P12i12 - p2ii21 + p22i22 = 1og;[OCRerr].r (n-[OCRerr]N -log .(n-r) (N-R)n (R-r)N -log .(R-r) (N-n)R (N-n-R+r)N +log (`v-n-R+r) (N-n)(N-R) 2.2.3 The porie[OCRerr] algorithm Porter and Galpin (1988) describe a ranking formula used in the MUSCAT online catalogue: r n porler = - N where r, R, n, N are defined as in the f[OCRerr] weight (eq. 1). 2.2.4 The r[OCRerr]lohi algorithm The r[OCRerr]lohi algorithm has been proposed by Efthimiadis (1993a) as the result of the observation of the ranking be- havior of six algorithms used for ranking terms for query expansion. The r[OCRerr]lohi ranking algorithm: * ranks terms according to r, i.e. their frequency of oc- currence in the relevant document set, in descending order and * resolves ties according to their term frequency, n, from low-to-high frequency. It was hypothesized that the r[OCRerr]lohi algorithm would have an almost identical ranking to porier and a performance ap- proaching that of wpq and emim. More differences between the algorithms may occur if the size of the set of relevant documents (R) gets larger. Conclusions about the algo- rithm however could not be drawn before it was evaluated against the other algorithms. The results of that evaluation are reported in Efthimiadis (in press) where the r[OCRerr]lohi al- gorithm demonstrated better performance when compared to the other algorithms. 2.2.5 The r[OCRerr]hilo sort A variant of the r[OCRerr]lohi algorithm is to rank candidate terms for query expansion using the r[OCRerr]hilo rank which: * ranks terms according to r, i.e. their frequency of oc- currence in the relevant document set, and * resolves ties according to their term frequency, n, from high-t[OCRerr]low frequency. Since the r[OCRerr]hilo algorithm will result in sorting terms in exactly the opposite way of the r[OCRerr]loh i algorithm it was in- cluded as a control for the study. 3 Methodology 3.1 Runs (5) Initial tests were performed in topics 1-50 where the depen- dent variables were the weighting function and the query processing of terms. [OCRerr]From the results obtained it was es- tablished that the function to use will be BM15 and that the parsing of the Topics would include both single terms and "phrases" as defined by comma delimited text in the Topics. The table below (Table 3.1 gives all the variables used in constructing the runs. The options available for each variable are also provided. Weighting Function: Best match function BM15 (see equation 2). Phrases: Choice of YES, NO, or BOTH. This determines the type of parsing of the "Concepts" and "Title" fields 281