SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Combining Evidence for Information Retrieval chapter N. Belkin P. Kantor C. Cool R. Quatrain National Institute of Standards and Technology D. K. Harman modification of the query formulation. To do this, we compared performance of unweighted 5-way query combina- tion (combi) with performance using the best- performing query formulations in the training database (1)estl), the best performing query formulations in the test database (bes[OCRerr]), the weighted 5-way query combination using weights from the training database (combx), the weighted 5-way query combination using weights from the test database (comby), and 5-way query combination weighted by the mean of the weights for test and training databases. The weights that we used were the precision at 100 retrieved documents for each query formulation. In the official results, we used av- erage 11-point precision. The reason for the change, is that precision at some cutoff level is a realistic measure for the routing task in general, and especially in an operational en- vironment, whereas the average precision is a measure that we cannot realistically expect to have in an operational en- vironment. When we compared the performance of both weights in the combx formulation, there was no significant difference. The results are presented in tables 9 and 9a, and show that talting account of subsequent evidence has a posi- tive and significant effect on performance. When reading Tables 9 and 9a, note that the entries for combi and fusion have already appeared in Table 7, as 1'5-way11 and "fusion", respectively. Also, "best2" has already appeared in Table 8, as the best `11-way't combination. comi bestl best2 comx comy cxv fus comi [OCRerr]29 21 13.5 16* 14.5 28 ** ** besti 21 13** 14** 14** 12.5 22.5 ** best2 29 37** 23 20 23 36** comx 36.5 36** 27 21.5 18** 40** ** comy 34* 36** 30 28.5 25.5 36.5 ** cxy 35.5 37.5 27 32* 24.5 37** ** ** fus 22 27.5 14** 10** 13.5 13** ** ** = significant difference at p < .01, sign test * = significant difference at p < .05, sign test Read row with respect to column, e.g. combx performed better than combi 36.5 times, or combi better than combx 13.5 times. Table 9a. Number of times that one treatment for routing topics performed better than another. I combl I besti I best2 I combx I comby I comxy I fusion I .2807 .2721 .2931 .3012 .3090 .3068 .2661 combi = unweighted combination of all queries for each topic besti = best performing query (on training set) for each topic bes[OCRerr] = best performing query (on test set) for each topic combx = weighted (by prec.@ 100 docs in training set) combi- nation of all queries for each topic comby = weighted (by prec.@ 100 docs in test set) combination of all queries for each topic combxy = weighted (by mean of the sum of prec.@ 100 docs in training and test sets) combination of all queries for each topic Table 9. For routing topics, mean 11-point precision for seven treatments. Table 9a encapsulates all of the key concepts of the several approaches to combination that we have explored. We have two approaches which are a priori and symmetric in their treatment of the query formulations (fus and combi). As expected, the fusion system, using the least information, performs worse. comb 1, the symmetric for- mulations does better, although the difference is not statis- tically significant. Both of these methods often perform better than the best of the individual formulations, and their relations to other combination schemes are (except for the relation to bes[OCRerr]) quite similar. The query [OCRerr]hat performs best on the training set (besti) does not perform signifi- cantly better than any of the combination schemes. But that formulation which performs best on the test set (best2, also called 1-way in Table 8) is significantly better than besti and the fusion scheme. 41 Of greater interest are the methods representing adaptive weighting schemes: combx, comby and combxy. Most significantly, combx, the adaptive weighting formulation, is better than the symmetrically weighted combination (comb 1), the fusion rule, and the best single formulation in a substantial fraction (over 70%) of all cases. The weight- ing based on the test set (comby) stands also in essentially the same relation to those three other schemes. Finally, the weighting scheme combxy simulates a situation which might arise in updating or tuning a combination rule after two batches of documents have been retrieved. This is ac- complished by averaging the weights assigned to each for- mulation in the training run, with those assigned based on the test run. This scheme shows essentially the same pro- file as combx and comby when compared with the comb 1, fusion, besti and bes[OCRerr] schemes. It performs significantly better than combx, but not significantly better than comby. 4. Discussion 4.1 General Results As is customary, we begin this section with a general disclaimer. In this case, we need to point out that all of our results were obtained with a very specific kind of query formulation technique and very special kinds of queries, and, that all of our results were obtained within a very special re- trieval context, the INQUERY system. It is certainly p05- sible that these circumstances strongly affected our results, so that we cannot make widely general claims for them. On the other hand, the results reported by Fox and Shaw (this volume), using queries generated in quite different ways, and using a quite different IR system and retrieval technique, are quite similar in general form and trend to