SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Combination of Multiple Searches chapter E. Fox J. Shaw National Institute of Standards and Technology D. K. Harman seen that the CombSUM method performs significantly better than the best single individual run, Pn2.0; a tw[OCRerr] tailed paired t test on the CombSUM and Pn2.0 average precisions results in a p value of 3. le-OS, which indi- cates these results are conclusive. However, comparing the CombSUM results with the best individual runs for each query basis, results in a p value of approximately about 0.16, indicating that there is a 16 percent chance that the CombSUM method is no better than the best individual run, Pn2.0, for any specific query. Perform- ing the same calculation on the R-Precision results in similar significance findings. While combining all five runs produced an overall improvement in retrieval effectiveness over each of the runs, the same does not always hold true when com- bining only two or three runs. Each of the ten combi- nations of two CombSUM runs was performed for both of the AP test collections, as well as a run combining all three of the P-norm runs. The results of these are given in Table 8. Most of the combinations of two runs performed worse than the better of the two runs while performing better than the poorer of the two runs. One notable exception to this is the combination of the two vector runs, which performed noticeably poorer than either of the two runs. 3.4 Collection Merging The retrieval results for each of the collections were combined by simply merging the results based solely on the combined similarity values. Since the retrieval runs were based on term weights without collection statis- tics such as inverse document frequency, the similarity values were directly comparable across collections. The results of merging the CombSUM results by summed similarity value for both disks, is shown in the last col- umn of Table 5. 4 TREC-2 Results The procedure described above was used for both our official TREC-2 routing and ad-hoc results. The exact queries for ad-hoc topics 51 to 100 used for testing our above method were used for the routing queries against the new collections on disk 3. The results obtained from performing the CombSUM retrieval runs for each of the four collections as well as the merged results are shown in Table 9. The two CombSUM entries in the last column of table are the official TREC-2 results. Since we concentrated on the ad-hoc evaluations, these routing results are included primarily for the benefit of other groups, for purposes of comparison. The ad- hoc queries for topics 101 to 150 were evaluated in the same manner, and are reported in Table 10. Again, 247 the official results are the two CombSUM entries in last column of the table. As can be seen from Table 12, the CombSUM method performs quite poorly for certain topics while perform- ing very well for others, compared to the best single run's results that that topic. Comparing the Comb- SUM results to the single best individual run (Pn2.0) shows an improvement for 46 out of the 50 topics, which shows that the CombSUM run performs much better than any single individual run. Performing a tw[OCRerr]tailed paired [OCRerr] test on the Pn2.0 and CombSUM precisions results in a p value of about 1.le-11, which indicates these results are very conclusive. However, comparing the CombSUM results with the best individual runs on a per query basis results in a p value of about 0.2, indicating that there is a 20 percent chance that the CombSUM method is no better than the best individ- ual run for each specific query. Again, performing the same calculation on the R-Precision results in similar values. 4.1 The CEO Model The Combination of Expert Opinion (CEO) model [6, 7] of Thompson can be used to treat the dif- ferent retrieval methods as experts, and allows combin- ing their weighting probability distributions to improve performance. This could be used in a variety of ways to combine results from a variety of runs and indexing schemes (that could include stemming and/or morpho- logical analysis). For TREC-2, the CEO experiments completed consisted of combining seven individual runs, the three P-norm extended boolean retrieval run types described above, and retrieval runs based on the long vector queries, using both cosine correlation and inner product similarity measures for SMART system term weighting schemes of `inn and a[OCRerr]n. Further discussion of this process and the results are described elsewhere in these proceedings. 4.2 Evaluation Improvements in retrieval effectiveness from combining the evidence from multiple sources of evidence has been performed before in various incarnations, most recently by Belkin e[OCRerr] aL [1] who evaluated the progressive effect of considering multiple soft boolean representations to improve on a base INQUERY natural language retrieval run. In their experiments, the base INQUERY natural language run performed better than any of the boolean representations, and they report that combining the re- sults from the natural language representation and the combined boolean representations with equal weights performed worse than the best single run. Not until weighting the natural language run four times more