SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Combination of Multiple Searches
chapter
E. Fox
J. Shaw
National Institute of Standards and Technology
D. K. Harman
seen that the CombSUM method performs significantly
better than the best single individual run, Pn2.0; a tw[OCRerr]
tailed paired t test on the CombSUM and Pn2.0 average
precisions results in a p value of 3. le-OS, which indi-
cates these results are conclusive. However, comparing
the CombSUM results with the best individual runs for
each query basis, results in a p value of approximately
about 0.16, indicating that there is a 16 percent chance
that the CombSUM method is no better than the best
individual run, Pn2.0, for any specific query. Perform-
ing the same calculation on the R-Precision results in
similar significance findings.
While combining all five runs produced an overall
improvement in retrieval effectiveness over each of the
runs, the same does not always hold true when com-
bining only two or three runs. Each of the ten combi-
nations of two CombSUM runs was performed for both
of the AP test collections, as well as a run combining
all three of the P-norm runs. The results of these are
given in Table 8. Most of the combinations of two runs
performed worse than the better of the two runs while
performing better than the poorer of the two runs. One
notable exception to this is the combination of the two
vector runs, which performed noticeably poorer than
either of the two runs.
3.4 Collection Merging
The retrieval results for each of the collections were
combined by simply merging the results based solely on
the combined similarity values. Since the retrieval runs
were based on term weights without collection statis-
tics such as inverse document frequency, the similarity
values were directly comparable across collections. The
results of merging the CombSUM results by summed
similarity value for both disks, is shown in the last col-
umn of Table 5.
4 TREC-2 Results
The procedure described above was used for both our
official TREC-2 routing and ad-hoc results. The exact
queries for ad-hoc topics 51 to 100 used for testing our
above method were used for the routing queries against
the new collections on disk 3. The results obtained
from performing the CombSUM retrieval runs for each
of the four collections as well as the merged results are
shown in Table 9. The two CombSUM entries in the
last column of table are the official TREC-2 results.
Since we concentrated on the ad-hoc evaluations, these
routing results are included primarily for the benefit
of other groups, for purposes of comparison. The ad-
hoc queries for topics 101 to 150 were evaluated in the
same manner, and are reported in Table 10. Again,
247
the official results are the two CombSUM entries in last
column of the table.
As can be seen from Table 12, the CombSUM method
performs quite poorly for certain topics while perform-
ing very well for others, compared to the best single
run's results that that topic. Comparing the Comb-
SUM results to the single best individual run (Pn2.0)
shows an improvement for 46 out of the 50 topics, which
shows that the CombSUM run performs much better
than any single individual run. Performing a tw[OCRerr]tailed
paired [OCRerr] test on the Pn2.0 and CombSUM precisions
results in a p value of about 1.le-11, which indicates
these results are very conclusive. However, comparing
the CombSUM results with the best individual runs
on a per query basis results in a p value of about 0.2,
indicating that there is a 20 percent chance that the
CombSUM method is no better than the best individ-
ual run for each specific query. Again, performing the
same calculation on the R-Precision results in similar
values.
4.1 The CEO Model
The Combination of Expert Opinion (CEO)
model [6, 7] of Thompson can be used to treat the dif-
ferent retrieval methods as experts, and allows combin-
ing their weighting probability distributions to improve
performance. This could be used in a variety of ways
to combine results from a variety of runs and indexing
schemes (that could include stemming and/or morpho-
logical analysis). For TREC-2, the CEO experiments
completed consisted of combining seven individual runs,
the three P-norm extended boolean retrieval run types
described above, and retrieval runs based on the long
vector queries, using both cosine correlation and inner
product similarity measures for SMART system term
weighting schemes of `inn and a[OCRerr]n. Further discussion
of this process and the results are described elsewhere
in these proceedings.
4.2 Evaluation
Improvements in retrieval effectiveness from combining
the evidence from multiple sources of evidence has been
performed before in various incarnations, most recently
by Belkin e[OCRerr] aL [1] who evaluated the progressive effect
of considering multiple soft boolean representations to
improve on a base INQUERY natural language retrieval
run. In their experiments, the base INQUERY natural
language run performed better than any of the boolean
representations, and they report that combining the re-
sults from the natural language representation and the
combined boolean representations with equal weights
performed worse than the best single run. Not until
weighting the natural language run four times more