SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Combination of Multiple Searches
chapter
E. Fox
J. Shaw
National Institute of Standards and Technology
D. K. Harman
Table 3: Average Precision and Exact R-Precision for P-norm experiments on weighting with the AP and WSJ
collections (Ad-hoc Topics 51-100).
I IT Average Precision TI R-Precision I
Coil._[OCRerr] P-value_II_ann_[OCRerr]_bnn_[OCRerr]_mnn_I'_ann_[OCRerr]_bnn_[OCRerr]_mnn_j
1.0 0.2810 0.2419 0.1419 0.2688 0.2660 0.1689
AP-1 1.5 0.3122 0.2581 0.1444 0.2976 0.2732 0.1757
_______ 2.0 0.3027 0.2510 0.1457 0.2968 0.2775 0.1707
1.0 0.3004 0.2672 0.1826 0.3165 0.2864 0.2046
AP-2 1.5 0.3332 0.2999 0.1831 0.3412 0.3118 0.2161
2.0 0.3300 0.2922 0.1847 0.3339 0.3057 0.2284
1.0 0.2941 0.2485 0.1742 0.3221 0.2830 0.2181
WSJ-1 1.5 0.3199 0.2753 0.1774 0.3443 0.2994 0.2225
2.0 0.3217 0.2752 0.1776 0.3470 0.3013 0.2277
1.0 0.2206 0.1881 0.1356 0.2367 0.2094 0.1722
WSJ-2 1.5 0.2327 0.2013 0.1174 0.2511 0.2234 0.1549
_______ 2.0 0.2325 0.1970 0.1098 0.2442 0.2158 0.1445
Table 5. In general, the P-norm queries performed bet-
ter than the vector queries. The most effective P-value
however differed between the collections: The AP runs
performed better with a P-value of 1.5, while a P-value
of 2.0 performed better for the WSJ collections.
3.3 Combination Retrieval Runs
Our experiments in TREC-1 involved combining the
results from several different retrieval runs for a given
collection either simply taking the top N documents re-
trieved for each run, or modifying the value of N for
each run, based on the eleven point average precision
for that run. We felt these efforts suffered from con-
sidering only the rank of a retrieved document and not
the actual similarity value itself. In TREC-2, our ex-
periments concentrated on methods of combining runs
based on the similarity values of a document to each
query for each of the runs. Additionally, combining the
similarities at retrieval time had the advantage of extra
evidence over combining separate results files since the
similarity of every document for each run was available
instead of just the similarities for the top 1000 docu-
ments for each run. While our results for four of the
training collections indicated that the P-norm queries
performed better than the vector queries, this result
was likely specific to the actual queries involved and
not necessarily true in general. This lead to a decision
to weight each of the separate runs equally and not fa-
vor any individual run or method. In general, it may
be desirable or necessary to weight a single run more,
or less, depending on its overall performance; this could
be especially useful in a routing situation.
For any given information retrieval ranking metohd,
there are two primary types of errors that can occur:
245
Table 6: Formulas for combining similarity values.
Name I' Combined Similarity = ]
CombMAX MAX(Individual Similarities)
CombMIN MIN(Individual Similarities)
CombSUM SUM(Individual Similarities)
CombANZ SUM(Individual Similarities)
_______________ Num bcr ol Nonzero Similarities,
CombMNZ SUM(Individual Similarities)*
Number of Nonzero Similarities
CombMED MED(Individual Similarities)
assigning a relatively high rank to a non-relevant docu-
ment, and assigning a relatively low rank to a relevant
document. It has been shown that different retrieval
paradigms will perform differently on the same set of
data, often will little overlap in the set of retrieved doc-
uments. [5] For instance, when one retrieval method
assigns a high rank to a non-relevant document, a differ-
ent retrieval method is likely to assign that document a
much lower rank. Similarly, when one retrieval method
fails to assign a high rank to a relevant document, a
different retrieval method is likely to assign that doc-
ument a high rank. This characteristic of information
retrieval methods indicates that some method for con-
sidering both retrieval methods together should help to
decease the probability of this happening; of course, it
is also possible for both methods to highly rank a non-
relevant document or to poorly rank a relevant docu-
ment.
Six methods of combining the similarity values were
tested in our TREC-2 experiments, as summarized in