SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Combination of Multiple Searches chapter E. Fox J. Shaw National Institute of Standards and Technology D. K. Harman Table 3: Average Precision and Exact R-Precision for P-norm experiments on weighting with the AP and WSJ collections (Ad-hoc Topics 51-100). I IT Average Precision TI R-Precision I Coil._[OCRerr] P-value_II_ann_[OCRerr]_bnn_[OCRerr]_mnn_I'_ann_[OCRerr]_bnn_[OCRerr]_mnn_j 1.0 0.2810 0.2419 0.1419 0.2688 0.2660 0.1689 AP-1 1.5 0.3122 0.2581 0.1444 0.2976 0.2732 0.1757 _______ 2.0 0.3027 0.2510 0.1457 0.2968 0.2775 0.1707 1.0 0.3004 0.2672 0.1826 0.3165 0.2864 0.2046 AP-2 1.5 0.3332 0.2999 0.1831 0.3412 0.3118 0.2161 2.0 0.3300 0.2922 0.1847 0.3339 0.3057 0.2284 1.0 0.2941 0.2485 0.1742 0.3221 0.2830 0.2181 WSJ-1 1.5 0.3199 0.2753 0.1774 0.3443 0.2994 0.2225 2.0 0.3217 0.2752 0.1776 0.3470 0.3013 0.2277 1.0 0.2206 0.1881 0.1356 0.2367 0.2094 0.1722 WSJ-2 1.5 0.2327 0.2013 0.1174 0.2511 0.2234 0.1549 _______ 2.0 0.2325 0.1970 0.1098 0.2442 0.2158 0.1445 Table 5. In general, the P-norm queries performed bet- ter than the vector queries. The most effective P-value however differed between the collections: The AP runs performed better with a P-value of 1.5, while a P-value of 2.0 performed better for the WSJ collections. 3.3 Combination Retrieval Runs Our experiments in TREC-1 involved combining the results from several different retrieval runs for a given collection either simply taking the top N documents re- trieved for each run, or modifying the value of N for each run, based on the eleven point average precision for that run. We felt these efforts suffered from con- sidering only the rank of a retrieved document and not the actual similarity value itself. In TREC-2, our ex- periments concentrated on methods of combining runs based on the similarity values of a document to each query for each of the runs. Additionally, combining the similarities at retrieval time had the advantage of extra evidence over combining separate results files since the similarity of every document for each run was available instead of just the similarities for the top 1000 docu- ments for each run. While our results for four of the training collections indicated that the P-norm queries performed better than the vector queries, this result was likely specific to the actual queries involved and not necessarily true in general. This lead to a decision to weight each of the separate runs equally and not fa- vor any individual run or method. In general, it may be desirable or necessary to weight a single run more, or less, depending on its overall performance; this could be especially useful in a routing situation. For any given information retrieval ranking metohd, there are two primary types of errors that can occur: 245 Table 6: Formulas for combining similarity values. Name I' Combined Similarity = ] CombMAX MAX(Individual Similarities) CombMIN MIN(Individual Similarities) CombSUM SUM(Individual Similarities) CombANZ SUM(Individual Similarities) _______________ Num bcr ol Nonzero Similarities, CombMNZ SUM(Individual Similarities)* Number of Nonzero Similarities CombMED MED(Individual Similarities) assigning a relatively high rank to a non-relevant docu- ment, and assigning a relatively low rank to a relevant document. It has been shown that different retrieval paradigms will perform differently on the same set of data, often will little overlap in the set of retrieved doc- uments. [5] For instance, when one retrieval method assigns a high rank to a non-relevant document, a differ- ent retrieval method is likely to assign that document a much lower rank. Similarly, when one retrieval method fails to assign a high rank to a relevant document, a different retrieval method is likely to assign that doc- ument a high rank. This characteristic of information retrieval methods indicates that some method for con- sidering both retrieval methods together should help to decease the probability of this happening; of course, it is also possible for both methods to highly rank a non- relevant document or to poorly rank a relevant docu- ment. Six methods of combining the similarity values were tested in our TREC-2 experiments, as summarized in