SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Combining Evidence for Information Retrieval chapter N. Belkin P. Kantor C. Cool R. Quatrain National Institute of Standards and Technology D. K. Harman ours, although their specific figures are different. So we are willing to believe that the influence of our experimental si- tuation is probably not enough to invalidate our results at some level of generality. There are several aspects of our general results which are of some interest, apart from the issues of query combi- nation and data fusion. One has to do with the lack of any significant relationship between number of words in a query, and the performance of a query. It has been at least informally suggested in the IR community, that the re- trieval performance of queries increases with the number of words in the query. There is no support in our data for this hypothesis. Indeed, in our data, there is at least one one- word query, which performed better than all of the other, multi-word queries for that topic. It is also of interest that none of the query/searcher characteristics was related to performance. This may be a characteristic of our particular data set, but it also suggests that it will be rather difficult to identify characteristics of people or topics at this level, which will be predictive of performance of the query. Mthough the level of familiarity by the searchers on the topics was in general rather low, our searchers neverthe- less found it not too difficult to formulate queries (mean of 2.82 on a scale of 1 to 5), and felt that they had sufficient information to construct a reasonable query, on the basis of the topic (mean of 3.2 on a scale 1 of 5). This makes us think that the queries are likely to be reasonable formula- tions of the search topics, at least as far as the searchers are concerned. But the range and variability of the numbers of words and numbers of operators per topic seems to indicate that the query formulations themselves are rather different (we have not yet compared them for overlap in specific words, but work on this issue is in progress). These two results seem to us to confirm our initial i6ea that each query formulation is indeed a "different" interpretation of the in- formation problem, and thus to substantiate our general ap- proach. 4.2 Query Combination Results Our results, for both ad hoc and routing topics, seem clearly to show that, in general, the more evidence one has, and uses, in the form of different query formulations, the better the IR performance is going to be. In particular, ta- bles 4, 6, 7 and 9 support this conclusion, in various re- spects. From the results of tables 6 and 9, we can see that, taking advantage of what one learns about query perfor- mance from one iteration doesn't help a lot, after the first iteration, but on the other hand, it doesn't hurt, either. This suggests to us that continual modification and reweighting of the multiple query formulations in a combined query, is likely to be useful in the general routing environment. But even doing it once, given the initial evidence, seems to help. This also suggests that continuing to add new query formulations to a combined query will likely help perfor- mance on subsequent runs. Having said all this, it is worth considering the results 42 of tables 5 and 8, which showed that picking the best 2-way or 3-way combination of query formulations was signifi- cantly better than using 4-way or 5-way combinations. On the face of it, this runs counter to the general result of "the more, the better". However, it is possible that this result is an artifact of our data. For both 2;way and 3-way combina- tions, it was possible to choose the best from ten different combinations. Because we had only five different query formulations for each topic, we had smaller pools from which to choose, for both single query formulations, and for the 4-way and 5-way combinations. This issue needs further investigation. 4.3 Data Fusion Results There are several points to be made with regard to the median-fusion scheme as implemented here. First, as may be expected from general arguments, it sometimes performs better than the best of the lists which are joined by the fu- sion process. Second, it does not perform as well as even the symmetric (unweighted) combinations made using the internal scores generated by the INQUERY system. This is expected, since those scores contain more information than the rankings alone. One can imagine special cases in which the distribution of scores assigned to a document by several queries is such that the internal combination rule of un- weighted sum does not perform as well, but this has appar- ently not occurred in the cases studied here. Third, in the application to the routing problem we have, in fact, operated in a batch mode. For a true routing situation, it would be necessary to estimate cutoff scores for the several query formulations, corresponding to the cutoff rank on the fused list. For large data sets this can be done easily. Without this step, it is not possible to make an immediate decision about a newly presented document. Fourth, the stability induced by using this system was manifested in the case of the one query for which we dis- covered, too late to make the change, that one of the query formulations was in error. For this case, all of the "average combination of evidence formulations" performed more poorly than the fusion rule. This is because one, or even two disastrously bad query formulations will have little ef- fect on the results of the 3-of-S fusion rule. Of course, ex- pect for the case of combining all five query formulations, the best of one, two, three or four query formulations can do well because the one bad formulation will be missing from the combinations that are best. Finally, the application of data fusion here, at the so- called decision level (that is to say, after the documents have been ranked according to several rules) is a simulation for the case to which it should be applied. Since the spe- cific system that we used permits internal manipulation of scores, there is no need to delay combination until after the output lists have been formed. But in realistic settings, several distinct systems will have internal operations which are not compatible, so that, even if it were possible to ex- tract the internal scores, it would not be apparent how to combine them.