SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Combining Evidence for Information Retrieval chapter N. Belkin P. Kantor C. Cool R. Quatrain National Institute of Standards and Technology D. K. Harman Combining Evidence for Information Retrieval N.J. Belkin, P. Kantor, C. Cool, R. Quatrain School of Communication, Information & Library Studies Rutgers University New Brunswick, NJ 08903 USA [belkin"kantorp/ccoollquatrain]@cs.rutgers.edu Abstract This study investigated the effect on retrieval performance of two methods of combination of multiple representations of TREC topics. Five separate Boolean queries for each of the 50 TREC routing topics and 25 of the TREC ad hoc topics were generated by 75 experienced online searchers. Using the INQUERY retrieval system, these queries were both combined into single queries, and used to produce five separate retrieval results, for each topic. In the former case, results indicate that progressive combination of queries leads to progressively improving retrieval performance, sig- nificantly better than that of single queries, and at least as good as the best individual single query formulations. In the latter case, data fusion of the ranked lists also led to per- formance better than that of any single list. 1. Introduction The general goal of our project in the TREC-2 program was to investigate the effect of making use of several differ- ent formulations of a single information problem, on in- formation retrieval (IR) system performance. The basis for this work lies in both theory and empirical evidence. From the empirical point of view, it has been noted for some time, that different representations of the same information problem retrieve sets (or ranked lists) of documents which contain different relevant, as well as non-relevant documents (see, e.g. McGill, Koll & Norreault, 1979; Saracevic & Kantor, 1988). There is some implication from this evi- dence (made explicit by Saracevic and Kantor, 1988), that taking account of the different results of the different formu- lations, could lead to retrieval performance that is better than that of any of the individual query formulations. From the theoretical point of view, IR can be considered as a problem of inference (see, e.g. van Rijsbergen, 1986). That is, IR is concerned with estimating, given available evi- dence about such things as information problems and doc- uments (or in general, retrievable information objects), the likelihood (or probability, or degree) of relevance of a doc- ument to the information problem. From this point of view, different query formulations constitute different sources of evidence which could be used to infer the proba- ble relevance of a document to an information problem, and it is thus reasonable to consider ways in which to use (i.e. combine) these sources of evidence in the inference process. These ideas are general to any source of evidence which might be used for IR, such as the evidence of different re- trieval techniques, or different document representation techniques, or, in general, different IR systems. One aspect 35 of our project uses the example of different query formula- tions as a simulation of the general problem of combina- tion of evidence from different systems. An additional argument is available for the special case of different query representations. That is, if we consider an information problem to be a complex, and in general diffi- cult-to-specify entity (see, e.g. Taylor, 1968; Belkin, Oddy & Brooks, 1982), then we might conclude that each differ- ent representation, derived from some statement by the user, is a different interpretation of the user's underlying informa- tion problem, highly unlikely to be like anyone else's (or any other system's) interpretation. Given the empirical evi- dence, whether any one such interpretation is `better' than another seems mooL However, we might say that each cap- tures some different, yet pertinent aspect of the user's under- lying problem; or, that those aspects of the different inter- pretations which are common to them all (or more than one) reflect some `core' aspect of the problem. Although techniques for making use of the different interpretations might vary according to which of these two views one takes, the general position suggests that it will always be a good idea to take advantage of as many such interpretations as possible. For this case, we therefore consider the issue of combination of different query representations within the `same' IR system. Our project, thus, considers the problem of inference in IR at two levels of analysis. The first level, as introduced by Turtle & Croft (1991), asks about the effect of evidence obtained when two or more formal query statements are produced for the same information problem. The second level, which is simulated in this study, asks about combi- nation of evidence provided by two or more distinct sys- tems, ranking the same set of documents in response to the same problem. To distinguish th[OCRerr]se two levels, and in keeping with earlier discussions of the issues involved, we henceforth refer to the combination of query statements as "query combination", and we refer to the combination of ev- idence from differing systems as "data fusion". Others have also addressed various aspects of this general question. Apart from those akeady cited, we mention in particular the work of Fox and his colleagues (Fox et al., 1993; Fox and Shaw, this volume), and that of Belkin, et al. (1993). These studies in fact address precisely the question of query combination, the Belkin et al. work being a direct precursor to this, and the Fox et al. studies using different query for- mulation, combination and retrieval techniques, but with very similar results. Why ought either of these two methods work in the IR situation? The central idea is that either the specific inter- nal score, assigned to a document for a query, or the rank of