SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Combining Evidence for Information Retrieval chapter N. Belkin P. Kantor C. Cool R. Quatrain National Institute of Standards and Technology D. K. Harman and routing topics. The first, which we label 1tcomblt1 was applied to the ad hoc topics. In this procedure, we simply combine the five query formulations for each topic directly, into one query, using the INQUBRY `tunweighted sumt1 op- erator. This query is then used as the search statement in our experiments. In the ad hoc search environment, we cannot expect to have relevance judgments, and so we can do no more than simple combination. The second combinatorial procedure, called 1'combx", was used for the routing topics. Here, we did a separate search for each separate query formulation for all 50 topics, on the training set supplied from the TRBC- 1 data. From these results, we used the average 11-point precision (in the "official" results reporLed at TRBC-2; precision at 100 doc- uments for the "unofficial results" reported in this paper) of each query formulation as a weight for that formulation in the combination of all five formulations for each topic. For this, we used INQUERY's `tweighted sum" operator. This procedure corresponds to constructing a simple com- bined query, learning something about how that query's components perform on the current database, and taking ac- count of that evidence to modify the query formulation for searching the next database. These methods of combining queries give us a very straightforward way to test our hypotheses about the effec- tiveness of multiple sources of evidence. For our experi- ments (as opposed to the results which were submitted to TRBC-2, which were just the comb I and combx results as described above), we divided the query formulations for both ad hoc and routing topics, into five different groups. In each group, each topic was represented by one query, and no searcher was represented more than once in any one group. This distribution was meant to control for possible searcher effects. We then did runs for each single group, and for each combination of groups, for both ad hoc and routing topics. With these data, we were able to compare retrieval performance of different levels of query combination, and to compare retrieval performance of combined queries with uri- combined. 2.3 Data Fusion Experiments Data fusion was accomplished by a list-merging method which is the natural ex[OCRerr]ension of a 3-out-of-S data fusion logic in the binary case. The basic data used was the five lists of documents retrieved by the five different query formulations for each topic. Every document has some rank in each of the five lists being joined together. An ef- fective rank is calculated by taking the third highest of the five ranks which the document has. This has the same ef- fect as moving a threshold along the list of effective ranks, and including a document in the output when it has ap- peared on three of the lists. Since there are five scores all together, this can also be thought of as a median rule. In practice, to maintain consistency with other parts of our work, we did not calculate the rank of every document, but worked with the lists of the top 1000 documents pro- duced in response to each query formulation. This meant that some documents would appear on all five of the lists, others on just four, or three, or even fewer. Of course, the 37 whole logic of data fusion suggests that those which appear on more lists are more likely to be relevant. We imple- mented this, in fact, by forming a combined sort key con- sisting of (10-degeneracy, 3-rd rank). The degeneracy is the number of lists on which a specific document appears in the top 1000. We used a lexicographic sort, so that all items with degeneracy 5 appeared before any items with degener- acy 4, and so on. Within a given degeneracy, items with lower values for the 3rd rank were ranked first. 3. Results 3,1 Caveats The results presented in this paper differ in several ways from those submitted as "official" results to TREC-2, which are published at the end of this volume. According to our experimentai design, there are five independently pro- duced query formulations for each of the TREC topics. However, due to uneven return rate among our searchers, we were missing one searcher's set of queries for the ad hoc topics, and three searchers' sets of queries for the routing topics, when we did the "official" runs. Consequently, in the official results, five ad hoc topics and fifteen routing topics are represented by four searches, rather than five. However, we were subsequently able to obtain substitute searchers, and so for the "unofficial" results presented and discussed in this paper, we have the full complement of 75 searchers and five query formulations per topic. We were unable to report the data fusion results for routing topics for the official results, because of time con- straints. We have subsequently been able to do those runs, and report them here as unofficial results. We also caution that one query for one of our ad hoc topics is known to have a syntactic error which resulted in very poor performance for that single query, and for all un- weighted combinations of queries in which it was present. Therefore, some of our comparative results in the ad hoc case may be slightly incorrect. 3.2 General Results Because our analyses of ad hoc topics are based on a subset of the total sample, we here consider questions of the sample representativeness. As explained above, the sample was originally chosen to represent topic domains. To see if this had introduced some other bias, we compared the distri- bution of our 25 topics along the three dimensions of top- ics proposed by Harman (this volume). These are: broad- ness, operationally defined as the total number of relevant documents found for that topic; hardness, operationally de- fined as inverse to the median average precision for that topic; and, restriction, defined according to linguistic charac- teristics of the topic. The distribution of the 25 topics in our sample did not differ significantly from the total ad hoc topic distribution on any of these dimensions, so we feel reasonably confident that we did not select a markedly bi- ased subset of topics. Tables 1 and 2 present a descriptive profile of the queries and topics in our study, based upon the query formu-