SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Combining Evidence for Information Retrieval
chapter
N. Belkin
P. Kantor
C. Cool
R. Quatrain
National Institute of Standards and Technology
D. K. Harman
and routing topics. The first, which we label 1tcomblt1 was
applied to the ad hoc topics. In this procedure, we simply
combine the five query formulations for each topic directly,
into one query, using the INQUBRY `tunweighted sumt1 op-
erator. This query is then used as the search statement in
our experiments. In the ad hoc search environment, we
cannot expect to have relevance judgments, and so we can
do no more than simple combination.
The second combinatorial procedure, called 1'combx",
was used for the routing topics. Here, we did a separate
search for each separate query formulation for all 50 topics,
on the training set supplied from the TRBC- 1 data. From
these results, we used the average 11-point precision (in the
"official" results reporLed at TRBC-2; precision at 100 doc-
uments for the "unofficial results" reported in this paper) of
each query formulation as a weight for that formulation in
the combination of all five formulations for each topic.
For this, we used INQUERY's `tweighted sum" operator.
This procedure corresponds to constructing a simple com-
bined query, learning something about how that query's
components perform on the current database, and taking ac-
count of that evidence to modify the query formulation for
searching the next database.
These methods of combining queries give us a very
straightforward way to test our hypotheses about the effec-
tiveness of multiple sources of evidence. For our experi-
ments (as opposed to the results which were submitted to
TRBC-2, which were just the comb I and combx results as
described above), we divided the query formulations for both
ad hoc and routing topics, into five different groups. In
each group, each topic was represented by one query, and no
searcher was represented more than once in any one group.
This distribution was meant to control for possible searcher
effects. We then did runs for each single group, and for
each combination of groups, for both ad hoc and routing
topics. With these data, we were able to compare retrieval
performance of different levels of query combination, and to
compare retrieval performance of combined queries with uri-
combined.
2.3 Data Fusion Experiments
Data fusion was accomplished by a list-merging
method which is the natural ex[OCRerr]ension of a 3-out-of-S data
fusion logic in the binary case. The basic data used was the
five lists of documents retrieved by the five different query
formulations for each topic. Every document has some
rank in each of the five lists being joined together. An ef-
fective rank is calculated by taking the third highest of the
five ranks which the document has. This has the same ef-
fect as moving a threshold along the list of effective ranks,
and including a document in the output when it has ap-
peared on three of the lists. Since there are five scores all
together, this can also be thought of as a median rule.
In practice, to maintain consistency with other parts of
our work, we did not calculate the rank of every document,
but worked with the lists of the top 1000 documents pro-
duced in response to each query formulation. This meant
that some documents would appear on all five of the lists,
others on just four, or three, or even fewer. Of course, the
37
whole logic of data fusion suggests that those which appear
on more lists are more likely to be relevant. We imple-
mented this, in fact, by forming a combined sort key con-
sisting of (10-degeneracy, 3-rd rank). The degeneracy is the
number of lists on which a specific document appears in the
top 1000. We used a lexicographic sort, so that all items
with degeneracy 5 appeared before any items with degener-
acy 4, and so on. Within a given degeneracy, items with
lower values for the 3rd rank were ranked first.
3. Results
3,1 Caveats
The results presented in this paper differ in several ways
from those submitted as "official" results to TREC-2,
which are published at the end of this volume. According
to our experimentai design, there are five independently pro-
duced query formulations for each of the TREC topics.
However, due to uneven return rate among our searchers, we
were missing one searcher's set of queries for the ad hoc
topics, and three searchers' sets of queries for the routing
topics, when we did the "official" runs. Consequently, in
the official results, five ad hoc topics and fifteen routing
topics are represented by four searches, rather than five.
However, we were subsequently able to obtain substitute
searchers, and so for the "unofficial" results presented and
discussed in this paper, we have the full complement of 75
searchers and five query formulations per topic.
We were unable to report the data fusion results for
routing topics for the official results, because of time con-
straints. We have subsequently been able to do those runs,
and report them here as unofficial results.
We also caution that one query for one of our ad hoc
topics is known to have a syntactic error which resulted in
very poor performance for that single query, and for all un-
weighted combinations of queries in which it was present.
Therefore, some of our comparative results in the ad hoc
case may be slightly incorrect.
3.2 General Results
Because our analyses of ad hoc topics are based on a
subset of the total sample, we here consider questions of the
sample representativeness. As explained above, the sample
was originally chosen to represent topic domains. To see if
this had introduced some other bias, we compared the distri-
bution of our 25 topics along the three dimensions of top-
ics proposed by Harman (this volume). These are: broad-
ness, operationally defined as the total number of relevant
documents found for that topic; hardness, operationally de-
fined as inverse to the median average precision for that
topic; and, restriction, defined according to linguistic charac-
teristics of the topic. The distribution of the 25 topics in
our sample did not differ significantly from the total ad hoc
topic distribution on any of these dimensions, so we feel
reasonably confident that we did not select a markedly bi-
ased subset of topics.
Tables 1 and 2 present a descriptive profile of the
queries and topics in our study, based upon the query formu-