SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Combining Evidence for Information Retrieval
chapter
N. Belkin
P. Kantor
C. Cool
R. Quatrain
National Institute of Standards and Technology
D. K. Harman
ours, although their specific figures are different. So we are
willing to believe that the influence of our experimental si-
tuation is probably not enough to invalidate our results at
some level of generality.
There are several aspects of our general results which
are of some interest, apart from the issues of query combi-
nation and data fusion. One has to do with the lack of any
significant relationship between number of words in a
query, and the performance of a query. It has been at least
informally suggested in the IR community, that the re-
trieval performance of queries increases with the number of
words in the query. There is no support in our data for this
hypothesis. Indeed, in our data, there is at least one one-
word query, which performed better than all of the other,
multi-word queries for that topic.
It is also of interest that none of the query/searcher
characteristics was related to performance. This may be a
characteristic of our particular data set, but it also suggests
that it will be rather difficult to identify characteristics of
people or topics at this level, which will be predictive of
performance of the query.
Mthough the level of familiarity by the searchers on
the topics was in general rather low, our searchers neverthe-
less found it not too difficult to formulate queries (mean of
2.82 on a scale of 1 to 5), and felt that they had sufficient
information to construct a reasonable query, on the basis of
the topic (mean of 3.2 on a scale 1 of 5). This makes us
think that the queries are likely to be reasonable formula-
tions of the search topics, at least as far as the searchers are
concerned. But the range and variability of the numbers of
words and numbers of operators per topic seems to indicate
that the query formulations themselves are rather different
(we have not yet compared them for overlap in specific
words, but work on this issue is in progress). These two
results seem to us to confirm our initial i6ea that each query
formulation is indeed a "different" interpretation of the in-
formation problem, and thus to substantiate our general ap-
proach.
4.2 Query Combination Results
Our results, for both ad hoc and routing topics, seem
clearly to show that, in general, the more evidence one has,
and uses, in the form of different query formulations, the
better the IR performance is going to be. In particular, ta-
bles 4, 6, 7 and 9 support this conclusion, in various re-
spects. From the results of tables 6 and 9, we can see that,
taking advantage of what one learns about query perfor-
mance from one iteration doesn't help a lot, after the first
iteration, but on the other hand, it doesn't hurt, either. This
suggests to us that continual modification and reweighting
of the multiple query formulations in a combined query, is
likely to be useful in the general routing environment. But
even doing it once, given the initial evidence, seems to
help. This also suggests that continuing to add new query
formulations to a combined query will likely help perfor-
mance on subsequent runs.
Having said all this, it is worth considering the results
42
of tables 5 and 8, which showed that picking the best 2-way
or 3-way combination of query formulations was signifi-
cantly better than using 4-way or 5-way combinations. On
the face of it, this runs counter to the general result of "the
more, the better". However, it is possible that this result is
an artifact of our data. For both 2;way and 3-way combina-
tions, it was possible to choose the best from ten different
combinations. Because we had only five different query
formulations for each topic, we had smaller pools from
which to choose, for both single query formulations, and
for the 4-way and 5-way combinations. This issue needs
further investigation.
4.3 Data Fusion Results
There are several points to be made with regard to the
median-fusion scheme as implemented here. First, as may
be expected from general arguments, it sometimes performs
better than the best of the lists which are joined by the fu-
sion process. Second, it does not perform as well as even
the symmetric (unweighted) combinations made using the
internal scores generated by the INQUERY system. This is
expected, since those scores contain more information than
the rankings alone. One can imagine special cases in which
the distribution of scores assigned to a document by several
queries is such that the internal combination rule of un-
weighted sum does not perform as well, but this has appar-
ently not occurred in the cases studied here.
Third, in the application to the routing problem we
have, in fact, operated in a batch mode. For a true routing
situation, it would be necessary to estimate cutoff scores for
the several query formulations, corresponding to the cutoff
rank on the fused list. For large data sets this can be done
easily. Without this step, it is not possible to make an
immediate decision about a newly presented document.
Fourth, the stability induced by using this system was
manifested in the case of the one query for which we dis-
covered, too late to make the change, that one of the query
formulations was in error. For this case, all of the "average
combination of evidence formulations" performed more
poorly than the fusion rule. This is because one, or even
two disastrously bad query formulations will have little ef-
fect on the results of the 3-of-S fusion rule. Of course, ex-
pect for the case of combining all five query formulations,
the best of one, two, three or four query formulations can
do well because the one bad formulation will be missing
from the combinations that are best.
Finally, the application of data fusion here, at the so-
called decision level (that is to say, after the documents
have been ranked according to several rules) is a simulation
for the case to which it should be applied. Since the spe-
cific system that we used permits internal manipulation of
scores, there is no need to delay combination until after the
output lists have been formed. But in realistic settings,
several distinct systems will have internal operations which
are not compatible, so that, even if it were possible to ex-
tract the internal scores, it would not be apparent how to
combine them.