SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Combining Evidence for Information Retrieval
chapter
N. Belkin
P. Kantor
C. Cool
R. Quatrain
National Institute of Standards and Technology
D. K. Harman
lations, and the searchers' responses to our questionnaire.
Table 1 shows the distribution of numbers of words and op-
erators per query, and also of time required to construct a
query. Table 2 shows the distribution of searchers' attitudes
to the topics, each indicated on a scale of one to five, from
least to most.
[_______ Mean StdDev. Mm. Max. N
Operators 9.94 5.72 1.00 44.00 375
[Words 19.40 14.63 1.0 145.00 375
LL 11.31 7.48 1.00 40.00 367
minutes
Table 1. Characteristics of queries for ad hoc and routing
topics.
Mean StdDev. Mm. Max. N
Familiar- 1.81 1.15 1.00 5.00 388
ity __
Easeof 2.82 1.11 1.00 5.00 372
construc-
tion
Enough 3.20 1.11 1.00 5.00 322
informa-
tion ________
Table 2. Characterization of topics by searchers, for rout-
mg and ad hoc topics.
Our ad hoc questionnaire also included a question on
how many years of experience each searcher had in online
searching. The mean response was 6.8 years. Unfortu-
nately, we do not have these data for the routing searchers.
We wished to consider whether there were any relation-
ships between the various characteristics of queries and top-
ics and the performance of the queries themselves. For this
purpose, we constructed a table in which each separate query
formulation (75x5=375) is associated with performance
measures, the characteristics enumerated in tables 1 and 2,
and the three topic categories of broadness, hardness and re-
striction defined by Harman (this volume). For perfor-
mance, we considered using one or more of three measures:
average of 11-point precision; precision at 100 documents;
and R-precision (defined by Harman, this volume). Factor
analysis of these three measures showed that a single factor
accounts for more than 90% of the variance among them,
so that they represent, in effect, a single aspect or factor of
performance. The average precision was chosen as represen-
tative of this factor, and we have used it both in evaluation
of our retrieval results, and in attempting to determine the
effect of the other variables we have considered, on retrieval
performance. Since this variate does not exhibit a normal
distribution, logarithmic and logistic transforms were ex-
plored. The logistic leads to a most nearly normal distribu-
tion of the transformed score, but we can still not say that
the transformed variable follows a normal distribution.
The results of applying ANOVA to seek a predictor of
p are shown in Table 3. No significant relations appear.
Because of the range of values assumed by the variables
Operators, Words and Time, the relation was sought using
38
regression analysis. Once again, no significant relations
were found, and the scatter plots (not included here) make it
clear that there is no trend to be found.. Both hardness and
broadness are significantly related to performance. The
former is expected, since the hardness is determined by me-
dian average precision; the latter is less obvious.
Anal sis of variance for lo I 1-
Independent variable Significance
Familiaritv 0.149
Easiness 0.169
Information 0.907
Table 3. Significance levels of F-tests using ANOVA to
seek dependence of the logistically transformed average pre-
cision on the searcher's assessments of their query formula-
tion.
The search for relations between average precision and
characteristics of the query formulation, whether provided
by the search, or determined from the query text itself, was
motivated by the results, discussed below, which show that
it is desirable to weight formulations in proportion to their
average precision. Thus, if we could find a surrogate for
average precision which can be known without evaluating
the retrieved documents, it would be possible to approxi-
mate the effective combination on the first pass of a re-
trieval operation. This hope is frustrated at this time.
3.3 Query Combination and Data Fusion
Results: Ad hoc Topics
The official results reported to JkBC-2 were for the
overall performance of each of two treatments for the ad hoc
topics, and of one treatment for the routing topics. For
those results, we refer the reader to the relevant section of
this volume. Here we report on our further investigations
on the effect of combination of queries, and of data fusion,
on performance.
Our first investigation in query combination was to see
if combining query formulations has a regular, beneficial ef-
fect, as hypothesized. To do this, we generated the five dif-
ferent search groups for the ad hoc topics, as described in
section 2.2, and did experimental runs on all single query
groups, all 2-way combinations of queries, all 3-way com-
binations of queries, all 4-way combinations of queries, and
the combination of all 5 query formulations. The results
are presented in Table 4, where it is evident that the average
performance increases monotonically as more evidence is
added. The increase is strict and significant, as shown in
Table 4a, where we display the number of times that each
combination level performed better than each other level.
We note that the data fusion results are not significantly
better than any but 1-way combination (that is, average per-
formance for single queries), but also that its performance is
not significantly different from unweighted 5-way combina-
tion