SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Overview of the Second Text REtrieval Conference (TREC-2)
chapter
D. Harman
National Institute of Standards and Technology
D. K. Harman
6. Some Preliminary Analysis
6.1 Int[OCRerr]duction
The recall[OCRerr]recision curves shown in section 5 represent
the average performance of the various systems on the full
sets of topics. It is important to look beyond these aver-
ages in order to learn more about how a given system is
performing and to discover some generalizable principles
of retrieval.
Individual Systems are able to do this by performing fail-
ure analysis (see Dumais paper in this proceedicgs for a
good example) and by rimlng specific experiments to test
hypotheses on retrieval behavi[OCRerr] within a given system.
However, additional infomiation can be gained by doing
some cross-system comparison: information about spe-
cific system behavior and information about generalized
information retrieval principles. One way to do this is to
examine system behavior with respect to test collection
characteristics. A second method is to compare system
behavior on a topic by topic basis.
62 The Effects of Test Collection Characteristics
One particular test collection characteristic is the length of
documents, both the average length of documents in a col-
lection, and the variation in document length across a col-
lectio[OCRerr] Document length has significant effect on system
performance. A term that appears 10 times in a "short"
document is likely to be more important to that document
than if the same term appeared 10 times in a "long" docu-
ment. Table 3 shows system performance across the dif-
ferent document subcollections for each of the adhoc top-
ics, listing the total number of documents that were
retrieved by the system as well as the number of relevant
documents that were retrieved.
Two particiilar points can be seen from table 3. First, the
better systems retrieve about 50% relevant documents
from all the subcollections except the Federal Register
(FR). For this subeollection the retrieval rates are in the
25% range because the varied length of these documents
makes retrieval difficult.
The second point concerning table 3 is that the retrieval
rate across the subeollections is highly varied among the
systems. For example the "Brkly3" results show that
many fewer Federal Register documents and more AP
were retrieved than for the INQUERY system, whereas
the "CLARTA" results show more DOE abstracts and
fewer Wall Street Journal being retrieved. These "biases"
towards particular subcollections reflect the methods used
by systems such as the length normalization issues,
domain concentrations of terminology, and methods used
to "merge" results across subcollections (often implicit
merges during indexing).
A second test collection characteristic worth examining is
the varied broadness and varied difficulty of the topics.
An analysis was done [Harman 1994] to find the topics
for which the systems retrieved the lowest percentage of
the relevant documents on average. These topics are 61,
67,76,77, 81, 85,90,91,93, and 98 for the routing topics
and 101, 114, 120, 121, 124, 131, 139, 140, 141, and 149
for the adhoc topics. Tables 4 and 5 show the top 8 sys-
tem runs for the individual topics based in the average
precision (noninterpolated). These tables mix automatic,
manual, and feedback results for category A, and also cat-
egory B results, so they should be interpreted careflilly.
However they do demonstrate that no consistent patterns
appear for the "hard" topics. The two best routing runs
("crnlCl" and "do[OCRerr]1") only do well on about half of
these topics, and the adhoc results are even more varied.
Often systems that do not perform well on average are the
top performing system for a given topic. This verifies
that, as usual, the variation across the topics is greater
than the variation across systems.
6.3 C[OCRerr][OCRerr]ss-System Analysis
Tables 4 and 5 not only show the wide variation in system
performance, but also raise several questions about system
performance in general.
1. Does better average performance for a system
result fro[n better performance on most topics or
from comparable performance on most topics and
significantly better performance on other topics?
2. if two systems perform similarly on a given topic,
does that mean that they have retrieved a large
proportion of the same relevant documents?
3. Do systems that use "similar" approaches have a
high overlap in the particular relevant documents
they retrieve?
4. And, if number 3 is not true, what are the issues
that affect high overlap of relevant documents?
Work is ongoing at NIST on these questions and oth&
related issues.
16