SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Overview of the Second Text REtrieval Conference (TREC-2)
chapter
D. Harman
National Institute of Standards and Technology
D. K. Harman
1.00
0.80
0.60
0.40
0.20
0.00
Effects of Cutoff on Evaluation
0.00 0.20 0.40 0.60 0.80 1.00
Recall
[OCRerr]at2OO [OCRerr]at5OO [OCRerr]at1OOO[OCRerr]fiill
Figure 4. Effect of evaluation cutoffs on recall/precision curves.
difficult to perform failure analysis on the results to better
understand the retrieval processes being tested. Without
better understanding of underlying system performance, it
will be laard to consolidate research progress. Some pre-
liminary analysis of per topic performance is provided in
section 6, and and more attention will be given to this
problem in the future.
5. Results
5.1 Inti[OCRerr]uction
In general the ThEC-2 results showed signilicant
improvements over the [OCRerr]IREC-l results. Many of the
original ThEC-l groups were able to "complete" their
system rebuilding and tuning tasks. The results for
ThEC-2 therefore can be viewed as the "best first-pass"
that most groups can accomplish on this large amount of
data. The adhoc results in particular represent baseline
results from the scaling-up of current algorithms to large
test collections. The better systems produced similar
results, results that are comparable to those seen using
these algorithms on smaller test collections.
The routing results showed even more improvement over
ThEC-l routing results. Some of this improvement was
due to the availability of large numbers of accurate
9
relevance judgments for training (unlike TREC-l). but
most of the improvements came from new research by
participating groups into better ways of using the training
data.
For full descriptions of each system discussed in this sec-
tion, see the individual papers in this proceedings.
52 Adhoc Results
The acihoc evaluation used new topics (101-150) against
the two disks of training documents (disks 1 and 2).
There were 44 sets of results for adhoc evaluation in
ThEC-2, with 32 of them based on runs for the full data
seL of these, 23 used automatic construction of queries,
9 used manual construction, and 2 used feedback.
Figure 5 shows the recall/precision curves for the six
ThEC-2 groups with the highest non-interpolated average
precision using automatic construction of queries. The
results marked "INQOOl" are the INQUERY system from
the University of Massachusetts (see Croft, Callan &
Brogho paper). This system uses probabilistic term
weighting and a probabilistic inference net to combine
various topic and document features. The results marked
"dortQ2", "Brkly3" and "cmlL2" are all based on the use
of the Cornell SMART system, but with important varia-
tions. The "c[OCRerr]2" run is the basic SMART system from