SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
CLARIT TREC Design, Experiments, and Results
chapter
D. Evans
R. Lefferts
G. Grefenstette
S. Handerson
W. Hersh
A. Archbold
National Institute of Standards and Technology
Donna K. Harman
5.2 Official Results
Table 3 gives the official results as reported by NIST. The figures for precision at 30 docs"
show, for example, that on average, in the first-30 documents returned by the CLARIT-TREC
system, more than half of the routing and 60% of the ad-hoc query documents were relevant.
Tables 4 and 5 present the official calculations of precision by topic, compared to the best,
median, and worst performance across all evaluated TREC-participant systems. The tables also
give the ranking of CLARIT precision relative to the best precision for each topic ("B200/Best").
Tables 6 and 7 show the official results of CLARIT for the first-100 and full-200 documents
retrieved for each query, along with 11-pt precision scores. Each line in the table gives a topic
number ("T") followed by the total number of documents found relevant by the TREC judges
("ReP'). This is followed by the CLARIT results for the first 100 hundred documents ("B1oo")
and the global results (based on all TREC-participant systems) for the greatest ("Best"), the
average ("Med"), and the smallest ("Worst") number of documents returned in the first 100 for
each topic. This, in turn, is followed by results for the first 200 documents along with the global
best, average, and worst performance and the 11-pt average precision figure for CLARIT, along
with the best, average, and worst 11-pt performances.
5.3 CLARIT "A" I "B" Comparative Results
Tables 8 and 9 present the official results with a focus on CLARIT-TREC differential pr[OCRerr]
cessing. Here "R#" gives the number of documents found relevant by the TREC judges. "T"
is the topic number, followed by "A20[OCRerr]", which gives the number of the relevant documents
that were present in the partition of 2000 documents created by the evoking routing thesaurus
for the topic. Since the actual identifiers of relevant documents were not reported for some
topics, there are zeroes (signifying missing data) for some A2[OCRerr]0 amounts. (For example, we
do not know how many relevant documents were in our partitions for Topic 22, 45, 49, etc.).
When the A2[OCRerr] number is present, we can measure the effectiveness of our `discrimination'
processing-the final steps in the CLARIT-TREC process. As a measure of effectiveness in
bringing the relevant documents to the top of the final ranked list, we give the percentage of
relevant documents present in the 2000 document partition that were promoted to the first 100
returned ("% A2[OCRerr]"). For the routing queries, these values range from 3% for Topic 18 in which
only 3 of the 118 relevant documents available were promoted to the first 100, up to 95% for
Topic 21 and 100% for Topic 6 and 23 in which all relevant documents were promoted into the
first 100 documents returned. The average was about 42% promoted from among all the 2000
into the top 100. For the ad-hoc queries, the discrimination step was more successful, averaging
a 52% promotion rate, and promoting all the relevant topics in the partition six times out of
48 (Topic 51, 52, 70, 78, 81, and 92).
These tables also present the results ranked according to performance, taking the average
results of all TREC systems as a baseline. The columns marked "B 100/m" and "B200/m" show
the ratio of CLARIT results to the baseline, for the first-100 and final- 200 documents returned.
6 Analysis
We are continuing to evaluate CLARIT-TREC processing results and to interpret CLARIT-
TREC performance. This section presents initial observations.
6.1 General Observations
It is extremely difficult to evaluate system performance on a task such as the TREC experiments.
First, as with many such experiments involving information retrieval, it is difficult to establish
274