SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Overview of the First Text REtrieval Conference (TREC-1)
chapter
D. Harman
National Institute of Standards and Technology
Donna K. Harman
4.2 Problems with Evaluation
Since this was the first time that such a large collection of text has been used in evaluation, there were some
problems using the existing methods of evaluation. First, groups were asked to send in only the top 200 docu-
ments retrieved by their systems. This artificial document cutoff is relatively low and systems did not retrieve
all the relevant documents for most topics within the cutoff. All documents retrieved beyond the 200 were con-
sidered nonrelevant by default and therefore the recall/precision curves become inaccurate after about 40% recall
on average. Table 5 shows a comparison of one system using no threshold (so relevant documents found
beyond the 200 limit are marked as relevant) versus using the 200 document threshold.
TABLE 5. COMPARISON OF TABLES FROM TIPSThR
Full Ranking Top 200 Ranking
Recall Precision Precision
0.0 0.821 0.8208
0.1 0.672 0.6710
0.2 0.581 0.5759
0.3 0.528 0.5030
0.4 0.472 0.3819
0.5 0.424 0.2999
0.6 0.368 0.1773
0.7 0.315 0.1075
0.8 0.244 0.0487
0.9 0.154 0.0117
1.0 0.039 0.0000
11 Pt. average 0.421 0.3271
Recall Precision Recall Precision
0.25 0.559 0.20 0.5759
0.50 0.424 0.50 0.2999
0.75 0.280 0.80 0.0487
3 Pt. average 0.421 0.3082
It can be seen from these tables that not only are the recall-level statistics beyond about 40% recall inaccurate,
but both the 11 pL and the 3 Pt. averages based on this table are also inaccurate. Since all systems were com-
pared using the same measures, this problem is not serious in terms of comparing methods within ThEC-1.
However, it could be improved by lowering the threshold, and TREC-2 will be run such that at least the top 500
documents are used for evaluation.
A related problem occurred because some systems in TREC-1 worked on a variable thresholding system,
with that threshold set for each topic. Documents not matching sufficient system criteria were rejected, even if
fewer than 200 were returned. Sometimes as few as 10 documents were sent as results, and the evaluation
method again assumed all documents beyond the 10 were not relevant. This hurt performance for these systems
badly in some cases and the individual system papers discuss this. The plans for TREC-2 are to include some
additional thresholding tests, so that these systems can evaluate how their thresholding performs and evaluate the
standard ranking as done by other systems.
The third prbblem was more general in nature. The current recall/precision measures do not include any
indication of the collection size. This means that the recall and precision of a system based on a 1400 docu-
ment collection could be the same as that of a system based on a million document collection, but obviously the
discrimation powers on a million document collection would be much greater. This may not have been a prob-
lem on the smaller collections, but the discrimation power of systems on TREC-sized collections is very impor-
tant. Clearly some new evaluation measures are needed for this.
One new measure being tried in [OCRerr]IREC- 1 is the ROC (Relative Operating Characteristic) curves used in sig-
nal processing. These curves are similar to the recall/precision curves, but allow the total size of the collection
to influence performance. The two variables being used here are the probability of detection or probability of a
12