NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Overview of the First Text REtrieval Conference (TREC-1) chapter D. Harman National Institute of Standards and Technology Donna K. Harman 4.2 Problems with Evaluation Since this was the first time that such a large collection of text has been used in evaluation, there were some problems using the existing methods of evaluation. First, groups were asked to send in only the top 200 docu- ments retrieved by their systems. This artificial document cutoff is relatively low and systems did not retrieve all the relevant documents for most topics within the cutoff. All documents retrieved beyond the 200 were con- sidered nonrelevant by default and therefore the recall/precision curves become inaccurate after about 40% recall on average. Table 5 shows a comparison of one system using no threshold (so relevant documents found beyond the 200 limit are marked as relevant) versus using the 200 document threshold. TABLE 5. COMPARISON OF TABLES FROM TIPSThR Full Ranking Top 200 Ranking Recall Precision Precision 0.0 0.821 0.8208 0.1 0.672 0.6710 0.2 0.581 0.5759 0.3 0.528 0.5030 0.4 0.472 0.3819 0.5 0.424 0.2999 0.6 0.368 0.1773 0.7 0.315 0.1075 0.8 0.244 0.0487 0.9 0.154 0.0117 1.0 0.039 0.0000 11 Pt. average 0.421 0.3271 Recall Precision Recall Precision 0.25 0.559 0.20 0.5759 0.50 0.424 0.50 0.2999 0.75 0.280 0.80 0.0487 3 Pt. average 0.421 0.3082 It can be seen from these tables that not only are the recall-level statistics beyond about 40% recall inaccurate, but both the 11 pL and the 3 Pt. averages based on this table are also inaccurate. Since all systems were com- pared using the same measures, this problem is not serious in terms of comparing methods within ThEC-1. However, it could be improved by lowering the threshold, and TREC-2 will be run such that at least the top 500 documents are used for evaluation. A related problem occurred because some systems in TREC-1 worked on a variable thresholding system, with that threshold set for each topic. Documents not matching sufficient system criteria were rejected, even if fewer than 200 were returned. Sometimes as few as 10 documents were sent as results, and the evaluation method again assumed all documents beyond the 10 were not relevant. This hurt performance for these systems badly in some cases and the individual system papers discuss this. The plans for TREC-2 are to include some additional thresholding tests, so that these systems can evaluate how their thresholding performs and evaluate the standard ranking as done by other systems. The third prbblem was more general in nature. The current recall/precision measures do not include any indication of the collection size. This means that the recall and precision of a system based on a 1400 docu- ment collection could be the same as that of a system based on a million document collection, but obviously the discrimation powers on a million document collection would be much greater. This may not have been a prob- lem on the smaller collections, but the discrimation power of systems on TREC-sized collections is very impor- tant. Clearly some new evaluation measures are needed for this. One new measure being tried in [OCRerr]IREC- 1 is the ROC (Relative Operating Characteristic) curves used in sig- nal processing. These curves are similar to the recall/precision curves, but allow the total size of the collection to influence performance. The two variables being used here are the probability of detection or probability of a 12