NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) CLARIT TREC Design, Experiments, and Results chapter D. Evans R. Lefferts G. Grefenstette S. Handerson W. Hersh A. Archbold National Institute of Standards and Technology Donna K. Harman 5.2 Official Results Table 3 gives the official results as reported by NIST. The figures for precision at 30 docs" show, for example, that on average, in the first-30 documents returned by the CLARIT-TREC system, more than half of the routing and 60% of the ad-hoc query documents were relevant. Tables 4 and 5 present the official calculations of precision by topic, compared to the best, median, and worst performance across all evaluated TREC-participant systems. The tables also give the ranking of CLARIT precision relative to the best precision for each topic ("B200/Best"). Tables 6 and 7 show the official results of CLARIT for the first-100 and full-200 documents retrieved for each query, along with 11-pt precision scores. Each line in the table gives a topic number ("T") followed by the total number of documents found relevant by the TREC judges ("ReP'). This is followed by the CLARIT results for the first 100 hundred documents ("B1oo") and the global results (based on all TREC-participant systems) for the greatest ("Best"), the average ("Med"), and the smallest ("Worst") number of documents returned in the first 100 for each topic. This, in turn, is followed by results for the first 200 documents along with the global best, average, and worst performance and the 11-pt average precision figure for CLARIT, along with the best, average, and worst 11-pt performances. 5.3 CLARIT "A" I "B" Comparative Results Tables 8 and 9 present the official results with a focus on CLARIT-TREC differential pr[OCRerr] cessing. Here "R#" gives the number of documents found relevant by the TREC judges. "T" is the topic number, followed by "A20[OCRerr]", which gives the number of the relevant documents that were present in the partition of 2000 documents created by the evoking routing thesaurus for the topic. Since the actual identifiers of relevant documents were not reported for some topics, there are zeroes (signifying missing data) for some A2[OCRerr]0 amounts. (For example, we do not know how many relevant documents were in our partitions for Topic 22, 45, 49, etc.). When the A2[OCRerr] number is present, we can measure the effectiveness of our `discrimination' processing-the final steps in the CLARIT-TREC process. As a measure of effectiveness in bringing the relevant documents to the top of the final ranked list, we give the percentage of relevant documents present in the 2000 document partition that were promoted to the first 100 returned ("% A2[OCRerr]"). For the routing queries, these values range from 3% for Topic 18 in which only 3 of the 118 relevant documents available were promoted to the first 100, up to 95% for Topic 21 and 100% for Topic 6 and 23 in which all relevant documents were promoted into the first 100 documents returned. The average was about 42% promoted from among all the 2000 into the top 100. For the ad-hoc queries, the discrimination step was more successful, averaging a 52% promotion rate, and promoting all the relevant topics in the partition six times out of 48 (Topic 51, 52, 70, 78, 81, and 92). These tables also present the results ranked according to performance, taking the average results of all TREC systems as a baseline. The columns marked "B 100/m" and "B200/m" show the ratio of CLARIT results to the baseline, for the first-100 and final- 200 documents returned. 6 Analysis We are continuing to evaluate CLARIT-TREC processing results and to interpret CLARIT- TREC performance. This section presents initial observations. 6.1 General Observations It is extremely difficult to evaluate system performance on a task such as the TREC experiments. First, as with many such experiments involving information retrieval, it is difficult to establish 274