SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Overview of the Second Text REtrieval Conference (TREC-2) chapter D. Harman National Institute of Standards and Technology D. K. Harman 1.00 0.80 0.60 0.40 0.20 0.00 Effects of Cutoff on Evaluation 0.00 0.20 0.40 0.60 0.80 1.00 Recall [OCRerr]at2OO [OCRerr]at5OO [OCRerr]at1OOO[OCRerr]fiill Figure 4. Effect of evaluation cutoffs on recall/precision curves. difficult to perform failure analysis on the results to better understand the retrieval processes being tested. Without better understanding of underlying system performance, it will be laard to consolidate research progress. Some pre- liminary analysis of per topic performance is provided in section 6, and and more attention will be given to this problem in the future. 5. Results 5.1 Inti[OCRerr]uction In general the ThEC-2 results showed signilicant improvements over the [OCRerr]IREC-l results. Many of the original ThEC-l groups were able to "complete" their system rebuilding and tuning tasks. The results for ThEC-2 therefore can be viewed as the "best first-pass" that most groups can accomplish on this large amount of data. The adhoc results in particular represent baseline results from the scaling-up of current algorithms to large test collections. The better systems produced similar results, results that are comparable to those seen using these algorithms on smaller test collections. The routing results showed even more improvement over ThEC-l routing results. Some of this improvement was due to the availability of large numbers of accurate 9 relevance judgments for training (unlike TREC-l). but most of the improvements came from new research by participating groups into better ways of using the training data. For full descriptions of each system discussed in this sec- tion, see the individual papers in this proceedings. 52 Adhoc Results The acihoc evaluation used new topics (101-150) against the two disks of training documents (disks 1 and 2). There were 44 sets of results for adhoc evaluation in ThEC-2, with 32 of them based on runs for the full data seL of these, 23 used automatic construction of queries, 9 used manual construction, and 2 used feedback. Figure 5 shows the recall/precision curves for the six ThEC-2 groups with the highest non-interpolated average precision using automatic construction of queries. The results marked "INQOOl" are the INQUERY system from the University of Massachusetts (see Croft, Callan & Brogho paper). This system uses probabilistic term weighting and a probabilistic inference net to combine various topic and document features. The results marked "dortQ2", "Brkly3" and "cmlL2" are all based on the use of the Cornell SMART system, but with important varia- tions. The "c[OCRerr]2" run is the basic SMART system from