SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Overview of the Second Text REtrieval Conference (TREC-2) chapter D. Harman National Institute of Standards and Technology D. K. Harman 6. Some Preliminary Analysis 6.1 Int[OCRerr]duction The recall[OCRerr]recision curves shown in section 5 represent the average performance of the various systems on the full sets of topics. It is important to look beyond these aver- ages in order to learn more about how a given system is performing and to discover some generalizable principles of retrieval. Individual Systems are able to do this by performing fail- ure analysis (see Dumais paper in this proceedicgs for a good example) and by rimlng specific experiments to test hypotheses on retrieval behavi[OCRerr] within a given system. However, additional infomiation can be gained by doing some cross-system comparison: information about spe- cific system behavior and information about generalized information retrieval principles. One way to do this is to examine system behavior with respect to test collection characteristics. A second method is to compare system behavior on a topic by topic basis. 62 The Effects of Test Collection Characteristics One particular test collection characteristic is the length of documents, both the average length of documents in a col- lection, and the variation in document length across a col- lectio[OCRerr] Document length has significant effect on system performance. A term that appears 10 times in a "short" document is likely to be more important to that document than if the same term appeared 10 times in a "long" docu- ment. Table 3 shows system performance across the dif- ferent document subcollections for each of the adhoc top- ics, listing the total number of documents that were retrieved by the system as well as the number of relevant documents that were retrieved. Two particiilar points can be seen from table 3. First, the better systems retrieve about 50% relevant documents from all the subcollections except the Federal Register (FR). For this subeollection the retrieval rates are in the 25% range because the varied length of these documents makes retrieval difficult. The second point concerning table 3 is that the retrieval rate across the subeollections is highly varied among the systems. For example the "Brkly3" results show that many fewer Federal Register documents and more AP were retrieved than for the INQUERY system, whereas the "CLARTA" results show more DOE abstracts and fewer Wall Street Journal being retrieved. These "biases" towards particular subcollections reflect the methods used by systems such as the length normalization issues, domain concentrations of terminology, and methods used to "merge" results across subcollections (often implicit merges during indexing). A second test collection characteristic worth examining is the varied broadness and varied difficulty of the topics. An analysis was done [Harman 1994] to find the topics for which the systems retrieved the lowest percentage of the relevant documents on average. These topics are 61, 67,76,77, 81, 85,90,91,93, and 98 for the routing topics and 101, 114, 120, 121, 124, 131, 139, 140, 141, and 149 for the adhoc topics. Tables 4 and 5 show the top 8 sys- tem runs for the individual topics based in the average precision (noninterpolated). These tables mix automatic, manual, and feedback results for category A, and also cat- egory B results, so they should be interpreted careflilly. However they do demonstrate that no consistent patterns appear for the "hard" topics. The two best routing runs ("crnlCl" and "do[OCRerr]1") only do well on about half of these topics, and the adhoc results are even more varied. Often systems that do not perform well on average are the top performing system for a given topic. This verifies that, as usual, the variation across the topics is greater than the variation across systems. 6.3 C[OCRerr][OCRerr]ss-System Analysis Tables 4 and 5 not only show the wide variation in system performance, but also raise several questions about system performance in general. 1. Does better average performance for a system result fro[n better performance on most topics or from comparable performance on most topics and significantly better performance on other topics? 2. if two systems perform similarly on a given topic, does that mean that they have retrieved a large proportion of the same relevant documents? 3. Do systems that use "similar" approaches have a high overlap in the particular relevant documents they retrieve? 4. And, if number 3 is not true, what are the issues that affect high overlap of relevant documents? Work is ongoing at NIST on these questions and oth& related issues. 16