SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Design and Evaluation of the CLARIT-TREC-2 System chapter D. Evans R. Lefferts National Institute of Standards and Technology D. K. Harman Table 2 gives the official CLARIT-TREC-2 system ad-hoc query results as reported by MST. A graph of the precision-recall curves for the two sets of re- suits is given in Figure 3. The total number of docu- ments retrieved under the ad-hoc query task was 8,229 (CLARTM) and 8,109 (CLARTA), representing, respec- tively, 76.30% and 75.19% of the total known relevants (10,785). The graph in Figure 4 shows the average precision score for each process at N documents, for selected val- ues of N. It should be noted that the maximum possible precision score at 500 and 1,000 documents is less than 100%. In particulat, the average number of relevants per routing topic is 209.78; this corresponds to a maxi- mum precision of 41.96% at 500 documents and 20.98% at 1,000 documents. The average number of relevants per ad-hoc query topic is 215.70; this corresponds to a maximum precision of 43.14% at 500 documents and 21.57% at 1,000 documents. Tables 3 and 4 provide another view of total per- formance. The numbers in each cell give the number of times the CLARIT-TREC-2 system produced results above, equal to, or below the median for all TREC- participant systems. Numbers in brackets give the instances of `extreme' performance-best and worst- among all systems. For the routing topics, for example, CLARIT retrieval results at 1,000 documents were bet- ter than the median 36 times in both "manual" and "automatic" modes; CLARIT scored the maximum 10 and 11 times, respectively. For the ad-hoc query topics, CLARUr retrieval results at 1,000 documents were bet- ter than the median 44 times in "manual" mode and 42 times in "automatic" mode; CLARIT scored the maxi- mum 4 and 9 times, respectively. 3.2 CLARIT Automatic vs. Manual Modes of Processing In both tasks (routing and ad-hoc querying), CLARIT- TREC-2 automatic processing results are virtually iden- tical to manual results. This conf[OCRerr]s our hypothesis that the principal contribution to performance derives from (1) the base-level CLARUF process (using linguis- tic phrases as information units) and (2) the effect of query augmentation via thesaural terms. On this lat- ter point, we note that, on average, the final query vector for a topic will contain many more terms that derive from thesaurus extraction than terms that de- rive from the source topic. In general, then, when reliable information is available (as in sample known relevants or highiy-likely relevants returned in a first- pass retrieval), the CLARIT process will succeed in finding good supplemental terminology for a topic and the overall effects of manual intervention will be minumized.1 Figures 5 and 6 illustrate the relative absence of a positive effect for manual intervention in the selection and weighting of query terms. There are approximately as many instances of decreased performance as there are instances of increased performance. Most topics show very little percentage difference in numbers of documents returned;2 this is especially underscored in the results for routing topics at 1,000 documents. 4 Analysis 4.1 CLARIT Precision As in TREC-1, CLARIT precision-recall curves demon- strate very high precision at low percentages of recall. The first few documents returned by the system are ex- tremely likely to be relevant for the given topic. This fact of CLARIT processing was successfully exploited in the ThEC-2 processing method: query augmenta- tion was possible because there was, in general, a good concentration of topic-relevant information among the sub-documents of the first-pass returned documents. As shown in Figure 4, precision remains quite stable for all methods across the first 30 documents retrieved and is relatively high across the full retrieved set of 1,000 documents. 5 Query Augmentation Experiments A distinguishing feature of the CLARIT-TREC-2 sys- tem is the use of fully-automatic query augmentation. As noted above, the selection of terms for query aug- mentation depends on (1) the selection of a source set of known- or nominated-relevant documents and (2) the application of the CLARIT thesaurus-discovery proc[OCRerr] dure. Since the size (and quality) of the source set of documents can vary and since CLARIT thesaurus- discovery processing can be adjusted to nominate rel- atively greater or fewer numbers of terms, the `query- augmentation' facet of the CLARIT process is a natural source of potential variation in system performance. 10f course, there may be some forms of manual intervention-not utilized in the CLAJ[OCRerr] `manual' proces[OCRerr]that would have effects dramaticallybetter than the CLARIT automatic process. We know of no such process that canbe applied efficiently to arbitrary topics and databases, however. m 2[OCRerr]deed, even in the absolute number of documents returned for each topic-not shown in the figures-there is very little difference. 140