SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Design and Evaluation of the CLARIT-TREC-2 System chapter D. Evans R. Lefferts National Institute of Standards and Technology D. K. Harman provement over the method employed in the CLARIT- TREC-2 system.) Second, larger sample sizes seem to perform better than smaller ones. In the experiments that we ran, this effect peaks at 500 sub-documents, but such results probably interact with the average number of relevant sub-documents available. As noted previ- ously, we would like ideally to use a variable number of sub-documents for each queryr, but such an approach requires an accurate measure of `absolute' relevance. 5.3 Thesaurus Size CLARIT thesaurus discovery techniques allow for the selection (sampling) of greater or fewer numbers of terms from a collection. In practice, the size of the sample is determined by a real-number threshold be- tween 0.00 and 1.00. A larger threshold results in the inclusion of more general terminology from the docu- ment collection; a smaller threshold results in selection of terms that are more specific to the collection. (In- tuitively, such variation correlates with the `breadth' or `narrowness' of the thesaurus.) The set of terms se- lected at a larger threshold will always properly include all of the terms that would be selected from the same collection at a smaller threshold. In CLARIT-TREC-2 processing, the threshold was set at 0.50. Note, how- ever, that in the document-sample-size experiments re- ported above, we used a threshold of 0.75. Using sub-document samples of relevant docu- ments with N =300 (reflecting the `second-best' perfor- mance obtained in the experiments on sample size), we have explored the effects of using different thesaurus extraction thresholds-at 0.50, 0.75, 0.85, and 0.95 lev- els. Results for routing topics are given in Figure 12. As with sample-size variation, differences in per- formance between increments are not dramatic, but trends do appear. First, the 0.50 thesaurus is clearly inferior and actually performs very much like the base- line CLARY[-TREC-2 system. Such a result indicates that much of the variation observed between base- line CLAR[[-TREC-2 processing and our current `best' technique is due to changes in thesaurus thresholds, rather than changes in the document sample size. Sec- ond, while all of the 0.75, 0.85, and 0.95 thesauri have the similar precision at low recall levels, the 0.75 the- saurus performs slightly better in the average case. From this we might hypothesize that the 0.75 value is close to the optimal threshold for thesaurus discovery in the context of query augmentation. Figure 13 gives the results for the automatic routing task for the optimal number of sub-documents and the- saurus threshold compared to TREC-2 reported results, compared to the unaugmented baseline. Here we can see that a simple refinement to parameter setting yields 146 a 5.5% overall improvement in average precision and 9.1%, 8.8%, and 7.6% improvement in precision at 10%, 20%, and 30% recall levels, respectively. In addition, there is a 6.3% improvement in total relevant docu- ments (7,241 vs. 6,811). Furthermore, our experiments suggest mechanisms, such as measures of absolute rel- evance, that might result in further significant gains. 6 Conclusion The CLARIT-TREC-2 system has successfully demon- stated the ability to operate as a fully-automatic IR system. Since the performance differences between CLARIT manual and automatic processing modes are negligible, one can use CLARIT in fully-automatic mode and expect high precision and very good recall on retrieval tasks. The TREC-2 results also demonstrate the efficacy of the CLARIT technique of automatic query augmenta- tion. It is generally difficult for a user to predict whether the addition of terms to a query will have a positive or negative effect on performance. CLARIT query aug- mentation, using CLARIT thesaurus-discovery tech- niques, however, shows positive effects. Because the technique is fully automatic, it can be applied either at the time of query formulation (if exemplary relevant texts are known) or at the time of `first-pass' retrieval. In either case, final results will be improved. In several experiments, we have already identified two simple adjustments to CLARIT parameters that will improve performance beyond the CLARIT-TREC- 2 system baseline. The system is not yet optimized; we expect to make other straightforward improvements. Many text processing functions currently available in the CLARIT system or near completion were not used on TRFC-2 documents. In future evaluations, we plan to utilize some of the more sophisticated function- ality in the system. For example, we have been devel- oping grammars for recognizing complex tokens such as proper names, dates, times, monetary values, etc., but did not use token recognition modules in CLARIT- TREC processing. We believe that such token recog- nition will improve the results for queries involving specific persons or time intervals. Finally, we have also been experimenting with generating sub-corpus- derived equivalence classes for words and terms. We expect to use equivalence classes selectively to supple- ment thesaurus terms in query augmentation. In sum, we believe that CLARIT-TREC-2 process- ing results demonstrate the power of CLARIT tools to solve IR tasks. The CLARIT-TREC-2 system represents only one of many possible configurations of CLARIT modules. In subsequent work, we plan on exploring other configurations.