NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) CLARIT TREC Design, Experiments, and Results chapter D. Evans R. Lefferts G. Grefenstette S. Handerson W. Hersh A. Archbold National Institute of Standards and Technology Donna K. Harman a `gold standard' for the task-authoritative and comprehensive knowledge of the `correct' responses to a query. A gold standard is difficult to establish in general and is genuinely problematic in the case of the TREC experiments because of the sheer size of the corpus. Second, for the CLARIT-TREC effort in particular, many errors resulted from simple mistakes (i.e., human errors) made in the course of processing. It is difficult to isolate such incidental errors from actual flaws in the design and performance of the CLARIT-TREC system. In the following sections, we offer thoughts about the `official' performance of the CLARIT- TREC system, several hypotheses about sources of failure, and a list of known problems in the design and application of the CLARIT-TREC system to the TREC tasks. 6.2 Observations About CLARIT-TREC Performance An evaluation of CLARIT-TREC performance must certainly begin with the comparison of CLARIT-TREC results to the NIST-identified `correct' results. Such results are reflected in, but are not restricted to, recall-precision statistics, e.g., as given in Table 3 and the comparative results in Tables 4 through 7. We must bear in mind, however, that recall-precision statistics grossly under-simplify the analysis of a system's performance in a retrieval task. CLARIT recall-precision curves demonstrate very high precision at low percentages of recall. The first few documents returned by the system are extremely likely to be relevant for the given query. Such a result is encouraging, and suggests straightforward methods to improve the recall rates of the system. A simple, automatic iteration of the query, augmented with the top few relevant documents, should extend the `net' of retrieved relevant documents, as has been well demonstrated in past IR experiments.17 Such high precision at low recall tends to validate several hypotheses about performance characteristics of the CLARIT system. A priori, we expect that one of the benefits of accurate and appropriate NLP in information retrieval is an improved ability to discriminate among similar documents. Furthermore, increased precision is an expected result of our `evoke-and- discriminate' system design. Because only a small subset of candidate relevant documents was considered in the discrimination phase of CLARIT-TREC processing, the distinctions among the documents could be highlighted through more `expensive' processing of the smaller topical partitions. We were able to use a vector-space model with a large number of dimensions (all multi-word terms and individual words) relative to the number of documents under considera- tion. The CLARIT-TREC results are clearly competitive with other state-of-the-art information retrieval systems. As indicated in Tables 1 and 2 and 4 through 7, CLARIT performance relative to other TREC-participant systems is quite good. CLARIT performs consistently above the median and often at or near the top of the group. There are relatively few cases where CLARIT performance is the worst; for the ad-hoc queries, CLARIT does not perform minimally on any of the topics. 6.3 Hypotheses About Failure Comparison of CLARIT "A" recall rates against the full results of CLARIT "B" (Tables 8 and 9) helps to isolate some sources of failure and possible flaws in CLARIT processing. CLARIT "B" processing is confined to the restricted document set identified by the partitioning procedure; it is impossible for the final results to demonstrate recall rates better than the number of documents present in the 2000 document partition. In some cases, many of the actually relevant documents simply were not available in the partition. As noted previously, on average, at 17Cf. [Salton & McGill 1983J for discussion, for example. 282