SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
CLARIT TREC Design, Experiments, and Results
chapter
D. Evans
R. Lefferts
G. Grefenstette
S. Handerson
W. Hersh
A. Archbold
National Institute of Standards and Technology
Donna K. Harman
a `gold standard' for the task-authoritative and comprehensive knowledge of the `correct'
responses to a query. A gold standard is difficult to establish in general and is genuinely
problematic in the case of the TREC experiments because of the sheer size of the corpus.
Second, for the CLARIT-TREC effort in particular, many errors resulted from simple mistakes
(i.e., human errors) made in the course of processing. It is difficult to isolate such incidental
errors from actual flaws in the design and performance of the CLARIT-TREC system.
In the following sections, we offer thoughts about the `official' performance of the CLARIT-
TREC system, several hypotheses about sources of failure, and a list of known problems in the
design and application of the CLARIT-TREC system to the TREC tasks.
6.2 Observations About CLARIT-TREC Performance
An evaluation of CLARIT-TREC performance must certainly begin with the comparison of
CLARIT-TREC results to the NIST-identified `correct' results. Such results are reflected in,
but are not restricted to, recall-precision statistics, e.g., as given in Table 3 and the comparative
results in Tables 4 through 7. We must bear in mind, however, that recall-precision statistics
grossly under-simplify the analysis of a system's performance in a retrieval task.
CLARIT recall-precision curves demonstrate very high precision at low percentages of recall.
The first few documents returned by the system are extremely likely to be relevant for the given
query. Such a result is encouraging, and suggests straightforward methods to improve the recall
rates of the system. A simple, automatic iteration of the query, augmented with the top few
relevant documents, should extend the `net' of retrieved relevant documents, as has been well
demonstrated in past IR experiments.17
Such high precision at low recall tends to validate several hypotheses about performance
characteristics of the CLARIT system. A priori, we expect that one of the benefits of accurate
and appropriate NLP in information retrieval is an improved ability to discriminate among
similar documents. Furthermore, increased precision is an expected result of our `evoke-and-
discriminate' system design. Because only a small subset of candidate relevant documents was
considered in the discrimination phase of CLARIT-TREC processing, the distinctions among
the documents could be highlighted through more `expensive' processing of the smaller topical
partitions. We were able to use a vector-space model with a large number of dimensions (all
multi-word terms and individual words) relative to the number of documents under considera-
tion.
The CLARIT-TREC results are clearly competitive with other state-of-the-art information
retrieval systems. As indicated in Tables 1 and 2 and 4 through 7, CLARIT performance relative
to other TREC-participant systems is quite good. CLARIT performs consistently above the
median and often at or near the top of the group. There are relatively few cases where CLARIT
performance is the worst; for the ad-hoc queries, CLARIT does not perform minimally on any
of the topics.
6.3 Hypotheses About Failure
Comparison of CLARIT "A" recall rates against the full results of CLARIT "B" (Tables 8 and 9)
helps to isolate some sources of failure and possible flaws in CLARIT processing. CLARIT "B"
processing is confined to the restricted document set identified by the partitioning procedure;
it is impossible for the final results to demonstrate recall rates better than the number of
documents present in the 2000 document partition. In some cases, many of the actually relevant
documents simply were not available in the partition. As noted previously, on average, at
17Cf. [Salton & McGill 1983J for discussion, for example.
282