SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Design and Evaluation of the CLARIT-TREC-2 System
chapter
D. Evans
R. Lefferts
National Institute of Standards and Technology
D. K. Harman
provement over the method employed in the CLARIT-
TREC-2 system.) Second, larger sample sizes seem to
perform better than smaller ones. In the experiments
that we ran, this effect peaks at 500 sub-documents, but
such results probably interact with the average number
of relevant sub-documents available. As noted previ-
ously, we would like ideally to use a variable number
of sub-documents for each queryr, but such an approach
requires an accurate measure of `absolute' relevance.
5.3 Thesaurus Size
CLARIT thesaurus discovery techniques allow for the
selection (sampling) of greater or fewer numbers of
terms from a collection. In practice, the size of the
sample is determined by a real-number threshold be-
tween 0.00 and 1.00. A larger threshold results in the
inclusion of more general terminology from the docu-
ment collection; a smaller threshold results in selection
of terms that are more specific to the collection. (In-
tuitively, such variation correlates with the `breadth'
or `narrowness' of the thesaurus.) The set of terms se-
lected at a larger threshold will always properly include
all of the terms that would be selected from the same
collection at a smaller threshold. In CLARIT-TREC-2
processing, the threshold was set at 0.50. Note, how-
ever, that in the document-sample-size experiments re-
ported above, we used a threshold of 0.75.
Using sub-document samples of relevant docu-
ments with N =300 (reflecting the `second-best' perfor-
mance obtained in the experiments on sample size), we
have explored the effects of using different thesaurus
extraction thresholds-at 0.50, 0.75, 0.85, and 0.95 lev-
els. Results for routing topics are given in Figure 12.
As with sample-size variation, differences in per-
formance between increments are not dramatic, but
trends do appear. First, the 0.50 thesaurus is clearly
inferior and actually performs very much like the base-
line CLARY[-TREC-2 system. Such a result indicates
that much of the variation observed between base-
line CLAR[[-TREC-2 processing and our current `best'
technique is due to changes in thesaurus thresholds,
rather than changes in the document sample size. Sec-
ond, while all of the 0.75, 0.85, and 0.95 thesauri have
the similar precision at low recall levels, the 0.75 the-
saurus performs slightly better in the average case.
From this we might hypothesize that the 0.75 value is
close to the optimal threshold for thesaurus discovery
in the context of query augmentation.
Figure 13 gives the results for the automatic routing
task for the optimal number of sub-documents and the-
saurus threshold compared to TREC-2 reported results,
compared to the unaugmented baseline. Here we can
see that a simple refinement to parameter setting yields
146
a 5.5% overall improvement in average precision and
9.1%, 8.8%, and 7.6% improvement in precision at 10%,
20%, and 30% recall levels, respectively. In addition,
there is a 6.3% improvement in total relevant docu-
ments (7,241 vs. 6,811). Furthermore, our experiments
suggest mechanisms, such as measures of absolute rel-
evance, that might result in further significant gains.
6 Conclusion
The CLARIT-TREC-2 system has successfully demon-
stated the ability to operate as a fully-automatic IR
system. Since the performance differences between
CLARIT manual and automatic processing modes are
negligible, one can use CLARIT in fully-automatic
mode and expect high precision and very good recall
on retrieval tasks.
The TREC-2 results also demonstrate the efficacy of
the CLARIT technique of automatic query augmenta-
tion. It is generally difficult for a user to predict whether
the addition of terms to a query will have a positive or
negative effect on performance. CLARIT query aug-
mentation, using CLARIT thesaurus-discovery tech-
niques, however, shows positive effects. Because the
technique is fully automatic, it can be applied either at
the time of query formulation (if exemplary relevant
texts are known) or at the time of `first-pass' retrieval.
In either case, final results will be improved.
In several experiments, we have already identified
two simple adjustments to CLARIT parameters that
will improve performance beyond the CLARIT-TREC-
2 system baseline. The system is not yet optimized; we
expect to make other straightforward improvements.
Many text processing functions currently available
in the CLARIT system or near completion were not
used on TRFC-2 documents. In future evaluations, we
plan to utilize some of the more sophisticated function-
ality in the system. For example, we have been devel-
oping grammars for recognizing complex tokens such
as proper names, dates, times, monetary values, etc.,
but did not use token recognition modules in CLARIT-
TREC processing. We believe that such token recog-
nition will improve the results for queries involving
specific persons or time intervals. Finally, we have
also been experimenting with generating sub-corpus-
derived equivalence classes for words and terms. We
expect to use equivalence classes selectively to supple-
ment thesaurus terms in query augmentation.
In sum, we believe that CLARIT-TREC-2 process-
ing results demonstrate the power of CLARIT tools to
solve IR tasks. The CLARIT-TREC-2 system represents
only one of many possible configurations of CLARIT
modules. In subsequent work, we plan on exploring
other configurations.