SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
CLARIT TREC Design, Experiments, and Results
chapter
D. Evans
R. Lefferts
G. Grefenstette
S. Handerson
W. Hersh
A. Archbold
National Institute of Standards and Technology
Donna K. Harman
routing topic. The rules of the exercise required each group to submit `models' for each routing
topic (e.g., a set of procedures or a `query vector'), which then were `on record' and had to
be used in the final evaluation of the routing task. That evaluation required that each group
retrieve documents from the second installment of documents, approximately 0.9-gigabytes of
text. "Ad-hoc" querying corresponds to situations in which a topic is presented to a system
and appropriate documents must be found; no example documents are available. In TREC, the
second 50 topics were designed as "ad-hoc-query" topics. The rules required each group to use
the full 2-gigabyte database as the search space for ad-hoc queries. All results were reported
as a ranked list of the 200 `top' documents in response to each topic, whether a routing topic
or an ad-hoc-query topic.
1.2 Notes on CLARIT Team Participation
The CLARIT team submitted results, labeled "A" and "B" `5 representing the top 200 docu-
ments at the end of each of two sequential steps in the processing of topics. Since the actual
processing of topics was designed to give `best' results only after both stages of processing were
completed, the "A" results are known to be suboptimal; the "B" results represent the true test
of the CLARIT-TREC design.
The large scale of the tasks challenged the resources that were available to the CLARIT
team. Storage for the source data and topics alone required 2 gigabytes of space. The research-
prototype version of the CLARIT system, which was used in the task, generates various sec-
ondary and intermediate resources in the course of processing. Such intermediate files also
require temporary storage. In all, approximately 8 gigabytes of disk space was used for the
process. The system-engineering work required to manage the data represented a significant ef-
fort for the team; more than 75% of the team effort was devoted to (a) re-implementing critical
CLARIT processes to deal with larger volumes of data and limited space and (b) monitoring
and directing the use of resources and the sequence of processes when making actual `runs' over
the data.
Data for the final tests was made available from NIST only after preliminary processing
results were submitted. The CLARIT team submitted its preliminary results (the `frozen'
forms of the routing queries) on Friday, August 21, 1992. NIST express-mailed the new test
data to Carnegie Mellon on the same day, but the package was misaddressed and did not arrive.
A second mailing finally did arrive on Tuesday, August 25, one week before the deadline for
final results. Thus, all final processing took place in seven days.
The CLARIT team utilized, variously, six machines (including a DECsystem 5820 and DEC-
station 5000s and 3100s) and the approximately 8 gigabytes of dedicated storage for TREC-
processing tasks. Actual processing occurred in batch mode over several machines and across
a network (as some storage was remote).
2 Background Description of Basic CLARIT Processing in TREC
Basic CLARIT processing is described elsewhere.6 A schematic representation of the `standard'
CLARIT process for document indexing is given in Fignre 1. A representation of the simplified
CLARIT process that was employed in the case of CLARIT-TREC document indexing is given
in Figure 2.
5The Conference provided a special category ("Category B") for groups that intended to work only with a
subset (100 megabytes) of the TREC data. This should not be confused with what we call the CLARIT "A"
and "B" results: all CLARIT processing involved the full set of TREC data.
6Cf. [Evans 1990], [Evans et al. 1991a,b,c].
252