SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
CLARIT TREC Design, Experiments, and Results
chapter
D. Evans
R. Lefferts
G. Grefenstette
S. Handerson
W. Hersh
A. Archbold
National Institute of Standards and Technology
Donna K. Harman
WS3891102-0187
McDermott International Inc. said its Babcock & Wilcox
unit completed the sale of its Bailey Controls Operations
to Finmeccanica S.p.A. for $295 million.
Finmeccanica is an Italian state-owned holding company
with interests in the mechanical engineering industry.
Bailey Controls, based in Wickliffe, Ohio,
makes computerized industrial controls systeRs.
It employs 2,700 people and has annual revenue
of about $370 million.
Figure 5: Sample of Data-Document After Text Formating
4 Details of the CLARIT-TREC Experiments
Both "routing" and "ad-hoc" query experiments took advantage of basic CLARIT processing.
There are several features the two experiments share. The experiments are distinct in that
routing" involved a special step of creation of a partitioning thesaurus using larger sets of
supplied relevant documents and "ad-hoc" queries involved partitioning the document set once
using only automatically derived (but manually weighted) query terms and choosing a small
set of relevant documents to expand the final query vector.
4.1 Preparing Data
Each TREC document had to be formated for CLARIT processing. This involved making
the unique text ID accessible to CLARIT as a special field and delimiting the beginning and
end of each text in a file. Figure 5 gives a sample formated document. As can be seen in the
sample, the beginning and end of the record is marked by a backslash followed by "*". The
unique ID is set off by a backslash followed by "#". The beginning and end of the text of the
document is marked by a backslash followed by "!". Each paragraph is separated from the
next by a backslash followed by "C".13
4.2 Processing TREC Corpora (NLP)
Figure 6 gives a schematic representation of the processing steps that occurred subsequent
to data formating. The process labeled "NLP" in the figure includes all the steps illustrated in
the "NLP" portion of Figure 2: morphological analysis of words and parsing for simplex NPs.
Simplex NPs were extracted for all TREC documents; words were morphologically normalized.14
13Though CLARIT data preparation demarks paragraph units, the CLARIT-TREC process did not distinguish
divisions of text at this level. For CLARIT-TREC purposes, all the text between the "!"-marks was used as the
source of information about a document. Thus, longer and shorter documents were treated uniformly as `unit'
texts.
14
The manually-supplied keywords attached to some TREC documents in a "keyword field" were discarded.
258