SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Design and Evaluation of the CLARIT-TREC-2 System
chapter
D. Evans
R. Lefferts
National Institute of Standards and Technology
D. K. Harman
* Sub-Document Processing. The CLARif-TREC-
1 system treated all documents as whole texts;
retrieval `scores' were calculated over full doc-
uments. The CLAR[r-TREC-2 system treats all
documents as collections of one or more sub-
documents, operationalized as variable-sized units
of approximately paragraph length. Such units are
used as the basis for all statistical calculations and
for measuring `similarity' to a query A full docu-
ment is assigned the score (e.g., for ranicing) of the
highest-scoring sub-document it contains.
2.2 Processing Method
Figure 1 offers a schematic overview of processing in
the CLARIT-TREC-2 system. All topics were parsed
for noun phrases. These, in turn, were either manually
("CLARTM") or automatically ("CLARTA") assigned
weights (values "1", "2", or "3") for `importance'. The
terms for each topic were automatically supplemented
with terms from a (pseudo-)thesaurus, automatically
extracted from available known-relevant documents
(in the case of routing topics) or from the top-ranked
sub-documents returned in a first-pass querying of the
TREC-2 collection (in the case of ad-hoc topics). All in-
stances of retrieval took place over the applicable full set
of documents, which had undergone an inltial round
of CLARIT processing (parsing).
The CLARIT-TREC-2 system incorporates a vector-
space retrieval system that uses several CLARIT-
specific techniques to improve retrieval results. The
principal techniques involve the use of (1) natural-
language processing to identify and normalize index-
ing terms, (2) fully automatic query augmentation
based on CLARIT thesaurus discovery, and (3) sim-
ple text-analysis heuristics to approximate the effect of
more sophisticated discourse analysis of texts. These
techniques are described in greater detail in the follow-
ing sections.
2.2.1 Natural-Language Processing
CLARIT natural-language processing (NLP) encom-
passes an inflectional morphological analyzer for word
recognition and normalization and a determInistic rule-
based parser for phrase identification. For ThEC-2 pro-
cessing, only simplex noun phrases (NPs) were used.
Simplex NPs are phrasal constitutents that include the
modifiers and head noun(s) of an NP but not the post-
head prepositional phrases, relative clauses, or verb
constructions. The CLARIT parser can provide a more
complex linguistic analysis of texts, but such additional
detail was not used in TREC-2 experiments.
Th\source Relevant [OCRerr].................................
Topic1 Documents Possibly Identical
+ (Ad[OCRerr]Hoc Queries)
Heuristics Retrieval Training
½½ parse Corpus Corpus
Optional Manual Thessurus parse panse
Correction Extraction
Sansplel
Query Vector
Construction Vector-Space
_________ I
<FtillyAtitomatic[OCRerr]
QFcedback)
ifiHoc)
Overview of
CLARIT TREC-2
Processing
Figure 1: Overview of CLARIT-TREC-2 Processing
138