SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Design and Evaluation of the CLARIT-TREC-2 System chapter D. Evans R. Lefferts National Institute of Standards and Technology D. K. Harman * Sub-Document Processing. The CLARif-TREC- 1 system treated all documents as whole texts; retrieval `scores' were calculated over full doc- uments. The CLAR[r-TREC-2 system treats all documents as collections of one or more sub- documents, operationalized as variable-sized units of approximately paragraph length. Such units are used as the basis for all statistical calculations and for measuring `similarity' to a query A full docu- ment is assigned the score (e.g., for ranicing) of the highest-scoring sub-document it contains. 2.2 Processing Method Figure 1 offers a schematic overview of processing in the CLARIT-TREC-2 system. All topics were parsed for noun phrases. These, in turn, were either manually ("CLARTM") or automatically ("CLARTA") assigned weights (values "1", "2", or "3") for `importance'. The terms for each topic were automatically supplemented with terms from a (pseudo-)thesaurus, automatically extracted from available known-relevant documents (in the case of routing topics) or from the top-ranked sub-documents returned in a first-pass querying of the TREC-2 collection (in the case of ad-hoc topics). All in- stances of retrieval took place over the applicable full set of documents, which had undergone an inltial round of CLARIT processing (parsing). The CLARIT-TREC-2 system incorporates a vector- space retrieval system that uses several CLARIT- specific techniques to improve retrieval results. The principal techniques involve the use of (1) natural- language processing to identify and normalize index- ing terms, (2) fully automatic query augmentation based on CLARIT thesaurus discovery, and (3) sim- ple text-analysis heuristics to approximate the effect of more sophisticated discourse analysis of texts. These techniques are described in greater detail in the follow- ing sections. 2.2.1 Natural-Language Processing CLARIT natural-language processing (NLP) encom- passes an inflectional morphological analyzer for word recognition and normalization and a determInistic rule- based parser for phrase identification. For ThEC-2 pro- cessing, only simplex noun phrases (NPs) were used. Simplex NPs are phrasal constitutents that include the modifiers and head noun(s) of an NP but not the post- head prepositional phrases, relative clauses, or verb constructions. The CLARIT parser can provide a more complex linguistic analysis of texts, but such additional detail was not used in TREC-2 experiments. Th\source Relevant [OCRerr]................................. Topic1 Documents Possibly Identical + (Ad[OCRerr]Hoc Queries) Heuristics Retrieval Training ½½ parse Corpus Corpus Optional Manual Thessurus parse panse Correction Extraction Sansplel Query Vector Construction Vector-Space _________ I <FtillyAtitomatic[OCRerr] QFcedback) ifiHoc) Overview of CLARIT TREC-2 Processing Figure 1: Overview of CLARIT-TREC-2 Processing 138