SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) CLARIT TREC Design, Experiments, and Results chapter D. Evans R. Lefferts G. Grefenstette S. Handerson W. Hersh A. Archbold National Institute of Standards and Technology Donna K. Harman WS3891102-0187 McDermott International Inc. said its Babcock & Wilcox unit completed the sale of its Bailey Controls Operations to Finmeccanica S.p.A. for $295 million. Finmeccanica is an Italian state-owned holding company with interests in the mechanical engineering industry. Bailey Controls, based in Wickliffe, Ohio, makes computerized industrial controls systeRs. It employs 2,700 people and has annual revenue of about $370 million. Figure 5: Sample of Data-Document After Text Formating 4 Details of the CLARIT-TREC Experiments Both "routing" and "ad-hoc" query experiments took advantage of basic CLARIT processing. There are several features the two experiments share. The experiments are distinct in that routing" involved a special step of creation of a partitioning thesaurus using larger sets of supplied relevant documents and "ad-hoc" queries involved partitioning the document set once using only automatically derived (but manually weighted) query terms and choosing a small set of relevant documents to expand the final query vector. 4.1 Preparing Data Each TREC document had to be formated for CLARIT processing. This involved making the unique text ID accessible to CLARIT as a special field and delimiting the beginning and end of each text in a file. Figure 5 gives a sample formated document. As can be seen in the sample, the beginning and end of the record is marked by a backslash followed by "*". The unique ID is set off by a backslash followed by "#". The beginning and end of the text of the document is marked by a backslash followed by "!". Each paragraph is separated from the next by a backslash followed by "C".13 4.2 Processing TREC Corpora (NLP) Figure 6 gives a schematic representation of the processing steps that occurred subsequent to data formating. The process labeled "NLP" in the figure includes all the steps illustrated in the "NLP" portion of Figure 2: morphological analysis of words and parsing for simplex NPs. Simplex NPs were extracted for all TREC documents; words were morphologically normalized.14 13Though CLARIT data preparation demarks paragraph units, the CLARIT-TREC process did not distinguish divisions of text at this level. For CLARIT-TREC purposes, all the text between the "!"-marks was used as the source of information about a document. Thus, longer and shorter documents were treated uniformly as `unit' texts. 14 The manually-supplied keywords attached to some TREC documents in a "keyword field" were discarded. 258