SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection chapter E. Liddy S. Myaeng National Institute of Standards and Technology Donna K. Harman modules, the system description will also be organized according to these same divisions. The documents and the topic statements are analyzed in basically similar ways by the system's modules, with a few exceptions which will be detailed below within that module's description. Z&L Pre-ProcessinQ We have chosen to perform rather substantive pre-processing of the raw text that is received from DARPA because much of our later processing is dependent on clean, well-demarcated text. For example, we identify sub-headlines, and embedded figures, as well as identifying and restoring correct sentence boundaries. Also, we identify multiple stories within a single document so that each separate story can have its own SF0 vector produced which will be representative of just that one story, and so that accurate text structuring can be accomplished at the individual story level. The text is then processed by the POST part-of-speech tagger (Meteer et al, 1991) loaned to us by BBN, that stochastically attaches a part-of-speech tag to individual words. The part-of- speech tagged text is then fed into a bracketer, a deterministic finite state automaton that adds several different types of brackets for linguistic constituents (e.g. noun phrases, prepositional phrases, clauses, etc.) essential for several tasks in our system. 2.b. Subiect Field Coder The Subject Field Coder (SFCer) produces a summary-level semantic representation of a text's contents that is useable either for ranking a large set of incoming documents for their broad subject appropriateness to a standing query, or for dividing a database into clusters of documents on the same topic. One important benefit of the the SF0 representation is that it implicitly handles both the synonymy and polysemy problems which have plagued the use of NLP in l.R. systems because this representation is one level above the actual words in a text. For example, Figure 2 presents a short WSJ article and a humanly readable version of the normalized SF0 vector which serves as the document's semantic summary representation. A U.S. magistrate in Flodda ordered Cados Lehder Rivas, described as among the wodd's leading cocaine traffickers, held wfthout bond on 11 drug-smuggllng count. Lehder, who was caotured last week in Colombia and immediately extradited to the U.S., pleaded innocent to the charges in federal court in Jacksonvme. LAW .2667 .1333 BUSINESS .1333 [OCRerr]NOMlOB .0667 DRUGS .1333 MILITARY .0667 POUTICAL SCIENCE .1333 OOCUPATIOI[OCRerr] .0667 Fig. 2: Sample WSJ document and its SF0 representation The SFCer uses the Subject Codes from Longman's Dictionarv of Contemporary English (LDOCE) to produce this semantic representation of a text's contents. The machine-readable tape of the 1987 edition of LDOCE contains 35,899 headwords and 53,838 senses, for an average of 1.499 senses per headword plus several fields of information not visible in the hard-copy version 116