SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) DR-LINK: A System Update for TREC-2 chapter E. Liddy S. Myaeng National Institute of Standards and Technology D. K. Harman DR-LINK: A System Update for TREC-2 Elizabeth D. Liddy Sung H. Myaeng School of Information Studies Syracuse University Syracuse, New York 132100-4100 liddy@mailbox.syr.edu; slunyaeng@mailbox.syr.edu 1. Overview of DR-LINK's Approach The theoretical goal underlying the DR-LINK System is to represent and match documents and queries at the various linguistic levels at which human language conveys meaning. Accordingly, we have developed a modular system which processes and represents text at the lexical, syntactic, semantic, and discourse levels of language. In concert, these levels of processing permit DR-LINK to achieve a level of intelligent retrieval beyond more traditional approaches. In addition, the rich annotations to text produced by DR-LINK are replete with much of the semantics necessary for document extraction. The system was planned and developed in a modular fashion and fl[OCRerr]ctional modularity has been achieved, while a full integration of these multiple levels of linguistic processing is within reach. As currently configured, DR-LINK performs a staged processing of documents, with each module adding a meangful annotation to the text. For matching, a Topic Statement undergoes analogous processing to determine its relevancy requirements for documents at each stage. Ainong the many benefits of staged processing are: improvements and changes can be easily made within any module; the contribution of the various stages can be empirically tested by simply trulng them on or off; modules can be re-ordered (as was done within the last six months) in order to utilize document annotations in various ways, and; individual modules can be incorporated in other evolving systems. The purpose of each of the processing modules will be briefly introduced here (also see Figure 1) in the order in which the system is currendy run, with fuller explanations provided in the section below: 1) the Text Structurer labels clauses or sentences with a text[OCRerr]omponent tag which provides a means for responding to the discourse level Topic Statement requirements of time, source, intentionality, and state of completion; 2) the Subject Field Coder provides a subject-based, sununary-level vector representation of the content of each text; 3) the Froper Noun Interpreter and 4) the Complex Noininal Phraser provide precise levels of content representation in the form of concepts and relations, as well as controlled expansion of group nouns and content-bearing nomirial phrases; 5) the Relation-Concept Detector produces concept-relation-concept triples with a range of semantic relations expressed via various syntactic classes, e.g. verbs, nominalized verbs, complex nominals, and proper nouns; 6) the Conceptual Graph Generator combines the triples to form a CG and adds Roget International Thesaurus (Rrf) codes to concept nodes, and; 7) the Conceptual Graph Matcher determines the degree of overlap between a query graph and graphs of those documents which surpass a statistically predetermined criterion of likelihood of relevance based on ranking by the integrated processing of the first four system modules. 2. Detailed Svstem Descrintion In the following system description, emphasis is placed on work accomplished within the last year, plus a basic overview description of each module. The more rudirnentary processing details of each module plus fuller description of eartier development are available in the TREC-l Froceedings (Harman, 1993). 2. A. Text Structur[OCRerr] Since human interpretation of text is influenced by expectations regarding the text to be read, discourse level analysis is required for a system to approximate the same level of meaningful representation and matching. DR-LINK's Text Structurer is based on discourse linguistic theory which suggests that texts of a particular type have a predictable 85