SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
DR-LINK: A System Update for TREC-2
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
D. K. Harman
DR-LINK: A System Update for TREC-2
Elizabeth D. Liddy
Sung H. Myaeng
School of Information Studies
Syracuse University
Syracuse, New York 132100-4100
liddy@mailbox.syr.edu; slunyaeng@mailbox.syr.edu
1. Overview of DR-LINK's Approach
The theoretical goal underlying the DR-LINK System is to represent and match documents and queries at the various
linguistic levels at which human language conveys meaning. Accordingly, we have developed a modular system
which processes and represents text at the lexical, syntactic, semantic, and discourse levels of language. In concert,
these levels of processing permit DR-LINK to achieve a level of intelligent retrieval beyond more traditional
approaches. In addition, the rich annotations to text produced by DR-LINK are replete with much of the semantics
necessary for document extraction.
The system was planned and developed in a modular fashion and fl[OCRerr]ctional modularity has been achieved, while a
full integration of these multiple levels of linguistic processing is within reach. As currently configured, DR-LINK
performs a staged processing of documents, with each module adding a meangful annotation to the text. For
matching, a Topic Statement undergoes analogous processing to determine its relevancy requirements for documents
at each stage. Ainong the many benefits of staged processing are: improvements and changes can be easily made
within any module; the contribution of the various stages can be empirically tested by simply trulng them on or
off; modules can be re-ordered (as was done within the last six months) in order to utilize document annotations in
various ways, and; individual modules can be incorporated in other evolving systems.
The purpose of each of the processing modules will be briefly introduced here (also see Figure 1) in the order in
which the system is currendy run, with fuller explanations provided in the section below: 1) the Text Structurer
labels clauses or sentences with a text[OCRerr]omponent tag which provides a means for responding to the discourse level
Topic Statement requirements of time, source, intentionality, and state of completion; 2) the Subject Field Coder
provides a subject-based, sununary-level vector representation of the content of each text; 3) the Froper Noun
Interpreter and 4) the Complex Noininal Phraser provide precise levels of content representation in the form of
concepts and relations, as well as controlled expansion of group nouns and content-bearing nomirial phrases; 5) the
Relation-Concept Detector produces concept-relation-concept triples with a range of semantic relations expressed via
various syntactic classes, e.g. verbs, nominalized verbs, complex nominals, and proper nouns; 6) the Conceptual
Graph Generator combines the triples to form a CG and adds Roget International Thesaurus (Rrf) codes to concept
nodes, and; 7) the Conceptual Graph Matcher determines the degree of overlap between a query graph and graphs of
those documents which surpass a statistically predetermined criterion of likelihood of relevance based on ranking by
the integrated processing of the first four system modules.
2. Detailed Svstem Descrintion
In the following system description, emphasis is placed on work accomplished within the last year, plus a basic
overview description of each module. The more rudirnentary processing details of each module plus fuller description
of eartier development are available in the TREC-l Froceedings (Harman, 1993).
2. A. Text Structur[OCRerr]
Since human interpretation of text is influenced by expectations regarding the text to be read, discourse level analysis
is required for a system to approximate the same level of meaningful representation and matching. DR-LINK's Text
Structurer is based on discourse linguistic theory which suggests that texts of a particular type have a predictable
85