SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection chapter E. Liddy S. Myaeng National Institute of Standards and Technology Donna K. Harman which sub-divides a text into its discourse-level segments in order to focus later matching to the appropriate discourse component in response to particular types of information need. All of the structured texts1 with the appropriate components high-lighted, are passed to the Relation- Concept Detector1 whose purpose is to raise the level at which we do matching from a key- word or key-phrase level to a more conceptual level by expanding terms in the topic statement to all terms which have been shown to be `substitutable' for them, and then by extracting semantic relations between concepts from both documents and topic statements. This component produces concept-relation-concept triples which are passed to the Conceptual Graph Generator which converts these triples into the CG formalism (Sowa, 1984). The resultant CGs are passed to the Conceptual Graph Matcher, which measures the degree to which a particular topic statement CG and candidate document CGs share a common structure, and ranks the documents accordingly. The five modules in DR-LINK have well specified interfaces, making it possible for some of the modules to be re-combined, when appropriate, in a different order for a more advantageous flow of processing. For example, the Subject Field Coder can produce vectors for any-size unit of text (e.g. a sentence, a paragraph, a discourse-level text-type component, or the full document). Therefore, the Subject Field Coder can create an SFC representation for the full document before the text has been decomposed into its constituent discourse-level components by the Text Structurer, or the Subject Field Coder can be run on the document after the Text Structurer has recognized the discourse level components of a text, and can therefore produce separate vectors for the differing types of information (e.g. current event, past event, opinion, potential future event) contained in the various discourse-level components (e. g. Main Event, History, Evaluation, Expectation) within a newspaper text. This permits an SFC-vector of a particular topic statement to be matched to the representation for just that component whose content is most likely to be appropriate. Another vital aspect of our approach which is evidenced in the various semantic enrichments (e.g. Subject Field Codes, discourse components, concept-relation-concept triples, Conceptual Graphs) added to the basic text, is the real attention paid to representation at a deeper than surface level. That is, DR-LINK deals with lexical entities using more conceptually-based syntactic groupings. For example, complex nominals will be processed as meaningful multi- word constituents because the combination of individual terms in complex nominals conveys quite different meanings than if the individual constituents were individually interpreted. In addition, verbs are represented in case-frames so that the other lexical entities in the sentence which perform particular semantic roles in conjunction with the verb are represented according to these semantic roles. Also, the very rich semantic data (e.g. location, purpose, nationality) that is conveyed in the formulaic, appositional phrases typically accompanying proper nouns are represented in such a way that the semantic relations implicitly conveyed in the appositions are explicitly available for more refined matching and the creation of CGs which contain this relational information. In each of these three examples, relations are important in that they contextually bind concepts which otherwise would be treated as if they were independent of each other. To accomplish this task, given a text database, DR-LINK extracts important relations by relying on relation-revealing formulae (RRF) that are patterns of linguistic (lexical, syntactic, and semantic) clues by which particular relations are detected. 2. Detailed System DescriDtion Since our system is modular in design, with well-defined boundaries between the various 115