SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
Donna K. Harman
which sub-divides a text into its discourse-level segments in order to focus later matching to
the appropriate discourse component in response to particular types of information need. All of
the structured texts1 with the appropriate components high-lighted, are passed to the Relation-
Concept Detector1 whose purpose is to raise the level at which we do matching from a key-
word or key-phrase level to a more conceptual level by expanding terms in the topic statement
to all terms which have been shown to be `substitutable' for them, and then by extracting
semantic relations between concepts from both documents and topic statements. This component
produces concept-relation-concept triples which are passed to the Conceptual Graph
Generator which converts these triples into the CG formalism (Sowa, 1984). The resultant
CGs are passed to the Conceptual Graph Matcher, which measures the degree to which a
particular topic statement CG and candidate document CGs share a common structure, and ranks
the documents accordingly.
The five modules in DR-LINK have well specified interfaces, making it possible for some of the
modules to be re-combined, when appropriate, in a different order for a more advantageous flow
of processing. For example, the Subject Field Coder can produce vectors for any-size unit of
text (e.g. a sentence, a paragraph, a discourse-level text-type component, or the full
document). Therefore, the Subject Field Coder can create an SFC representation for the full
document before the text has been decomposed into its constituent discourse-level components
by the Text Structurer, or the Subject Field Coder can be run on the document after the Text
Structurer has recognized the discourse level components of a text, and can therefore produce
separate vectors for the differing types of information (e.g. current event, past event, opinion,
potential future event) contained in the various discourse-level components (e. g. Main Event,
History, Evaluation, Expectation) within a newspaper text. This permits an SFC-vector of a
particular topic statement to be matched to the representation for just that component whose
content is most likely to be appropriate.
Another vital aspect of our approach which is evidenced in the various semantic enrichments
(e.g. Subject Field Codes, discourse components, concept-relation-concept triples, Conceptual
Graphs) added to the basic text, is the real attention paid to representation at a deeper than
surface level. That is, DR-LINK deals with lexical entities using more conceptually-based
syntactic groupings. For example, complex nominals will be processed as meaningful multi-
word constituents because the combination of individual terms in complex nominals conveys
quite different meanings than if the individual constituents were individually interpreted. In
addition, verbs are represented in case-frames so that the other lexical entities in the sentence
which perform particular semantic roles in conjunction with the verb are represented
according to these semantic roles. Also, the very rich semantic data (e.g. location, purpose,
nationality) that is conveyed in the formulaic, appositional phrases typically accompanying
proper nouns are represented in such a way that the semantic relations implicitly conveyed in
the appositions are explicitly available for more refined matching and the creation of CGs which
contain this relational information. In each of these three examples, relations are important in
that they contextually bind concepts which otherwise would be treated as if they were
independent of each other. To accomplish this task, given a text database, DR-LINK extracts
important relations by relying on relation-revealing formulae (RRF) that are patterns of
linguistic (lexical, syntactic, and semantic) clues by which particular relations are detected.
2. Detailed System DescriDtion
Since our system is modular in design, with well-defined boundaries between the various
115