SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
Donna K. Harman
[agree] -
(A) -> [country: *1 Venezuela]
(A) -> (creditor_bank: *2]
(AT) -> [restructure] -
(A) -> [country: *1 Venezuela]
(A) -> [creditor_bank: *2]
(P) -> [debt] -
(ME) -> [money]
(CH) -> [foreign].
where *1 and *2 indicate that the nodes with the same number represent the same concept.
While this process of extracting relations and constructing CGs is applied both to documents and
topic statements, the latter require additional processing to capture unique features of
information needs often found in the topic statement. Accordingly, CGs generated from topic
statements have such additional features as importance weights on concept and relation nodes and
ways of indicating whether an instantiation of a concept must exist in relevant documents. This
specialized processing, in comparison with document processing, is accomplished by treating
topic statements as a sub-language and building a model for them. For example, some
information on weights is revealed by phrases like [OCRerr] optionaIN and "... must exist ..." whereas
the need for an instantiation of a concept id indicated by phrases like "Identification of the
company must be included".
While CG theory provides a framework in which IR entities can be represented adequately, much
of the representation task involves intellectual analysis of topic statements and documents so
that we capture and store concepts and relations that are ontologically adequate for IR. For
example, it is essential to choose, organize and classify a restricted set of relations in such a
way that they facilitate matching and inferencing with two CGs representing a document and a
topic statement. The efficacy of the relations we have chosen will be determined with full
experiments and failure analyses.
a[OCRerr] Conceptual Graoh Matcher
The main function of the CG matching component is to determine the degree to which two OGs
share a common structure and score each document with respect to the topic statement. This is
accomplished by empbying techniques necessary to model plausible inferences with CGs
(Myaeng & Khoo, 1992). In order to allow for approximate matching between concept nodes or
relation nodes, we have developed a matrix that represents similarities among relations being
used in OG representation, as well as some concepts. Our goal is to enhance both precision and
recall. By exploiting the structure of the CGs and ihe nature of the relations, we attempt to meet
the specific information needs in topic statements. By allowing for partial matching (e.g.
between `debt' and `bank debt') and inexact matching (e.g. between `debt' and `loan' and between
`CO-AGENT' and `AGENT') at the node level, we can increase recall.
For CG matching, we first developed and implemented a base algorithm that is flexible enough to
allow for various types of partial matching between two CGs and ran experiments to test its
practicality (Myaeng and Lopez-lopez, 1992). While the general subgraph isomorphism
problem is known to be computationally intractable, matching CGs containing conceptual
information (i.e. labels on nodes) appears to be practical. With improved understanding of the
124