SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
DR-LINK: A System Update for TREC-2
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
D. K. Harman
text-level structure which is used by both producers and readers of that text-type as an indication of how and where
certain information endemic to that text-type will be conveyed. We have implemented a Text Structurer for the
newspaper text-type, which produces an annotated version of a news article in which each clause or sentence is tagged
for the specific slot it instantiates in the news-text model, an extension of van DUkts earlier model (1988). The
structural annotations are used to respond more precisely to information needs expressed in Topic Statements, where
some aspects of relevancy can only be met by understanding a Topic Statement's discourse requirements. For
example, Topic Statement 75, states that:
Document will identtf' an instance in which automation has clearly paid off or conversely,
has failed.
which contains the implicit discourse requirement that relevant instances should occur in the CONSEQUENCE
component of a news article. DR-LINK extracts this requirement from the Topic Statement and will only assign a
similarity value for the discourse-level of relevance to those documents in which the sought information occursin a
CONSEQUENCE component.
The current news-text model consists of thhi:y-eight recognizable components of information observed in a large
sample of traimug texts (e.g. MAIN EVENT, VERBAL REACrION, EVALU[OCRerr]ON, FUTURE
CONSEQUENCE, PREVIOUS EVENT). The Text Structurer assigns these component labels to document clauses
or sentences on the basis of lexical clues learned from text, which now coznprise a special lexicon. We considered
expanding the lexicon via avallable lexical resources such as [OCRerr] or [OCRerr] but our
analysis of these resources suggested that they do not capture the particularities of lexical usage in the sublangnage
of newspaper reporting.
The Text Structur& has recently been improved to assign structural tags at the clause level, a refinement which has
corrected most of the anomalies that were observed in earlier testings of the Text Structurer. For example, given the
new clause-level structuring, the following sentence is correctly interpreted as containg both future-oriented
information in the LEAD-FUTURE segment and some nested information regarding a past situation in the LEAD-
HISTORY segment.
[OCRerr]EAD-FL[OCRerr] South Korea's trade surplus, 4ŁAD-HISTh which more than doubled in 1987
to $6.55 billion, <[OCRerr]AD-HISTh is expected to narrow this year to above $4 billion. </IŁAD-
FUT>
We have recently implemented new matching techniques which more fully realize the Text Structurer's potential
contribution to the system's performance. This was achieved as one outcome of a study which greatly increased our
understanding of how text structure requirements in Topic Statements should be used for matching documents to
Topic Statements. Analysis of relevant and non-relevant documents retrieved for a test sample of Topic Statements
indicated that most of the errors in the Text Structurer's matching were not serious errors, but only slight
mismatches in terms of the conceptual defmitions of some of the text model's components. This suggested that our
model was overly specific for the task of responding to discourse aspects of information requirements, and that
matching Text Structure needs from a Topic Statement to structured documents called for a more gen&alized model.
That is,. Topic Statement text-structure requirements are not expressed at the same level of specificity at which Text
Structure components are recognizable in documents.
Given this, we reduced the matching complexity via a function that maps the thirty-eight news-text components to
seven meta-components. These are: LEAD-MAIN, HISTORY, FLyF[OCRerr], CONSEQUENCE, EVALUNnON,
ONGOING, and OTHERS. The new approach allows the system to continue to impose the fmer-level, 38-
component structure on the newspaper articles themselves with excellent precision, but maps this fuller set of text
components to the seven meta-components at the matching stage, as the Topic Statements' text structure
requirements are coded at the meta-component level. Unofficial experimental results indicate that this new scheme
has significantly increased the Text Structurer's contribution to an improved level of precision in the retrieval of
relevant documents.
87