SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
Donna K. Harman
the routing situation, each topic statement SFC vector is compared to the incoming document SFC
vectors and the douments are then ranked according to similarity to the topic statement SF0
vector. Either a predetermined or adjustable criterion can be used to select those documents
whose SF0 vectors exhibit a predetermined degree of similarity to the topic statement SF0
vector. This set is then passed to later system components for more refined representation and
matching.
For use with retrospective or ad hoc queries, the SF0 vectors are clustered using Ward's
agglomerative clustering algorithm (Ward, 1963) to form classes in the document database. For
retrieval, queries are likewise represented as SF0 vectors and then matched to the prototype
SF0 vector of each cluster in the database. Clusters whose prototype SF0 vectors exhibit a
predetermined criterion of similarity to the query SF0 vector are passed on to other system
components for more computationally expensive representation and matching (Liddy, Paik, &
Woelfel, 1992).
a&' Text Structurer
The purpose of the Text Structuring module in DR-LINK is to delineate the discourse-level
organization of each document's contents so that those document components where the type of
information suggested by the topic statement is most likely to be found, can be selected for
higher weighting. For example, in newspaper texts, opinions will be found in EVALUATION
components, basic facts of the news story will be found in MAIN EVENT components, and
predictions will be found in EXPECTATION components. The Text Structurer produces an
enriched representation of each document by decomposing it into these smaller, conceptually
labelled components. In parallel, the Topic Statement Processor evaluates each topic statement
to determine if there is an indication that a particular component in the documents should be
more highly weighted when matched to the topic statement representation. For example, topic
statement indicator-terms such as predict or anticipate or proposed reveal that the time frame
of the event being searched for must be in the future, in order for the document to be relevant.
Therefore, documents in which this event is reported in a piece of text which has been marked
by the Text Structurer as being either EXPECTATION or MAIN, FUTURE would be ranked more
highly than those in which this event is reported in a different component.
Operationally, DR-LINK evaluates each sentence in the input text, comparing it to the known
characteristics of the prototypical sentence of each component of the text-type model, and then
assigns a component label to the sentence. For the newspaper text-type model, we took as a
starting point, the hierarchical newspaper text model proposed by van Dijk (1988). With this
as a preliminary model, several iterations of coding of a sample of 149 randomly chosen Wall
StreetJournal articles from 1987-1988 resulted in a revised News Schema which organized
van Dijk's terminal node categories according to a more temporally oriented perspective. The
News Schema Components account for all the text in the sample of articles. The components are:
CIRCUMSTANCE, CONSEQUENCE, CREDENTIALS, DEFINITION, ERROR, EVALUATION,
EXPECTATION, HISTORY, LEAD, MAIN EVENT, NO COMMENT, PREVIOUS EVENT, REFERENCES, anrl
VERBAL REACTION.
The process of manually coding the sample also served to suggest to us that during our
intellectual decomposing of texts, we were in fact relying on six different types of linguistic
information to make our decisions. The data from the sample set which could be used to provide
the raw data for these evidence sources was then analyzed statistically and translated into
118