SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
Donna K. Harman
computationally recognizable text characteristics to be used by the Text Structurer to assign a
component label to each sentence. Briefly defined, the six sources of evidence used in the Text
Structurer are:
Likelihood of Component Occurring - The unit of analysis for the first source of evidence is the
sentence and is based on the observed frequency of each component in our coded sample set.
Order of Components - This source of evidence relies on the tendency of components to occur in
a particular, relative order determined by calculating across the coded files of the sample
documents1 looking not at the content of the individual documents, but the component labels. The
results are contained in two 19 by 19 matrices, one for probability of which component follows
a given component and one for probability of which component precedes a given component.
Lexical Clues - The third source of evidence is a set of one, two and three word phrases for each
component. The set of lexical clues for each component was chosen based on observed frequencies
and distributions. We were looking for words with sufficient occurrences, statistically skewed
observed frequency of occurrence in a particular component, and semantic indication of the role
or purpose of each component.
Syntactic Sources - We make use of two types of syntactic evidence: 1) typical sentence length
as measured in average number of words per sentence for each component; 2) individual part-
of-speech distribution based on the output of the part[OCRerr]f-speech tagging of each document, using
POST, a part-of-speech tagger loaned to us by BBN (Meteer et al, 1991). This evidence helps to
recognize those components which, because of their nature, tend to have a disproportionate
percentage of words of a particular part of speech.
Tense Distribution - Some components, as might be expected by their name alone, tend to
contain verbs of a particular tense more than verbs of other tenses. For example, DEFINITION
sentences seldom contain past tense, whereas the predominate tense in HISTORY and PREVIOUS
EVENT sentences is the past tense, based on POST tags.
Continuation Clues - The sixth and final source of evidence is based on the conjunctive
relations suggested in Halliday and Hasan's Cohesion Theorv (1976). The continuation clues are
lexical clues which occur in a sentence-initial position and which were observed in our coded
sample data to predictably indicate either that the current sentence continues the same
component as the prior sentence or that there is a change in the component.
These evidence sources for instantiating a discourse-level model of the newspaper text-model
have been incorporated in the Text-Structurer, which evaluates each sentence of an input
newspaper article against these six evidence sources for the purpose of assigning a text-level
label to each sentence. The implementation uses the Dempster-Shafer Theory of Evidence
Combination (Shafer, 1976) to coordinate information from the very complex matrices of
statistical values for the various evidence sources which were generated from the intellectual
analysis of the sample of 149 WSJ articles (Liddy, Paik, Mcvearry & Yu, In press).
Operatbnally within DR-LINK, each document is processed a sentence at a time and each source
of evidence assigns a number between 0 and 1 to indicate the degree of support that evidence
source provides to the belief that a sentence is of a particular news-text component. Then, a
simple supporting function for each component is computed and the component with the greatest
119