SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection chapter E. Liddy S. Myaeng National Institute of Standards and Technology Donna K. Harman computationally recognizable text characteristics to be used by the Text Structurer to assign a component label to each sentence. Briefly defined, the six sources of evidence used in the Text Structurer are: Likelihood of Component Occurring - The unit of analysis for the first source of evidence is the sentence and is based on the observed frequency of each component in our coded sample set. Order of Components - This source of evidence relies on the tendency of components to occur in a particular, relative order determined by calculating across the coded files of the sample documents1 looking not at the content of the individual documents, but the component labels. The results are contained in two 19 by 19 matrices, one for probability of which component follows a given component and one for probability of which component precedes a given component. Lexical Clues - The third source of evidence is a set of one, two and three word phrases for each component. The set of lexical clues for each component was chosen based on observed frequencies and distributions. We were looking for words with sufficient occurrences, statistically skewed observed frequency of occurrence in a particular component, and semantic indication of the role or purpose of each component. Syntactic Sources - We make use of two types of syntactic evidence: 1) typical sentence length as measured in average number of words per sentence for each component; 2) individual part- of-speech distribution based on the output of the part[OCRerr]f-speech tagging of each document, using POST, a part-of-speech tagger loaned to us by BBN (Meteer et al, 1991). This evidence helps to recognize those components which, because of their nature, tend to have a disproportionate percentage of words of a particular part of speech. Tense Distribution - Some components, as might be expected by their name alone, tend to contain verbs of a particular tense more than verbs of other tenses. For example, DEFINITION sentences seldom contain past tense, whereas the predominate tense in HISTORY and PREVIOUS EVENT sentences is the past tense, based on POST tags. Continuation Clues - The sixth and final source of evidence is based on the conjunctive relations suggested in Halliday and Hasan's Cohesion Theorv (1976). The continuation clues are lexical clues which occur in a sentence-initial position and which were observed in our coded sample data to predictably indicate either that the current sentence continues the same component as the prior sentence or that there is a change in the component. These evidence sources for instantiating a discourse-level model of the newspaper text-model have been incorporated in the Text-Structurer, which evaluates each sentence of an input newspaper article against these six evidence sources for the purpose of assigning a text-level label to each sentence. The implementation uses the Dempster-Shafer Theory of Evidence Combination (Shafer, 1976) to coordinate information from the very complex matrices of statistical values for the various evidence sources which were generated from the intellectual analysis of the sample of 149 WSJ articles (Liddy, Paik, Mcvearry & Yu, In press). Operatbnally within DR-LINK, each document is processed a sentence at a time and each source of evidence assigns a number between 0 and 1 to indicate the degree of support that evidence source provides to the belief that a sentence is of a particular news-text component. Then, a simple supporting function for each component is computed and the component with the greatest 119