SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection chapter E. Liddy S. Myaeng National Institute of Standards and Technology Donna K. Harman the routing situation, each topic statement SFC vector is compared to the incoming document SFC vectors and the douments are then ranked according to similarity to the topic statement SF0 vector. Either a predetermined or adjustable criterion can be used to select those documents whose SF0 vectors exhibit a predetermined degree of similarity to the topic statement SF0 vector. This set is then passed to later system components for more refined representation and matching. For use with retrospective or ad hoc queries, the SF0 vectors are clustered using Ward's agglomerative clustering algorithm (Ward, 1963) to form classes in the document database. For retrieval, queries are likewise represented as SF0 vectors and then matched to the prototype SF0 vector of each cluster in the database. Clusters whose prototype SF0 vectors exhibit a predetermined criterion of similarity to the query SF0 vector are passed on to other system components for more computationally expensive representation and matching (Liddy, Paik, & Woelfel, 1992). a&' Text Structurer The purpose of the Text Structuring module in DR-LINK is to delineate the discourse-level organization of each document's contents so that those document components where the type of information suggested by the topic statement is most likely to be found, can be selected for higher weighting. For example, in newspaper texts, opinions will be found in EVALUATION components, basic facts of the news story will be found in MAIN EVENT components, and predictions will be found in EXPECTATION components. The Text Structurer produces an enriched representation of each document by decomposing it into these smaller, conceptually labelled components. In parallel, the Topic Statement Processor evaluates each topic statement to determine if there is an indication that a particular component in the documents should be more highly weighted when matched to the topic statement representation. For example, topic statement indicator-terms such as predict or anticipate or proposed reveal that the time frame of the event being searched for must be in the future, in order for the document to be relevant. Therefore, documents in which this event is reported in a piece of text which has been marked by the Text Structurer as being either EXPECTATION or MAIN, FUTURE would be ranked more highly than those in which this event is reported in a different component. Operationally, DR-LINK evaluates each sentence in the input text, comparing it to the known characteristics of the prototypical sentence of each component of the text-type model, and then assigns a component label to the sentence. For the newspaper text-type model, we took as a starting point, the hierarchical newspaper text model proposed by van Dijk (1988). With this as a preliminary model, several iterations of coding of a sample of 149 randomly chosen Wall StreetJournal articles from 1987-1988 resulted in a revised News Schema which organized van Dijk's terminal node categories according to a more temporally oriented perspective. The News Schema Components account for all the text in the sample of articles. The components are: CIRCUMSTANCE, CONSEQUENCE, CREDENTIALS, DEFINITION, ERROR, EVALUATION, EXPECTATION, HISTORY, LEAD, MAIN EVENT, NO COMMENT, PREVIOUS EVENT, REFERENCES, anrl VERBAL REACTION. The process of manually coding the sample also served to suggest to us that during our intellectual decomposing of texts, we were in fact relying on six different types of linguistic information to make our decisions. The data from the sample set which could be used to provide the raw data for these evidence sources was then analyzed statistically and translated into 118