NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) DR-LINK: A System Update for TREC-2 chapter E. Liddy S. Myaeng National Institute of Standards and Technology D. K. Harman text-level structure which is used by both producers and readers of that text-type as an indication of how and where certain information endemic to that text-type will be conveyed. We have implemented a Text Structurer for the newspaper text-type, which produces an annotated version of a news article in which each clause or sentence is tagged for the specific slot it instantiates in the news-text model, an extension of van DUkts earlier model (1988). The structural annotations are used to respond more precisely to information needs expressed in Topic Statements, where some aspects of relevancy can only be met by understanding a Topic Statement's discourse requirements. For example, Topic Statement 75, states that: Document will identtf' an instance in which automation has clearly paid off or conversely, has failed. which contains the implicit discourse requirement that relevant instances should occur in the CONSEQUENCE component of a news article. DR-LINK extracts this requirement from the Topic Statement and will only assign a similarity value for the discourse-level of relevance to those documents in which the sought information occursin a CONSEQUENCE component. The current news-text model consists of thhi:y-eight recognizable components of information observed in a large sample of traimug texts (e.g. MAIN EVENT, VERBAL REACrION, EVALU[OCRerr]ON, FUTURE CONSEQUENCE, PREVIOUS EVENT). The Text Structurer assigns these component labels to document clauses or sentences on the basis of lexical clues learned from text, which now coznprise a special lexicon. We considered expanding the lexicon via avallable lexical resources such as [OCRerr] or [OCRerr] but our analysis of these resources suggested that they do not capture the particularities of lexical usage in the sublangnage of newspaper reporting. The Text Structur& has recently been improved to assign structural tags at the clause level, a refinement which has corrected most of the anomalies that were observed in earlier testings of the Text Structurer. For example, given the new clause-level structuring, the following sentence is correctly interpreted as containg both future-oriented information in the LEAD-FUTURE segment and some nested information regarding a past situation in the LEAD- HISTORY segment. [OCRerr]EAD-FL[OCRerr] South Korea's trade surplus, 4ŁAD-HISTh which more than doubled in 1987 to $6.55 billion, <[OCRerr]AD-HISTh is expected to narrow this year to above $4 billion. </IŁAD- FUT> We have recently implemented new matching techniques which more fully realize the Text Structurer's potential contribution to the system's performance. This was achieved as one outcome of a study which greatly increased our understanding of how text structure requirements in Topic Statements should be used for matching documents to Topic Statements. Analysis of relevant and non-relevant documents retrieved for a test sample of Topic Statements indicated that most of the errors in the Text Structurer's matching were not serious errors, but only slight mismatches in terms of the conceptual defmitions of some of the text model's components. This suggested that our model was overly specific for the task of responding to discourse aspects of information requirements, and that matching Text Structure needs from a Topic Statement to structured documents called for a more gen&alized model. That is,. Topic Statement text-structure requirements are not expressed at the same level of specificity at which Text Structure components are recognizable in documents. Given this, we reduced the matching complexity via a function that maps the thirty-eight news-text components to seven meta-components. These are: LEAD-MAIN, HISTORY, FLyF[OCRerr], CONSEQUENCE, EVALUNnON, ONGOING, and OTHERS. The new approach allows the system to continue to impose the fmer-level, 38- component structure on the newspaper articles themselves with excellent precision, but maps this fuller set of text components to the seven meta-components at the matching stage, as the Topic Statements' text structure requirements are coded at the meta-component level. Unofficial experimental results indicate that this new scheme has significantly increased the Text Structurer's contribution to an improved level of precision in the retrieval of relevant documents. 87