NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) DR-LINK: A System Update for TREC-2 chapter E. Liddy S. Myaeng National Institute of Standards and Technology D. K. Harman the multiple SFC vectors produced for each document9 including both a Dempster-Shafer combination and a straight averaging. Although official results are not yet available, our internal test results indicate that the combination of Text Structuring and Subject Field Coding produces an improved ranking of documents, especially when using the Dempster-Shafer method. 2. D. Proper Noun Interpreter Our earlier work with the SFCoder, suggested that the most important factor in improving the pefformance of this upstream ranking module, would be to integrate the general subjectAevel representation provided by SFCodes with a level of text representation that enabled more refined discrimination. Analysis of earlier test results suggested that proper noun [OCRerr]N) matching that incorporated both particular proper nouns (e.g. Argentina, FAA) as well as `category' level proper nouns (e.g. third-world country, government agency) would improve precision pexformarce. The Proper Noun Interpreter [OCRerr]aik et al, 1993) that we developed provides: a canomcal representation of each proper noun; a classification of each proper noun into one of thirty-seven categories, and; a means for expanding group nouns into their constituent members (e.g. all the countries comprising the Third World). Recent work on our proper noun algorithms, context-based rules, and knowledge bases, has improved the module's ability to recognize and categorize proper nouns to 93% correct categorization using 37 categories as tested on a sample set of 545 proper nouns from newspaper text. The improved peiformance has a double impact on the system's retrieval peiformance, as proper nouns contribute both to the downstream relation-concept representation used in CG matching as well as to the upstream proper noun, complex norninal, and text structure ranking of documents in relation to individual queries. Details of processing Topic Statements for their PN requirements and the use of this sintilarity value in document ranking is described in the later section on the Query Constructor. 2. E. Complex Nominal Phraser A new level of natural language processing has been incorporated in the DR-LINK System with the implementation of the Complex Nominal (CN) Phraser. The motivation behind this addition was our recognition that either, in addition to proper nouns, or in the absence of proper nouns, most of the substantive content requirements of Topic Statements are expressed in complex nominals (i.e. noun + noun, reduction", "government assistance", "health hazards"). Complex nominals provide a linguistic means for precise conceptual matching, as do proper nouns. However, the conceptual content of complex nomihals can be expressed in synonymous phrases, in a dilferent way than can the conceptual content of proper nouns, which are more particularized. Therefore, for complex nominals, a controlled expansion step was incorporated in the CN matching process in order to accomplish the desired goals of improved recall, as well as improved precision. For input to the CN Phraser, the complex nominals in Topic Statements are recognizable as adjacent noun pairs or non-predicating adjective + noun pairs in the output of the part-of-speech tagger. Having recognized all CNs, the substitutable phrases for each complex nominal are found by computationally determining the overlap of synonymous terms suggested by RIT and statistical corpus analysis. These processes serve to identify all second order associations between each complex nominal constituent and terms in the database. Second order associations exist between terms that are used interchangeably in certain contexts. The premise here is that if, for example, terms a and b are both frequently premodified by the same set of terms in a corpus, it is highly likely that terms a and b are substitutable for each other within these phrases. The use of both corpus and RIT information appears to Irmit the over-generation that frequently results from automatic term expansion. Ongoing experiments on this new addition to the system will help us further refrne the process and will be reported more extensively in the near future. The terms that exhibit second order associations are compiled into equivalence classes. These equivalence classes provide substitutable synonymous phrases for Topic Statement complex nominals and are used by the matching algorithms in the same manner that the original complex nominals are used. The complex nominals and their substitutes are first used in the upstream matching of Topic Staternents to documents as one contributing factor to the integrated similarity value, to be ftnther explained in the section on the Query Constructor. In addition, each complex nominal and its assigned relation provides a CRC to the RCD module for use in the fmal round of matching. For that module, semantic relations between the constituent nouns of each complex nominal are 89