SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
DR-LINK: A System Update for TREC-2
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
D. K. Harman
the multiple SFC vectors produced for each document9 including both a Dempster-Shafer combination and a straight
averaging. Although official results are not yet available, our internal test results indicate that the combination of
Text Structuring and Subject Field Coding produces an improved ranking of documents, especially when using the
Dempster-Shafer method.
2. D. Proper Noun Interpreter
Our earlier work with the SFCoder, suggested that the most important factor in improving the pefformance of this
upstream ranking module, would be to integrate the general subjectAevel representation provided by SFCodes with a
level of text representation that enabled more refined discrimination. Analysis of earlier test results suggested that
proper noun [OCRerr]N) matching that incorporated both particular proper nouns (e.g. Argentina, FAA) as well as
`category' level proper nouns (e.g. third-world country, government agency) would improve precision pexformarce.
The Proper Noun Interpreter [OCRerr]aik et al, 1993) that we developed provides: a canomcal representation of each proper
noun; a classification of each proper noun into one of thirty-seven categories, and; a means for expanding group
nouns into their constituent members (e.g. all the countries comprising the Third World). Recent work on our proper
noun algorithms, context-based rules, and knowledge bases, has improved the module's ability to recognize and
categorize proper nouns to 93% correct categorization using 37 categories as tested on a sample set of 545 proper
nouns from newspaper text. The improved peiformance has a double impact on the system's retrieval peiformance, as
proper nouns contribute both to the downstream relation-concept representation used in CG matching as well as to
the upstream proper noun, complex norninal, and text structure ranking of documents in relation to individual
queries. Details of processing Topic Statements for their PN requirements and the use of this sintilarity value in
document ranking is described in the later section on the Query Constructor.
2. E. Complex Nominal Phraser
A new level of natural language processing has been incorporated in the DR-LINK System with the implementation
of the Complex Nominal (CN) Phraser. The motivation behind this addition was our recognition that either, in
addition to proper nouns, or in the absence of proper nouns, most of the substantive content requirements of Topic
Statements are expressed in complex nominals (i.e. noun + noun, reduction", "government assistance",
"health hazards"). Complex nominals provide a linguistic means for precise conceptual matching, as do proper
nouns. However, the conceptual content of complex nomihals can be expressed in synonymous phrases, in a
dilferent way than can the conceptual content of proper nouns, which are more particularized. Therefore, for complex
nominals, a controlled expansion step was incorporated in the CN matching process in order to accomplish the
desired goals of improved recall, as well as improved precision.
For input to the CN Phraser, the complex nominals in Topic Statements are recognizable as adjacent noun pairs or
non-predicating adjective + noun pairs in the output of the part-of-speech tagger. Having recognized all CNs, the
substitutable phrases for each complex nominal are found by computationally determining the overlap of
synonymous terms suggested by RIT and statistical corpus analysis. These processes serve to identify all second
order associations between each complex nominal constituent and terms in the database. Second order associations
exist between terms that are used interchangeably in certain contexts. The premise here is that if, for example, terms
a and b are both frequently premodified by the same set of terms in a corpus, it is highly likely that terms a and b are
substitutable for each other within these phrases. The use of both corpus and RIT information appears to Irmit the
over-generation that frequently results from automatic term expansion. Ongoing experiments on this new addition to
the system will help us further refrne the process and will be reported more extensively in the near future.
The terms that exhibit second order associations are compiled into equivalence classes. These equivalence classes
provide substitutable synonymous phrases for Topic Statement complex nominals and are used by the matching
algorithms in the same manner that the original complex nominals are used. The complex nominals and their
substitutes are first used in the upstream matching of Topic Staternents to documents as one contributing factor to
the integrated similarity value, to be ftnther explained in the section on the Query Constructor.
In addition, each complex nominal and its assigned relation provides a CRC to the RCD module for use in the fmal
round of matching. For that module, semantic relations between the constituent nouns of each complex nominal are
89