NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) DR-LINK: A System Update for TREC-2 chapter E. Liddy S. Myaeng National Institute of Standards and Technology D. K. Harman 2. H. Topic Statement Processin[OCRerr] for Conceptual Graph Generation The processing of topic statements for CG generation does not make use of the output of the Natural Language Query Constructor, but instead the current system first applies the same RCD and CG generator modules to produce topic statement Cr5) CGs. Several TS-specific processing requirements have been identified, some of which have been implemented as post-processing routines and others are under development. - Elimination of concept and relation nodes corresponding to contenfless meta-phrases (e.g. `[OCRerr]Relevant document must identify ..."). If both of the concept nodes in a concept-relation-concept triple belong to a meta-phrase, the CRC is ignored. When only one of them is a meta-phrase concept, the triple is not removed blindly uniess the other concept occursin another triple. - Handling of negated parts of topic statements. The weights are adjusted in such a way that an occurrence of the negated concept in a document will contribute to the negative evidence that the document will be relevant. In effect, the two weights for the concept are switched. Automatic assignment of weights to concept and relation nodes. There are several factors we consider: the conventional way of determimng the importance of terms using inverse document frequency (DF) and total frequency; the location of terms occurring in topic statements; the part of speech information for each term; and indications in the topic statement sublanguage (e.g. the document MUST contain...). Although we have implemented a program that tags individual words with the degree of importance based on the sublanguage patterns, we assigned concept weights based on IDF values of terms in the collection for the evaluation, due to time constraints. Merging common concept appearing in different sections of topic statements. Although it is not safe in general to assume that two concepts sharing the same concept name actually refer to the same concept instantiation and merge them blindly, we have observed that this is not the case in the topic statements. In fact, we believe that it is desirable to merge CG fragments using common concept nodes. This is an important process that eliminates undesirable effects on scoring. Without this, a document contaimng a concept occurring repeatedly in <desc>, <narr>, and <con> fields would be ranked unnecessarily high (or low if it is negated) because each ocerrence of the concept would make an independent contribution to the overall score. Since an integrated automatic topic processing module was not available, the mechanical aspects of the process were hand-simulated with some parts done automatically and other done manually. 2.1. Relation Concent Detector [OCRerr]CD) The output of the Complex Nominal Phraser and the Proper Noun Interpreter modules described above provide concept-relation-concept triples directly to the Relation-Concept Detector [OCRerr]CD) module. In addition, the following RCD handlers are operative. One of the more distinct aspects of the DR-LINK system is its capability of extracting and using relations in the fmal representation of documents and topic statements in their CG representations. This module provides bullding blocks for the CG representation by generating concept-relation-concept triples based on the domain-independent knowledge bases we have been constructing with machine-readable resources and corpus statistics. In this module, there are several handlers that are activated selectively depending on the input sentence. 2. L 1. Case Frame (CF) Handler The main function of the CF Handler is to generate concept-relation-concept triples where one of the concepts comes typically from a verb. It identifies a verb in a sentence and counects it to other constituents surrounding the verb. Since the relations (about 50 we use currently) included inour representation are originated froin the theories of linguistic case roles (Somers, 1987, and Cook, 1989) and are all semantic in nature, this module consults the 93