NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) DR-LINK: A System Update for TREC-2 chapter E. Liddy S. Myaeng National Institute of Standards and Technology D. K. Harman [`,uy] -> (GOAL) -> [satisfy] The knowledge base contains a small number of simple patterns involving BE verbs and more than 350 pattern rules for phrasal patterns across phrase boundaries, by which important relations are extracted. The pattern rules specify certain lexical patterns and the order of oceurrences of words belonging to certain part-of-speech categories, and the concept-relation-concept triples to be generated. These patterns require a processing capability no more powetful than a fmite state automaton. Due to the time constraints, however, the current ad-hoc handler has not been generalized to process all the patterns, and about 30% of the patterns in the knowledge base are recogm2ed and handled correctly. 2. J. Conceptual Graph (CG) Generator After individual RCD modules have generated concept-relation-concept triples for a document, the CO generator merges them to form a set of conceptual graphs, each corresponding to a clause in most cases. Since more than one handler can generate different triples for the same concept pairs (e.g. a prepositional phrase handled by the CF handler and the NPI?P handler) based on independently constructed rules and on independent processes, a form of conflict resolution is necessary. III the current implementation, we simply order the execution of different handlers based on the general quality of the rules and the resulting triples so that more reliable handlers have higber precedence. The concept nodes in the resulting COs can not only contain general concept names but also some instantiations (referents) of the concepts. Such a concept can be derived either from a proper noun such as a company name or from a sub-ordinate clause. `lithe latter case, the instantiation is a CO itself to produce a CO like [country: {US}] <- (SOURCE-OF-INFO) <- [C#: [[pact] <- [OCRerr]AT]ENI) <-...] `lithe current implementation, concepts with the same instantiation are merged across sentences to form a larger CO, but concept with the same label but without any referents across sentences are treated as separate concepts and are not merged. A pronoun resolution method is being implemented to merge a pronoun to its antecedent as a way to increase the connectivity of COs and hence increase the usefulness of relation nodes. As a way to make our current representation more "conceptual", we have implemented a module that adds RYF [OCRerr]oget's `liternational Thesaurus) codes to individual concept nodes so that the label on the nodes is not a word but a position of the hierarchy of Rif. The lowest level position beyond individual lexical items in the [OCRerr] hierarchy is called a semi[OCRerr]olon group consisting of several terms within the delimiter of semi-colons, which represents a concept. The mapping from a word (called target) in text to a position in RUE requires sense disambiguation, and our approach is to use the words surrounding the target word as the context within which the sense of the target word is determined and one or more R[OCRerr] codes are selected. The algorithm selects minimal number (i.e. one or more) of RUE codes, not just the best one, for target words since we feel that some of the sense distinctions made in RUE are unnecessarily subtle, and it is unlikely that any attempts to make such fme distinctions would be successful and hence contribute to information retrieval. We have produced RUE-coded documents and topic statements for the San Jose Mercury collection and the routing queries. All the concept nodes derived from nouns now have RUE codes selected using the surrounding tex[OCRerr] as the context. Those concept nodes derived from verbs also have RUE codes but in a different way. `listead of using the surrounding text as the context and trying to disambiguate senses (we concluded that this method is not reliable for verbs), we first assign RUE codes to each sense of LDOCE verb entries using the same method. `lithis case the context become the defmition text in LI)OCE. Ohce we select the right case frame by Case Frame Handl& while text is processed, the RUE codes attached to the case frame are automatically assigned to the target verb. 2. K. ConceDtual Oraph (CO) Matcher The main flinction of the CO matcher is to determine the relevance of each document against a topic statement CO 96