SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
DR-LINK: A System Update for TREC-2
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
D. K. Harman
and produce a ranked list of documents as the third and fmal output of the system. Using the techniques necessary to
model plausible inferences with COs [OCRerr]yaeng and Khoo, 1992), this module computes the degree to which the topic
statement CG is covered by the COs in the document (see Myaeng and Liddy (1993) and Myaeng & Lopez-lopez
(1992) for details).
While the most obvious strength of the CO approach is its ability to enhance precision by exploiting the structure
of the COS and the semantics of relations in document and topic statement COs, and by attempting to meet the
specIfic semantic constraints of topic statements, we also attempt to increase recall by allowing flexibility in
node-level matching. Concept labels can be matched partially (e.g. between `Bill Clinton' and `Clinton'), and both
relation and concept labels can be matched inexacdy (e.g. between `aid' and `loan' or between `AGENT' and
`EXPERIENCER'). For both inexact and partial matches, we determine the degree of matching and apply a
multiplication factor less than 1 to the resulting score. For inexact matching cases, we have used a relation
Similarity table that determines the degree of similarity between pairs of relations. Although this type of matching
slows down the matching time, we feel that until we have a more accurate way of deterniining the conceptual
relations and a way to represent at a truly conceptual level (e.g. our attempt to use R[OCRerr] codes), it is necessary. More
importandy, the similarity table reflects our ontology of relations and allows for matching between relations
produced by different RCD handlers whose operations in turn are heavily dependent on the domain-independent
knowledge bases.
We have done a series of matching experiments internally to evaluate various strategies in CO matching/scoring and
document representation with the goal of selecting the best one for the fmal TIPSIER 24th month runs. The first
question we had was how to "normalize" the score assigned to a document based on the current scoring scheme. As
described above, the scoring algorithm is query-oriented in the sense that the score reflects to what extent the query
CO is covered by the document CO. While this approach is theoretically justifiable, one potential drawback is that a
document containing the entire query CO is not ranked higher than one that contains fragments of the query CO
scattered in the document as long as they cover the same query CO. That is "connectivity" or `coherence" of
matching document CO is not fully taken into account.
With the intuitive notion that the number of matching CO fragments in a document would be inv&sely proportional
to "connectivity", we have been experimenting with various normalization factors that are a function of the number
of matching CO fragments. At the time of writing, our experimental data show that when we consider 12 sentential
COs as a unit (called "paragraph") and use the number of units containing one or more matching CO fragments in
the normalization function, we obtain the best result. Among all the functions we have tried, the best normalization
factor we have found experimentally so far is:
1.()5 A (1-x)
where x is the number of text units that contain one or more matching CO fragments. When this is combined with
the maximum of the scores assigned to individual "paragraph" as follows:
S*1.05A(1[OCRerr]x) + OA*M
where S is for the unnormalized score and M for the maximum "paragraph" score, we obtained the best results. Since
we determined the constants incrementally, it is entirely possible that different combination of the constants can give
better results. It is relatively clear based on these experiments that the first or the second term alone are always
inferior to the combination. The number of sentential COs for "paragraphs", 12, seems also pretty stable.
We have produced TIPSThR runs using the Rf[-coded documents and topic statements. The current matching
program attempts to match on Rr[ codes ouly when the concept names (words) don't match. Because of this
conservative approach, the RH' codes do not block a match between two different polysemous words and thus have
any direct impact on the word ambiguity problems in IR. With the disambiguation process employed when RH'
codes are chosen for a noun or verb, however, the net effect is analogues to term expansion with sense
disambiguation. It should be noted that since RH' codes are used for both document and query concepts, this amounts
to sense-disambiguated term expansion on both queries and documents.
97