SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
Donna K. Harman
nature of RRF and the exact features of the representation of documents and topic statements, we
have developed and implemented heuristics for scoring documents for their relevancy with
respect to a given topic statement, and added them to the base algorithm. With the rich
representation of topic statements and documents, the expanded matching component currently
is capable of discriminating documents based on the availability of rather unusual information
needs specified in a topic statement, such as a certain status of an event or a need to contain a
specific entity (e.g. company name) satisfying a certain condition. Also it facilitates the process
of reliably identifying "hot spots" in retrieved documents.
An example of how a matching between document OG and a topic statement OG is shown in Figure
5, where the entire OG represents the example sentence discussed above and the topic statement
sentence:
"...a current debt reschedullng agreement... between a debtor developing
country and one or more of its creditors, commercial andlor official. It will
identify the debtor country and the creditor(s), the payment time period...,
and the interest rate, ...,,
The dark area and the individual nodes with gray shades represent the matched parts: a connected
sub-OG that is the main contributor for the final score and some single node matches,
respectively. It should be noted that information on the relative importance of the text
components delineated by the Text Structure component with respect to the topic statement is
incorporated in the scoring heuristics.
An independent module under development, which will have direct bearing on the final results of
matching and which will be added at a later stage to determine its efficacy, is RIT (Roaet's
International Thesaurus) Coder. The goal is to simulate the use of a type hiearchy in the CG
theory by replacing lexical terms in concept nodes in our representation with RIT semi-colon
group numbers, each of which represent a set of semantically similar words and phrases in its
hierarchy. Our approach is to use the words surrounding the target word to be replaced with an
RIT code as context words by which we try to disambiguate the sense of the target word and find
the exact location in the RIT. While this approach is to increase both recall and precision at the
same time through implicit expansion of terms and sense disambiguation, we will have to see the
sensitivity of incomplete disambiguation to the overall retrieval results.
Testino and Results
Although the DR-LINK entry in TREC was not tested in the same manner as the other systems,
the status of the system and the testing which was done were consistent with the milestones that
had been established in consultation with our TIPSTER contractor. DR-LINK was not tested as a
full system due to the fact that DR-LINK is based on several theories that have never before been
implemented in an information retrieval system and none of the system's components were in
even the design stage of development when the project began. As a result of these facts, our
system is not yet fully implemented. The results that follow are on the three modules (Subject
Field Coder, Text Structurer, Conceptual Graph Matcher) that have been implemented to date.
The full system will be tested at the eighteenth month meeting of TIPSTER.
125