SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
Donna K. Harman
DR-LINK's
Linguistic-Conceptual Approach to Document Detection1
Elizabeth D. Liddy
Sung H. Myaeng
School of Information Studies
Syracuse University
Syracuse1 New York 132100-4100
liddy@mailbox.syr.edu; shmyaeng@mailbox.syr.edu
1' Overview
Our approach to the difficult problem of selecting only those documents which satisfy a user's
specified information need, is to pay parallel attention to two very important aspects of the task.
Firstly, there are many documents which have no likely possibility of being relevant to either a
standing query or a query newly put to the system. These documents should be filtered from
further consideration at an early stage in the system's processing if the system's later
processing is computationally expensive, and if their presence introduces unnecessary
ambiguity, while their removal produces more accurate results. This focusing process should
continue at subsequent stages using additional linguistic features of the query and documents in
order to further refine the flow of documents. Secondly, there is a continuum of levels of
linguistic-conceptual processing which can produce enrichments of the original text in order to
explicitly represent documents at more conceptual levels for more accurate matching to
queries.
Our approach also recognizes that, as reflected by the topic statements, the retrieval task in
TREC requires capabilities beyond what has been required in the past for traditional IR systems.
Topic statements describe not only `aboutness' but also more detailed information such as
relationships among entities, characteristics of participants in an event, and temporality. We
believe that richer representations of documents and topic statements are essential to meet the
extended retrieval requirements of such complex information needs and to reduce the ambigui-
ties resulting from keyword-based retrieval. To produce this enriched representation the
system uses lexical, syntactic, semantic, and discourse linguistic processing techniques for
distilling from documents and topic statements all the rich layers of knowledge incorporated in
their deceptively simple textual surface and produces a final document representation which has
been shaped by all these levels of linguistic processing.
To achieve the goals stated above, we have developed a system whose architecture is modular in
design, with five separate processing modules which continuously refine the flow of documents
both in terms of pure numbers and in terms of continual semantic enrichments (see Figure 1).
Briefly previewed, the five modules processing is as follows:
The Subject Field Coder uses semantic word knowledge to produce a summary-level topical
vector representation of a document's contents that is matched to a vector representation of a
topic statement in order to select for further processing only those documents which have real
potential of being relevant. This subset of documents is then passed to the Text Structurer,
1 Ken McVearry, Woojin Paik, Ming Li, Edmund Yu, and Chris Khoo contributed to the design,
data analysis and implementation of the system for TREC-1.
113