SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection chapter E. Liddy S. Myaeng National Institute of Standards and Technology Donna K. Harman DR-LINK's Linguistic-Conceptual Approach to Document Detection1 Elizabeth D. Liddy Sung H. Myaeng School of Information Studies Syracuse University Syracuse1 New York 132100-4100 liddy@mailbox.syr.edu; shmyaeng@mailbox.syr.edu 1' Overview Our approach to the difficult problem of selecting only those documents which satisfy a user's specified information need, is to pay parallel attention to two very important aspects of the task. Firstly, there are many documents which have no likely possibility of being relevant to either a standing query or a query newly put to the system. These documents should be filtered from further consideration at an early stage in the system's processing if the system's later processing is computationally expensive, and if their presence introduces unnecessary ambiguity, while their removal produces more accurate results. This focusing process should continue at subsequent stages using additional linguistic features of the query and documents in order to further refine the flow of documents. Secondly, there is a continuum of levels of linguistic-conceptual processing which can produce enrichments of the original text in order to explicitly represent documents at more conceptual levels for more accurate matching to queries. Our approach also recognizes that, as reflected by the topic statements, the retrieval task in TREC requires capabilities beyond what has been required in the past for traditional IR systems. Topic statements describe not only `aboutness' but also more detailed information such as relationships among entities, characteristics of participants in an event, and temporality. We believe that richer representations of documents and topic statements are essential to meet the extended retrieval requirements of such complex information needs and to reduce the ambigui- ties resulting from keyword-based retrieval. To produce this enriched representation the system uses lexical, syntactic, semantic, and discourse linguistic processing techniques for distilling from documents and topic statements all the rich layers of knowledge incorporated in their deceptively simple textual surface and produces a final document representation which has been shaped by all these levels of linguistic processing. To achieve the goals stated above, we have developed a system whose architecture is modular in design, with five separate processing modules which continuously refine the flow of documents both in terms of pure numbers and in terms of continual semantic enrichments (see Figure 1). Briefly previewed, the five modules processing is as follows: The Subject Field Coder uses semantic word knowledge to produce a summary-level topical vector representation of a document's contents that is matched to a vector representation of a topic statement in order to select for further processing only those documents which have real potential of being relevant. This subset of documents is then passed to the Text Structurer, 1 Ken McVearry, Woojin Paik, Ming Li, Edmund Yu, and Chris Khoo contributed to the design, data analysis and implementation of the system for TREC-1. 113