SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Recent Developments in Natural Language Text Retrieval chapter T. Strzalkowski J. Carballo National Institute of Standards and Technology D. K. Harman RECENT DEVELOPMENTS IN NATURAL LANGUAGE TEXT RETRIEVAL Tomek Strzalkowski and Jose Perez Carballo Courant Institute ofMathematical Sciences New York University 715 Broadway, rm. 704 New York, NY 10003 tomek@cs.nyu.edu ABSTRACT This paper reports on some recent developments in our natural language text retrieval system. The system uses advanced natural language processing teclmiques to enhance the effectiveness of term-based document retrieval. The backbone of our system is a traditional sta- tistical engine which builds inverted index files from pre-processed documents, and then searches and ranlls the documents in response to user queries. Natural language processing is used to (1) preprocess the docu- ments in order to extract content-carrying terms, (2) dis- cover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and (3) pro- cess user's natural language requests into effective search queries. For the present ThEC-2 effort, the total of 550 MBytes of Wall Street Journal articles (ad-hoc queries database) and 300 MBytes of San Jose Mercury articles (routing data) have been processed. In terms of text quantity this represents approximately 130 million words of Fnglish. Unlilte in IREC-1, we were able to create a single compound index for each database, and therefore avoid merging of results. While the general design of the system has not changed since IREC-1 conference. we nonetheless replaced several components and added a number of new features which are described in the present paper. INTRODUCTION A typical information retrieval (IR) task is to select documents from a database in response to a user's query, and rank these documents according to relevance. This has been usually accomplished using statistical methods (often coupled with manual encoding) that (a) select terms (words, phrases, and other units) from documents that are deemed to best represent their content, and (b) create an inverted index file (or files) that provide an easy access to documents containing these terms. A sub- sequent search process will attempt to match a prepro- cessed user query (or queries) against term-based representations of docurnents in each case determing a degree of relevance between the two which depends upon the number and types of matching terms. Although 123 many sophisticated search and matching methods are available, the crucial problem remains to be that of an adequate representation of content for both the docu- ments and the queries. The simplest word-based representations of con- tent are usually inadequate since single words are rarely specific enough for accurate discrirnation, and their grouping is often accidental. A better method is to iden- tify groups of words that create meaningful phrases, especially if these phrases denote iniportant concepts in database domain. For example, joint venture is an impor- tant term in the Wall Street Joumal (WSJ henceforth) database, while neither joint nor venture is important by itself. In the retrieval experiments with the training `IREC database, we noticed that both joint and venture were dropped from the list of terms by the system because their idf (inverfrd document frequency) weights were too low. In large databases, such as TIPSTER, the use of phrasal terms is not just desirable, it becomes necessary. An accurate syntactic analysis is an essential prere- quisite for selection of phrasal terms. Various statistical methods, e.g., based on word co-occuirences and mutual information, as well as partial parsing technlques, are prone to high error rates (sometimes as high as 50%), tuuing out many unwanted associations. Therefore a good, fast parser is necessary, but it is by no means sufficient. While syntactic phrases are often better indi- cators of content than `statistical phrases' -- where words are grouped solely on the basis of physical proximity (e.g., "college junior" is not the same as "junior college") -- the creation of compound terms makes term matching process more complex since in addition to the usual problems of synonymy and subsumption, one must deal with their structure (e.g., "college junior" is the same as jumor m college"). In order to deal with structure, the parser's output needs to be "normalized" or "regularized" so that complex terms with the same or closely related meanings would indeed receive matching representa- tions. This goal has been achieved to a certain extent in the present work. As it will be discussed in more detail below, indexing terms were selected from among head- modifier pairs extracted from predicate-argument