SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Natural Language Processing in Large-Scale Text Retrieval Tasks chapter T. Strzalkowski National Institute of Standards and Technology Donna K. Harman NATURAL LANGUAGE PROCESSING IN LARGE-SCALE TEXT RETRIEVAL TASKS Tomek Strzalkowski Courant Institute of Mathematical Sciences New York University 715 Broadway, rm. 704 New York, NY 10003 tomek@cs.nyu.edu ABSTRACT We developed a prototype text retrieval system which uses advanced natural language processing techniques to enhance the effectiveness of key-word based docuinent retrieval. The backbone of our sys- tern is a tradifional stafistical engine which builds inverted index files from pre-processed documents, and then searches and ranks the documents in response to user queries. Natural language process- ing is used to (1) preprocess the documents in order to extract contents-carrying terms, (2) discover inter- term dependencies and build a conceptual hierarchy specific to the database domain, and (3) process user's natural language requests into effective search queries. For the present TREC effort, the total of 500 MBytes of Wall Street Journal articles have been processed in two batches of 250 MBytes each. Due to time and space limits, two separate inverted indexes were produced for each half of the data, with a partial concept hierarchy built from the first 250 Mbytes only but used for retrieval on either half. Retrieval were performed independently on both halfs of the database and the partial results were merged to pro- duce the final rankings. INTRODUCTION A typical information retrieval (IR) task is to select documents from a database in response to a user 5 query, and rank these documents according to relevance. This has been usually accomplished using statisfical methods (often coupled with manual encoding) that (a) select terms (words, phrases, and other units) from documents that are deemed to best represent their contents, and (b) create an inverted index file (or files) that provide and easy access to documents containing these terms. A subsequent search process will attempt to match a preprocessed user query (or queries) against term-based represen- tations of documents in each case determining a degree of relevance between the two which depends upon the number and types of matching terms. Although many sophisticated search and matching 173 methods are available, the crucial problem remains to be that of an adequate representation of contents for both the documents and the queries. The simplest word-based representations of contents are usually inadequate since single words are rarely specific enough for accurate discrimina- tion, and their grouping is often accidental. A better method is to identify groups of words that create meaningful phrases, especially if these phrases denote impor[OCRerr][OCRerr][OCRerr]t concepts in database domain. For ex[OCRerr]ple, joint venture is an important term in Wall Street Journal (WSJ henceforth) database, while nei- ther joint nor venture is import£[OCRerr]int by itself. In the retrieval experiments with the training TREC data- base, we noticed that both joint and venture were dropped from the list of terms by the system because their idf (inverted document frequency) weights were too low. In large databases, such as TIPSTER, the use of phrasal terms is not just desirable, it becomes necessary. The question thus becomes, how to identify the correct phrases in the text? Both statistical and syn- tactic methods were used before with only limited success. Statistical methods based on word co- occurrences and mutual information are prone to high error rates, turning out many unwanted associations. Syntactic methods suffered from low quality of gen- crated parse structures that could be attributed to lim- ited coverage gr£[OCRerr]nmars and the lack of adequate lex- icons. In fact, the difficulties encountered in applying computational linguistics technologies to text pro- cessing have contributed to a wide-spread belief that automated natural language processing may not be suitable in IR. These difficulties included inefficiency, lack of robustness, and prohibitive cost of manual effort required to build lexicons and knowledge bases for each new text domain. On the other hand, while numerous experiments did not establish the usefulness of linguistic methods in IR, they cannot be considered conclusive because of their