SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) OCLC Online Computer Library Center, Inc. chapter R. Thompson National Institute of Standards and Technology Donna K. Harman OCLC Online Computer Library Center, Inc. Roger Thompson Introduction We are interested in determining the extent to whi[OCRerr]h the effects of syntactic phrase indexing scales up to large databases. Previous investigations of the hypothesis that syntactic phrase indexing leads to improvements in retrieval performance were conducted on databases ranging from 250 records to a few thousand records (Dillon and Gray 1983, Fagan 1987, Lewis 1991, Burgin and Dillon 1992). The results have been conflicting or equivocal. However, we believe that the issue isn't settled. Technologies for phrase extraction and thesaurus construction have been evolving, and the arguments favoring syntactic phrase indexing have not been convincingly dispelled. The major argument in favor of phrase indexing is that it is a precision enhancer because words in context are less ambiguous than isolated terms, and because documents represented in indexes as key phrases indicative of their content instead of the sum of their individual terms have undergone considerable noise reduction. Nevertheless, we believe that there are upper limits to the effectiveness of syntactic phrase indexing. One of our goals in this project is to understand and elucidate these limits. In keeping with our interest in answering the simple question of whether syntactic phrase indexing scales up, we tested the effectiveness of an existing program for phrase extraction, FASIT (described in Dillon and Gray 1983, Dillon and McDonald 1983 and Burgin and Dillon 1992), in the SMART retrieval environment. The primary advantages of FASIT are that the phrase extraction process is fully automatic, the parse is shallow and time-efficient, and the logic is table-driven and easily modified. Thus, FASIT represents a kind of lower-bound estimate of what is necessary to enhance retrieval performance through automatic indexing. FASIT Description FASIT identifies noun phrases appropriate for indexing by determinining the part of speech for each word in the input text. This is done by looking up the word in a dictionary created by assigning tags derived from the Brown Corpus to all entries in the Oxford Advanced Learner's Dictionary. If the word is not found in the dictionary, its part of speech is determined from the word's suffix. Words with more than one part speech have multiple tag assignments and are eventually disambiguated by examining the tags of the words in the surrounding context. Once tagging is complete, the concept selection module consults a template to identify index phrases. Concepts in FASIT are a subset of the noun phrases encountered in the input text which are judged by syntactic criteria to be useful for indexing. These include all proper nouns, adjective-noun combinations such as "federal agency"; noun-noun combinations such as "metals technology"; or noun-prepositional phrase combinations such as "maker of furniture", which might be paraphrased as the noun-noun construction "furniture maker". The selected concepts are normalized by eliminating determiners and pronouns, and the head noun is stemmed. Table I shows a portion of a sentence as it passes phases of FASIT processing. Table I -- Stages of FASIT Processing Input Tagging Disambiguation Selected Concepts 189 through the major