SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
OCLC Online Computer Library Center, Inc.
chapter
R. Thompson
National Institute of Standards and Technology
Donna K. Harman
OCLC Online Computer Library Center, Inc.
Roger Thompson
Introduction
We are interested in determining the extent to whi[OCRerr]h the effects of
syntactic phrase indexing scales up to large databases. Previous
investigations of the hypothesis that syntactic phrase indexing leads to
improvements in retrieval performance were conducted on databases ranging
from 250 records to a few thousand records (Dillon and Gray 1983, Fagan
1987, Lewis 1991, Burgin and Dillon 1992). The results have been
conflicting or equivocal. However, we believe that the issue isn't
settled. Technologies for phrase extraction and thesaurus construction
have been evolving, and the arguments favoring syntactic phrase indexing
have not been convincingly dispelled.
The major argument in favor of phrase indexing is that it is a precision
enhancer because words in context are less ambiguous than isolated
terms, and because documents represented in indexes as key phrases
indicative of their content instead of the sum of their individual terms
have undergone considerable noise reduction. Nevertheless, we believe that
there are upper limits to the effectiveness of syntactic phrase indexing.
One of our goals in this project is to understand and elucidate these limits.
In keeping with our interest in answering the simple question of whether
syntactic phrase indexing scales up, we tested the effectiveness of an
existing program for phrase extraction, FASIT (described in Dillon and
Gray 1983, Dillon and McDonald 1983 and Burgin and Dillon 1992), in the
SMART retrieval environment. The primary advantages of FASIT are that the
phrase extraction process is fully automatic, the parse is shallow and
time-efficient, and the logic is table-driven and easily modified. Thus,
FASIT represents a kind of lower-bound estimate of what is necessary to
enhance retrieval performance through automatic indexing.
FASIT Description
FASIT identifies noun phrases appropriate for indexing by
determinining the part of speech for each word in the input text.
This is done by looking up the word in a dictionary created by
assigning tags derived from the Brown Corpus to all entries in the
Oxford Advanced Learner's Dictionary. If the word is not found in
the dictionary, its part of speech is determined from the word's suffix.
Words with more than one part speech have multiple tag assignments and
are eventually disambiguated by examining the tags of the words in the
surrounding context.
Once tagging is complete, the concept selection module consults a
template to identify index phrases. Concepts in FASIT are a subset
of the noun phrases encountered in the input text which are judged
by syntactic criteria to be useful for indexing. These include
all proper nouns, adjective-noun combinations such as "federal
agency"; noun-noun combinations such as "metals technology"; or
noun-prepositional phrase combinations such as "maker of furniture",
which might be paraphrased as the noun-noun construction "furniture
maker". The selected concepts are normalized by eliminating determiners
and pronouns, and the head noun is stemmed.
Table I shows a portion of a sentence as it passes
phases of FASIT processing.
Table I -- Stages of FASIT Processing
Input Tagging Disambiguation Selected
Concepts
189
through the major