SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Recent Developments in Natural Language Text Retrieval
chapter
T. Strzalkowski
J. Carballo
National Institute of Standards and Technology
D. K. Harman
representations of sentences.
Introduction of compound terms also complicates
the task of discovery of various semantic relationships
among them, including synonymy and subsumption. For
example, the term natural language can he considered, in
certain domains at least, to subsume any term denoting a
specific human language, such as English. Therefore, a
query containing the former may be expected to retrieve
documents containing the latter. The same can he said
about language and English, unless language is in fact a
part of the compound term programming language in
which case the association language - Fortran is
appropriate. This is a problem because (a) it is a standard
practice to include both simple and compound terms in
document representation, and (1,) term associations have
thus far been computed primarily at word level (includ-
ing fixed phrases) and therefore care must he taken when
such associations are used in term matching. This may
prove particularly troublesome for systems that attempt
term clustering in order to create "meta-terms" to he used
in document representation.
The system presented here computes term associa-
tions from text at word and fixed phrase level and then
uses these associations in query expansion. A fairly
primitive filter is employed to separate synonymy and
subsumption relationships from others including anto-
nymy and complementation, some of which are strongly
domain-dependent. This process has led to an increased
retrieval precision in experiments with both ad-hoc and
routing queries for IREC-1 and ThEC-2 experiments.
However, the actual improvement levels can vary sub-
stantially hetween different databases, types of runs (ad-
hoc vs. routing), as well as the degree of prior processing
of the queries. We continue to study more advanced
clustering methods along with the changes in interpreta-
tion of resulting associations, as signaled in the previous
paragraph.
In the remainder of this paper we discuss particu-
lars of the present system and some of the observations
made while processing TREC-2 data. The above coin-
ments will provide the background for situating our
present effort and the state-of-the-art with respect to
where we should he in the future.
OVERALL DESIGN
Our information retrieval system consists of a trad-
itional statistical backbone (NIST's PRISE system; Har-
man and Candela, 1989) auginented with various natural
language processing components that assist the system in
database processing (steiming, indexing, word and
phrase clustering, selectional restrictions), and translate a
user's information request into an effective query. This
design is a careful compromise hetween purely statistical
non-linguistic approaches and those requiring rather
124
accomplished (and expensive) semantic analysis of data,
often referred to as `conceptual retrieval'.
In our system the database text is first processed
with a fast syntactic parser. Subsequendy certain types of
phrases are extracted from the parse trees and used as
compound indexing terms in addition to single-word
terms. The extracted phrases are statistically analyzed as
syntactic contexts in order to discover a variety of simi-
larity links hetween smaller subphrases and words occur-
ring in them. A further filtering process maps these simi-
larity links onto semantic relations [OCRerr]eneralization, spe-
cialization, synonymy, etc.) after which they are used to
transform a user's request into a search query.
The user's natural language request is also parsed,
and all indexing terms occurring in it are identilied. Cer-
tain highly ambiguous, usually single-word terms may he
dropped, provided that they also occur as elements in
some compound terms. For example, "natural" is deleted
from a query already containing "natural language"
because "natural" occurs in many unrelated contexts:
"natural numher", "natural logarithin", "natural
approach", etc. At the same time, other terms may he
added, namely those which are linked to some query
term through admissible similarity relations. For exam-
ple, "unlawful activity" is added to a query [OCRerr]EC topic
055) containing the compound term "illegal activity via
a synonymy link hetween "illegal" and "unlawful".
One of the striking observations made during the
course of ThEC-2 was to note that removing low-quality
terms from the queries is at least as important (and often
more so) as adding synonyms and specializations. In
some instances (e.g., routing runs) low-quality terms had
to he removed (or inhibited) before sin:ular terms could
he added to the query or else the effect of query expan-
sion was all but drowned out by the increased noise.'
Mter the final query is constructed, the database
search follows, and a ranked list of documents is
returned. It should he noted that all the processing steps,
those performed by the backbone system, and those per-
formed by the natural language processing components,
are fully automated, and no human intervention or
manual encoding is required.
FAST PARSING WITH TTP PARSER
TIP Cragged Text Parser) is based on the Linguis-
tic String Grammar developed by Sager (1981). The
parser currendy encompasses some 400 grammar pro-
ductions, but it is by no means complete. The parser's
output is a regularized parse tree representation of each
`We would like to thank Donna Hannan for tunling our attention
to the importance of term weighting achemes, including term deletion.