SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Natural Language Processing in Large-Scale Text Retrieval Tasks
chapter
T. Strzalkowski
National Institute of Standards and Technology
Donna K. Harman
specified. 4 An altogether different situafion arises
when the query actually requests that cer[OCRerr]un
underspecified information is found before a docu-
ment could be judged relevant. In TREC topics (e.g.,
058, as in many others) the following request is corn-
monpiace: to be relevant, the document will identiA
the location of the strike or potential strike. What we
ask the system here is to extract the value of ce[OCRerr][OCRerr][OCRerr]in
variable that satisfies certain conditions, i.e., find X
such that location-of-strike(X). It is impossible to
properly evaluate such query using any kind of con-
stant term based retrieval. What is required, at the
minimum is a general pattern matching capability,
and appropriately advanced representation of con-
tents (including more careful NLP processes).
In the remainder of this paper we discuss par-
ticulars of the present system and some of the obser-
vations made while processing TREC data. The
above comments will provide the background for
situafing our present effort and state-of-the-art with
respect to where we should be in the future.
OVERALL DESIGN
Our information retrieval system consists of a
traditional stafistical backbone (Harman and Candela,
1989) augmented with various natural language pro-
cessing components that assist the system in database
processing (stemming, indexing, word and phrase
clustering, selectional restrictions), and translate a
user's information request into an effective query.
This design is a careful compromise between purely
statistical non-linguistic approaches and those requir-
ing rather accomplished (and expensive) semantic
analysis of data, often referred to as `conceptual
retrieval'. The conceptual retrieval systems, though
quite effective, are not yet mature enough to be con-
sidered in serious information retrieval applications,
the major problems being their extreme inefficiency
and the need for manual encoding of domain
knowledge (Mauldin, 1991). However, as pointed out
in the previous section, a more careful text process-
ing may be required for certain types of requests.
In our system the database text is first pro-
cessed with a fast syntactic parser. Subsequently cer-
tain types of phrases are extracted from the parse
trees and used as compound indexing terms in addi-
tion to single-word terms. The extracted phrases are
statistically analyzed as syntactic contexts in order to
discover a variety of similarity links between smaller
subphrases and words occurring in them. A further
filtering process maps these similarity links onto
semantic relations (generalization, specialization,
synonymy, etc.) after which they are used to
transform user's request into a search query.
The user's natural language request is also
parsed, and all indexing terms occurring in them are
identified. Certain highly ambiguous, usually single-
word terms may be dropped, provided that they also
occur as elements in some compound terms. For
example, "natural" is deleted from a query already
containing "natural language because "natural"
occurs in many unrelated contexts: "natural number",
"natural logarithm", "natural approach", etc. At the
same time, other terms may be added, namely those
which are linked to some query term through admis-
sible similarity relations. For example, "unlawful
activity" is added to a query (TREC topic 055) con-
taining the compound term "illegal activity" via a
synonymy link between "illegal" and "unlawful".
After the final query is constructed, the database
search follows, and a ranked list of documents is
returned.
It should be noted that all the processing steps,
those performed by the backbone system, and these
performed by the natural language processing corn-
ponents, are fully automated, and no human interven-
tion or manual encoding is required.
FAST PARSING WITH TTP PARSER
TTP (Tagged Text Parser) is based on the
Linguistic String Grammar developed by Sager
(1981). The parser currently encompasses some 400
grammar productions, but it is by 110 means complete.
The parser's output is a regularized parse tree
representation of each sentence, that is, a representa-
tion that reflects the sentence's logical predicate-
argument structure. For exainple, logical subject and
logical object are identified in both passive and active
sentences, and noun phrases are organized around
their head elements. The significance of this
representation will be discussed below. The parser is
equipped with a powerful skip-and-fit recovery
mechanism that allows it to operate effectively in the
face of ill-formed input or under a severe time pres-
sure. In the runs with approximately 83 million words
of TREC's Wall Street Journal texts,5 the parser's
speed averaged between 0.45 and 0.5 seconds per
sentence, or up to 2600 words per minute, on a 21
MIPS SparcStation ELC.
try.
4 This does not mean that the query has to accurately reflect
the user's intentions. we take what we've got and give it our best
175
Approximately 0.5 (;Bytes of text, over 4 million sen-
tences.