SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Natural Language Processing in Large-Scale Text Retrieval Tasks
chapter
T. Strzalkowski
National Institute of Standards and Technology
Donna K. Harman
Let us consider a specific example from WSJ
database:
The former Soviet president has heen a local
hero ever since a Russian tank invaded Wiscon-
sin.
The tagged sentence is given below, followed by the
regularized parse structure generated by TTP, given
in Figure 1.
The/di formei;[OCRerr][OCRerr]j Soviet/ji president/nn has/vbz
been/vbn a/dt locaiji hero/nn ever/rb since/in
a/dt Russian/ji tank/im invaded/vbd
Wisconsin/np Iper
It should be noted that the parser's output is a
predicate-argument structure centered around main
elements of various phrases. In Figure 1, BE is the
main predicate (modified by HAVE) with 2 argu-
ments (subject, object) and 2 adjuncts (adv, sub_ord).
INVADE is the predicate in the subordinate clause
with 2 arguments (subject, object). The subject of
BE is a noun phrase with PRESDENT as the head
element, two modifiers (FOR[OCRerr][OCRerr]R, SOVIET) and a
determiner (THE). From this structure, we extract
head-modifier pairs that become candidates for com-
pound terms. The following types of pairs are con-
sidered: (1) a head noun and its left adjecfive or noun
adjunct, (2) a head noun and the head of its right
[assert
[[pert [HAVE]l
[(verb [BEll
[subject
[np
[n PRESIDENT]
(t[OCRerr]pos ThE]
[adj [FORMERII
ladi [SOVIET] Ill
[object
[np
[n HEROI
[t[OCRerr]pos Al
[adj [LOCALIIII
[adv EVER]
[sub_ord
[SINCE
[[verb [INVADE] I
(subject
[np
[n TANK]
[t[OCRerr]os Al
[adj [RUSSIAN]]]]
(object
[np
(name [WISCONSIN]]]]]]j]]]
Figure 1. Predicate-argument parse structure.
178
adjunct, (3) the main verb of a clause and the head of
its object phrase, and (4) the head of the subject
phrase and the main verb. These types of pairs
account for most of the syntactic variants for relafing
two words (or simple phrases) into pairs carrying
compatible semantic content. For example, the pair
retrieve+information will be extracted from any of
the following fragments: information retrieval sys-
tem, retrieval of information from databases: and
information that can be retrieved by a user-
controlled interactive search process. In the example
at hand, the following head-modifier pairs are
extracted (pairs containing low-contents elements,
such as BE and FORMER, or names, such as
WISCONSIN, will be later discarded):
[PRESIDENT,BE]
[PPESIDENT,FORMER]
[PRESIDENT,SOVIET]
[BE,HERO]
[HERO,LOCAL]
[TANK,INVADEJ
[TANK[OCRerr]USSIAN]
[INVADE,WISCONSIN]
We may note that the three-word phrase former
Soviet president has been broken into two pairs
former president and Soviet president, both of which
denote things that are potenfially quite different from
what the original phrase refers to, and this fact may
have potentially negafive effect on retrieval preci-
sion. This is one place where a longer phrase appears
more appropriate. An further example is shown in
Figure 2.11 One difficulty in obtaining head-modifier
pairs of highest accuracy is the notorious ambiguity
of nominal compounds. For example, the phrase
natural language processing should generate
language+natural and processing+language, while
dynamic information processing is expected to yield
processing +dynamic and processing +information.
Since our parser has no knowledge about the text
domain, and uses no semantic preferences, it does not
attempt to guess any internal associations within such
phrases. Instead, this task is passed to the pair extrac-
tor module which processes ambiguous parse struc-
tures in two phases. In phase one, all and only unam-
biguous head-modifier pairs are extracted, and the
frequencies of their occurrences are recorded. In
phase two, frequency information about pairs gen-
erated in the first pass is used to form associations
from ambiguous structures. For example, if
"Note that working with the parsed text ensures a degree of
predsion in capturing the meaningful phrases, which is especially
evident when compared with the results usually obtained from ei-
ther unprocessed or only partially processed text (Lewis and Croft,
1990). Note also that names, pronouns and dummy verbs are not
allowed to create pairs.