SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Recent Developments in Natural Language Text Retrieval
chapter
T. Strzalkowski
J. Carballo
National Institute of Standards and Technology
D. K. Harman
sentence, that is, a representation that reflects the
sentence's logical predicate-argument structure. For
example, logical subject and logical object are identified
in both passive and active sentences, and noun phrases
are organized around their head elements. The parser is
equipped with a powerfui skip-and-fit recovery mechan-
ism that allows it to operate effectively in the face of ill-
formed input or under a severe time pressure. In the runs
with approximately 130 million words of TREC's Wall
Street Joimial and San Jose Meruury texts,2 the parser's
speed averaged between 0.3 and 0.5 seconds per sen-
tence, or up to 70 words per second, on a Sun's SparcS-
tatio[OCRerr]. In addition, III' has been shown to produce
parse structures which are no worse than those generated
by full-scale linguistic parsers when compared to hand-
coded Treebank parse trees.
TIP is a full grammar parser, and initially, it
attempts to generate a complete analysis for each sen-
tence. However, unlike an ordinary parser, it has a built-
in timer which regulates the amount of time allowed for
parsing any one sentence. If a parse is not returned
before the allotted time elapses, the parser enters the
ski[OCRerr]and-fit mode in which it will try to "fit" the parse.
While in the skip-and-fit mode, the parser will attempt to
forcibly reduce incomplete constituents, possibly skip-
ping portions of input in order to restart processing at a
next unattempted constituenL In oth& words, the parser
will favor reduction to backtracking while in the skip-
and-fit mode. The result of this strategy is an approxi-
mate parse, partially fitted using top-down predictions.
The fragments skipped in the first pass are not thrown
out, instead they are analyzed by a simple phrasal parser
that looks for noun phrases and relative clauses and then
attaches the recovered material to the main parse struc-
ture. Full details of TIP parser have been described in
the TREC-1 report (Strzalkowski, 1993a), as well as in
other works (Strzalkowski, 1992; Strzalkowski &
Scheyen, 1993).
As may be expected, the skip-and-fit strategy will
only be effective if the input skipping can be performed
with a degree of determihism. This means that most of
the lexical level ambiguity must be removed from the
input text, prior to parsing. We achieve this using a sto-
chastic parts of speech tagger to preprocess the text (see
[OCRerr]fl[OCRerr]EC-1 report for details). For TREC-2 a number of
problems have been corrected in the tagger, including
unproper tokenization of input and handling of abbrevia-
tions.
2Approximately 0.85 GBytes of text, over 6 mrnion sentences.
125
WORD SUFFIX TRIMMER
Word stemming has been an effective way of
improving document recall since it reduces words to their
common morphological root, thus allowing more suc-
cessful matches. On the other hand, stemming tends to
decrease retrieval precision, if care is not taken to
prevent situations where otherwise unrelated words are
reduced to the same stem. In our system we replaced a
traditional morphological stemmer with a conservative
dictionary-assisted suffix trhr[OCRerr]er. 3 The suffix trimmer
performs essentially two tasks: (1) it reduces inflected
word forms to their root forms as specified in the diction-
ary, and (2) it converts nominalized v&b forms (e.g.,
"implementation", "storage") to the root forms of
corresponding verbs (i.e., "implement", "store"). This is
accomplished by removing a standard suffix, e.g.,
"stor+age", replacing it with a standard root ending
("+e"), and checking the newly created word against the
dictionary, i.e., we check whether the new root ("store")
is indeed a legal word.
HEAD-MODIFIER STRUCTURES
Syntactic phrases extracted from TIP parse trees
are head-modifier pairs. The head in such a pair is a cen-
tral element of a phrase (main verb. main noun, etc.),
while the modifier is one of the adjunct arguments of the
head. In the TREC experiments reported here we
extracted head-modifier word and fixed-phrase pairs
only. While TREC databases are large enough to warrant
generation of larger compounds, we were m no position
to verify their effectiveness in indexing. This was largely
because of the tight schedule, but also because of rapidly
escalating complexity of the indexing process: even with
2-word phrases. compound terms accounted for nearly
88% of all index entries, in other words, including 2-
word phrases increased the index size approximately 8
times.
Let us consider a specific example from the WSJ
database:
The former Soviet president has been a local hero
ever since a Russian tank invaded Wisconsin.
The tagged sentence is given below, followed by the reg-
ularized parse structure generated by TJP, given in Fig-
urel.
The/dt forrner/jj Soviet/jj president/nn has/vbz
been/vbn a/dt locai[OCRerr][OCRerr]j hero/nn ever/rb since/in a/cit
Russian/ji tank/nn invaded/vbd Wisconsin/np Iper
3Deating with prefixes is a more complicated matter, since they
may have quite strong effect upon the meaning of the resulting term,
e.g., Un- usually introduces explicit negation.