SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Natural Language Processing in Large-Scale Text Retrieval Tasks
chapter
T. Strzalkowski
National Institute of Standards and Technology
Donna K. Harman
TIT' is a full grammar parser, and initially, it
attempts to generate a complete analysis for each
sentence. However, unlike an ordinary parser, it has a
built-in timer which regulates the amount of time
allowed for parsing any one sentence. If a parse is not
returned before the allotted time elapses, the parser
enters the skip-and-fit mode in which it will try to
"fit" the parse. While in the skip-and-fit mode, the
parser will attempt to forcibly reduce incomplete
constituents, possibly skipping portions of input in
order to restart processing at a next unattempted con-
stituent. In other words, the parser will favor reduc-
tion to backtracking while in the skip-and-fit mode.
The result of this strategy is an approximate parse,
partially fitted using top-down predictions. The frag-
ments skipped in the first pass are not thrown out,
instead they are analyzed by a simple phrasal parser
that looks for noun phrases and relative clauses and
then attaches the recovered material to the main parse
structure. As an illustration, consider the following
sentence taken from the CACM-3204 corpus:
The method is illustrated by the automatic con-
struction of both recursive and iterative pro-
grains operating on natural numbers, lists, and
trees, in order to construct a program satisfying
certain specifications a theorem induced by
those specifications is proved, and the desired
program is extracted from the proof.
The italicized fragment is likely to cause additional
complications in parsing this lengthy string, and the
parser may be better off ignoring this fragment alto-
gether. To do so successfully, the parser must close
the currently open constituent (i.e., reduce a program
satisfying certain speeLfications to NP), and possibly
a few of its parent constituents, removing
corresponding productions from further considera-
tion, until an appropriate production is reactivated.
In this case, TTP may force the following reductions:
SI [OCRerr] to V NP; SA [OCRerr] SI; S [OCRerr] NP V NP SA, until the
production S [OCRerr] S and S is reached. Next, the parser
skips input to find and, and resumes normal process-
ing.
As may be expected, the skip-and-fit strategy
will only be effective if the input skipping can be per-
formed with a degree of determinism. This means
that most of the lexical level ambiguity must be
removed from the input text, prior to parsing. We
achieve this using a stochastic parts of speech tagger
to preprocess the text. Full details of the parser can
be found in (Strzalkowski, 1992).
PART OF SPEECII TAGGER
One way of dealing with lexical ambiguity is to
use a tagger to preprocess the input marking each
word with a tags that indicates its syntactic
176
categorization: a part of speech with selected mor-
phological features such as number, tense, mode,
case and degree. The following are tagged sentences
from the CACM-3204 collection: 6
The/dt paper/nn presents/vbz a/dt proposai/nn
for/in structured/vbn representation/nn of/in
multiprogramming/vbg in/in a/dt high/ji level/nn
language/nn ./per
The/dt notation/nn used/vbn explicitly/rb
associates/vbz a/dt data/nns structure/nn
shared/vbn by/in concurrent/jj processes/nns
with/in operations/nns defined/vbn on/in it/pp
./per
The tags are understood as follows: dt - determiner,
nn - singular noun, nns - plural noun, in - preposition,
ii - adjective, vbz - verb in present tense third person
singular, to - particle "to", vbg - present participle,
vbn - past participle, vbd - past tense verb, vb -
infinitive verb, cc - coordinate conjunction.
Tagging of the input text substantially reduces
the search space of a top-down parser since it
resolves most of the lexical level ambiguities. In the
examples above, tagging of presents as "vbz" in the
first sentence cuts off a potentially long and costly
"garden path" with presents as a plural noun followed
by a headless relative clause starting with (that) a
proposal.... In the second sentence, tagging resolves
ambiguity of used (vbn vs. vbd), and associates (vbz
vs. nns). Perhaps more impo[OCRerr]'tntly, elimination of
word-level lexical ambiguity allows the parser to
make projection about the input which is yet to be
parsed, using a simple lookahead; in particuL'tr,
phrase boundaries can be determined with a degree
of confidence (Church, 1988). This latter property is
critical for implementing skip-and-fit recovery tech-
nique ouflined in the previous section.
Tagging of input also helps to reduce the
number of parse structures that can be assigned to a
sentence, decreases the demand for consulting of the
dictionary, and simplifies dealing with unknown
words. Since every item in the sentence is assigned a
tag, so are the words for which we have no entry in
the lexicon. Many of these words will be tagged as
"np" (proper noun), however, the surrounding tags
may force other selections. In the following exam-
ple, chinese, which does not appear in the dictionary,
is tagged as "jj" :[OCRerr]
this/dt paper/nn dates/vbz back/rb the/dt
genesis/nn of/in binary/jj conception/nn circa/in
6 Tagged using the 35-tag Penn Treebank Tagset created at
the University of Pennsylvania.
7 We use the machine readable version of the Oxford Ad-
vanced Leamer's Dictionaiy (OALD).