SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Natural Language Processing in Large-Scale Text Retrieval Tasks
chapter
T. Strzalkowski
National Institute of Standards and Technology
Donna K. Harman
5000/cd years/nns ago/rb ,/com as/rb
derived/vbn by/in the/di chinese/jj ancients/nns
./per
The tagger which we use to process the input
text prior to parsing is based upon a bi-gram model; it
selects most likely tag for a word given co-
occurrence probabilities computed from a relafively
small training set.8 While the peak accuracy of the
best-tag option of the tagger is predicted to approach
97% (Meteer et al., 1991), we noted that the actual
performance on unprocessed WSJ text was in fact
somewhat worse. The main problem, it appears, were
frequent mistakes in tokenization of input, especially
in recognizing sentence boundaries. For example,
when a sentence ended with a period but wasn't fol-
lowed by at least two blanks or an end-of-line, this
and the next sentence would be collapsed together.
On the other hand, intra-sentenfial periods (like those
following abbreviated words) were occasionally
found followed by a new-line character, and the sen-
tence was split into two. While the parser contains a
provision to deal with the case of collapsed sen-
tences, the tags were likely to be incorrect. The fol-
lowing example is typical; note tagging errors at the
second apostrophe, and plans.
Gorbachev was rinining into trouble at home, including the
August coup, "which I thought would be the end of it," Mr.
Costa says. Still, plans to send the tank to the U.S. somehow
moved ahead.
Gorbachev/np was/vbd running/vbg into/in
trouble/nn at/in home/nn ,/com including/vbg
the/di August/np coup/nn ,/com "/apos
which/wdt I/pp thought/vbd would/md be/vb
the/di end/nn of/in it/pp ,/com "/nn Mr/nn ./per
Costa/np says/vbz ./per still/rb ,/com plans/vbz
to/to send/vb the/di tank/nn to/to the/di U.S./np
somehow/rb moved/vbd ahead/rb ./per
WORD SUFFIX TRIMMER
Word stemming has been an effective way of
improving document recall since it reduces words to
their common morphological root, thus allowing
more successful matches. On the other hand, stem-
ming tends to decrease retrieval precision, if care is
not taken to prevent situafions where otherwise unre-
lated words are reduced to the same stem. In our
The program, supplied to us by Bolt Beranek and New-
man, operates in two alternative modes, either selecting a single
most likely tag for each word (1,est-tag option, the one we use at
present), or supplying a short ranked list of alternatives (Meteer et
at., 1991).
177
system we replaced a traditional morphological stein-
mer with a conservative dictionary-assisted suffix
trimmer. 9 The suffix trimmer performs essentially
two tasks: (1) it reduces inflected word forms to their
root forms as specified in the dictionary, and (2) it
converts nominalized verb forms (e.g., "implementa-
tion", "storage") to the root forms of corresponding
verbs (i.e., "implement", "store"). This is accom-
plished by removing a standard suffix, e.g.,
"stor+age", replacing it with a standard root ending
("+e"), and checking the newly created word against
the dictionary, i.e., we check whether the new root
("store") is indeed a legal word, and whether the ori-
ginal root ("storage") is defined using the new root
("store") or one of its standard inflecfional forms
(e.g., "storing"). For exatnple, the following
definifions are excerpted from the O[OCRerr])rd Advanced
Learner's Dictionary (OALD):
storage n [U] (space used for, money paid for)
the storing of goods ...
diversion n [U] diverting...
procession n [C] number of persons, vehicles,
etc moving forward and following each other in
an orderly way.
Therefore, we can reduce "diversion" to "divert" by
removing the suffix "+sion" and adding root form
suffix "+t". On the other hand, "process+ion" is not
10
reduced to "process
Earlier experiments with CACM-3204 collec-
tion showed an improvement in retrieval precision by
6% to 8% over the base system equipped with a stan-
dard morphological stemmer (the SMART stemmer).
Due to time limitations these numbers are not avail-
able for TFIEC database at this time.
HEAD-MODIFIER STRUCTURES
Syntactic phrases extracted from TTP parse
trees are head-modifier pairs. The head in such a pair
is a central element of a phrase (main verb, main
noun, etc.), while the modifier is one of the adjunct
£trguments of the head. In the TREC experiments
reported here we extracted head-modifier word and
fixed-phrase pairs only. While TREC WSJ database
is large enough to warrant generation of ktrger coin-
pounds, we were in no posiflon to verify their effec-
tiveness in indexing (largely because of the tight
schedule). We discuss some options below.
9 Dealing with prefixes is a more complicated matter, since
they may have quite strong effect upon the meaning of the result-
ing term, e.g., Un- usually introduces explicit negation.
`[OCRerr] Definition checking is not implemented yet.