SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Natural Language Processing in Large-Scale Text Retrieval Tasks
chapter
T. Strzalkowski
National Institute of Standards and Technology
Donna K. Harman
SENThNclll:
()orbachev ordered the tank shipped to a plant in the Ukraine,
where seven mechanics worked for three months restoring it.
PARSE STRUCTURE:
[assert
[[verb [ORDERII
[subject
[np
[name [GORBACHEVIJII
[object
[[verb []I
[subject
[np
[n TANKI
[t[OCRerr]os ThEI
[m[OCRerr]wh
[[verb [SHIPII
[subject ANYONEI
[object VARI
[TO
[np
in PLANTI
[t[OCRerr]s AJII
[IN
[np
[name [UKRAINEII
[t[OCRerr]s ThEI
[m_wh
[[verb [WORKJI
[subject
[np
[n MECHANICI
[count [SEVENIII
[FOR
[np
[n MONTh]
[count [THREEIIII
[sa[OCRerr]wh
[[verb [OCRerr]ESTO[OCRerr]l
[subject PROI
[object
[np
significantly fewer times or perhaps none at all, then
we will prefer the former association as valid.
Although the noun phrase disambiguation rou-
tine has been implemented to work with the pair
extractor program, it has not been used in the current
installment of ThEC. A more conservative version of
pair extractor was used instead (it generates fewer
pairs) since that version was found effective in test
runs. 12
TERM CORRELATIONS FROM TEXT
Head-modifier pairs form compound terms
used in database indexing. They also serve as
occurrence contexts for smaller terms, including
single-word terms. In order to determine whether
such pairs signify any important association between
terms, we calculate the value of the Informational
Contribution (IC) function for each element in a pair.
This is important because not every syntactic associa-
tion necessarily translates into a semantic one, and
we may also need to eliminate spurious pairs gen-
erated by the parser. Higher values of IC indicate
stronger association, and moreover the element
which has the larger value is considered semantically
dominant. These values are context dependent, and
will vary from one corpus to another. 13
The likelihood of a given word being paired
with another word, within one predicate-argument
structure, can be expressed in statistical terms as a
conditional probability. In our present approach, the
required measure had to be uniform for all word
occurrences, covering different types of predicate-
argument links, i.e., verb-object, noun-adjunct, etc.
This is reflected by an additional dispersion parame-
ter, introduced to evaluate the heterogeneity of word
associations. The resulting new formula IC (x, [x,y])
is based on (an estimate of) the conditional probabil-
ity of seeing a word y as a modifier of the word x,
normalized with a dispersion parameter for x.
IC(x,[x,y))= f:,y
`zX + d[OCRerr] - 1
EXTRAC[OCRerr]D PAIRS:
MECHANIC WORK
SHIP TANK
TANK SHIP
Figure 2. Extraction of syntactic pairs.
language+natural has Occurred unambiguously a
number times in contexts such as parser for natural
language, while processing+natural has occurred
179
where [OCRerr] is the frequency of [x,y] in the corpus, n[OCRerr]
is the number of pairs in which x occurs at the same
position as in [x,yJ, and d(x) is the dispersion pararn-
eter understood as the number of distinct words with
12 In a few test runs with TREC topics 001 to 005 against the
training database (disk 1), we observed that the inclusion of corn-
pound terms obtained from head-modifier pairs increased both re-
call and precision quite sugnificantly. In particular, for topic 003 no
relevant documents could be found without compound terms.
13 For more details please refer to (Strzalkowski and Vau-
they, 1992).