SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Recent Developments in Natural Language Text Retrieval
chapter
T. Strzalkowski
J. Carballo
National Institute of Standards and Technology
D. K. Harman
[assert
[[perf [HAVE]]
[[verb [BE]]
[subject
[np
[n PRESIDENT]
[[OCRerr]pos ThE]
[adj [OCRerr]ORMER]]
[adj [SOVIEII]]]
[object
[np
[n HERO]
[[OCRerr])os A]
[adj [LOCAL]]]]
[adv EVER]
[sub[OCRerr]ord
[SINCE
[[verb [INVADE]]
[subject
[np
[n TANK]
[t[OCRerr]pos A]
[adj [OCRerr]USSIAN]]]]
[object
[np
[name PVISCONSIN]]]]]]J]]J
F[OCRerr]gure 1. Prelicate-argument parse structure.
It should be noted that the parser's output is a
predicate-argument structure centered around main ele-
ments of various phrases. In Figure 1, BE is the main
predicate (modified by HAVE) with 2 arguments (sub-
ject, object) and 2 adjuncts (adv, sub_ord). INVADE is
the predicate in the subordinate clause with 2 arguments
(subject, object). The subject of BE is a noun phrase
with PRESIDENT as the head element, two modifiers
(FORMER, SO'EIET) and a determiner C[HE). From this
structure, we extract head-modifier pairs that become
candidates for compound terms. The following types of
pairs are considered: (1) a head ndun and its left adjec-
five or noun adjunct, (2) a head noun and the head of its
right adjunct, (3) the main verb of a clause and the head
of its object phrase, and (4) the head of the subject
phrase and the main verb. These types of pairs account
for most of the syntactic variants for relating two words
(or simple phrases) into pairs carrying compatible
semantic content. For example, the pair
retrieve+information will be extracted from any of the
following fragments: information retrieval system;
retrieval of information from databases; and information
that can be retrieved by a user-controlled interactive
search process. `[1 the example at hand, the following
head-modifier pairs are extracted (pairs containing low-
content elements, such as BE and FORMER, or names,
such as WISCONSIN, will be later discarded):
PRESIDENT+BE, PRESIDENT+FORMER, PRESIDENT+SOVIET,
BE+HERO, HERO+LOCAL,
TANK+INVADE, TANK+RUSSIAN, INVADE+WISCONSIN
We may note that the three-word phrase former Soviet
presulent has been broken into two pairs former
president and Soviet president, both of which denote
things that are potentially quite different from what the
original phrase refers to, and this fact may have poten-
tially negative effect on retrieval precision. This is one
place where a longer phrase appears mole appropriate.
The representation of this sentence may therefore contain
the following terms (along with their inverted document
frequency weights):
PRESIDENT
SOVIEF
PRESIDENT+SOVIET
PRESIDENT+FORMER
HERO
HERO+LOCAL
INVADE
TANK
TANK+INVADE
TANK+RUSSIAN
RUSSIAN
WISCONSIN
While generating compound terms we took care to iden-
tify `negative' terms, that is, those whose denotations
have been explicifly excluded by negation. Even though
matching of negative terms was not used in retrieval (nor
did we use negative weights), we could easily prevent
matching a negative term in a query against its positive
counterpart in the database by removing known negative
terms from queries. As an example consider the follow-
ing fragrnent from topic 067:
It should NOT be about economically-mofivated
civil disturbances and NOT be about a civil distur-
bance directed against a second country.
2.623519
5.416102
11.556747
14.594853
7.896426
14.314775
8.435012
6.848128
17.402237
16.030809
7.383342
7.785689
126
The corresponding compound terms are:
NOT disturb+civil
NOT countxy+second
NOT dtrect+disturb
The particular way of interpreting syntactic con-
texts was dictated, to some degree at least, by statistical
considerations. Our inltial experiments were pefformed
on a relatively small collection (CACM-3204), and there-
fore we combined pairs obtained from different syntactic
relations (e.g., verb-object, subject-verb, noun-adjunct,
etc.) in order to increase frequencies of some associa-
tions. This became largely unnecessary in a large collec-
tion such as TIPSTER, but we had no means to test alter-
native options, and thus decided to stay with the original.
It should not be difficult to see that this was a comprom-
ise solution, since many important distinctions were