SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Recent Developments in Natural Language Text Retrieval chapter T. Strzalkowski J. Carballo National Institute of Standards and Technology D. K. Harman [assert [[perf [HAVE]] [[verb [BE]] [subject [np [n PRESIDENT] [[OCRerr]pos ThE] [adj [OCRerr]ORMER]] [adj [SOVIEII]]] [object [np [n HERO] [[OCRerr])os A] [adj [LOCAL]]]] [adv EVER] [sub[OCRerr]ord [SINCE [[verb [INVADE]] [subject [np [n TANK] [t[OCRerr]pos A] [adj [OCRerr]USSIAN]]]] [object [np [name PVISCONSIN]]]]]]J]]J F[OCRerr]gure 1. Prelicate-argument parse structure. It should be noted that the parser's output is a predicate-argument structure centered around main ele- ments of various phrases. In Figure 1, BE is the main predicate (modified by HAVE) with 2 arguments (sub- ject, object) and 2 adjuncts (adv, sub_ord). INVADE is the predicate in the subordinate clause with 2 arguments (subject, object). The subject of BE is a noun phrase with PRESIDENT as the head element, two modifiers (FORMER, SO'EIET) and a determiner C[HE). From this structure, we extract head-modifier pairs that become candidates for compound terms. The following types of pairs are considered: (1) a head ndun and its left adjec- five or noun adjunct, (2) a head noun and the head of its right adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) the head of the subject phrase and the main verb. These types of pairs account for most of the syntactic variants for relating two words (or simple phrases) into pairs carrying compatible semantic content. For example, the pair retrieve+information will be extracted from any of the following fragments: information retrieval system; retrieval of information from databases; and information that can be retrieved by a user-controlled interactive search process. `[1 the example at hand, the following head-modifier pairs are extracted (pairs containing low- content elements, such as BE and FORMER, or names, such as WISCONSIN, will be later discarded): PRESIDENT+BE, PRESIDENT+FORMER, PRESIDENT+SOVIET, BE+HERO, HERO+LOCAL, TANK+INVADE, TANK+RUSSIAN, INVADE+WISCONSIN We may note that the three-word phrase former Soviet presulent has been broken into two pairs former president and Soviet president, both of which denote things that are potentially quite different from what the original phrase refers to, and this fact may have poten- tially negative effect on retrieval precision. This is one place where a longer phrase appears mole appropriate. The representation of this sentence may therefore contain the following terms (along with their inverted document frequency weights): PRESIDENT SOVIEF PRESIDENT+SOVIET PRESIDENT+FORMER HERO HERO+LOCAL INVADE TANK TANK+INVADE TANK+RUSSIAN RUSSIAN WISCONSIN While generating compound terms we took care to iden- tify `negative' terms, that is, those whose denotations have been explicifly excluded by negation. Even though matching of negative terms was not used in retrieval (nor did we use negative weights), we could easily prevent matching a negative term in a query against its positive counterpart in the database by removing known negative terms from queries. As an example consider the follow- ing fragrnent from topic 067: It should NOT be about economically-mofivated civil disturbances and NOT be about a civil distur- bance directed against a second country. 2.623519 5.416102 11.556747 14.594853 7.896426 14.314775 8.435012 6.848128 17.402237 16.030809 7.383342 7.785689 126 The corresponding compound terms are: NOT disturb+civil NOT countxy+second NOT dtrect+disturb The particular way of interpreting syntactic con- texts was dictated, to some degree at least, by statistical considerations. Our inltial experiments were pefformed on a relatively small collection (CACM-3204), and there- fore we combined pairs obtained from different syntactic relations (e.g., verb-object, subject-verb, noun-adjunct, etc.) in order to increase frequencies of some associa- tions. This became largely unnecessary in a large collec- tion such as TIPSTER, but we had no means to test alter- native options, and thus decided to stay with the original. It should not be difficult to see that this was a comprom- ise solution, since many important distinctions were