SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Natural Language Processing in Large-Scale Text Retrieval Tasks chapter T. Strzalkowski National Institute of Standards and Technology Donna K. Harman Let us consider a specific example from WSJ database: The former Soviet president has heen a local hero ever since a Russian tank invaded Wiscon- sin. The tagged sentence is given below, followed by the regularized parse structure generated by TTP, given in Figure 1. The/di formei;[OCRerr][OCRerr]j Soviet/ji president/nn has/vbz been/vbn a/dt locaiji hero/nn ever/rb since/in a/dt Russian/ji tank/im invaded/vbd Wisconsin/np Iper It should be noted that the parser's output is a predicate-argument structure centered around main elements of various phrases. In Figure 1, BE is the main predicate (modified by HAVE) with 2 argu- ments (subject, object) and 2 adjuncts (adv, sub_ord). INVADE is the predicate in the subordinate clause with 2 arguments (subject, object). The subject of BE is a noun phrase with PRESDENT as the head element, two modifiers (FOR[OCRerr][OCRerr]R, SOVIET) and a determiner (THE). From this structure, we extract head-modifier pairs that become candidates for com- pound terms. The following types of pairs are con- sidered: (1) a head noun and its left adjecfive or noun adjunct, (2) a head noun and the head of its right [assert [[pert [HAVE]l [(verb [BEll [subject [np [n PRESIDENT] (t[OCRerr]pos ThE] [adj [FORMERII ladi [SOVIET] Ill [object [np [n HEROI [t[OCRerr]pos Al [adj [LOCALIIII [adv EVER] [sub_ord [SINCE [[verb [INVADE] I (subject [np [n TANK] [t[OCRerr]os Al [adj [RUSSIAN]]]] (object [np (name [WISCONSIN]]]]]]j]]] Figure 1. Predicate-argument parse structure. 178 adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) the head of the subject phrase and the main verb. These types of pairs account for most of the syntactic variants for relafing two words (or simple phrases) into pairs carrying compatible semantic content. For example, the pair retrieve+information will be extracted from any of the following fragments: information retrieval sys- tem, retrieval of information from databases: and information that can be retrieved by a user- controlled interactive search process. In the example at hand, the following head-modifier pairs are extracted (pairs containing low-contents elements, such as BE and FORMER, or names, such as WISCONSIN, will be later discarded): [PRESIDENT,BE] [PPESIDENT,FORMER] [PRESIDENT,SOVIET] [BE,HERO] [HERO,LOCAL] [TANK,INVADEJ [TANK[OCRerr]USSIAN] [INVADE,WISCONSIN] We may note that the three-word phrase former Soviet president has been broken into two pairs former president and Soviet president, both of which denote things that are potenfially quite different from what the original phrase refers to, and this fact may have potentially negafive effect on retrieval preci- sion. This is one place where a longer phrase appears more appropriate. An further example is shown in Figure 2.11 One difficulty in obtaining head-modifier pairs of highest accuracy is the notorious ambiguity of nominal compounds. For example, the phrase natural language processing should generate language+natural and processing+language, while dynamic information processing is expected to yield processing +dynamic and processing +information. Since our parser has no knowledge about the text domain, and uses no semantic preferences, it does not attempt to guess any internal associations within such phrases. Instead, this task is passed to the pair extrac- tor module which processes ambiguous parse struc- tures in two phases. In phase one, all and only unam- biguous head-modifier pairs are extracted, and the frequencies of their occurrences are recorded. In phase two, frequency information about pairs gen- erated in the first pass is used to form associations from ambiguous structures. For example, if "Note that working with the parsed text ensures a degree of predsion in capturing the meaningful phrases, which is especially evident when compared with the results usually obtained from ei- ther unprocessed or only partially processed text (Lewis and Croft, 1990). Note also that names, pronouns and dummy verbs are not allowed to create pairs.