NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Recent Developments in Natural Language Text Retrieval chapter T. Strzalkowski J. Carballo National Institute of Standards and Technology D. K. Harman sentence, that is, a representation that reflects the sentence's logical predicate-argument structure. For example, logical subject and logical object are identified in both passive and active sentences, and noun phrases are organized around their head elements. The parser is equipped with a powerfui skip-and-fit recovery mechan- ism that allows it to operate effectively in the face of ill- formed input or under a severe time pressure. In the runs with approximately 130 million words of TREC's Wall Street Joimial and San Jose Meruury texts,2 the parser's speed averaged between 0.3 and 0.5 seconds per sen- tence, or up to 70 words per second, on a Sun's SparcS- tatio[OCRerr]. In addition, III' has been shown to produce parse structures which are no worse than those generated by full-scale linguistic parsers when compared to hand- coded Treebank parse trees. TIP is a full grammar parser, and initially, it attempts to generate a complete analysis for each sen- tence. However, unlike an ordinary parser, it has a built- in timer which regulates the amount of time allowed for parsing any one sentence. If a parse is not returned before the allotted time elapses, the parser enters the ski[OCRerr]and-fit mode in which it will try to "fit" the parse. While in the skip-and-fit mode, the parser will attempt to forcibly reduce incomplete constituents, possibly skip- ping portions of input in order to restart processing at a next unattempted constituenL In oth& words, the parser will favor reduction to backtracking while in the skip- and-fit mode. The result of this strategy is an approxi- mate parse, partially fitted using top-down predictions. The fragments skipped in the first pass are not thrown out, instead they are analyzed by a simple phrasal parser that looks for noun phrases and relative clauses and then attaches the recovered material to the main parse struc- ture. Full details of TIP parser have been described in the TREC-1 report (Strzalkowski, 1993a), as well as in other works (Strzalkowski, 1992; Strzalkowski & Scheyen, 1993). As may be expected, the skip-and-fit strategy will only be effective if the input skipping can be performed with a degree of determihism. This means that most of the lexical level ambiguity must be removed from the input text, prior to parsing. We achieve this using a sto- chastic parts of speech tagger to preprocess the text (see [OCRerr]fl[OCRerr]EC-1 report for details). For TREC-2 a number of problems have been corrected in the tagger, including unproper tokenization of input and handling of abbrevia- tions. 2Approximately 0.85 GBytes of text, over 6 mrnion sentences. 125 WORD SUFFIX TRIMMER Word stemming has been an effective way of improving document recall since it reduces words to their common morphological root, thus allowing more suc- cessful matches. On the other hand, stemming tends to decrease retrieval precision, if care is not taken to prevent situations where otherwise unrelated words are reduced to the same stem. In our system we replaced a traditional morphological stemmer with a conservative dictionary-assisted suffix trhr[OCRerr]er. 3 The suffix trimmer performs essentially two tasks: (1) it reduces inflected word forms to their root forms as specified in the diction- ary, and (2) it converts nominalized v&b forms (e.g., "implementation", "storage") to the root forms of corresponding verbs (i.e., "implement", "store"). This is accomplished by removing a standard suffix, e.g., "stor+age", replacing it with a standard root ending ("+e"), and checking the newly created word against the dictionary, i.e., we check whether the new root ("store") is indeed a legal word. HEAD-MODIFIER STRUCTURES Syntactic phrases extracted from TIP parse trees are head-modifier pairs. The head in such a pair is a cen- tral element of a phrase (main verb. main noun, etc.), while the modifier is one of the adjunct arguments of the head. In the TREC experiments reported here we extracted head-modifier word and fixed-phrase pairs only. While TREC databases are large enough to warrant generation of larger compounds, we were m no position to verify their effectiveness in indexing. This was largely because of the tight schedule, but also because of rapidly escalating complexity of the indexing process: even with 2-word phrases. compound terms accounted for nearly 88% of all index entries, in other words, including 2- word phrases increased the index size approximately 8 times. Let us consider a specific example from the WSJ database: The former Soviet president has been a local hero ever since a Russian tank invaded Wisconsin. The tagged sentence is given below, followed by the reg- ularized parse structure generated by TJP, given in Fig- urel. The/dt forrner/jj Soviet/jj president/nn has/vbz been/vbn a/dt locai[OCRerr][OCRerr]j hero/nn ever/rb since/in a/cit Russian/ji tank/nn invaded/vbd Wisconsin/np Iper 3Deating with prefixes is a more complicated matter, since they may have quite strong effect upon the meaning of the resulting term, e.g., Un- usually introduces explicit negation.