SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Natural Language Processing in Large-Scale Text Retrieval Tasks chapter T. Strzalkowski National Institute of Standards and Technology Donna K. Harman TIT' is a full grammar parser, and initially, it attempts to generate a complete analysis for each sentence. However, unlike an ordinary parser, it has a built-in timer which regulates the amount of time allowed for parsing any one sentence. If a parse is not returned before the allotted time elapses, the parser enters the skip-and-fit mode in which it will try to "fit" the parse. While in the skip-and-fit mode, the parser will attempt to forcibly reduce incomplete constituents, possibly skipping portions of input in order to restart processing at a next unattempted con- stituent. In other words, the parser will favor reduc- tion to backtracking while in the skip-and-fit mode. The result of this strategy is an approximate parse, partially fitted using top-down predictions. The frag- ments skipped in the first pass are not thrown out, instead they are analyzed by a simple phrasal parser that looks for noun phrases and relative clauses and then attaches the recovered material to the main parse structure. As an illustration, consider the following sentence taken from the CACM-3204 corpus: The method is illustrated by the automatic con- struction of both recursive and iterative pro- grains operating on natural numbers, lists, and trees, in order to construct a program satisfying certain specifications a theorem induced by those specifications is proved, and the desired program is extracted from the proof. The italicized fragment is likely to cause additional complications in parsing this lengthy string, and the parser may be better off ignoring this fragment alto- gether. To do so successfully, the parser must close the currently open constituent (i.e., reduce a program satisfying certain speeLfications to NP), and possibly a few of its parent constituents, removing corresponding productions from further considera- tion, until an appropriate production is reactivated. In this case, TTP may force the following reductions: SI [OCRerr] to V NP; SA [OCRerr] SI; S [OCRerr] NP V NP SA, until the production S [OCRerr] S and S is reached. Next, the parser skips input to find and, and resumes normal process- ing. As may be expected, the skip-and-fit strategy will only be effective if the input skipping can be per- formed with a degree of determinism. This means that most of the lexical level ambiguity must be removed from the input text, prior to parsing. We achieve this using a stochastic parts of speech tagger to preprocess the text. Full details of the parser can be found in (Strzalkowski, 1992). PART OF SPEECII TAGGER One way of dealing with lexical ambiguity is to use a tagger to preprocess the input marking each word with a tags that indicates its syntactic 176 categorization: a part of speech with selected mor- phological features such as number, tense, mode, case and degree. The following are tagged sentences from the CACM-3204 collection: 6 The/dt paper/nn presents/vbz a/dt proposai/nn for/in structured/vbn representation/nn of/in multiprogramming/vbg in/in a/dt high/ji level/nn language/nn ./per The/dt notation/nn used/vbn explicitly/rb associates/vbz a/dt data/nns structure/nn shared/vbn by/in concurrent/jj processes/nns with/in operations/nns defined/vbn on/in it/pp ./per The tags are understood as follows: dt - determiner, nn - singular noun, nns - plural noun, in - preposition, ii - adjective, vbz - verb in present tense third person singular, to - particle "to", vbg - present participle, vbn - past participle, vbd - past tense verb, vb - infinitive verb, cc - coordinate conjunction. Tagging of the input text substantially reduces the search space of a top-down parser since it resolves most of the lexical level ambiguities. In the examples above, tagging of presents as "vbz" in the first sentence cuts off a potentially long and costly "garden path" with presents as a plural noun followed by a headless relative clause starting with (that) a proposal.... In the second sentence, tagging resolves ambiguity of used (vbn vs. vbd), and associates (vbz vs. nns). Perhaps more impo[OCRerr]'tntly, elimination of word-level lexical ambiguity allows the parser to make projection about the input which is yet to be parsed, using a simple lookahead; in particuL'tr, phrase boundaries can be determined with a degree of confidence (Church, 1988). This latter property is critical for implementing skip-and-fit recovery tech- nique ouflined in the previous section. Tagging of input also helps to reduce the number of parse structures that can be assigned to a sentence, decreases the demand for consulting of the dictionary, and simplifies dealing with unknown words. Since every item in the sentence is assigned a tag, so are the words for which we have no entry in the lexicon. Many of these words will be tagged as "np" (proper noun), however, the surrounding tags may force other selections. In the following exam- ple, chinese, which does not appear in the dictionary, is tagged as "jj" :[OCRerr] this/dt paper/nn dates/vbz back/rb the/dt genesis/nn of/in binary/jj conception/nn circa/in 6 Tagged using the 35-tag Penn Treebank Tagset created at the University of Pennsylvania. 7 We use the machine readable version of the Oxford Ad- vanced Leamer's Dictionaiy (OALD).