SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Natural Language Processing in Large-Scale Text Retrieval Tasks chapter T. Strzalkowski National Institute of Standards and Technology Donna K. Harman SENThNclll: ()orbachev ordered the tank shipped to a plant in the Ukraine, where seven mechanics worked for three months restoring it. PARSE STRUCTURE: [assert [[verb [ORDERII [subject [np [name [GORBACHEVIJII [object [[verb []I [subject [np [n TANKI [t[OCRerr]os ThEI [m[OCRerr]wh [[verb [SHIPII [subject ANYONEI [object VARI [TO [np in PLANTI [t[OCRerr]s AJII [IN [np [name [UKRAINEII [t[OCRerr]s ThEI [m_wh [[verb [WORKJI [subject [np [n MECHANICI [count [SEVENIII [FOR [np [n MONTh] [count [THREEIIII [sa[OCRerr]wh [[verb [OCRerr]ESTO[OCRerr]l [subject PROI [object [np significantly fewer times or perhaps none at all, then we will prefer the former association as valid. Although the noun phrase disambiguation rou- tine has been implemented to work with the pair extractor program, it has not been used in the current installment of ThEC. A more conservative version of pair extractor was used instead (it generates fewer pairs) since that version was found effective in test runs. 12 TERM CORRELATIONS FROM TEXT Head-modifier pairs form compound terms used in database indexing. They also serve as occurrence contexts for smaller terms, including single-word terms. In order to determine whether such pairs signify any important association between terms, we calculate the value of the Informational Contribution (IC) function for each element in a pair. This is important because not every syntactic associa- tion necessarily translates into a semantic one, and we may also need to eliminate spurious pairs gen- erated by the parser. Higher values of IC indicate stronger association, and moreover the element which has the larger value is considered semantically dominant. These values are context dependent, and will vary from one corpus to another. 13 The likelihood of a given word being paired with another word, within one predicate-argument structure, can be expressed in statistical terms as a conditional probability. In our present approach, the required measure had to be uniform for all word occurrences, covering different types of predicate- argument links, i.e., verb-object, noun-adjunct, etc. This is reflected by an additional dispersion parame- ter, introduced to evaluate the heterogeneity of word associations. The resulting new formula IC (x, [x,y]) is based on (an estimate of) the conditional probabil- ity of seeing a word y as a modifier of the word x, normalized with a dispersion parameter for x. IC(x,[x,y))= f:,y `zX + d[OCRerr] - 1 EXTRAC[OCRerr]D PAIRS: MECHANIC WORK SHIP TANK TANK SHIP Figure 2. Extraction of syntactic pairs. language+natural has Occurred unambiguously a number times in contexts such as parser for natural language, while processing+natural has occurred 179 where [OCRerr] is the frequency of [x,y] in the corpus, n[OCRerr] is the number of pairs in which x occurs at the same position as in [x,yJ, and d(x) is the dispersion pararn- eter understood as the number of distinct words with 12 In a few test runs with TREC topics 001 to 005 against the training database (disk 1), we observed that the inclusion of corn- pound terms obtained from head-modifier pairs increased both re- call and precision quite sugnificantly. In particular, for topic 003 no relevant documents could be found without compound terms. 13 For more details please refer to (Strzalkowski and Vau- they, 1992).