NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Natural Language Processing in Large-Scale Text Retrieval Tasks chapter T. Strzalkowski National Institute of Standards and Technology Donna K. Harman which x is paired. When JC(x, [x,y]) = 0, x and y never occur together (i.e., [OCRerr] = 0); when IC(x, [x,yJ) = 1, x occurs only with y (i.e., [OCRerr] = and d[OCRerr]= 1). So defined, IC function is asymmetric, a pro- perty found desirable by Wuks et al. (1990) in their study of word co-occurrences in the Longman dic- tionary. In addition, IC is quite stable even for rela- tively low frequency words (dispersion parameter helps here), and in this respect it compares favour- ably to Fano's mutual information formula. The lack of stability on low frequency terms is particularly worrisome for IR applications since many important indexing terms could be eliminated from considera- tion. It should also be pointed out that the parficular way of generating syntactic pairs was dictated, to some degree at least, by statistical considerations. Our original experiments with IC formula were per- formed on the relafively small CACM-3204 collec- tion, and therefore we combined pairs obtained from different syntactic relations (e.g., verb-object, subject-verb, noun-adjunct, etc.) in order to increase frequencies of some associations. This became largely unnecess[OCRerr]uy in a large collection such as TIP- STER, but we had no means to test alternative options, and thus decided to stay with the original. It should not be difficult to see that this was a compromise solution, since many important distinc- tions were potentially lost, and strong associations could be produced where there weren't any. A way to improve things is to consider different syntacfic rela- tions independently, perhaps as independent sources of evidence that could lend support (or not) to certain term similarity predictions. We have already started testing this option. A few examples of IC coefficients obtained from CACM-3204 corpus are listed in Table 1. IC values for terms become the basis for calcu- lating term-to-term similarity coefficients. If two terms tend to be modified with a number of common modifiers and otherwise appear in few distinct con- texts, we assign them a similarity coefficient, a real number between 0 and 1. The similarity is deter- mined by comparing distribution characteristics for both terms within the corpus: how much information contents do they c[OCRerr]trry, do their information contribu- tion over contexts vary greatly, are the common con- texts in which these terms occur specific enough? In general we will credit high-contents terms appearing in identical contexts, especially if these contexts are not too commonplace.14 The relative similarity `[OCRerr] It would not he appropriate to predict similarity between language and logarithm on the basis of their co-occurrence with natural. 180 word head+modifier IC coeff, distribute distribute+normal 0.040 normal distribute+normal 0.115 minimum minimum+relative 0.200 relative minimum+relative 0.016 retrieve retrieve+inform 0.086 inform retrieve+inform 0.004 size size +medium 0.009 medium size +medium 0.250 editor editor+text 0.142 text editor+text 0.025 system system+parallel 0.001 parallel system+parallel 0.014 read read+character 0.023 character read+character 0.007 implicate implicate+legal 0.035 legal implicate+legal 0.083 system system+distribute 0.002 distribute system+distribute 0.037 make make +recommend 0.024 recommend make +recommend 0.142 infer infer+deductive 0.095 deductive infer+deductive 0.142 share share +resource 0.054 resource share +resource 0.042 Table 1. Examples of IC coefficients. between two words x1 and x2 is obtained using the following formula (0: is a large constant): 15 SIM(x1 ,x2) = log (0: [OCRerr] sim[OCRerr](xi ,x2)) where sim~(x1,x2)=MJN(IC(x1,[x1,yJ),!C(x2,[x2,yJ)) * MIN (IC ty, [x1,y]),IC("', [x2,yJ)) The similarity function is further normalized with respect to SIM(x1 ,x1). In addition, we require that words x1 and x2 appear in at least two distinct common contexts, where a common context is a couple of pairs [x1 ,y] and [x2,y], or [y[OCRerr][OCRerr]] and [yx2] such that they each occurred at least twice. Thus, banana and baltic will `[OCRerr] This is inspired by a formula used by Hindle (1990), and subsequently modified to take into account the asymmetry of IC measure.