SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Natural Language Processing in Large-Scale Text Retrieval Tasks
chapter
T. Strzalkowski
National Institute of Standards and Technology
Donna K. Harman
which x is paired. When JC(x, [x,y]) = 0, x and y
never occur together (i.e., [OCRerr] = 0); when
IC(x, [x,yJ) = 1, x occurs only with y (i.e., [OCRerr] =
and d[OCRerr]= 1).
So defined, IC function is asymmetric, a pro-
perty found desirable by Wuks et al. (1990) in their
study of word co-occurrences in the Longman dic-
tionary. In addition, IC is quite stable even for rela-
tively low frequency words (dispersion parameter
helps here), and in this respect it compares favour-
ably to Fano's mutual information formula. The lack
of stability on low frequency terms is particularly
worrisome for IR applications since many important
indexing terms could be eliminated from considera-
tion.
It should also be pointed out that the parficular
way of generating syntactic pairs was dictated, to
some degree at least, by statistical considerations.
Our original experiments with IC formula were per-
formed on the relafively small CACM-3204 collec-
tion, and therefore we combined pairs obtained from
different syntactic relations (e.g., verb-object,
subject-verb, noun-adjunct, etc.) in order to increase
frequencies of some associations. This became
largely unnecess[OCRerr]uy in a large collection such as TIP-
STER, but we had no means to test alternative
options, and thus decided to stay with the original. It
should not be difficult to see that this was a
compromise solution, since many important distinc-
tions were potentially lost, and strong associations
could be produced where there weren't any. A way to
improve things is to consider different syntacfic rela-
tions independently, perhaps as independent sources
of evidence that could lend support (or not) to certain
term similarity predictions. We have already started
testing this option. A few examples of IC
coefficients obtained from CACM-3204 corpus are
listed in Table 1.
IC values for terms become the basis for calcu-
lating term-to-term similarity coefficients. If two
terms tend to be modified with a number of common
modifiers and otherwise appear in few distinct con-
texts, we assign them a similarity coefficient, a real
number between 0 and 1. The similarity is deter-
mined by comparing distribution characteristics for
both terms within the corpus: how much information
contents do they c[OCRerr]trry, do their information contribu-
tion over contexts vary greatly, are the common con-
texts in which these terms occur specific enough? In
general we will credit high-contents terms appearing
in identical contexts, especially if these contexts are
not too commonplace.14 The relative similarity
`[OCRerr] It would not he appropriate to predict similarity between
language and logarithm on the basis of their co-occurrence with
natural.
180
word head+modifier IC coeff,
distribute distribute+normal 0.040
normal distribute+normal 0.115
minimum minimum+relative 0.200
relative minimum+relative 0.016
retrieve retrieve+inform 0.086
inform retrieve+inform 0.004
size size +medium 0.009
medium size +medium 0.250
editor editor+text 0.142
text editor+text 0.025
system system+parallel 0.001
parallel system+parallel 0.014
read read+character 0.023
character read+character 0.007
implicate implicate+legal 0.035
legal implicate+legal 0.083
system system+distribute 0.002
distribute system+distribute 0.037
make make +recommend 0.024
recommend make +recommend 0.142
infer infer+deductive 0.095
deductive infer+deductive 0.142
share share +resource 0.054
resource share +resource 0.042
Table 1. Examples of IC coefficients.
between two words x1 and x2 is obtained using the
following formula (0: is a large constant): 15
SIM(x1 ,x2) = log (0: [OCRerr] sim[OCRerr](xi ,x2))
where
sim~(x1,x2)=MJN(IC(x1,[x1,yJ),!C(x2,[x2,yJ))
* MIN (IC ty, [x1,y]),IC("', [x2,yJ))
The similarity function is further normalized with
respect to SIM(x1 ,x1).
In addition, we require that words x1 and x2
appear in at least two distinct common contexts,
where a common context is a couple of pairs [x1 ,y]
and [x2,y], or [y[OCRerr][OCRerr]] and [yx2] such that they each
occurred at least twice. Thus, banana and baltic will
`[OCRerr] This is inspired by a formula used by Hindle (1990), and
subsequently modified to take into account the asymmetry of IC
measure.