IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Correlation Measures
chapter
K. Reitsma
J. Sagalyn
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IV-12
P-R-N =
[OCRerr]v. +
Tyw.
1w[OCRerr] 1[OCRerr]v[OCRerr]w[OCRerr]
where all summations are taken over i = 1 to t , and where t equals the
number of documents in the collection. Since the term vectors are binary,
the interpretation of the terms in the denominator is simple. The first
term is the number of documents containing term v , the second is the number
of documents containing term w , and the third is the number of documents
containing both terms V and w On the whole, the denominator gives the
number of documents containing at least one of the terms.
For two identical terms, the denominator equals the numerator and
the association is 1 . For two independent terms, where a document does
not contain both terms, the numerator is zero and the association is 0
When term - term associations are calculated, all the terms are
usually compared with all the other terms at the same time, using matrix
[OCRerr]iltiplication. The result is a matrix whose elements are terms of the
above formula. Since matrix multiplication requires the calculation of
many inner products, each of the entries in the association matrix is the
result of an inner product and therefore, so is each term in the P-R-N for-
mula. Thus, the summations
Zv[OCRerr] and
Z w[OCRerr] are in practice calculated
by v.v and w[OCRerr]w which is the same as [OCRerr]v.v. and [OCRerr]w.w. which is
LJ-1-1
the same as and , where the summations are taken