IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Correlation Measures
chapter
K. Reitsma
J. Sagalyn
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IV-l~
term associations, n equals the number of documents in the collection and
all the sunirLations are taken from i = 1 to n Stiles defines his formula
as based upon the chi-square formula and gives the distance from the expected
frequency of occurrence assuming no association. The magnitude of this
function may be greater than 1 due to the presence of the Log function.
By a simple analysis, it can be seen that the four factors in the
denominator are the number of documents containing term V [OCRerr] and the number
containing term w the number not containing term V and the number
not containing term w , respectively.
This formula has been adapted for use with weighted vectors.
The modified formula h N( W;w[OCRerr] - [OCRerr]2Nj
l[OCRerr][OCRerr]N w4) [OCRerr]2
St = ln
(v[OCRerr]) 7' [OCRerr]2 F 7' 2 F T(wi)2[OCRerr]
e [OCRerr] [OCRerr]l[OCRerr]iI L [OCRerr]l4(v:) ] . LN - ______
Ignoring the factor of l1[OCRerr] , the function is the same as Stile's original
function except that the denominator contains the sum of squares instead of
only the sum of the terms. The reason for this change has already been
explained in the discussion of the Parker-Rhodes-Needham coefficient. One
other variation from the original function is the use of the natural logarithm
instead of the base 10 log. This substitution was made in order to facili-
tate coding on the computer, where a natural logarithm function exists.
No difficulty should arise since both logarithms are increasing functions.