IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Correlation Measures chapter K. Reitsma J. Sagalyn Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IV-l~ term associations, n equals the number of documents in the collection and all the sunirLations are taken from i = 1 to n Stiles defines his formula as based upon the chi-square formula and gives the distance from the expected frequency of occurrence assuming no association. The magnitude of this function may be greater than 1 due to the presence of the Log function. By a simple analysis, it can be seen that the four factors in the denominator are the number of documents containing term V [OCRerr] and the number containing term w the number not containing term V and the number not containing term w , respectively. This formula has been adapted for use with weighted vectors. The modified formula h N( W;w[OCRerr] - [OCRerr]2Nj l[OCRerr][OCRerr]N w4) [OCRerr]2 St = ln (v[OCRerr]) 7' [OCRerr]2 F 7' 2 F T(wi)2[OCRerr] e [OCRerr] [OCRerr]l[OCRerr]iI L [OCRerr]l4(v:) ] . LN - ______ Ignoring the factor of l1[OCRerr] , the function is the same as Stile's original function except that the denominator contains the sum of squares instead of only the sum of the terms. The reason for this change has already been explained in the discussion of the Parker-Rhodes-Needham coefficient. One other variation from the original function is the use of the natural logarithm instead of the base 10 log. This substitution was made in order to facili- tate coding on the computer, where a natural logarithm function exists. No difficulty should arise since both logarithms are increasing functions.