IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Correlation Measures
chapter
K. Reitsma
J. Sagalyn
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IV-15
The use of the factor l[OCRerr] is intended to simulate the original
function. In essence, dividing by l[OCRerr]4 partially eliminates the effect of
weights and therefore approximates binary terms.
The definition of N presents some problems. Originally, it was
intended to let N equal the number of concepts in the thesaurus, about
610. However, if this were done, it is possible that the last two factors
in the denominator might become negative. Therefore, to avoid this problem.
N is defined as (4)(610), the ([OCRerr]) being the average concept weight divided
by 12, the base of the weighting system. ([OCRerr]8 was arbitrarily chosen as the
average concept weight.) The coefficient is assured of being real, and no
atten[OCRerr]t to normalize it has been made, so that values greater than 1 are
possible.
H) The Average Coefficient
This formula simply calculates the average weight of all those
concepts which are found in both description vectors v and w . The for-
mula is
I
AV = 2*N
where
1 if both Xi and !i>o
0 if either or Xi = 0
and where N equals the number of matching concepts. The sunmiation is
taken from i = l,...,d , where d equals the number of concepts in the
description vector.
It was originally intended to use this function, time permitting,
to determine whether it is more important to have fewer matching concepts
at higher weights than it is to have more matching concepts at lower weights.