IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Correlation Measures
chapter
K. Reitsma
J. Sagalyn
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
`v-b
by making the simple substitution [OCRerr]vi = [OCRerr] + [OCRerr]viw[OCRerr] and a similar
substitution for [OCRerr] . The first term in the expression for 6 gives
the number of documents containing both terms V and w and the second term
is proportional to the frequency of documents both having terms v and w if
both v and w were random vectors.
For random vectors 6 = 0 giving a value of 0 for the coefficient.
For vectors in which there are a greater or smaller number of matching docu-
ments the expected number 8 is greater than or less than 0 . The range
of the function is then -l [OCRerr] M-K [OCRerr] +1 , +1 signifying perfectly correlated
terms and -l signifying perfectly uncorrelated terms.
When the Maron-Kuhns coefficient is modified to be used as a docu-
ment - document correlation coefficient, its interpretation is altered.
The summations must now be taken from i = 1 to d where d equals the
number of concepts in the description vector. The formula then gives a
measure of the number of concepts found in both document v and document
w over and above the number expected if both v and w were random vectors.
Further problems arise when the document description vectors are
weighted vectors instead of binary vectors. One problem is the question of
complementation. To solve this problem, the complement of an element of a
vector, is defined as the maximum concept weight found in the entire collec-
tion in which that vector is found minus the concept weight to be complemented.
A second problem is concerned with all the zero elements of the
document description vector. If the above method of complementation were
used, the complement of a concept weight of zero would equal the maximum