IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Correlation Measures
chapter
K. Reitsma
J. Sagalyn
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IV-5
the elements are denoted by a second subscript, i.e. for term i , the
vector ci = (c11,c12,... ,c[OCRerr]k)
I
k
C. C for all i,j (3)
[OCRerr]k-jk
cc for all i,j ([OCRerr])
[OCRerr]i[OCRerr]j
I
k
The first summation gives the number of documents having both terms i and
j . The second summation gives the number of terms that documents i and
j have in coimnon. It is identical to expression (i) discussed previously.
3. The Correlation Coefficients
This section contains an analysis of the various correlation coef-
ficients considered in this study. Each is analyzed according to its origin,
initial interpretation, modifications made and final interpretation as a docu-
ment - document correlation coefficient.
It must be noted that there is a basic difference between the
document description vector and the request description vector. The former
is taken from an abstract of the article which may consist of several sen-
tences. The latter is taken from a very short request. In the 82 document
ADI collection, the maximum number of concepts in one description vector
is [OCRerr]4, the maximum weight found in 96 Among the 35 requests the. maximum
number of concepts in one description vector is 11, the maximum weight found
is [OCRerr]8. Actually, most of the weights in the request vectors are 12. It is
seen therefore that the document description space is not the same as the
request description space. This must be kept in mind when analyzing the