IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Correlation Measures
chapter
K. Reitsma
J. Sagalyn
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
iv-4
the two vectors have in common. The summation equals 3, meaning there are
three concepts found both in V and w , namely concepts 1 , 1+ , and 6
However, the same expression used with weighted vectors does not
produce the same simple interpretation. For example, given the two vectors
v = (l2,2[OCRerr],o,36,o,l2,o)
w = (2[OCRerr],O,l2,24,O,l2,36)
the above equation (EQ-i) gives a value of 1296. Although each of these
vectors contains the same concepts as the binary vectors above, and each
the name three concepts in conm[OCRerr]n, there is no simple interpretation for
the number 1296. The closest interpretation is that it produces a relative
value which can be compared with another figure derived by using the sum-
mation on v and some other vector w as a measure of the matching con-
cepts, thereby it determines which vector, w or Wt [OCRerr] matches better with
v.
have
An example of an expression which doesn't lose its meaning when
weighted vectors are used instead of binary vectors is the following
t
(1¾2 )i
i=l
(P)
This expression represents the absolute length of the vector in t-space,
where t is the number of concepts possible in the description vector.
There exist coefficients other than these two to measure the simi-
larity between documents. For the most part, these coefficients are used
in thesaurus construction and measure the similarity between concepts.
When calculating the term - term association coefficient, several of the
expressions discussed above have a different interpretation. F"r example¶
given the term description v[OCRerr]ctors c1,[OCRerr],... where for each term vector