IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Correlation Measures
chapter
K. Reitsma
J. Sagalyn
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
concept number
vectors, if a
that document,
present.
IV-3
one, the third position concept three, etc.). Thus for binary
o occurs in the third position, concept three is absent from
and if a 1 occurs in the third p[OCRerr][OCRerr]ition, concept three is
The second type of vector is a weighted vector. Id[OCRerr]ally, the posi-
tions in the vector have the same interpretation [OCRerr]s for binary [OCRerr]ectors.
The difference is that the value in each posi4ion is 0 if the concept is
not found in the document, or some integer j where j > 0 is proportional
to the number of times the concept appears in the document. In the SMART
system a weight of 12 is given to concept k if concept k occurs once
iri the document, 2[OCRerr] if k occurs twice, etc Since approximately 6oo
concepts occur in the thesaurus, each document description vector would
normally have a length of 600 positions. To reduce the memory space needed,
only those concepts with non-zero weights are retained in the vector, the
concept number and weight both being packed into the same memory location.
The use of a correlation coefficient in the two systems poses
some problems. A coefficient defined for binary vectors may have a specific
interpretation, either logical or statistical. However, the same coefficient
used with weighted vectors may lose its former interpretation. For
example, given the two vectors
V = (1,1,0,1,0,1,0)
w (1,0,1,1,0,1,1)
the expression
vw
I -i-i
i (1)
is interpreted as the nwnber of matching terms or the number of concepts