ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Operating Instructions for the SMART Text Processing and Document Retrieval System
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
11-13
measure of relation between the rows A (all i) and A (all i). It has
pi qi
the value 0 for totally dissimilar rows, and 1 for identical rows. Two
different numerical algorithms are available for the computation of
these are the cosine algorithm and the overlap algorithm. The cosine
algorithm is defined as follows:
rpq =SUm(Ap[OCRerr]*Aq[OCRerr]) I (sum(Ap2[OCRerr]))*(sum(A2q[OCRerr])).
k k k
The overlap algorithm is defined as follows:
r = [OCRerr] SUmAq[OCRerr]))
pq k k k
where min(x,y) = the numerically smaller of x and ye Note that both
measures are synimetric; i.e., rpq = r
qp
These correlations can now be subjected to a cutoff process defining
S.. as follows: 5 = 1 if r is greater than the cutoff value; S = 0
-13 pq pq pq
otherwise. The matrix S with both rows and columns labeled by concepts
now identifies the concepts which have similar document environments;
that is, the pairs of concepts which occur in the same documents have a
one at the intersection of their row and column.
This process can now be used for the expansion of the document vector
by augmenting all concepts by the list of concepts with similar environments.
The similarity matrix S can also be used as the starting point for
further correlations, however. Some writers feel that the trlly significant
question is not "which concepts have similar document environments" but
"which concepts have similar concept environments". This question requires
an additional correlation, a correlation of the matrix S, to identi[OCRerr] the