ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Operating Instructions for the SMART Text Processing and Document Retrieval System chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 11-13 measure of relation between the rows A (all i) and A (all i). It has pi qi the value 0 for totally dissimilar rows, and 1 for identical rows. Two different numerical algorithms are available for the computation of these are the cosine algorithm and the overlap algorithm. The cosine algorithm is defined as follows: rpq =SUm(Ap[OCRerr]*Aq[OCRerr]) I (sum(Ap2[OCRerr]))*(sum(A2q[OCRerr])). k k k The overlap algorithm is defined as follows: r = [OCRerr] SUmAq[OCRerr])) pq k k k where min(x,y) = the numerically smaller of x and ye Note that both measures are synimetric; i.e., rpq = r qp These correlations can now be subjected to a cutoff process defining S.. as follows: 5 = 1 if r is greater than the cutoff value; S = 0 -13 pq pq pq otherwise. The matrix S with both rows and columns labeled by concepts now identifies the concepts which have similar document environments; that is, the pairs of concepts which occur in the same documents have a one at the intersection of their row and column. This process can now be used for the expansion of the document vector by augmenting all concepts by the list of concepts with similar environments. The similarity matrix S can also be used as the starting point for further correlations, however. Some writers feel that the trlly significant question is not "which concepts have similar document environments" but "which concepts have similar concept environments". This question requires an additional correlation, a correlation of the matrix S, to identi[OCRerr] the