ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
iv-68
weights of the properties are reasonably similar for both terms,
so that neither term dominates the other, and they are placed in
the same concept class;
3) terms A and B are identified by the same properties, but the
property weights are higher for term A than for term B; then A
may be said to dominate B, and may be placed on a higher level in
the hierarchy;
i[OCRerr]) terms A and B are identified by the same properties, and B dominates
A.
In order to be able to make a decision concerning the similarity
between two property vectors, it is necessary to compute a similarity
coefficient between them. In the present context, it is best to use an
asymmetric coefficient such that the similarity between term i and term
j is not necessarily the same as between term j and term i. Given
property vectors v[OCRerr] and v[OCRerr], representing terms [OCRerr]. and [OCRerr]. respec-
1
tively, a possible similarity measure is
min (v[OCRerr] v[OCRerr])
c.. = k -k ` k
k-k
Using this measure, a term-term correlation matrix can now be con-
structed, giving for each pair of terms the similarity measure c. It may
i
be noticed, that if the two vectors v and v3 are identical, then c..
i j
equals 1, and when V *and V have no common properties, then c..
-1J
equals 0. A cut-off value K may now be applied to the similarity
coefficients, and a hierarchy may be formed based on the following
algorithm: [11]