ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
On Some Clustering Techniques for Information Retrieval
chapter
J. D. Broffitt
H. L. Morgan
J. V. Soden
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
ix-8
similarity measures discussed in section 2. S is then transformed into
a binary mat?ix whose elements are zero (0) if the calculated similarity
coefficient is below an empirically determined threshold, or one (1)
otherwise.
Subroutine SIMSIM then calculates the similarity matrix of the
document-document matrix in a manner completely analagous to DOCDOC. This
is obviously equivalent to calculating the similarity matrix of another
similarity matrix and is repeated as many times as necessary to better
define the clusters which result.
Next, subroutine CLUSTER uses the algorithm as outlined in Figure 2
to define a 1tclustertt as that set of documents of maximum size [OCRerr]ere each
document is similar (in the final document-document matrix of SIb[OCRerr]3IM) to
all other members of the set. In addition, no cluster can be a subset of
another cluster.
After CLUSTER has found the "tight" clusters or, in graph theoretic
terms, the maximal complete subloops of the document collection, subroutine
ADJLJSTCL attempts to redefine the clusters in a manner more conducive to
search optimization. This is accomplished by transferring a member of a
cluster with membership less than a predetermined constant into the large
cluster to which it is most similar. This similarity is the percentage of
documents in the large cluster to which the transferring document is
linked in the first document-document sjinjlarity matrix. Before the
transfer is made, however, this percentage is checked against a predet-
ermined value (sI[OCRerr][OCRerr]HRES), below which the transfer will not be made.
After all shifting is completed, the classification vector is calcu-
lated. This is merely the centroid vector of all of the document vectors