ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval On Some Clustering Techniques for Information Retrieval chapter J. D. Broffitt H. L. Morgan J. V. Soden Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. ix-8 similarity measures discussed in section 2. S is then transformed into a binary mat?ix whose elements are zero (0) if the calculated similarity coefficient is below an empirically determined threshold, or one (1) otherwise. Subroutine SIMSIM then calculates the similarity matrix of the document-document matrix in a manner completely analagous to DOCDOC. This is obviously equivalent to calculating the similarity matrix of another similarity matrix and is repeated as many times as necessary to better define the clusters which result. Next, subroutine CLUSTER uses the algorithm as outlined in Figure 2 to define a 1tclustertt as that set of documents of maximum size [OCRerr]ere each document is similar (in the final document-document matrix of SIb[OCRerr]3IM) to all other members of the set. In addition, no cluster can be a subset of another cluster. After CLUSTER has found the "tight" clusters or, in graph theoretic terms, the maximal complete subloops of the document collection, subroutine ADJLJSTCL attempts to redefine the clusters in a manner more conducive to search optimization. This is accomplished by transferring a member of a cluster with membership less than a predetermined constant into the large cluster to which it is most similar. This similarity is the percentage of documents in the large cluster to which the transferring document is linked in the first document-document sjinjlarity matrix. Before the transfer is made, however, this percentage is checked against a predet- ermined value (sI[OCRerr][OCRerr]HRES), below which the transfer will not be made. After all shifting is completed, the classification vector is calcu- lated. This is merely the centroid vector of all of the document vectors