ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval On Some Clustering Techniques for Information Retrieval chapter J. D. Broffitt H. L. Morgan J. V. Soden Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. Ix-7 This entire procedure is repeated with all unclustered documents. when this is completed, there is no guarantee that the required minimum number of clusters have been formed. Hence, some documents which failed the region density test in the first pass are chosen as cluster centers. If the number of categories formed in the first pass was too.high, the density test may be made stricter and the first pass repeated. There may still be some relatively isolated documents at the end of the clustering process. These correlate very poorly with any of the classification vectors. In order to properly test the procedures with the limited collections available, these documents are included in the cluster with which they correlate highest. This procedure has been programmed for the CDC 1604 computer in [OCRerr]RTRAN 63 and CODA? by V. Lesser. The output of this program consists of a deck of punched cards which specify the documents belonging to each category and the classification vector for each category formed. These cards are then used as input to a two-level search procedure program, [OCRerr]Yritten by the authors in FoRTRAN 63, which compares the queries to the documents in the clusters correlating highest with the query vector. [OCRerr]. Bonner's Procedure This algorithm is based upon Clustering Programs I and II and the Cluster Adjustment Program presented by Bonner in reference [2], and has been programmed by the authors for the CDC 1604 computer as FoRTRAN 63 subroutines DOCLOC, SI[OCRerr]IM, CLUSTER, and ADJUSTCL. Subroutine DOCDOC accepts as input a binary document-term matrix and calculates a document-document similarity matrix S, using either of the