ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval On Some Clustering Techniques for Information Retrieval chapter J. D. Broffitt H. L. Morgan J. V. Soden Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IX-5 Rocchio states that since the cosine is used as the matching [OCRerr]"1nction to retrieve documents, clustering with it should give better results. To test th[OCRerr]s, Bonner[OCRerr]s method is being tried using both coefficients. * [OCRerr]. Rocchio1s Procedure The stated objective of Rocchiots procedure is that of jointly maxi- mizing search efficiency and minimizing loss of relevant documents retrieved. in the search. It is a heuristically derived algorithm which is meant to be used in conjunction with a two-level search. The input parameters to the algorithm are: (1) the number of categories desired; (2) lower and upper bounds on the number of elements to be allowed in a category; (3) a lower bound on the correlation between a document and a classification vector, below which a document will not be placed in a category. All documents are first considered unclustered, and pass from this state into one of two other possible states, clustered or loose. The algorithm proceeds as follows. An unclustered document is selected. as a possible cluster center. All of the other unclustered and loose documents are correlated with it and the selected document is subjected to a region density test to see. if a category should be formed around it. This test specifies that more than N documents should be correlated higher than 1 with the candidate, and that more than N documents should be correlated 2 higher than [OCRerr]2 with the candidate. This ensures that documents on the edge of large groups do not become centers of groups. For example, in * This section is a summary of section [OCRerr].5 of reference [3], to which the reader is referred for a more detailed discussion and program flowcharts.