ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval On Some Clustering Techniques for Information Retrieval chapter J. D. Broffitt H. L. Morgan J. V. Soden Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. Ix-l Ix On Some Clustering Techniques for Information Retrieval J. D. Broffitt, H. L. Morgan, and J. V. Soden Abstract Document clustering methods which have been proposed by R. E. Bonner and J. J. Rocchio are compared. Bonner's method is found to give higher precision than Rocchio1s method, while the recall for the two method is about the same. Bonner's method necessitates about twice as many comparisons against a query vector as Rocchio's method; this is to be expected since Rocchio controls the cluster size in order to maximize search efficiency. Manual relevance judgments are used as well as relevance judgments determined by query document cosines. The results are found to be invariant under the two measures. 1. Introduction The organization of information into homogeneous groups plays a major role in many fields of research. Some areas of application are informa,tion retrieval, biological taxonomy, isolation of disease syndromes in medicine, anthropology (categorization of tribes), and business applications such as categorizing TV audiences, sales offices, etc. Indeed, the applications of information organization are numerous, anQ the particular application being studied dictates the type of classification needed. Needham[l] has divided cl[OCRerr]ssification problems into three types: (1) the assignment of given objects to given classes, (2) the extraction of class characteristics from given classes and their objects, and