ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval On Some Clustering Techniques for Information Retrieval chapter J. D. Broffitt H. L. Morgan J. V. Soden Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. the groups with whose classification vector the query vector was found to correlate highly on the first level. While increasing the search efficiency in terms of the number of c[OCRerr]parisons made with a query vector, the clustering process inherently decreases the information which categorizes the individual documents. Thus the problem of document clustering is clear: classifying the documents into homogeneous groups in order to increase search efficiency without seriously sacrificing the ability to retrieve the documents. Various schemes have been suggested to this end. These procedures gather documents into groups based on some type of correlation function which associates s[OCRerr]lar documents. Since there exists no purely analytical method for comparing the relative efficiency of these various methods, and since this efficiencyt1 may differ [OCRerr]Tith the use to which the groups are put, it is nccessary to compare the clustering procedures in the context of an aut[OCRerr]natic information retrieval system, and to evaluate the merits of the clustering procedures by evaluating the overall performance of such a system, with only a change in the clustering method from experiment to experiment. The evaluation of the overall performance of such a system clearly involves the environment in which the system is used, and the users themselves. Specifically, the present study consists of a comparison of two clustering procedures: a method of R. E. Bonner [2], and the method proposed by J. J. [OCRerr]occhio [3[OCRerr]. These techniques can reasonably be studied by a comparison of the documents retrieved by an autcmatic retrieval system using a two level search based on the clusters produced by the methods with those retrieved by a full search of the document collection.