ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval On Some Clustering Techniques for Information Retrieval chapter J. D. Broffitt H. L. Morgan J. V. Soden Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IX-2 (3) the setting up of appropriate classes, clusters, clumps, or groups given a set of objects and some information about them. Even more specifically, this last problem may be viewed as a problem of either structuring the objects into a hierarchy or tree arrangement, or collecting the objects into coherent groups without regard f6r hierarchical relations among the objects or groups. Both of these problems find their place within the realm of an automatic information retrieval system, where documents are identified by vectors of weighted measures of occurrences of concepts. Thesaurus construction, for example, while being concerned with the grouping of cooccurring concepts for synonym references, has as a major objective the determination of hierarchies of information among the concepts. On the other hand, document clustering deals with the collection of similar documents into groups based only on similarities amdng the documents. The application of information organization to document clustering is the concern of this study. In order effectively to operate an automatic information retrieval system based on vector matching between queries and documents, an efficient document clustering procedure is necessary. The number of documents which must be compared with the query in order reasonably to satisfy the demands of an information request is too large to permit individual comparisons with each document in the collection. Once the documents have been clustered into homogeneous groups, a two-level search procedure greatly reduces the number of comparisons needed to answer a request. This method first compares the query vector with the classification vectors characterizing the groups. The second level compares the query vector with all documents belonging to