ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
On Some Clustering Techniques for Information Retrieval
chapter
J. D. Broffitt
H. L. Morgan
J. V. Soden
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IX-2
(3) the setting up of appropriate classes, clusters, clumps, or groups
given a set of objects and some information about them. Even more
specifically, this last problem may be viewed as a problem of either
structuring the objects into a hierarchy or tree arrangement, or collecting
the objects into coherent groups without regard f6r hierarchical relations
among the objects or groups. Both of these problems find their place
within the realm of an automatic information retrieval system, where
documents are identified by vectors of weighted measures of occurrences
of concepts.
Thesaurus construction, for example, while being concerned with the
grouping of cooccurring concepts for synonym references, has as a major
objective the determination of hierarchies of information among the concepts.
On the other hand, document clustering deals with the collection of
similar documents into groups based only on similarities amdng the documents.
The application of information organization to document clustering is the
concern of this study.
In order effectively to operate an automatic information retrieval
system based on vector matching between queries and documents, an efficient
document clustering procedure is necessary. The number of documents which
must be compared with the query in order reasonably to satisfy the demands
of an information request is too large to permit individual comparisons
with each document in the collection. Once the documents have been clustered
into homogeneous groups, a two-level search procedure greatly reduces the
number of comparisons needed to answer a request. This method first compares
the query vector with the classification vectors characterizing the groups.
The second level compares the query vector with all documents belonging to