ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
On Some Clustering Techniques for Information Retrieval
chapter
J. D. Broffitt
H. L. Morgan
J. V. Soden
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
`x-6
Figure 1 below, document A would pass the test while docunLent B would not
(the documents are here represented by their endpoints on the unit hyper-
sphere). If the document passes the
____A
Example of the density test.
Figure 1
region density test, the cutoff on category size is used to find the lowest
correlated document that would be included in the group. This figure is
used to set up a correlation [OCRerr] If a document correlates below
with a classification vector, it will not be included in the cluster. By
using this cutoff, documents that are in an area between two classification
vectors are likely to be included only in the cluster to which they are
correlated more highly. This means that the boundaries between groups of
documents which lie near each other will be sharpened, although some
documents may still be included in both groups.
A classification vector is then formed by taking the centroid of all
of the document vectors belonging to the cluster at this time. This
centroid is then matched against the entire collection, and the cutoff
parameters on category size are used to create the cluster. At this
point, some documents may be in more than one cluster. Also, some docu-
me nts which were in a cluster when the centroid was formed may no longer
be in the cluster. These documents, as well as those which fail the
region density test, are then marked loose, and those in the cluster are
marked clustered.