ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval On Some Clustering Techniques for Information Retrieval chapter J. D. Broffitt H. L. Morgan J. V. Soden Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. `x-6 Figure 1 below, document A would pass the test while docunLent B would not (the documents are here represented by their endpoints on the unit hyper- sphere). If the document passes the ____A Example of the density test. Figure 1 region density test, the cutoff on category size is used to find the lowest correlated document that would be included in the group. This figure is used to set up a correlation [OCRerr] If a document correlates below with a classification vector, it will not be included in the cluster. By using this cutoff, documents that are in an area between two classification vectors are likely to be included only in the cluster to which they are correlated more highly. This means that the boundaries between groups of documents which lie near each other will be sharpened, although some documents may still be included in both groups. A classification vector is then formed by taking the centroid of all of the document vectors belonging to the cluster at this time. This centroid is then matched against the entire collection, and the cutoff parameters on category size are used to create the cluster. At this point, some documents may be in more than one cluster. Also, some docu- me nts which were in a cluster when the centroid was formed may no longer be in the cluster. These documents, as well as those which fail the region density test, are then marked loose, and those in the cluster are marked clustered.