ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
On Some Clustering Techniques for Information Retrieval
chapter
J. D. Broffitt
H. L. Morgan
J. V. Soden
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Ix-7
This entire procedure is repeated with all unclustered documents.
when this is completed, there is no guarantee that the required minimum
number of clusters have been formed. Hence, some documents which failed
the region density test in the first pass are chosen as cluster centers.
If the number of categories formed in the first pass was too.high, the
density test may be made stricter and the first pass repeated.
There may still be some relatively isolated documents at the end of
the clustering process. These correlate very poorly with any of the
classification vectors. In order to properly test the procedures with
the limited collections available, these documents are included in the
cluster with which they correlate highest.
This procedure has been programmed for the CDC 1604 computer in
[OCRerr]RTRAN 63 and CODA? by V. Lesser. The output of this program consists
of a deck of punched cards which specify the documents belonging to each
category and the classification vector for each category formed. These
cards are then used as input to a two-level search procedure program,
[OCRerr]Yritten by the authors in FoRTRAN 63, which compares the queries to the
documents in the clusters correlating highest with the query vector.
[OCRerr]. Bonner's Procedure
This algorithm is based upon Clustering Programs I and II and the
Cluster Adjustment Program presented by Bonner in reference [2], and
has been programmed by the authors for the CDC 1604 computer as FoRTRAN
63 subroutines DOCLOC, SI[OCRerr]IM, CLUSTER, and ADJUSTCL.
Subroutine DOCDOC accepts as input a binary document-term matrix and
calculates a document-document similarity matrix S, using either of the