ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
On Some Clustering Techniques for Information Retrieval
chapter
J. D. Broffitt
H. L. Morgan
J. V. Soden
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IX-5
Rocchio states that since the cosine is used as the matching [OCRerr]"1nction to
retrieve documents, clustering with it should give better results. To test
th[OCRerr]s, Bonner[OCRerr]s method is being tried using both coefficients.
*
[OCRerr]. Rocchio1s Procedure
The stated objective of Rocchiots procedure is that of jointly maxi-
mizing search efficiency and minimizing loss of relevant documents retrieved.
in the search. It is a heuristically derived algorithm which is meant to be
used in conjunction with a two-level search. The input parameters to the
algorithm are:
(1) the number of categories desired;
(2) lower and upper bounds on the number of elements to be
allowed in a category;
(3) a lower bound on the correlation between a document and a
classification vector, below which a document will not be
placed in a category.
All documents are first considered unclustered, and pass from this state
into one of two other possible states, clustered or loose. The algorithm
proceeds as follows. An unclustered document is selected. as a possible
cluster center. All of the other unclustered and loose documents are
correlated with it and the selected document is subjected to a region
density test to see. if a category should be formed around it. This test
specifies that more than N documents should be correlated higher than
1
with the candidate, and that more than N documents should be correlated
2
higher than [OCRerr]2 with the candidate. This ensures that documents on the
edge of large groups do not become centers of groups. For example, in
*
This section is a summary of section [OCRerr].5 of reference [3], to which the
reader is referred for a more detailed discussion and program flowcharts.