ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
On Some Clustering Techniques for Information Retrieval
chapter
J. D. Broffitt
H. L. Morgan
J. V. Soden
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Ix-12
7. Results and Conclusions
The results of the study are sunKnarized in Table 1. One striking
characteristic of Bonner's method is the large nuniber of clusters it
produces. This is to be expected since Bonner refuses to associate
dissimilar documents, whereas Rocchio allows dissimilar documents to be
associated in order to build clusters of a size conducive to search
efficiency.
Thus, one would expect that the mean number of matches made with a
query vector using Rocchiots method would be less than the mean number of
matches made using Bonner's method. The results support this hypothesis
since [OCRerr]pproximately twice as many matches are required using Bonner's
method as when usi[OCRerr]ng Rocchiots method.
Next, restricting our attention to the recall and precision results
obtained using the manual relevance judgments, it is apparent that Bonnert S
metho4 exhibits higher precision than, and nearly equivalent recall to
Rocchiots method. One would expect the higher precision since there are
far fewer members in each cluster, and hence, when a cluster is retrieved,
it is more likely to contain a high percentage of relevant documents. Also,
there is a higher similarity between members of the same cluster with
Bonnerts method than with Rocchiots method. Hence, if the cluster is similar,
more of the documents in it are likely to be relevant since they are all
very similar. The nearly equivalent recall between the two methods is
somewhat surprising, as one would expect the large number of documents
retrieved by using Rocchio's method to include more of the relevant ones.
This is the case if the two highest ranking clusters are used, but is not
true if only the highest ranking cluster is used, although using only the