ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
On Some Clustering Techniques for Information Retrieval
chapter
J. D. Broffitt
H. L. Morgan
J. V. Soden
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
the groups with whose classification vector the query vector was found to
correlate highly on the first level. While increasing the search efficiency
in terms of the number of c[OCRerr]parisons made with a query vector, the clustering
process inherently decreases the information which categorizes the individual
documents.
Thus the problem of document clustering is clear: classifying the
documents into homogeneous groups in order to increase search efficiency
without seriously sacrificing the ability to retrieve the documents. Various
schemes have been suggested to this end. These procedures gather documents
into groups based on some type of correlation function which associates
s[OCRerr]lar documents. Since there exists no purely analytical method for
comparing the relative efficiency of these various methods, and since this
efficiencyt1 may differ [OCRerr]Tith the use to which the groups are put, it is
nccessary to compare the clustering procedures in the context of an aut[OCRerr]natic
information retrieval system, and to evaluate the merits of the clustering
procedures by evaluating the overall performance of such a system, with only
a change in the clustering method from experiment to experiment. The
evaluation of the overall performance of such a system clearly involves
the environment in which the system is used, and the users themselves.
Specifically, the present study consists of a comparison of two
clustering procedures: a method of R. E. Bonner [2], and the method proposed
by J. J. [OCRerr]occhio [3[OCRerr]. These techniques can reasonably be studied by a
comparison of the documents retrieved by an autcmatic retrieval system
using a two level search based on the clusters produced by the methods
with those retrieved by a full search of the document collection.