ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
On Some Clustering Techniques for Information Retrieval
chapter
J. D. Broffitt
H. L. Morgan
J. V. Soden
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Ix-l
Ix On Some Clustering Techniques for Information Retrieval
J. D. Broffitt, H. L. Morgan, and J. V. Soden
Abstract
Document clustering methods which have been proposed by R. E. Bonner
and J. J. Rocchio are compared. Bonner's method is found to give higher
precision than Rocchio1s method, while the recall for the two method is
about the same. Bonner's method necessitates about twice as many comparisons
against a query vector as Rocchio's method; this is to be expected since
Rocchio controls the cluster size in order to maximize search efficiency.
Manual relevance judgments are used as well as relevance judgments
determined by query document cosines. The results are found to be invariant
under the two measures.
1. Introduction
The organization of information into homogeneous groups plays a major
role in many fields of research. Some areas of application are informa,tion
retrieval, biological taxonomy, isolation of disease syndromes in medicine,
anthropology (categorization of tribes), and business applications such
as categorizing TV audiences, sales offices, etc. Indeed, the applications
of information organization are numerous, anQ the particular application
being studied dictates the type of classification needed.
Needham[l] has divided cl[OCRerr]ssification problems into three types:
(1) the assignment of given objects to given classes, (2) the extraction
of class characteristics from given classes and their objects, and