ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
On Some Clustering Techniques for Information Retrieval
chapter
J. D. Broffitt
H. L. Morgan
J. V. Soden
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Ix-lo
in the cluster. This set of classification vectors is used in the two-level
search procedure, as described at the end of section 3.
5. The Experiment
The experimental environment consisted of 82 documents and 20 queries
from an American Documentation Institute (ADI) collection. These documents
were automatically indexed by the SMART automatic document retrieval
system. [5] This system provided 82 document vectors and 20 query vectors
in 601-dimensional Euclidean space. These vectors are then normalized
to length one and used as input to both Bonner1 5 and Rocchio's clustering
procedures.
Since each of these procedures depends on several parameters, many
runs were planned with each method in order empirically to determine
desirable values for these parameters. The results presented below are
based mainly on six computer runs, four using Bonner's clustering method
and two using Rocchio1s clustering method.
Each run consists of a clustering process for the 82 documents,
resulting in the generation of the classification vectors, foll[OCRerr]Ted by a
two-level search for each of the 20 queries. In addition, a run was made
to match each query with the entire document collection using the cosine
matching function, so as to obtain an ordering of the documents by this
correlation for each of the 20 queries.
At the end of the first level of the search, only the two clusters
which correlate highest with a query are retained. This is done because
of the small size of the collection. If more than two clusters are
retained, the total number of c[OCRerr]parisons with a given query vector becomes