ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
On Some Clustering Techniques for Information Retrieval
chapter
J. D. Broffitt
H. L. Morgan
J. V. Soden
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Ix-il
too large to obtain any savings over a search of the entire collection. In
larger collections, one would probably wish to look not only at a certain
number of clusters, but to take into account their correlations with a
query as well, [OCRerr]hen determining how mary clusters to search.
6. Evaluation
The results of the experiment are evaluated by using the recall and
precision measures defined in reference [5]*, and the mean number of
comparisons per query. In calculating the recall and precision measures,
the question of which documents are relevant to a given query arises. For
the' 82 document ADI collection, judgments have previously been made by
searching the entire collection manually to identify all documents relevant
to a given query. It is not clear that recall and precision calculated by
using these manual relevance judgments provide a good measuring device for
comparing the clustering procedures. Hence, an additional set of relevance
ju[OCRerr]ents has been made based upon the results of the search of the entire
collection described in the previous section. For each query, all documents
correlating above .30 with a query (.30 was chosen to give the same average
number of relevant documents per query as the manual judgments gave) are
considered relevant. In the results presented below, the recall and
precision figures for both the manual and the Ilautomatic!3 relevance judgments
are given. The authors feel that both sets of measures should be considered
in evaluating the performance of a clustering procedure as part of an
overall automatic information retrieval system.
*
Recall = nun[OCRerr]ber of relevant documents retrieved
number of relevant documents in the collection
Precision = number of relevant documents retrieved
number of documents retrieved