ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval On Some Clustering Techniques for Information Retrieval chapter J. D. Broffitt H. L. Morgan J. V. Soden Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. Ix-il too large to obtain any savings over a search of the entire collection. In larger collections, one would probably wish to look not only at a certain number of clusters, but to take into account their correlations with a query as well, [OCRerr]hen determining how mary clusters to search. 6. Evaluation The results of the experiment are evaluated by using the recall and precision measures defined in reference [5]*, and the mean number of comparisons per query. In calculating the recall and precision measures, the question of which documents are relevant to a given query arises. For the' 82 document ADI collection, judgments have previously been made by searching the entire collection manually to identify all documents relevant to a given query. It is not clear that recall and precision calculated by using these manual relevance judgments provide a good measuring device for comparing the clustering procedures. Hence, an additional set of relevance ju[OCRerr]ents has been made based upon the results of the search of the entire collection described in the previous section. For each query, all documents correlating above .30 with a query (.30 was chosen to give the same average number of relevant documents per query as the manual judgments gave) are considered relevant. In the results presented below, the recall and precision figures for both the manual and the Ilautomatic!3 relevance judgments are given. The authors feel that both sets of measures should be considered in evaluating the performance of a clustering procedure as part of an overall automatic information retrieval system. * Recall = nun[OCRerr]ber of relevant documents retrieved number of relevant documents in the collection Precision = number of relevant documents retrieved number of documents retrieved