IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Experiment in Automatic Thesaurus Construction chapter R. T. Dattola D. M. Murray Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. VIII-3 a) formation of merged classes by combining related initial classes; d) formation of final thesaurus classes by eliminating merged classes that are subsets of each other. A) Clustering the Document Collection Rocchio's Clustering Algorithm (53 is used to divide the original document collection into subcollections of most similar documents. These subcollections contain many closely related concepts, and hence represent very broad concept classes. Table 1 summarizes the clustering results for the 82-document ADI collection and the 200-document Cranf ield collection. B) Formation of Initial Classes In this step, a set of initial concept classes is formed for each subcollection. Let C denote the binary concept-document matrix constructed from each subcollection, where C consists of row vectors C. that specify the 1 documents in which concept i occurs. Then for any concepts i and j, a similarity coefficient S.. is computed by correlating C. with C. 1 A concept-concept similarity matrix S is produced by computing these co- efficients between each pair of concepts. Several functions may be used to compute the elements of S, most desirable ones producing a symmetric matrix with the magnitude element between 0 and 1. [13 Let L. be the number of documents in which concept i occurs; L., 1 the number of documents in which concept j occurs; and N, the number of the of each