IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Experiment in Automatic Thesaurus Construction chapter R. T. Dattola D. M. Murray Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. VIII-6 c) if both a) and b) hold, then add node j to the new initial class; 3) repeat steps 1) and 2) until every node occurs in at least one initial class. As an example of the formation of initial classes, consider the concept-document matrix in Fig. 1(a). Using the cosine correlation, a concept-concept similarity matrix S is constructed as shown in Fig. 1(b). Fig. 2(a) illustrates the different graphs produced by varying the cut-off value k. Finally, the resulting initial classes are shown in Fig. 2(b) for each value of k. C) Formation of Merged Classes If every concept in the given collection occurs in only one sub- collection, then the initial classes represent the final thesaurus classes. However, it is very probable that many concepts occur in several sub- collections, possibly resulting in duplicate or very similar initial classes. These similar initial classes are combined into merged classes. Let C' denote the class-concept matrix formed from all the sub- collections. Then C' consists of row vectors C'. that specify the concepts 1 contained in initial class i. Following the same procedure that was used to produce the initial classes, a class-class similarity matrix S' is formed. Each element S' of S' is a measure of the similarity between class i and class j. As before, a cut-off value k is applied to the elements of S', allowing S' to be interpreted as a binary connection matrix for a graph G'. The maximal complete subgraphs of G' are the merged classes.