IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Experiment in Automatic Thesaurus Construction chapter R. T. Dattola D. M. Murray Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. VIII-9 D) Formation of Final Classes The final thesaurus classes consist of merged classes, omitting those that are subsets or duplicates of other merged classes. The algorithm used to construct the merged classes sometimes results in the generation of merged classes that are subsets of others. For example, Fig. 3 illustrates the formation of merged classes for a typical set of initial classes. Using the overlap correlation, S12 = 1, SI3 = 1, and S' = 0 Therefore, for k=l, two merged classes are formed; the 23 first consists of initial classes 21 and 52, and the second consists of initial classes 21 and 83 However, both merged classes contain the same original concepts; hence, one of them should be eliminated. Let C" denote the class-concept matrix formed from all the merged classes. Then C" consists of row vectors C" that specify the concepts 1 contained in merged class i. Following the same procedure used to form the initial and merged classes, a similarity matrix S" is computed, and then a graph G" is formed. The maximal complete subgraphs of G" represent the final classes. If broader concept classes are desired, the final classes can be treated as merged classes and the above step can be repeated as often as desired. Naturally, the lower the cut-off value k, the broader the final classes. However, in very broad concept classes some of the concepts might have little or no relation to one another. On the other hand, if the number of final classes is nearly as large as the number of original concepts, then an attempt should be made to combine some of the final classes. £7] Although the final evaluation of a thesaurus is determined by its performance when used in information searches, the four principles