IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VIII-9
D) Formation of Final Classes
The final thesaurus classes consist of merged classes, omitting
those that are subsets or duplicates of other merged classes.
The algorithm used to construct the merged classes sometimes results
in the generation of merged classes that are subsets of others. For
example, Fig. 3 illustrates the formation of merged classes for a typical
set of initial classes. Using the overlap correlation, S12 = 1, SI3 = 1,
and S' = 0 Therefore, for k=l, two merged classes are formed; the
23
first consists of initial classes 21 and 52, and the second consists of
initial classes 21 and 83 However, both merged classes contain the same
original concepts; hence, one of them should be eliminated.
Let C" denote the class-concept matrix formed from all the merged
classes. Then C" consists of row vectors C" that specify the concepts
1
contained in merged class i. Following the same procedure used to form
the initial and merged classes, a similarity matrix S" is computed, and
then a graph G" is formed. The maximal complete subgraphs of G" represent
the final classes.
If broader concept classes are desired, the final classes can be
treated as merged classes and the above step can be repeated as often as
desired. Naturally, the lower the cut-off value k, the broader the final
classes. However, in very broad concept classes some of the concepts
might have little or no relation to one another. On the other hand, if
the number of final classes is nearly as large as the number of original
concepts, then an attempt should be made to combine some of the final
classes. £7] Although the final evaluation of a thesaurus is determined
by its performance when used in information searches, the four principles