IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VIII-6
c) if both a) and b) hold, then add node j to the new
initial class;
3) repeat steps 1) and 2) until every node occurs in at least one
initial class.
As an example of the formation of initial classes, consider the
concept-document matrix in Fig. 1(a). Using the cosine correlation, a
concept-concept similarity matrix S is constructed as shown in Fig. 1(b).
Fig. 2(a) illustrates the different graphs produced by varying the cut-off
value k. Finally, the resulting initial classes are shown in Fig. 2(b)
for each value of k.
C) Formation of Merged Classes
If every concept in the given collection occurs in only one sub-
collection, then the initial classes represent the final thesaurus classes.
However, it is very probable that many concepts occur in several sub-
collections, possibly resulting in duplicate or very similar initial
classes. These similar initial classes are combined into merged classes.
Let C' denote the class-concept matrix formed from all the sub-
collections. Then C' consists of row vectors C'. that specify the concepts
1
contained in initial class i. Following the same procedure that was used
to produce the initial classes, a class-class similarity matrix S' is formed.
Each element S' of S' is a measure of the similarity between class i
and class j. As before, a cut-off value k is applied to the elements of S',
allowing S' to be interpreted as a binary connection matrix for a graph G'.
The maximal complete subgraphs of G' are the merged classes.