IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VIII-3
a) formation of merged classes by combining related initial classes;
d) formation of final thesaurus classes by eliminating merged
classes that are subsets of each other.
A) Clustering the Document Collection
Rocchio's Clustering Algorithm (53 is used to divide the original
document collection into subcollections of most similar documents. These
subcollections contain many closely related concepts, and hence represent
very broad concept classes.
Table 1 summarizes the clustering results for the 82-document ADI
collection and the 200-document Cranf ield collection.
B) Formation of Initial Classes
In this step, a set of initial concept classes is formed for each
subcollection.
Let C denote the binary concept-document matrix constructed from
each subcollection, where C consists of row vectors C. that specify the
1
documents in which concept i occurs. Then for any concepts i and j, a
similarity coefficient S.. is computed by correlating C. with C.
1
A concept-concept similarity matrix S is produced by computing these co-
efficients between each pair of concepts.
Several functions may be used to compute the elements of S,
most desirable ones producing a symmetric matrix with the magnitude
element between 0 and 1. [13
Let L. be the number of documents in which concept i occurs; L.,
1
the number of documents in which concept j occurs; and N, the number of
the
of each