IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VIII-2
Naturally, the evaluation of a thesaurus is based on its performance
when used in information searches. In its construction, the following
criteria should ideally be followed:
a) closely related pieces of information should be assigned the
same concept number;
b) the number of thesaurus classes should be significantly
smaller than the number of original concepts;
c) the number of concepts appearing in more than one thesaurus
class should be small; and
d) the concepts in a thesaurus class should be homogeneous; i.e.
they should all occur in approximately the same number of
documents.
In the present study, a document collection in a single subject
area is taken as a sample vocabulary. The vocabulary is represented by
previously assigned concept numbers with their associated weights.
Concept-concept association techniques are then used to derive the thesaurus
classes. The principle behind these techniques is co-occurrence - concepts
which occur together often enough may be replaced by a single concept (a
concept class).
2. The Construction Algorithm
A thesaurus is constructed in four steps:
a) formation of subcollections of documents by clustering;
b) formation of initial classes;