IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Experiment in Automatic Thesaurus Construction chapter R. T. Dattola D. M. Murray Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. VIII-2 Naturally, the evaluation of a thesaurus is based on its performance when used in information searches. In its construction, the following criteria should ideally be followed: a) closely related pieces of information should be assigned the same concept number; b) the number of thesaurus classes should be significantly smaller than the number of original concepts; c) the number of concepts appearing in more than one thesaurus class should be small; and d) the concepts in a thesaurus class should be homogeneous; i.e. they should all occur in approximately the same number of documents. In the present study, a document collection in a single subject area is taken as a sample vocabulary. The vocabulary is represented by previously assigned concept numbers with their associated weights. Concept-concept association techniques are then used to derive the thesaurus classes. The principle behind these techniques is co-occurrence - concepts which occur together often enough may be replaced by a single concept (a concept class). 2. The Construction Algorithm A thesaurus is constructed in four steps: a) formation of subcollections of documents by clustering; b) formation of initial classes;