IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Experiment in Automatic Thesaurus Construction chapter R. T. Dattola D. M. Murray Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. vill-il mentioned in section 1 should also be kept in mind to aid in the thesaurus construction. 3. Evaluation The evaluation of a thesaurus for information retrieval operations is based primarily on the results of information searches using thesaurus lookup. However, the results are dependent on the characteristics of the thesaurus classes; hence, the classes are subject to an independent evalu- ation. A) Evaluation of the Classes Initial classes are composed of the similar concepts in each sub- collection. Here, similarity is related to the fact that the concepts occur and do not occur in the same documents. The overlap correlation function measures similarity only on the basis of overlap between concepts, and therefore, is not used in the formation of initial classes. For example, given c. = (1,1,0,0,), c. = (1,1,0,0,), and ck = (1,1,1,1), then 1 ... = 1.0 and S = 1.0 for the overlap function. On the other hand, ik ... = 1.0 for the cosine and Tanimoto functions, but S = 71 and .50 ik respectively. Thus, these two functions measure similarity on the basis of co-occurrence and are, therefore, of significant interest. In order to evaluate the formation of initial classes within a subcollection, let k be the chosen cut-off value; N, the number of concepts in the subcollection; L, the number of classes formed, and M, the number of concepts appearing in more than one class. Then, define the overlap