IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Experiment in Automatic Thesaurus Construction chapter R. T. Dattola D. M. Murray Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. VIII-is Because of the smaller class coefficient, the cut-off value .71 is used. Any higher value does not significantly improve any of the evaluation measures and destroys some of the useful relations in the subcollection. An added factor to be considered is that for concepts appearing only once, the only graph connections possible are to other concepts appearing once (in the same document). Such concepts may be removed from the concept- document matrix and placed in an initial class before the computation of the similarity matrix. Hence, a saving results in computation. Instead of combining concepts which occur only once into a single cQncept class, each of these concept class. In order to this method will be known as bining concepts which occur The merged and final initial classes. Therefore, classes which are subsets of measuring similarity only on concepts can be treated as an individual avoid confusion, the thesaurus constructed by THS 2, and the thesaurus constructed by com- only once will be called THS 1. classes are combinations of closely related it is always desirable to combine initial each other. The overlap correlation function, the basis of co-occurrence of concepts, gives a correlation value of 1.0 in such cases. For this reason, the overlap function is used in the formation of the merged and final classes. Fig. 5 gives some statistics on THS 1 and THS 2. B) Retrieval Evaluation To evaluate the retrieval performance of an automatic thesaurus, three information searches are used - one search with a document and query collection before the thesaurus lookup; one search after the lookup in a manual thesaurus, and one search after the lookup in[OCRerr] the automatic thesaurus. By comparing the precision and recall statistics for all three searches, the