IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Experiment in Automatic Thesaurus Construction chapter R. T. Dattola D. M. Murray Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. UT 11-22 b) concepts which occur in only one document within a group should be treated as individual concept classes as in THS 2; c) the concepts within a thesaurus class should be homogeneous; i.e., they should all occur in approximately the same number of docu- ments; d) when expanding a query or document by a thesaurus, the concept class weights should be divided by the number of concept classes in which a concept appears. A) Overlap Because the original ADI collection is already a thesaurus, THS 1 and THS 2 have in effect combined many of the original concept classes, thereby producing more overlap between the classes. For example, in query 15, document 67 (a relevant document) is ranked 68th using the original thesaurus and 26th using THS 2. As shown in Fig. 7, in the original thesaurus,[OCRerr] only one out of eight concepts in the query also occurred in the document, while in THS 2, there were five out of eighteen matches. The improvement is due to concepts 10, 22, and 104 which appear in query 15 but not in document 67. specifically, concept class 36 contains concept 1 and concept 10; concept class 136 contains concept 1 and [OCRerr]04, and concept class 203 contains concepts 9 and 22. Therefore, both document 67 and query 15 contain concept classes 36, 136, and 203 after the lookup in THS 2. B) Unique Concepts THS 1 combines all concepts which occur in only one document of a group into a single concept class. The disadvantage of this method is illustrated by examining, for example, query 2 and document 12 (a relevant