IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Experiment in Automatic Thesaurus Construction chapter R. T. Dattola D. M. Murray Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. VIII-23 document), which was ranked 35th by THS 1 and 26th by THS 2. This document contains the following concepts which do not appear in query 2: 18, 25, 97, 116, 154, 261, 304, 338, 399. Concept class 82 of THS 1 contains the following concepts: 25, 97, 116, 154, 338, 399. These concepts occur only in document 12 in group 4; therefore, they become a single concept class, class 82. Thus, their weights are all added together during the expansion of document 12, producing a very high weight for concept class 82. However, if one of the concepts in this class had appeared in query 2, the correlation with document 12 would have been much higher. Therefore, the presence or absence of one concept in the query makes a large dif- ference in retrieval. When the query happens to contain one of these "unique" concepts, THS 1 usually performs better than the original thesaurus or THS 2. C) Homogeneous Concept Classes Another disadvantage of THS 1 is that the concept classes are not very homogeneous. Fig. 5 shows that the average standard deviation of frequency among concepts in a concept class is 3.9 for THS 1 and 1.4 for THS 2. Thus, a query containing a concept which occurs in few documents, but which is in a concept class with a concept occurring in many documents may retrieve several irrelevant documents. For example, query 5 contains concepts 1, 5, 13, 38, 94, 115, 533 with frequencies 44, 29, 10, 4, 3, 5, and 11 respectively. Concept 94 (frequency = 3) occurs in concept class 70 along with concept 67 (frequency = 13), concept 89 (frequency = 9), and concept 21 (frequency = 13). Document 46 is the only document con- taining concepts 21, 67, and 89. Since concept 94 maps into concept