IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VIII-23
document), which was ranked 35th by THS 1 and 26th by THS 2. This document
contains the following concepts which do not appear in query 2: 18, 25,
97, 116, 154, 261, 304, 338, 399. Concept class 82 of THS 1 contains
the following concepts: 25, 97, 116, 154, 338, 399. These concepts occur
only in document 12 in group 4; therefore, they become a single concept
class, class 82. Thus, their weights are all added together during the
expansion of document 12, producing a very high weight for concept class
82. However, if one of the concepts in this class had appeared in query 2,
the correlation with document 12 would have been much higher. Therefore,
the presence or absence of one concept in the query makes a large dif-
ference in retrieval. When the query happens to contain one of these
"unique" concepts, THS 1 usually performs better than the original
thesaurus or THS 2.
C) Homogeneous Concept Classes
Another disadvantage of THS 1 is that the concept classes are not
very homogeneous. Fig. 5 shows that the average standard deviation of
frequency among concepts in a concept class is 3.9 for THS 1 and 1.4 for
THS 2. Thus, a query containing a concept which occurs in few documents,
but which is in a concept class with a concept occurring in many documents
may retrieve several irrelevant documents. For example, query 5 contains
concepts 1, 5, 13, 38, 94, 115, 533 with frequencies 44, 29, 10, 4, 3, 5,
and 11 respectively. Concept 94 (frequency = 3) occurs in concept class
70 along with concept 67 (frequency = 13), concept 89 (frequency = 9),
and concept 21 (frequency = 13). Document 46 is the only document con-
taining concepts 21, 67, and 89. Since concept 94 maps into concept