IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
UT 11-22
b) concepts which occur in only one document within a group should
be treated as individual concept classes as in THS 2;
c) the concepts within a thesaurus class should be homogeneous; i.e.,
they should all occur in approximately the same number of docu-
ments;
d) when expanding a query or document by a thesaurus, the concept
class weights should be divided by the number of concept classes
in which a concept appears.
A) Overlap
Because the original ADI collection is already a thesaurus, THS 1
and THS 2 have in effect combined many of the original concept classes,
thereby producing more overlap between the classes. For example, in query
15, document 67 (a relevant document) is ranked 68th using the original
thesaurus and 26th using THS 2. As shown in Fig. 7, in the original
thesaurus,[OCRerr] only one out of eight concepts in the query also occurred in
the document, while in THS 2, there were five out of eighteen matches.
The improvement is due to concepts 10, 22, and 104 which appear in query 15
but not in document 67. specifically, concept class 36 contains concept 1
and concept 10; concept class 136 contains concept 1 and [OCRerr]04, and concept
class 203 contains concepts 9 and 22. Therefore, both document 67 and
query 15 contain concept classes 36, 136, and 203 after the lookup in
THS 2.
B) Unique Concepts
THS 1 combines all concepts which occur in only one document of a
group into a single concept class. The disadvantage of this method is
illustrated by examining, for example, query 2 and document 12 (a relevant