IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
vill-il
mentioned in section 1 should also be kept in mind to aid in the thesaurus
construction.
3. Evaluation
The evaluation of a thesaurus for information retrieval operations
is based primarily on the results of information searches using thesaurus
lookup. However, the results are dependent on the characteristics of the
thesaurus classes; hence, the classes are subject to an independent evalu-
ation.
A) Evaluation of the Classes
Initial classes are composed of the similar concepts in each sub-
collection. Here, similarity is related to the fact that the concepts occur
and do not occur in the same documents. The overlap correlation function
measures similarity only on the basis of overlap between concepts, and
therefore, is not used in the formation of initial classes. For example,
given c. = (1,1,0,0,), c. = (1,1,0,0,), and ck = (1,1,1,1), then
1
... = 1.0 and S = 1.0 for the overlap function. On the other hand,
ik
... = 1.0 for the cosine and Tanimoto functions, but S = 71 and .50
ik
respectively. Thus, these two functions measure similarity on the basis
of co-occurrence and are, therefore, of significant interest.
In order to evaluate the formation of initial classes within a
subcollection, let k be the chosen cut-off value; N, the number of concepts
in the subcollection; L, the number of classes formed, and M, the number
of concepts appearing in more than one class. Then, define the overlap