IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VIII-12
ration, class ratio, and class coefficient as follows:
1) overlap ration = M/N
2) class ratio = L/N, and
3) class coefficient = 100 (MIN) (L/N)
The class coefficient is used as a single evaluation measure for the
initial classes formed from the subcollection. Because it is desirable that
both the overlap ratio and the class ratio be small, it follows that the
class coefficient should be small. However, if each concept were put into
its own class, the overlap ratio and class coefficient would be 0, and the
class ratio 1. Therefore, the three evaluation measures are best considered
in conjunction with each other.
Table 2 and Fig. 4 give the values of the three evaluation measures
for various cut-offs and correlation functions. Three subcollections from
the ADI collection are used for comparison.
Both the cosine and Tanimoto's function give very similar results.
However, the cosine function is used in the initial document clustering
and in the retrieval searches. Therefore, to provide consi[OCRerr]tency, it is
also used in the formationof the initial classes.
The large difference in the class coefficient between the cut-off
values of .70 and .71 (cosine) is explained by the general nature of the
subcollections. There are many concepts which occur in only one document.
correlating one of these concepts with any other concept in the subcollection
yields one of the following values: 1, l/[OCRerr] l/[OCRerr], ... , 0. The .71
cut-off value permits graph connections only between this concept and other
concepts in the same document which occur only once in the subcollection.
The .70 cut-off value permits these connections and connections with con-
cepts which appear twice in the subcollection.