IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VIII-is
Because of the smaller class coefficient, the cut-off value .71 is
used. Any higher value does not significantly improve any of the evaluation
measures and destroys some of the useful relations in the subcollection.
An added factor to be considered is that for concepts appearing only once,
the only graph connections possible are to other concepts appearing once
(in the same document). Such concepts may be removed from the concept-
document matrix and placed in an initial class before the computation of
the similarity matrix. Hence, a saving results in computation.
Instead of combining concepts which occur only once into a single
cQncept class, each of these
concept class. In order to
this method will be known as
bining concepts which occur
The merged and final
initial classes. Therefore,
classes which are subsets of
measuring similarity only on
concepts can be treated as an individual
avoid confusion, the thesaurus constructed by
THS 2, and the thesaurus constructed by com-
only once will be called THS 1.
classes are combinations of closely related
it is always desirable to combine initial
each other. The overlap correlation function,
the basis of co-occurrence of concepts,
gives a correlation value of 1.0 in such cases.
For this reason, the
overlap function is used in the formation of the merged and final classes.
Fig. 5 gives some statistics on THS 1 and THS 2.
B) Retrieval Evaluation
To evaluate the retrieval performance of an automatic thesaurus,
three information searches are used - one search with a document and query
collection before the thesaurus lookup; one search after the lookup in a
manual thesaurus, and one search after the lookup in[OCRerr] the automatic thesaurus.
By comparing the precision and recall statistics for all three searches, the