IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VIII-24
class 70, query 5 might be expected to retrieve document 46, an irrelevant
document. This is exactly what happens, for document 46 is ranked 23rd
using the original thesaurus and is ranked 4th using THS 1.
D) Dividing Weights
A common objection against automatic thesauruses is that they
contain too much overlap between concept classes. Thus, the concepts which
occur in several concept classes (which are in fact the most common concepts)
do not contribute much to the thesaurus, as their weights are divided by
the number of concept classes in which they occur. The evaluation results
using THS 1 and THS 2 indicate that manual thesauruses do not contain enough
overlap, rather than automatic thesauruses contain too much overlap. How-.
ever, an argument might be raised against dividing the weights. To settle
this argument, THS 2 was also evaluated without dividing the weights during
the lookup. The results were much worse:
N.R. = .74, down from .83, and
N.P. = .52, down from .65.
E) Cranfield Collection
Although the results from the ADI text are encouraging, the goal is
to produce an automatic thesaurus starting from a word stem thesaurus rather
than a regular manual thesaurus. Since the original concepts in a stem
thesaurus are not themselves manually constructed concept classes, it can be
expected that many more connections exist between the original concepts than
in a regular thesaurus. Thus, THS 1 constructed from the Cranfield stem
thesaurus contains too much overlap between concept classes. In fact, over