IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Summary
summary
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Thesaurus dictionaries, phrase dictionaries and hierarchical
arrangements of terms are described and evaluated for retrieval ef-
fectiveness in section VII. A thesaurus is generally used to assemb
certain terms into common thesaurus groups according to specified simi-
larity criteria. Terms within the same group can then be reduced to
a unique class number, thus providing a certain amount of language
normalization. The best thesaurus dictionaries produce an average
retrieval performance superior to that provided by the stem dictionaries.
For high-precision users, the thesaurus results are not, however, very
different from the stem results. Thesaurus construction rules have
been devised to insure that a thesaurus is obtained which will, in fact,
operate satisfactorily in a retrieval environment, and produce the
expected improvements for high recall users.
The results exhibited in section VII for the phrase dictionaries
and hierarchical subject arrangements show that the effect of these
devices is not as yet sufficiently reliable to warrant their inclusion
in operational situations.
Suggestions are also made in section VII for additional retrieval
experiments using stored dictionaries, and for the generation of additional
language normalization tools.
An experiment in fully-automatic thesaurus construction is des-
cribed in section VIII by R. T. Dattola and D. M. Murray. The procedure
consists in breaking a document collection down into sub-collections,
using document-document correlation methods. For each sub-collection, a
thesaurus is then constructed using term-term correlation methods. Finally,
xv