IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
vill-iB
where q[OCRerr] is the weight of concept k in query i, d is the weight of
k
concept k in document j, and t is the total number of concepts.
Because the original ADI collection is a manual thesaurus, the auto-
matic thesauruses constructed from this collection are actually super-
thesauruses. However, both THS 1 and THS 2 give better results than the
original manual thesaurus. Two evaluation functions that are useful for
comparing the retrieval results of a given query using different thesauruses
are the normalized[OCRerr]recall and the normalized precision. Specifically,
N.P. = 1.0 -[OCRerr] ln r[OCRerr] - ln nI
i=l
ln (N) - ln n'
n
N.R. = 1.0 - _______ (r.-i)
1
(N-n) .n
and
where N is the total number of documents, n is the number of relevant docu-
th
ments, and r. is the rank of the i relevant document. The normalized
1 -
recall and precision values for the three ADI searches are given in Table 3.
Although THS 2 gives the best results overall, there are several
queries where the original thesaurus is best and several queries where THS 1
is best. A closer inspection of the results indicates the following con-
clusions:
a) the amount of overlap between concept classes of a manual
thesaurus such as the ADI can be increased by automatic pro-
cedures to produce better results;