IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Experiment in Automatic Thesaurus Construction chapter R. T. Dattola D. M. Murray Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. vill-iB where q[OCRerr] is the weight of concept k in query i, d is the weight of k concept k in document j, and t is the total number of concepts. Because the original ADI collection is a manual thesaurus, the auto- matic thesauruses constructed from this collection are actually super- thesauruses. However, both THS 1 and THS 2 give better results than the original manual thesaurus. Two evaluation functions that are useful for comparing the retrieval results of a given query using different thesauruses are the normalized[OCRerr]recall and the normalized precision. Specifically, N.P. = 1.0 -[OCRerr] ln r[OCRerr] - ln nI i=l ln (N) - ln n' n N.R. = 1.0 - _______ (r.-i) 1 (N-n) .n and where N is the total number of documents, n is the number of relevant docu- th ments, and r. is the rank of the i relevant document. The normalized 1 - recall and precision values for the three ADI searches are given in Table 3. Although THS 2 gives the best results overall, there are several queries where the original thesaurus is best and several queries where THS 1 is best. A closer inspection of the results indicates the following con- clusions: a) the amount of overlap between concept classes of a manual thesaurus such as the ADI can be increased by automatic pro- cedures to produce better results;