IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Experiment in Automatic Thesaurus Construction chapter R. T. Dattola D. M. Murray Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. VIII-17 effectiveness of the automatic thesaurus may be decided. The thesaurus collections are formed by treatinq each document or query independently. For each concept-weight pair (n,w), the thesaurus classes - .. , N N1,N2,. k - corresponding to n are determined by a table lookup procedure. The concept-weight pairs added to the new document(query) are (N11 w/k), (N2, w/k),..., (Nk, w/k). If k isgreater than 6 for a given concept n, the concept is dropped from the thesaurus. This is done because of space limitations, but these concepts would probably have very small weights anyway since the weight is divided by k. At the end of the lookup, concept pairs with duplicate concept numbers are eliminated. The duplicates are replaced by a single concept-weight pair whose weight is the sum of the weights in the duplicates. In the ADI collection, the lookup procedure produces a document and query collection with more concepts per document than in the original. The weights associated with these concepts are smaller than before, although the sum of the weights in both collections is nearly equal for THS 1. 4. Analysis of Results The results of the search evaluation for the ADI thesauruses are given in Fig. 6. The weighted cosine function is used to match the queries against the documents. Given query i and document j, the correlation is defined as follows: t I. k=l S = ii V ([OCRerr][OCRerr]) 2 (dk)2 k=l