IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VIII-17
effectiveness of the automatic thesaurus may be decided.
The thesaurus collections are formed by treatinq each document or
query independently. For each concept-weight pair (n,w), the thesaurus
classes - .. , N
N1,N2,. k - corresponding to n are determined by a table
lookup procedure. The concept-weight pairs added to the new document(query)
are (N11 w/k), (N2, w/k),..., (Nk, w/k). If k isgreater than 6 for a
given concept n, the concept is dropped from the thesaurus. This is done
because of space limitations, but these concepts would probably have very
small weights anyway since the weight is divided by k. At the end of the
lookup, concept pairs with duplicate concept numbers are eliminated. The
duplicates are replaced by a single concept-weight pair whose weight is
the sum of the weights in the duplicates.
In the ADI collection, the lookup procedure produces a document and
query collection with more concepts per document than in the original. The
weights associated with these concepts are smaller than before, although
the sum of the weights in both collections is nearly equal for THS 1.
4. Analysis of Results
The results of the search evaluation for the ADI thesauruses are
given in Fig. 6. The weighted cosine function is used to match the queries
against the documents. Given query i and document j, the correlation is
defined as follows:
t
I.
k=l
S =
ii
V
([OCRerr][OCRerr]) 2 (dk)2
k=l