ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
iv-~8
No matter what particular method of thesaurus construction is adopted,
the main virtue of an automatic process is to eliminate the human element,
either completely if a fully-automatic method can be found, or partially
if the process is semi-automatic. In the latter case, it is desirable to
restrict the human activities to questions [OCRerr]hich require only local
decisions [OCRerr]dthin the given subject area, rather than global considerations
involving linguistic knowledge, and experience in subject classification
and indexing.
Some systematic procedures for thesaurus construction are described
in the next few paragraphs, and a simplified exaxriple is given of one
particular semi-aut[OCRerr]natic process.
A) Fully Automatic ?4ethods
Most automatic method-s for thesaurus construction are based on the
vocabul[OCRerr]ry contained in a [OCRerr]ample document collection assumed to be typical
for a given subject area.[i.,5,6] In particular, a frequency count is made
of the words contained in a set of documents, and each document is identi-
fied by certain high frequency words included in it. The choice of these
words may be based strictly on frequency characteristics, or alternatively
on more complicated properties of the word distribution for the given
collection. In any case, the sLmple collection is initially represented
by a term-document matrix, or a term-document graph as shown in Fig. 15.
The matrix element at the intersection of row i and column j of the
matrix represents the weight of term j in document i ; this same weight
is represented in the graph of Fig. 15 (b) by the labelled branch between
nodes T. and D
J 1