ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. iv-~8 No matter what particular method of thesaurus construction is adopted, the main virtue of an automatic process is to eliminate the human element, either completely if a fully-automatic method can be found, or partially if the process is semi-automatic. In the latter case, it is desirable to restrict the human activities to questions [OCRerr]hich require only local decisions [OCRerr]dthin the given subject area, rather than global considerations involving linguistic knowledge, and experience in subject classification and indexing. Some systematic procedures for thesaurus construction are described in the next few paragraphs, and a simplified exaxriple is given of one particular semi-aut[OCRerr]natic process. A) Fully Automatic ?4ethods Most automatic method-s for thesaurus construction are based on the vocabul[OCRerr]ry contained in a [OCRerr]ample document collection assumed to be typical for a given subject area.[i.,5,6] In particular, a frequency count is made of the words contained in a set of documents, and each document is identi- fied by certain high frequency words included in it. The choice of these words may be based strictly on frequency characteristics, or alternatively on more complicated properties of the word distribution for the given collection. In any case, the sLmple collection is initially represented by a term-document matrix, or a term-document graph as shown in Fig. 15. The matrix element at the intersection of row i and column j of the matrix represents the weight of term j in document i ; this same weight is represented in the graph of Fig. 15 (b) by the labelled branch between nodes T. and D J 1