ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IV-5h After the list of word-uses to be included in the thesaur'[OCRerr]s is available, it becomes necessary to group them into thesaurus classes. This can be [OCRerr]one in various ways: 1) an informal [OCRerr]udgment can be made for each pair of word-uses to decide whether in the subject area under consideration, they arc synony[OCRerr][OCRerr]us, and if so, they can be grouped in the sa[OCRerr]re thes[OCRerr]-us class; 2) a set of "syntactic frames" can be used, and those word-uses which fit into the same francs can be collected in the same thesaur'[OCRerr]s group, or, equivalently, a decision is :-[OCRerr]de based on hether term A can [OCRerr] replace term L in a given eonte:[OCRerr] x.E9] This decision is of course not mechanized, but the dictionary maker is faced only with local choices within certain narrow limits; 3) a set of questions can be prepared designed to elicit answers about the terms to be grouped, and each term can be identified bi; the set of answers obtained in response to the proposed questions; for exam[OCRerr][OCRerr]e, one might ask "does this term represent a physical object or process, or does it represent an abstraction, or is this question inapplicable"; a score of 1 may then be assigned for a physical object, 2 for an abstraction, and 3 if the question is not applicable. At the end of such a procedure, each term is then identified by a set of properties (in the form of contexts which fit a given term, or in the form of answers to questions about the terms), and the complete vocabulary may be represented by a property matrix, as shown in simplified form in Fig. 18. It remains, then, to find the semantic distance between terms by comparing the rows of properties representing the respective word-uses. Specifically, rows which are completely identical can be coalesced into a single group immediately; terms which are not identical may be