ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. Iv-56 grouped by judiciously elIminating certain properties (certain columns in the property matrix) alternatively, terms which have already been grouped may be split apart by introducing new properties to differentiate them. For example, if property P3 is removed from the property matrix of Fig. 18, terms T and T will both exhibit the same set of 3 assigned properties (although vTith 4iffering weights), and may there- fore be grouped. Similarly, the removal of property P results in the I grouping of terms T and T[OCRerr]. In practice, it may be useful to consider more formal methods first for co[OCRerr]parin.[OCRerr] the rows of t[OCRerr]e property matrix (that is, for computing a similarity coefficient between each pair of terms), and then for generating term clusters. a) Sample Thesaurus Generation The procedure previously outlined may now be summarized as follows: automatic methods are used to prepare a word frequency list, as well as a concordance, for the principal words included in a sample document collection. A decision is then made concerning the number of word-uses to be included in the thesaurus for each distinct word, and discriminating questions are prepared to serve for purposes of word classification. The property lists which result are then compared, and word-uses which are identified by similar property lists are assigned to the same thesaurus category. E9,l1[OCRerr]] Consider a typical example in which P non-common word-uses are identified by M properties, and it is desired to create a thesaurus with N concept groups.[14] If the P word-uses have a total frequency of occurrence of n in the collection, each thesaurus class should account