ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. Iv-57 for a total frequency of n/N, assuming that classes of approximately equal frequency are wanted. The process of generating N classes from P initial property sets may now be carried out as follows: 1) a PX M word-use versus property matrix (similar to that shown in Fig. 18) is constructed; 2) the property vectors are sorted into numeric order, and the set of P property vectors is reduced to only the distinct property vectors, say Q1 < P; 3) since each of the distinct vectors is to account for a word-use frequency of n/N, each vector is examined to see whether the total frequency represented by that vector is approximately n/N; [OCRerr]) if a given concept vector occurs [OCRerr]Tith a frequency smaller than n/N, it represents too small a class and should be combined with other vectors; this is done by deleting a sufficient number of questions (columns of the property matrix) to obtain a resulting combined concept class of frequency approximately equal to n/N; let the number of distinct property vectors which result be equal to < Q1; 5) some property vectors account for too large a frequency count, and ought to be bro[OCRerr]en up by using the concordance to formulate additional questions [OCRerr]o as to resolve finer shades of meaning; this eventually produces distinct vectors (% > 6) by alternately using the procedures of parts [OCRerr]) and 5), the frequency count of each of = N vectors eventually may approach n/N, at which point the process terminates. Consider, as an example, the list of word-uses shown in Fig. 19 (a), accounting for a total frequency count of 2198 word instances, and assume that it is desired to form a thesaurus with 5 concept classes. Each concept vector should then cover approximately 2200/5 = 4[OCRerr]o word