MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Automatic Classification and Categorization chapter Mary Elizabeth Stevens National Bureau of Standards theoretical aspects of automatic indexing and sentence extraction involving co-occurrences of words. While Tanimoto's studies with respect to linguistic information processing for classification purposes have apparently been limited to the theoretical considerations, similar concepts of probabilistic, computational, and matrix manipulative operations to derive and use coefficients of correlation of associations between such attributes as words occurring in text or the index terms assigned to documents are involved in the factor analysis and theory of clumps techniques as applied in actual experiments in documentary classification. 5.1 Factor Analysis The factor analysis technique which seeks to derive from word associations in representative documents an automatically generated classification schedule for use in actual indexing experiments has previously been mentioned. 1/ Reasons suggested for its use in research at SDC have been reported as follows: `1The development of automatic procedures for purposes of classification and ab- stracting requires the identification and specification of attributes of words or passages so that the relevancy of topics or content can be determined. Auto- matic procedures to detect such attributes may be based on a number of characteristics of the text: word frequencies, syntactical information, semantic information and pragmatic contextual clues. Currently, word frequency informa- tion can be generated and manipulated by automatic procedures, whereas the other attributes are not as readily handled this way. However, a correlation matrix of content words becomes very unwieldy because of its size and the com- plexity of relationships. For this reason, factor analysis is used to identify clusters of relationships. Current work concentrates primarily on determining the usefulness of factors identified in this way as classification and indexing schemes." 2/ As noted above, Borko and Bernick (1961 [73], 1962 [77], 1963 [78]) have applied this technique to abstracts drawn from psychological literature and to the same computer literature abstracts as had been used by Maron, (1961 [395]). This technique had also been investigated in the studies looking toward information retrieval classification and grouping undertaken at the Cambridge Language Research Unit from about 1957 onward. However, certain apparent limitations of the factor analysis approach led Parker-Rhodes and Needham to the alternative of the "theory of clumps" (1960 [465], 1961 [OCRerr][435,464]). Parker-Rhodes gives the rationale, and some of the distinctions between the two tech- niques, as follows: "It has been assumed that statistical methods could be applied to the data in such a way as to reveal any objectively existing classes which may be there. The general 1/ 2/ Pp. 94-97 of this report. System Development Corporation, 1962 [590], p. 15. 108