MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Classification and Categorization
chapter
Mary Elizabeth Stevens
National Bureau of Standards
theoretical aspects of automatic indexing and sentence extraction involving co-occurrences
of words. While Tanimoto's studies with respect to linguistic information processing for
classification purposes have apparently been limited to the theoretical considerations,
similar concepts of probabilistic, computational, and matrix manipulative operations to
derive and use coefficients of correlation of associations between such attributes as words
occurring in text or the index terms assigned to documents are involved in the factor
analysis and theory of clumps techniques as applied in actual experiments in documentary
classification.
5.1 Factor Analysis
The factor analysis technique which seeks to derive from word associations in
representative documents an automatically generated classification schedule for use in
actual indexing experiments has previously been mentioned. 1/ Reasons suggested for its
use in research at SDC have been reported as follows:
`1The development of automatic procedures for purposes of classification and ab-
stracting requires the identification and specification of attributes of words or
passages so that the relevancy of topics or content can be determined. Auto-
matic procedures to detect such attributes may be based on a number of
characteristics of the text: word frequencies, syntactical information, semantic
information and pragmatic contextual clues. Currently, word frequency informa-
tion can be generated and manipulated by automatic procedures, whereas the
other attributes are not as readily handled this way. However, a correlation
matrix of content words becomes very unwieldy because of its size and the com-
plexity of relationships. For this reason, factor analysis is used to identify
clusters of relationships. Current work concentrates primarily on determining
the usefulness of factors identified in this way as classification and indexing
schemes." 2/
As noted above, Borko and Bernick (1961 [73], 1962 [77], 1963 [78]) have applied
this technique to abstracts drawn from psychological literature and to the same computer
literature abstracts as had been used by Maron, (1961 [395]). This technique had also
been investigated in the studies looking toward information retrieval classification and
grouping undertaken at the Cambridge Language Research Unit from about 1957 onward.
However, certain apparent limitations of the factor analysis approach led Parker-Rhodes
and Needham to the alternative of the "theory of clumps" (1960 [465], 1961 [OCRerr][435,464]).
Parker-Rhodes gives the rationale, and some of the distinctions between the two tech-
niques, as follows:
"It has been assumed that statistical methods could be applied to the data in such
a way as to reveal any objectively existing classes which may be there. The general
1/
2/
Pp. 94-97 of this report.
System Development Corporation, 1962 [590], p. 15.
108