MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Classification and Categorization
chapter
Mary Elizabeth Stevens
National Bureau of Standards
dichotomy can be observed. There is, on the one hand, a spate of examples of automatic
derivative indexing where words used by the author himself or by human analysis are
sorted and arranged, by machine, to provide index listings, announcement bulletins, and
current awareness distribution notices. There are also, on the other hand, at least a
few instances of investigations where the machine assigns category labels, indexing
terms, or "heads" and `1headings" from a classification schedule, to new items.
1/
In general, as Needham - points out, proposed automatic assignment indexing pro-
cedures can be investigated with reference to a previously existing index term vocabulary,
an existing classification system or schedule, or to specially designed vocabularies and
subject heading lists. On the other hand, if it is not known how well existing systems do
in fact characterize documents and if it is not known whether all pertinent properties of
the documents have been consistently ident[OCRerr]fied, then it may be preferable to develop
methods for assigning documents to the appropriate class in a classification system which
is itself set up automatically. [OCRerr]2/ Needham also suggests still a third possibility: that of
setting up automatically a classification within which the subsequent classifying of docu-
ments is done by hand.
The principal experimental results, to date, of attempts to achieve automatic
classification of documentary items, especially in the sense of machine-generated
groupings or categorizations of such items, have been those of applying techniques of
"clumping'1, 3/ factor analysis, and "latent class analysis't. [OCRerr]4/ We shall briefly consider
below some typical investigations into automatic classification or categorization proce-
dures that have already had, or may have, applicability in automatic index mg techniques.
In the late 1950's, Tanimoto undertook theoretical studies of mathematical
approaches to problems of classification and prediction with special reference to matrix
manipulations of sets of attributes of items to be classified. 5/ He also investigated
1/
2/
Needham, 1963, [432], p.1.
Ibid, p. 1-2: "If we are to assign a document to a class automatically, we must
have a) a list of facts about the classes which will make ascription possible:
b) an algorithm, usually some sort of matching algorithm, to tell us which class
best suits a document. Given a classification like the U. D. C. , it is not at all
obvious that a) and b) exist, or even, if they can be found. a) and b) imply a degree
of uniformity about the classification which may just not be there."
3/
4/
5/
That is, the clustering of objects that are in some sense similar because they
share certain attributes or properties, even if, and especially when, the identity
of cluster-producing common properties is not known in advance.
Compare Doyle, 1963 [162], p. 13; "There are other statistical techniques besides
factor analysis whose output is document clusters, such as latent class analysis
and clump theory, and there is a surprising increase in research in this kind of
analysis just within the last two years."
Tanirnoto1 1958 [593], 1961 [594]. See also Borko, 1963 [76], pp. 4-5: "In
1958, Tanimoto published a theoretical paper on the applications of mathematics to
the problems of classification and prediction. Specifically, he pointed out how the
problems of classification can be formulated in terms of sets of attributes and
manipulated as matrix functions."
I 07