MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Classification and Categorization
chapter
Mary Elizabeth Stevens
National Bureau of Standards
name for the techniques evolved in this way is factor analysis. Insofar as it
is practically applicable this technique has worked well enough; but... it has two
limitations (a) that some classification problems are outside its scope, and
(b) that it is not susceptible (at least as hitherto conceived) of adaptation com-
putationally to the study of really large universes. ` 1/
The procedure of factor analysis first finds certain clumps, but then, as
output, it gives us vectors relating the descriptors of the universe to the
clumps found...
"In most cases, factor analysis is used (especially in psychology) to debug the
descriptor space; more conventionally put, to eliminate those tests (descriptors)
which have an equivocal membership in several factors (Clumps) in favor of
those which, having more definite allegiances, convey more information of the
kind which the analysis suggests as valuable. It is thus only related to the
classification of the universe at one remove; the classification it suggests is a
simple categorical classification defined by the descriptors suggested as the
most valuable...
"The descriptive array of a universe is a table giving the applicability or
inapplicability of each descriptor to each element. To classify the elements
of the univerbe, we calculate for every pair of elements a similarity as a
function of the corresponding rows of the descriptive array, and then regard
the similarity matrix as a sufficient description of the univer3e. In factor
analysis, on the contrary, we start with the matrix of correlations between
the descriptors, each being a function of a pair of columns of the descriptive
array..." 2/
Other investigators who have considered factor analysis techniques for possible
applications to automatic indexing, automatic categorization of items in a collection of
items, or search prescription renegotiation in a mechanized selection and retrieval
system include Stiles (1962 L 573]), Doyle (1963 L162]), and Hammond (1962 L251]).
Stiles, whose principal experimental results relate rather to the use of statistical
associations between terms manually assigned to documents for search prescription
formulation and renegotiation than to automatic indexing procedures as such 3/ has also
considered both automatic indexing and automatic classification approaches. Specifi-
cally, he has made at least preliminary investigations of the factor analysis technique
independently developed for similar purposes by Borko. For a large collection of
105, 000 items, the statistics of co-occurrence of indexing terms were in some cases not
as precise as desired because the same terms were used in different senses for different
items in the collection.
1/
2/
3/
Note that Borko himself confirms this limitation as recently as November 1963,
in stating, of the CLRU work on clumps: "However, even now these techniques
have been applied to a 346x346 matrix which is beyong the capabilities of presently
available factor analysis programs." ([OCRerr]963 L76] , p.8).
Parker-Rhodes, 1961, [464], pp. 3-6.
This principal concern is discussed below with reference to potentially
related research, pp.119-122 of this report.
109