MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Classification and Categorization
chapter
Mary Elizabeth Stevens
National Bureau of Standards
The possibilities of using factor analysis to sort out the different meanings were
therefore explored. !` Using an IBM 704 program, the centroid method of factor analysis
was applied to a matrix of correlation coefficients of terms that had co-occurred signifi-
cantly with the term "exposure". Three factors were derived, one generally relating to
the corrosive effects of exposure, another to "exposure" in the sense of photographic
exposure, and the third dealing with both exposure-to-weather and exposure-to-radiation.
Mthough the results were considered quite satisfactory, more extensive experimentation
and use did not appear feasible because of computer matrix manipulation limitations.
Doyle notes, in particular, that factor analysis might be used to give well-defined
clusters separated one from another by clear boundaries rather than the less precise
clusters found by most document grouping techniques. He emphasizes, however, that
"its success in doing so of course, depends on the well-defined clusters actually being
present in the data". He suggests that a combination of factor analysis and human
editing to select items most typical of statistically derived categories could be valuable
in such applications as the sorting of Congressional mail or the identification of trends
in political or military intelligence materials free from the personal biases of an analyst.
Hammond and his Datatrol associates who have worked on an application of the
Stiles association factor technique for search question negotiation to legal literature have
also considered the potentialities of factor analysis. Thus they report:
The present association factor gives the relationship of one term to another.
A factor analysis study would allow us to determine the relationship of a single
term to a group of terms. From this we could learn how terms cluster when
related to the same concept." 3/
5.2 The Theory of Clumps
It is assumed, in the work on the theory of clumps, that we have a population of
objects or items among which at least some classes or groupings do objectively exist,
but that we do not have any bases for precisely determining class membership require-
ments. There may, therefore, be many possible ways of grouping and many possible
definitions of clumps. On the other hand, such diverse definitions must conform to the
extent of some similarities of membership in the clumps that they define if in fact they
do define any of the existing classes. Assuming further that we are given information
about properties ascribable to various members of the population, it is theorized that
useful clumps can be discovered by investigating similarity connections between pairs
of items, such as the number of co-occurrences of specific properties. Thereafter, only
these similarity connections are considered, and the connection matrix is used as the
basis for trial partitions of the population into various possible subsets.
1/
2/
3/
Stiles, 1962 [573], pp. 10-12.
Doyle, 1963 [162], p. 12.
Hammond, et al, 1962 [251], p. 17.
1[OCRerr]0