MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Classification and Categorization
chapter
Mary Elizabeth Stevens
National Bureau of Standards
In early work on clump definition, Kuhns of Ramo-Wooldridge 1, proposed the use
of a threshold value such that if a subset is a clump every pair of members in it has a
connection strength equal to or greater than the threshold value and no member of the
subset's complement has connections of more than threshold value to the members of the
subset. In the more extensive investigations carried out by Parker-Rhodes and Needham
(1960 [465], 1961 [434, 435, 464]), other clump definitions have been explored and
specifically that of the "GR-Clump". This is defined as a subset of the universe such
that all its members have a positive (or zero) bias to the subset and all non-members
have a negative bias to it, where bias is defined as the excess (positive or negative) of the
total connections of a member of the population to the members of the subset over its
total connections to the members of the subset's complement, following the convention
that the connection of the element to itself is taken as zero.
An iterative procedure for discovering GR-clumps can now be followed. This is
based on an arbitrary initial partition of the given universe of elements into a subset and
its complement. Then, since each element has a bias toward both the subset and its
complement, differing only in sign, the biases of each element are computeci. If the bias
of a particular element is positive with respect to the subset, it is transferred to the sub-
set if it is not already a member of it, and conversely if its bias is negative, it is trans-
ferred to the subset's complement if it is not already there. Each time a transfer is
made, the biases are recomputed and the process is repeated until for a complete scan of
all elements no further transfers can be made. The result is a GR-clump even though it
may have no members or may contain all the elements of the universe. In such case, a
further partition is made and the procedures are re-applied.
These GR-clump finding procedures have been applied to such diverse collections
of items to be classified as archaeological artefacts and patients' symptoms as related
to specific disease diagnosis. In the latter case, groupings were obtained that corre-
sponded satisfactorily to certain specific disease syndromes, but no group was found
corresponding to Hodgkin's disease where a great variety of symptoms typically occur.
Needham comments: "I can scarcely conceive of a clump definition that would be likely
to group these patients; I am unsure whether this is a reflection on clump theory or on
Hodgkin's disease. i' 2/
In applications more directly related to documentation, some investigations have
been made of the use of co-occurrence coefficients of index terms assigned to documents
in order to form a connection matrix from which clumps were then derived (Needham,
1963 [431]). These experiments covered 342 terms occurring more than once in the
index-term sets assigned to several hundred documents in the general subject field of
machine translation. Computation of the matrix required 20 minutes of computer time
and the 40 clumps found took 6-8 minutes each to find. Needham reports on the results
as follows:
1/
2/
See Kuhns, 1959 [336]
and Needham, 1961 [435], pp. 20-21.
Needham, 196[OCRerr] [435], p.46.
[OCRerr]II