MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Automatic Classification and Categorization chapter Mary Elizabeth Stevens National Bureau of Standards "Evaluation of the results was unexpectedly difficult. The acid test is presumably the efficiency of the retrieval system embodying the grouping given by the program; but the efficiency of retrieval systems cannot be easily measured. An apparently simpler test would be to see if the clumps were intuitively satisfactory, i. e. , were groupings that a classifier in his right mind could have made. This also was un- satisfactory because the groups are mostly rather large, larger in fact than classifiers ordinarily make, and were thus very difficult to judge. The test eventually adopted was to group the terms not distinguished by the clump classifi- cation, and look at these. Accordingly, for each term, a list of the clumps to which it belongs was prepared, and groups of terms were found which had all their clumps in common. These groups were quite small (2-6 terms) and could be studied easily. It turned out that some groups were ones of which a human classifier could have thought (e.g. , words concerning suffix removal for machine translation came together) while others were quite justified by the documents con- cerned, but would never have been thought of a priori. For example, the group: "phrase marker, phoneme, Markov process, terminal language'1 was entirely justified by the. . . contents of the library. It is groups of the latter kind that represent a success for clump theory, for they function usefully in retrieval but in no way form part of the structure of thought. . which the human classifier's work is likely to reflect.' 1/ Still another application of the theory of clumps may be of use in the construction of thesauri (Sparck-Jones, 1962 [564]. Here the assumption is that rows of a correlation matrix can be formed for words giving other words which are synonymous with respect to meaning. The overlaps of the same word's occurrence in two or more rows can then be used to find clumps which are presumed to represent conceptual groupings. Applications of clump theory to problems of mechanized documentation are also being investigated by Dale and Dale of the Linguistics Research Center, the University of Texas. 2/ They have begun experimentation to derive clumps for the 90 clue words used by Borko and the 260 source-item computer abstracts used by both Maron and Borko. Preliminary results reported so far are principally limited to considerations of the asso- ciative networks between terms as derived from the structure of the clumps discovered by several clump definitions. Mention should also be made of the work of Meetham and Vaswani at the National Physical Laboratory, Teddington, England, looking toward the use of similar techniques for machine-generated index vocabularies, with preliminary emphasis on testing them against a "library" consisting of the propositions of Euclid's geometry. 3/ 1/ 2/ 3/ Needham, 1963[431], p. 285-286. Dale and Dale, an unpublished report dated February 1964, [147]. National Science Foundation's CR&D report No. 11, [430], p. 137; and Meetham, 1963 [413]. 12