MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Automatic Assignment Indexing Techniques chapter Mary Elizabeth Stevens National Bureau of Standards * . The index words selected by Maron are decidedly specific to the documents from which they were derived and are of less generality than the frequency based terms. The Bayesian formula coupled with the Maron words correctly predicted the classification of 79.6% of the documents inGroupl[ `source items'] but only 45.5% of the documents in Group II [`test items']. The coupling of the Bayesian f6rmula with the Borko words resulted in a slight decrease in the percentage of Group I documents whose classification was correctly predicted (62.7%) but in creased the percentage of correct prediction for Group II documents to 58. o%.[OCRerr]' [OCRerr]` Other findings from the later experiments indicated that despite the differences in the two word-sets, the factor categories derived from them were very similar. It was also found that, at least for the source items ([OCRerr]oup I), the two machine techniques and the manual process classified 56.1 percent of the items into the same categories. It should be noted, however, that in the case of the automatic assignment methods: "Eleven documents contained no clue words and could not be automatically classified by either system. ` 2/ 4.4 Williams' Disc riminant Analysis Method The work of Williams in automatic assignment indexing, reported in the fall of 1963 [642], has also involved tests on abstracts of the computer literature, directly comparable to but not necessarily identical with those used by Maron and by Borko and Bernick. This work at IBM's Federal Systems Division, Bethesda is based in part on earlier work by Meadow which involved computer studies of matching functions for document word lists and category word lists for test items drawn from such fields as psychology, law, computer abstracts, and news items. [OCRerr]/ What has subsequently been developed is termed a "discriminant" method which begins with hierarchical classifi- cation structure of pre-established subject categories and with a small set of sample documents previously indexed by people into these categories. Frequency counts of words in each of the sample documents lead to computations, for each category, of the theoreti- cally probable frequencies of its most statistically significant words. For new items, observed word frequencies are compared with the theoretical word-category associations and a relevance value is computed for the item in terms of each category. The corpus selected for experimentation consisted of 400 items from " Computer Abstracts on Cards". 4/ These had previously been indexed using a classification structure of 15 major categories, each of which is divided in turn into 10 subcategories. The experimental sample, however, was so selected as to provide exactly 15 "source" items and 5 "new" items for each of 5 subdivisions of 4 of these major categories. 1/ 2/ 3/ Borko and Bernick, 1963 [78], p. 23. Ibid, p. 11. Williams, 1963 [642], cites H. R. Meadow, "Statistical Analysis and Classification of Documents", IRAD [OCRerr]sk No. 0353, FSD IBM, Rockville, Maryland, 1962, but this is apparently a company-confidential document, containing proprietary in- formation. Meadow gave an informal report on her work at the Computing Center seminars, University of Maryland, in March of 1963. Available on a subscription basis from Cambridge Communications Corporation, Cambridge, Mass. 97 4/