MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Assignment Indexing Techniques
chapter
Mary Elizabeth Stevens
National Bureau of Standards
* . The index words selected by Maron are decidedly specific to the documents
from which they were derived and are of less generality than the frequency based
terms. The Bayesian formula coupled with the Maron words correctly predicted
the classification of 79.6% of the documents inGroupl[ `source items'] but only
45.5% of the documents in Group II [`test items']. The coupling of the Bayesian
f6rmula with the Borko words resulted in a slight decrease in the percentage of
Group I documents whose classification was correctly predicted (62.7%) but in
creased the percentage of correct prediction for Group II documents to 58. o%.[OCRerr]' [OCRerr]`
Other findings from the later experiments indicated that despite the differences in
the two word-sets, the factor categories derived from them were very similar. It was
also found that, at least for the source items ([OCRerr]oup I), the two machine techniques and
the manual process classified 56.1 percent of the items into the same categories. It
should be noted, however, that in the case of the automatic assignment methods: "Eleven
documents contained no clue words and could not be automatically classified by either
system. ` 2/
4.4 Williams' Disc riminant Analysis Method
The work of Williams in automatic assignment indexing, reported in the fall of
1963 [642], has also involved tests on abstracts of the computer literature, directly
comparable to but not necessarily identical with those used by Maron and by Borko and
Bernick. This work at IBM's Federal Systems Division, Bethesda is based in part on
earlier work by Meadow which involved computer studies of matching functions for
document word lists and category word lists for test items drawn from such fields as
psychology, law, computer abstracts, and news items. [OCRerr]/ What has subsequently been
developed is termed a "discriminant" method which begins with hierarchical classifi-
cation structure of pre-established subject categories and with a small set of sample
documents previously indexed by people into these categories. Frequency counts of words
in each of the sample documents lead to computations, for each category, of the theoreti-
cally probable frequencies of its most statistically significant words. For new items,
observed word frequencies are compared with the theoretical word-category associations
and a relevance value is computed for the item in terms of each category.
The corpus selected for experimentation consisted of 400 items from " Computer
Abstracts on Cards". 4/ These had previously been indexed using a classification
structure of 15 major categories, each of which is divided in turn into 10 subcategories.
The experimental sample, however, was so selected as to provide exactly 15 "source"
items and 5 "new" items for each of 5 subdivisions of 4 of these major categories.
1/
2/
3/
Borko and Bernick, 1963 [78], p. 23.
Ibid, p. 11.
Williams, 1963 [642], cites H. R. Meadow, "Statistical Analysis and Classification
of Documents", IRAD [OCRerr]sk No. 0353, FSD IBM, Rockville, Maryland, 1962, but
this is apparently a company-confidential document, containing proprietary in-
formation. Meadow gave an informal report on her work at the Computing Center
seminars, University of Maryland, in March of 1963.
Available on a subscription basis from Cambridge Communications Corporation,
Cambridge, Mass.
97
4/