MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Assignment Indexing Techniques
chapter
Mary Elizabeth Stevens
National Bureau of Standards
In these additional experiments, 27 articles in the nuclear physics subject area were
included in a corpus of 100 articles, the remainder covering a variety of topics. Fre-
quency counts of word occurrences for the physics material were obtained and the 12 most
frequent words that were judged to be discriminatory for the subject were selected. The
hypothesis was then tested, that if any document pertained to nuclear physics it would
contain at least two of these words. Retrieval was achieved for 25 of the 27 documents
and the two "irrelevant'1 documents also retrieved did include information at least peri-
pherally related to the `subject. It was thus evident that the retrieval effectiveness of
automatic recognition of nuclear physics subject material in the general collection was
considerably greater than the average effectiveness of retrieving responses to the highly
specific search questions in nuclear physics that had been used in the full text searching
experiments (Swanson, 1961 [586]).
This second set of experiments provided a transition from the full text searching
work, which if it can be considered indexing at all is obviously derivative indexing, to
work in the application of an automatic assignment indexing method to 1, 200 newspaper
clippings (Swanson, 1962 [ 584], 1963 [580]). These were brief news items for which
machine-readable texts in the form of punched paper tape were available. Thesaurus-
groups of words likely to be associated with each of 20 to 24 subject headings were first
compiled on the basis of human analysis of 1,000 or more representative items. These
word groups were further screened so that no word appeared in more than one group and
so that each word retained should be uniquely indicative of the particular subject
category. In the machine assignment procedure, subsequently, if a word occurs that
belongs to a particular thesaurus group, the corresponding subject heading is assigned
to the item in which that word occurs.
Results achieved with this technique appear to be highly promising, at least for this
type of material. Swanson reports as follows:
"Approximately 1,200 brief news items were classified into 20 nonhierarchical
subject categories, both by a human and a machine procedure. Each item was
assigned on the average to about four categories. The results of the two
processes were compared. With the human process as a standard, the machine
missed only seven percent of the correct subject assignments and made a number
of irrelevant assignments equal to about 17 percent of the total. Nearly 40 per-
cent of the automatic subject assignments judged finally to be correct were
missed by the human catalogers.
While this accomplishment is actually due to the extensive human effort to compiling,
organizing, and pruning of the uniquely indivative word lists, it is pointed out that this
intellectual effort and the programming tasks need to be done only "once and for all".
It is further pointed out that garbles or misspellings in the input text do not appear to
affect the procedure, there being enough redundancy in the messages so that even if one or
two clue words are missed, others will be present. 3/
1/
Swanson, 1962 L584], p.468.
2/ Ibid, p.469.
Swanson, 1963 [ 580], p.5
92