MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Assignment Indexing Techniques
chapter
Mary Elizabeth Stevens
National Bureau of Standards
the theoretical probabilities of word occurrence by category or of discrimination Co-
efficients and thresholds. Instead, the technique involves ad hoc statistical associations
between the words occurring in the title and in the abstract of a sample item and the
descriptors previously assigned to that item. A master selection-word vocabulary is
thus built up where each word is listed in terms of the frequencies of its co-occurrence
with each of the descriptors with which it has co-occurred, regardless of whether or not
such prior a6sociations are either revelant or significant. No attempt has as yet been
made to "purge" the resulting association lists. Instead, reliance is placed on the
patterns of multiple word usage and of redundancy of words used in titles and cited titles
of new items to minimize the effects of irrelevant or accidental prior word-descriptor
associations and to enhance the significant ones.
The SADSACT method (for "Self Assigned Descriptors from Self and Cited Titles")
proceeds with the assumption, which it shares with the arguments for citation indexing
previously discussed, that the literature references cited by an author are indicative of
the subject content or contents of his paper. 1/ For the automatic indexing of new items,
their titles and the titles of up to ten bibliographic references cited are keystroked, con-
verted to punched cards, and fed to the computer. This input material is run against the
master vocabulary to obtain for each input word which matches a vocabulary word a
"descriptor-selection score" for each of the descriptors previously associated with that
word. These scores are summed up for all words and at an appropriate cutting level
those descriptors having the highest scores are assigned to the new item.
Preliminary results based on the titles and cited titles of items that were "source
items" in the sense that their titles and abstracts had been used in the teaching sample
were reported at the NATO Advanced Study Institute on Automatic Document Analysis
held in Venice in July, 1963. For 30 items drawn from such subject fields as computer
technology, information selection and retrieval, mathematical logic, pattern recognition,
and operations research, all of which had previously been indexed by ASTIA personnel in
1960, the machine assigned 64.8 percent of the descriptors previously assigned. Sub-
sequent tests on genuinely new items, however, resulted in a drop to only 48.2 percent
"hit" accuracy.
These "new" item results were also evaluated by having several representative
users of the collection analyze the test items and assign descriptors to them from a list
of the descriptors available to the machine. The extent to which the descriptors assigned
by machine were also independently chosen by one or more of these indexers was then
checked. In general, the fewer descriptors assigned by the machine, the better was the
human agreement, ranging from 47.4 percent overall in the case where the machine had
assigned twelve descriptors to each item to 76% agreement where the machine assigned
only one. In particular, for ten items which were analyzed by five different indexers,
the chances that one or more would also select the machine's first choice (highest scoring)
descriptor averaged 90 percent.
4.6 Assignment Indexing from Citation Data
Certain phases in the program of investigation of information selection and retrieval
problems at the Harvard Computation Laboratory have been mentioned previously. The
work of Storm and of Lesk and Storm on the use of first-noun-occurrences as selection
clues for both automatic indexing and abstracting was discussed in connection with tech-
niques for improved derivative indexing. The studies on citation indexing have included,
as noted, experiments to assign indexing terms to a new document by finding the indexing
1/
If necessary or desirable, however, abstracts or portions of text can be used in
addition to or in lieu of the cited titles.
99