MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Appendix B: Progress and Prospects in Mechanized Indexing
appendix
Mary Elizabeth Stevens
National Bureau of Standards
"The combination of these two automatic indexing methods, whereby a number of
indexing terms would be assigned to a document on the basis of its category
dependency, and the rest extracted from text, might be a desirable solution.
Automatic assignment indexing, with clue-words in the input textual material used to
determine the proper assignments of indexing terms to incoming items, is generallyequiv-
alent to automatic classification techniques that assign a single classification category to
items, again on the basis of clue-words in the input text, because a minimum cut-off level
in the automatic assignment procedure, combined with a sufficiently generic vocabulary,
can achieve classificatory as well as indexing results. The present state of the art in
automatic assignment indexing and classification is marked by intriguing demonstrations of
technical feasibility for the relatively small samples so far investigated. Present dif-
ficultie 5 associated with automatic assignment indexing or classification techniques,
however, relate to problems of input processing requirements, computational limitations,
the special purpose nature of results demonstrated to date, and problems of evaluation.
A listing of automatic classification and assignment indexing experiments as of 1964
is provided in Table 2, pp. 101-103, of the text of this report. To this we should add more
recent results of our own as well as additional results reported by O'Connor 27/ and
Williams 28, 29/, Dale and Dale 30, 31/, and others.
In the SADSACT method, we start with a "teaching sample" of items representative of
our collection, to which indexing terms have previously been assigned. We then derive the
statistics of co-occurrences of substantive words in the titles and abstracts of these items
with descriptors assigned to them, ending with a vocabulary of clue words weighted with
respect to prior co-occurrences with various descriptors with which they have been
associated.
Then, for new items, we look up each word of input (typically consisting of 100 words
or less: title and up to 10 cited titles, or title and brief abstract, or title and first or last
paragraphs) and derive "descriptor-selection-scores" based upon the prior ad hoc word-
descriptor associations. The highrst ranking descriptors, in terms of the accumulated
selection scores, are then assigned, at some appropriate cut-off level, to the new item.
To date, machine first-choice assignments (corresponding to performance figures
reported for other automatic classification and indexing experiments) have been checked
for 213 test items either against prior DDC indexing or against user evaluations, or both,
with 72.3 percent mean overall agreement.
Our most recent results involved 150 test items. Machine assignments of descriptors
to items were checked by having up to five actual users of our collection rate the relevance
to a given one of 14 descriptors of items whose titles were listed under that descriptor by
the machine assignment procedure. A total of 451 pairings of user-relevance-ratings with
the machine has now been analyzed, with a mean relevance rating of 74.9 percent. With
respect to machine first-choices, there were 206 pairings with 85.4 percent of the machine
assignments rated as at least somewhat relevant.
Checks have also been made of SADSACT results as compared to which of these same
documents would be directly retrievable if a KWJC or some other title-only index were to
be used. For the first 50 machine assignments rated as "highly relevant" in user-
evaluations, a check was made to determine whether or not the same item would be
retrievable by lookup under the name of the descriptor in a KWIC index. There were 9
such cases, or 18 percent. In 48 percent of the cases, a part of the descriptor name
occurred in the document title. For 17 cases, or 34 percent, there were no title words
identical with any part of the descriptor name.
227