MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Appendix B: Progress and Prospects in Mechanized Indexing
appendix
Mary Elizabeth Stevens
National Bureau of Standards
One evaluator was also asked to review the titles of 150 test items and to indicate
which, if any, he would wish to retrieve under each of 14 descriptors. He requested in all
353 items and 209 of these were retrieved on the basis of the SADSACT assignments, for a
recall ratio of 59.2 percent. Of these, 167 had been previously evaluated by the same user
for an overall relevance ratio of 81.4 percent.
Summary accounts of automatic classification and assignment indexing experiments
have been provided by Schultz 32/ in the form of an "imaginary panel discussion" (in which,
hypothetically, Borko, Schultz, and Stevens discuss their respective systems), and by
Black 33/ who concludes: 1'Provided that overall effectiveness is nearly equal, the system
that depends less on the human element would clearly seem to be more desirable from a
standpoint of reliability and efficiency, and perhaps even from a standpoint of economics
as well.
Additional work has been reported by Dale and Dale 30, 31/, Damerau 34/, Dolby et
al 35/, Kreithen 26/, O'Connor 27/, and Williams 28, 29 I, among others. Borko1s 36, 37/
more recent papers on this subject consider problems of reliability and evaluation. He
reports comparisons of automatic and manual classifications of 997 psychological abstracts
into 11 categories, factor[OCRerr]analytically derived from 65 percent of these abstracts used as
source items. He concluded that it was possible to determine that the percentage of agree-
ment between automatic classification and perfectly reliable human classification could
reach 67 percent.
O'Connor's 1965 report L2/ provides further promising results of his "machine-like
indexing by people" studies and also discussions of other techniques and of difficulties and
limitations in automatic indexing experiments to date. Using Merck, Sharp and Dohme
indexing data, O'Connor tested additional recognition-of-clue -word rules based on syntactic
emphasis, a first sentence and first paragraph measure, a syntactic-distance measure,
negations forbidden near clue words, and words naming substances or types of operations
being required in close proximity to clue words.
He reports considerable success with these new rules as follows: "The computer
rules selected 92% of 180 toxicity papers. Allowing for sampling error, these rules would
select between 88 and 95 percent of the toxicity papers. Thus the computer rules would be
roughly comparable to, or perhaps superior to, MSD indexers in identifying toxicity
papers."
With respect to the difficulties to be observed in automatic indexing experimentation,
O'Connor questions the adequacy of samplings of subject specifications, documents, and
collections, the size of clue word vocabularies, and the human judgments used as stand-
ards in many of the studies that have been made.
The question of sampling adequacy in terms of the representativeness of clue word
vocabularies as related to index terms or classification categories may be particularly
critical for methods using small teaching samples. Spiegel and Bennett 38/ report that:
"There seems to be no simple relation between the size of the corpus and the size of the
vocabulary but after a certain point vocabulary size increases very slowly."
Findings by Williams 29/ are encouraging. Working with teaching samples of 35, 70,
and 140 items respectively, he reports that in the first 10, 000 word tokens processed from
the text of 2, 700 abstracts 1, 800 different word types were encountered but that in the
80, 000 to 90, 000 range only 255 new types appeared. He found further that "an increase in
sample size beyond 140 would not appear to offer any significant increase in classification
performance."
228