NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report

MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Appendix B: Progress and Prospects in Mechanized Indexing appendix Mary Elizabeth Stevens National Bureau of Standards One evaluator was also asked to review the titles of 150 test items and to indicate which, if any, he would wish to retrieve under each of 14 descriptors. He requested in all 353 items and 209 of these were retrieved on the basis of the SADSACT assignments, for a recall ratio of 59.2 percent. Of these, 167 had been previously evaluated by the same user for an overall relevance ratio of 81.4 percent. Summary accounts of automatic classification and assignment indexing experiments have been provided by Schultz 32/ in the form of an "imaginary panel discussion" (in which, hypothetically, Borko, Schultz, and Stevens discuss their respective systems), and by Black 33/ who concludes: 1'Provided that overall effectiveness is nearly equal, the system that depends less on the human element would clearly seem to be more desirable from a standpoint of reliability and efficiency, and perhaps even from a standpoint of economics as well. Additional work has been reported by Dale and Dale 30, 31/, Damerau 34/, Dolby et al 35/, Kreithen 26/, O'Connor 27/, and Williams 28, 29 I, among others. Borko1s 36, 37/ more recent papers on this subject consider problems of reliability and evaluation. He reports comparisons of automatic and manual classifications of 997 psychological abstracts into 11 categories, factor[OCRerr]analytically derived from 65 percent of these abstracts used as source items. He concluded that it was possible to determine that the percentage of agree- ment between automatic classification and perfectly reliable human classification could reach 67 percent. O'Connor's 1965 report L2/ provides further promising results of his "machine-like indexing by people" studies and also discussions of other techniques and of difficulties and limitations in automatic indexing experiments to date. Using Merck, Sharp and Dohme indexing data, O'Connor tested additional recognition-of-clue -word rules based on syntactic emphasis, a first sentence and first paragraph measure, a syntactic-distance measure, negations forbidden near clue words, and words naming substances or types of operations being required in close proximity to clue words. He reports considerable success with these new rules as follows: "The computer rules selected 92% of 180 toxicity papers. Allowing for sampling error, these rules would select between 88 and 95 percent of the toxicity papers. Thus the computer rules would be roughly comparable to, or perhaps superior to, MSD indexers in identifying toxicity papers." With respect to the difficulties to be observed in automatic indexing experimentation, O'Connor questions the adequacy of samplings of subject specifications, documents, and collections, the size of clue word vocabularies, and the human judgments used as stand- ards in many of the studies that have been made. The question of sampling adequacy in terms of the representativeness of clue word vocabularies as related to index terms or classification categories may be particularly critical for methods using small teaching samples. Spiegel and Bennett 38/ report that: "There seems to be no simple relation between the size of the corpus and the size of the vocabulary but after a certain point vocabulary size increases very slowly." Findings by Williams 29/ are encouraging. Working with teaching samples of 35, 70, and 140 items respectively, he reports that in the first 10, 000 word tokens processed from the text of 2, 700 abstracts 1, 800 different word types were encountered but that in the 80, 000 to 90, 000 range only 255 new types appeared. He found further that "an increase in sample size beyond 140 would not appear to offer any significant increase in classification performance." 228