MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Conclusion chapter Mary Elizabeth Stevens National Bureau of Standards Thus, Borko and Bernick point out: "Up to this point we have used human classification as our criterion for the accuracy of automatic document classification. Against this criterion we have been able to predict with approximately 55% accuracy, and no more. Is this because out tech- niques of automatic classification are not very good, or is it because our criterion of human classification is not very reliable? There is some evidence to indicate that the reliability of human indexers is not very high. The reliability of classifying technical reports needs investigating and, perhaps even more basically, the reasons for using human classification as a criterion at all." 1/ In general, the results of automatic index-term assignment procedures appear to run in the area of 45-75 percent agreement with prior human indexing, Z/and this in turn is well within range of, and often superior to, estimates of human inter-indexer consistency based on actual observations and tests. There can be little or no doubt that the results of auto- matic assignment indexing experiments to date, (if extrapolation from the small and often highly specialized samples so far used in actual tests is in fact warranted 3/) do suggest that an indexing quality generally comparable to that achievable by run-of-the-mill manual operations, at comparable costs and with increased timeliness, can be achieved by machine. The question which remains is simply that of practicality, today. Extrapolation from small samples is highly dangerous, as is well noted even by enthusiastis for machine techniques. The fact that for at least some systems, the limitations on number of clue words that can be handled (due in part to computational requirements, matrix manipulations, and the like) are such that, even in an experimental situation, certain "tests" are excluded from the result statistics, because the items contained an insufficient number of clues, is a serious indictment of reasonable extrapolations for these techniques today. Most tests so far reported have involved not only a highly specialized "sample" library or collection, but a severe limitation on the total number of "descriptors", subject headings, or classi- fication categories to be assigned. Maron used 3Z, Borko Zl, Williams ZO, SADSACT 70, Swanson Z4. How would any of these approaches fare, given several hundred, much less 1/ Borko and Bernick, 1963 [78], pp. 31-3Z. See Table Z. 3/ This is an important, perhaps crucial, caveat. See, for example, Goldwyn, 1963 [Z33], p. 3Z1: "In the micro-experiments of many of those who would apply statis- tical techniques ... The document collection consists of 0-100 units. Results based on the manipulation, real or imagined, of such a collection can be valid for it, yet become shaky or even nonapplicable to larger collections"; Perry 1958 [471], p.415: "A degree of selectivity quite acceptable for files of moderate Bize may prove quite inadequate in dealing with large files. This fact often makes it necessary to exert unusual care and considerable reserve in evaluating the results of small-scale tests and demonstrations which may tend to cause the mass effects of large files to be underestimated or overlooked completely"; Swanson, 196Z [586], p. Z88: "The extent to which semantic characteristics of natural language are susceptible to being generalized from small sample data is deceptive." 178