MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Conclusion
chapter
Mary Elizabeth Stevens
National Bureau of Standards
Thus, Borko and Bernick point out:
"Up to this point we have used human classification as our criterion for the accuracy
of automatic document classification. Against this criterion we have been able to
predict with approximately 55% accuracy, and no more. Is this because out tech-
niques of automatic classification are not very good, or is it because our criterion
of human classification is not very reliable? There is some evidence to indicate that
the reliability of human indexers is not very high. The reliability of classifying
technical reports needs investigating and, perhaps even more basically, the reasons
for using human classification as a criterion at all." 1/
In general, the results of automatic index-term assignment procedures appear to run
in the area of 45-75 percent agreement with prior human indexing, Z/and this in turn is well
within range of, and often superior to, estimates of human inter-indexer consistency based
on actual observations and tests. There can be little or no doubt that the results of auto-
matic assignment indexing experiments to date, (if extrapolation from the small and often
highly specialized samples so far used in actual tests is in fact warranted 3/) do suggest
that an indexing quality generally comparable to that achievable by run-of-the-mill manual
operations, at comparable costs and with increased timeliness, can be achieved by machine.
The question which remains is simply that of practicality, today. Extrapolation
from small samples is highly dangerous, as is well noted even by enthusiastis for machine
techniques. The fact that for at least some systems, the limitations on number of clue
words that can be handled (due in part to computational requirements, matrix manipulations,
and the like) are such that, even in an experimental situation, certain "tests" are excluded
from the result statistics, because the items contained an insufficient number of clues, is
a serious indictment of reasonable extrapolations for these techniques today. Most tests
so far reported have involved not only a highly specialized "sample" library or collection,
but a severe limitation on the total number of "descriptors", subject headings, or classi-
fication categories to be assigned. Maron used 3Z, Borko Zl, Williams ZO, SADSACT 70,
Swanson Z4. How would any of these approaches fare, given several hundred, much less
1/
Borko and Bernick, 1963 [78], pp. 31-3Z.
See Table Z.
3/ This is an important, perhaps crucial, caveat. See, for example, Goldwyn, 1963
[Z33], p. 3Z1: "In the micro-experiments of many of those who would apply statis-
tical techniques ... The document collection consists of 0-100 units. Results based
on the manipulation, real or imagined, of such a collection can be valid for it, yet
become shaky or even nonapplicable to larger collections"; Perry 1958 [471], p.415:
"A degree of selectivity quite acceptable for files of moderate Bize may prove quite
inadequate in dealing with large files. This fact often makes it necessary to exert
unusual care and considerable reserve in evaluating the results of small-scale tests
and demonstrations which may tend to cause the mass effects of large files to be
underestimated or overlooked completely"; Swanson, 196Z [586], p. Z88: "The
extent to which semantic characteristics of natural language are susceptible to being
generalized from small sample data is deceptive."
178