MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Generated by Machine-Automatic Derivative Indexing chapter Mary Elizabeth Stevens National Bureau of Standards many properly indexable topics or points of interest because the authors did not emphasize them or used new and unusual terminology to describe them, failures to achieve con- sistency both of reference and index-vocabulary control for the papers of more than one author, and the like. Additional difficulties are engendered, for word indexing by machine from text as against word indexing by people, because of complexities required in pro ramming to achieve recognition of even such simple indicia as endings of sentences, - incon[OCRerr]is - tencies of capitalization 2/ and misspellings.[OCRerr]3/ Context distinctions between multiple meanings of homographic words are even more difficult. Difficulties in achieving good indexing quality are increased if only titles are used; those of keystroking and machine cost requirements increase as the amount of input material grows. For these reasons, early criticisms such as those of Bar-Hillel are largely as pertinent today as they were when statistical techniques for computer generation of document extracts and index terms were first proposed. For example: tiThere can be no doubt but that computers are in a position to select out of the words or word-strings occurring in the encoded form of the original document those words or strings which fulfill certain formal, statistical conditions, such as occurring more than five times, occurring with a relative frequency at least double the relative frequency in general. . . However, it is . . . unlikely that the set obtained thereby will be of a quality commensurate with that obtained by a competent indexer. First, there will be serious difficulties as to what is to be regarded as instances of the same word ... Second, there arises ... the problem of synonyms. Third, and most important, this procedure will yield at its best a set of words and word strings exclusively taken from the document itself 4/ On the other hand, there are many situations where, because of time factors or lack of conventional indexing resources, even unmodified derivative indexing by machine is itself of value and therefore modifications to improve the quality of results, whether made by man or by machine, may be well worthwhile. As Anzlowar claims: `1The in- creasingly widespread KWIC indexes . .. can save so much in time and effort that they surely deserve better than the somewhat haphazard `slash-dash -mg' now done in most 5/ in most instances as the only cerebral operations thereon. - 1/ See Luhn, 1959 [384], p.22: `1Amongst the difficulties encountered in the processing of machine readable texts, inconsistencies in the use of punctuation marks, com- pounds, capitals, spacing and indentations have been a problem way out of propor- tion with respect to the simple functions these devices stand for. For instance, even with the aid of a dozen different tests performed by the machine, the true end of a sentence cannot be determined with certainty." 2/ 3/ 4/ 5' See Artandi, 1963 [20], pp. 52ff, on problems of capitalization of proper names. See Wyllys, 1963 [653], p. 15. Bar-Hillel, 1962 [35], pp.417-418. Anzlowar, 1963 [16], p. 104. 90