MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Generated by Machine-Automatic Derivative Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
many properly indexable topics or points of interest because the authors did not emphasize
them or used new and unusual terminology to describe them, failures to achieve con-
sistency both of reference and index-vocabulary control for the papers of more than one
author, and the like.
Additional difficulties are engendered, for word indexing by machine from text as
against word indexing by people, because of complexities required in pro ramming to
achieve recognition of even such simple indicia as endings of sentences, - incon[OCRerr]is -
tencies of capitalization 2/ and misspellings.[OCRerr]3/ Context distinctions between multiple
meanings of homographic words are even more difficult. Difficulties in achieving good
indexing quality are increased if only titles are used; those of keystroking and machine
cost requirements increase as the amount of input material grows.
For these reasons, early criticisms such as those of Bar-Hillel are largely as
pertinent today as they were when statistical techniques for computer generation of
document extracts and index terms were first proposed. For example:
tiThere can be no doubt but that computers are in a position to select out of the
words or word-strings occurring in the encoded form of the original document
those words or strings which fulfill certain formal, statistical conditions, such
as occurring more than five times, occurring with a relative frequency at least
double the relative frequency in general. . . However, it is . . . unlikely that the
set obtained thereby will be of a quality commensurate with that obtained by a
competent indexer. First, there will be serious difficulties as to what is to be
regarded as instances of the same word ... Second, there arises ... the problem
of synonyms. Third, and most important, this procedure will yield at its best a
set of words and word strings exclusively taken from the document itself 4/
On the other hand, there are many situations where, because of time factors or lack
of conventional indexing resources, even unmodified derivative indexing by machine is
itself of value and therefore modifications to improve the quality of results, whether
made by man or by machine, may be well worthwhile. As Anzlowar claims: `1The in-
creasingly widespread KWIC indexes . .. can save so much in time and effort that they
surely deserve better than the somewhat haphazard `slash-dash -mg' now done in most
5/
in most instances as the only cerebral operations thereon. -
1/
See Luhn, 1959 [384], p.22: `1Amongst the difficulties encountered in the processing
of machine readable texts, inconsistencies in the use of punctuation marks, com-
pounds, capitals, spacing and indentations have been a problem way out of propor-
tion with respect to the simple functions these devices stand for. For instance,
even with the aid of a dozen different tests performed by the machine, the true end
of a sentence cannot be determined with certainty."
2/
3/
4/
5'
See Artandi, 1963 [20], pp. 52ff, on problems of capitalization of proper names.
See Wyllys, 1963 [653], p. 15.
Bar-Hillel, 1962 [35], pp.417-418.
Anzlowar, 1963 [16], p. 104.
90