MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Operational Considerations
chapter
Mary Elizabeth Stevens
National Bureau of Standards
Moreover, to date, very little material in the scientific and technical literature is
available in this form. As of 1961, it was reported that a survey by McGraw-Hill indicated
that only about 2 or 3 percent of the publications in the United States were then prepared by
typesetting tape, that most of this was in the form of Monotype tape which because of its
30-column width and special format is not generally compatible with tape reading equip-
ment, and that tapes had many errors in them which would require considerable effort to
correct. 1/ As of late 1963, Bennett reports:
"Computer processing of natural language text material requires that a body of data
be available in machine-readable form. At present such a body of data results only
from a direct human copying process. An inquiry into existing transcriptions of
text which were machine-readable showed that they were abbreviated both interms
of completeness and in number of symbols represented. As an alternative text pro-
duced as a by-product of typesetting operations is clearly an eventual possibility,
but present practices make the detection of unit delimiters such as ends-of-sentences
difficult. "
In the future, both machine-usable text from publishers and printers and the similar-
ly machine-usable paper tape produced as a byproduct from the original keystroking of
manuscript on such equipment as Flexowriters and Justowriters may alleviate this problem
for new items. Nevertheless, the wealth of the world's present literature, the informal
and unpublished technical reports of high current interest but limited initial distribution,
and material acquired from foreign sources, will continue to pose for the foreseeable
future major problems either of automatic reading of the printed page or of human re-
transcription at high cost.
While there have been many promising developments in automatic character recog-
nition techniques, the devices that are now available for production use are limited to
small character sets, such as a single alphabet in a single font, often of special design.
The multi-font page reader is not only not yet commercially available but may not become
so for some years to come. Even if it were, there are many unresolved and as yet in-
completely specified problems involved in the development of suitable rules for the machine
so that it can distinguish between title or page number and text, figure caption and text,
author's name in a cited reference and the title of the paper cited, and the like. A case in
point, not only for automatic reading equipment of the future but for machine processing
of machine -usable material available today, is the difficulty of machine recognition of
punctuation marks as used for different purposes. 3/
In the absence, then, both of scientific and technical documents already in machine
language form and of character recognition equipment capable of reading the printed page,
we are left with the unsatisfactory situation of re-transcribing input material either by
use of a tape typewriter or by keypunching to punched cards. That this situation is un-
satisfactory and is a major bottleneck in machine processing of text in excess of the
bibliographic citation data only is evidenced by such typical statements as these:
1/ Cornelius, 1962 [140], p. 47.
z/ Bennett, 1963 [50], p. 141.
3/ See Bennett quotationabove; Luhn, 1959 [384], p. 22, andcoyaud, 1963 [143].
166