MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Operational Considerations
chapter
Mary Elizabeth Stevens
National Bureau of Standards
ization, especially in the case of citation indexes compiled by machine. 1/ O'Connor notes,
however, that "the provision of pre-editing information can slow down the keypuncher or
typist, increase the chance of mistakes, and require more intelligence or training on the
typist's part." 2/
Questions of error detection and error correction apply both to the original text and
to transcribed versions if these are necessary. That is, the basic documents themselves
may contain typographical errors, misspellings, and the like, and additional errors are
bound to occur at all subsequent stages requiring human processing. Wyllys discusses the
need for the correction of spelling errors, mentions suggested computer programs for
detection, and cites a private communication from Stiles suggesting that the criteria for
accepting words as valid be either that they are identified as already being in the system
vocabulary or that they occur at least twice in the input item. 3/
Swanson's analysis of the reasons for retrieving irrelevant, and failing to retrieve
relevant, material in the case of text searching on the nuclear physics abstracts includes
typical data on the effect of errors. 4/ He found, for example, that failures to record
hyphenated words, subscripts, superscripts and other special symbols accounted for about
5 percent of failures to retrieve relevant items, and errors in transcription of either text
or search instructions accounted for another 3 percent of these failures. Errors in key-
punching of the search requests alone accounted for 4 percent of the cases of irrelevant
retrievals. By contrast, in the newspaper clippings experiments where the input material
was already in machine-usable form transcription errors were not a factor but the input
tape itself had many errors. In this special case, however, Swanson reports: "Garbles
are not important simply because messages are sufficiently redundant to insure that even
if one or two keywords for a given category are garbled, almost invariably others are
present." 5/
The news clippings material used by Swanson represents one class of materials that
are today initially available in machine-usable form, because the original recording of the
message or text resulted in a machine-usable medium, such as punched paper tape. A
punched paper tape is produced as the product of many typesetting operations, especially
for newspaper and magazine publication, and this will be increasingly true in the future,
together with computer-prepared tapes for input to automatic typographic composing
equipment. To date, however, equipment to convert from these tapes to the particular
machine language of a given computer processing system is largely non-available, is
costly, and is highly subject to error. 6/
1/ See, for example, Atherton, 1962 [25], p. 4; Marthaler, 1963 [399], p. 22.
However, at least one computer program has been developed to assist in this pro-
cess. See Thompson, 1963 [600], p. 11-1: "The present program takes biblio-
graphic citations and automatically arranges then into a standard format in such a
way that the various parts of the citation are unambiguously identified. These
standardized citations can later be processed by sorting and matching procedures to
identify similar citations and to effect various rearrangements.
2/ O'Connor, 1960 [444], p. 8.
3/ Wyllys, 1963 [653], p. 15.
4/
5/
Swanson, 1961 [586], Appendix.
Swanson, 1963 [580], p. 5.
6/ Compare, for example, Savage, 1958 [521], p. 11: "The use of tape as the
original input to the process has offered a number of problems which have yet to be
solved. One is the occurrence of typographical errors."
165