MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Compiled by Machine
chapter
Mary Elizabeth Stevens
National Bureau of Standards
Mauchly's suggestion was, in effect, the idea of a complete index that could be
searched by machine. We should note, however, that although subsequent technological
advances could significantly decrease his original time estimate, the crucial questions
that remain are those of what, assuming one-to-one representation of document text, one
would search for. [OCRerr]l/ Natural language searching by machine, in the sense of full text
inspection, is a "pay-as-you-go" concordance technique. It is, however, a technique
which must be aided and abetted by various forms of synonym reduction, syntactic
normaii[OCRerr]ation, homograph resolution and other special processing operations if it is to be
in any sense an effective tool for selection of clues to be retrieved.
Gardin, in a series of recent lectures on automatic documentation, (Gardin, 1963
[207, 208] )2/ refers to the opinions of some investigators that it should be possible to
"jump" the stage of indexing and to search the natural language texts directly. The
problem, he points out, then shifts to the determination of all the various ways in which
the possible answers to a question may have been expressed in these natural language
"complete indexes". Instead of carrying out reductions or condensations of the documents,
as in normal indexing procedures, amplifications of questions are required. "Reductive"
indexing of the source documents can only be eliminated at the expense of "expansive"
indexing of questions. Gardin concludes that the gain from this is very doubtful.
There is also the presently staggering burden of time and cost to convert full texts to
machine-usable form. As of February, 1961, it was estimated that the natural language
text material available for machine processing amounted to little more than the words
contained in the Harvard Classics five-foot shelf (Stevens, [OCRerr]962 [567]). Perhaps up to
ten times that amount is now available, notably in the 6, 000, 000 words of the statutes of
Pennsylvania 3/ and in several million additional words that have since been keypunched
at the Center for Automation of Literature Analysis, Gallarate, Italy. [OCRerr]4/ A very recently
1/
See, for example, Yngve, 1959 [657], pp .978-979: "We will have to find formal
connections between widely divergent ways of saying essentially the same thing. In
addition there is much that we will have to learn about searching. If we had today a
complete grammar of English which was capable of rendering explicit all the relations
and distinctions implicit in the document, I doubt that we would know how to use it
effectively in a machine search situation. We would be embarrassed by the very
wealth of the information available. Much more must be learned about search
situations."
2/
See also Bar-Hillel, 1962 [35], p.415: "Could not the stage of clue assignment be
completely skipped and the request topic be directly compared with the original
documents? It is very natural that such a thought should have arisen, but it must
be stressed that there is nothing in our knowledge of the workings of communication
which would indicate that such a proposal is, or ever will be, practical."
3/
4/
See various references by J.F.Horty, W. B. Eldridge and S.F. Dennis, E.M.Fels,
R. Wilson.
R. Busa, data reported at the NATO Advanced Study Institute on Automatic Docu-
ment Analysis, Venice, July 1963.
16