NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report

MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Compiled by Machine chapter Mary Elizabeth Stevens National Bureau of Standards Mauchly's suggestion was, in effect, the idea of a complete index that could be searched by machine. We should note, however, that although subsequent technological advances could significantly decrease his original time estimate, the crucial questions that remain are those of what, assuming one-to-one representation of document text, one would search for. [OCRerr]l/ Natural language searching by machine, in the sense of full text inspection, is a "pay-as-you-go" concordance technique. It is, however, a technique which must be aided and abetted by various forms of synonym reduction, syntactic normaii[OCRerr]ation, homograph resolution and other special processing operations if it is to be in any sense an effective tool for selection of clues to be retrieved. Gardin, in a series of recent lectures on automatic documentation, (Gardin, 1963 [207, 208] )2/ refers to the opinions of some investigators that it should be possible to "jump" the stage of indexing and to search the natural language texts directly. The problem, he points out, then shifts to the determination of all the various ways in which the possible answers to a question may have been expressed in these natural language "complete indexes". Instead of carrying out reductions or condensations of the documents, as in normal indexing procedures, amplifications of questions are required. "Reductive" indexing of the source documents can only be eliminated at the expense of "expansive" indexing of questions. Gardin concludes that the gain from this is very doubtful. There is also the presently staggering burden of time and cost to convert full texts to machine-usable form. As of February, 1961, it was estimated that the natural language text material available for machine processing amounted to little more than the words contained in the Harvard Classics five-foot shelf (Stevens, [OCRerr]962 [567]). Perhaps up to ten times that amount is now available, notably in the 6, 000, 000 words of the statutes of Pennsylvania 3/ and in several million additional words that have since been keypunched at the Center for Automation of Literature Analysis, Gallarate, Italy. [OCRerr]4/ A very recently 1/ See, for example, Yngve, 1959 [657], pp .978-979: "We will have to find formal connections between widely divergent ways of saying essentially the same thing. In addition there is much that we will have to learn about searching. If we had today a complete grammar of English which was capable of rendering explicit all the relations and distinctions implicit in the document, I doubt that we would know how to use it effectively in a machine search situation. We would be embarrassed by the very wealth of the information available. Much more must be learned about search situations." 2/ See also Bar-Hillel, 1962 [35], p.415: "Could not the stage of clue assignment be completely skipped and the request topic be directly compared with the original documents? It is very natural that such a thought should have arisen, but it must be stressed that there is nothing in our knowledge of the workings of communication which would indicate that such a proposal is, or ever will be, practical." 3/ 4/ See various references by J.F.Horty, W. B. Eldridge and S.F. Dennis, E.M.Fels, R. Wilson. R. Busa, data reported at the NATO Advanced Study Institute on Automatic Docu- ment Analysis, Venice, July 1963. 16