MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Compiled by Machine chapter Mary Elizabeth Stevens National Bureau of Standards completed study made by theTRW Computer Division, Thompson Ramo Wooldridge, involves the investigation of the possibilities for a center to provide text in machine- usable form. The report gives a total figure of approximately 50,000,000 words of text so available as of February 28, 1964, but this includes non-scientific text, such as news- paper and popular magazine materials (Mersel and Smith, 1964 [415]). Mersel and Smith also report on the estimated requirements for machine-usable text for various research groups, averaging over a million words per year per group. Yet, at present keypunching costs of one cent or more per word, is it reasonable to assume that any of these research groups can provide a budget of over $100, 000 per year for this purpose alone? Moreover, this budget would provide for the conversion of no more than a thousand 1,000-word items or a hundred 10,000-word items at costs, respectively, of $100 or $1,000 per item. For the present, therefore, the conclusion is ine[OCRerr]capab1e: either indexing or search based upon full text processing is not yet practical. Even the most enthusiastic proponents of "searching full natural language text1' (Swanson, 1960 [589]) and "maximum-depth indexing' `(Simmons and McConlogue, 1962 E 555]) generally agree as to the present impracticality of full-text mechanized indexing except for special limited cases. The two problems of determining what to search for, given full text, and of feasibility of conversion of text into machine-usable form thus combine td limit "complete indexing" largely to the special cases of providing corpora for studies in the field of computational linguistics and of compiling the traditional scholarly tool- -the concordance to all the words in a given literary work or works. Apparent exceptions, including experimental work with abstracts only and the law statutes studies, are usually cases in which the selective principle of disregarding common words (and hence the bulk of the actual text) is applied automatically either on input or in subsequent processing (Cleverdon and Mills, 1963 £131] ). These cases, therefore, may be considered machine-generated indexes rather than machine[OCRerr]ompiled. Moreover, it should be noted that: ......The law, itself, is an appropriate field for data retrieval. The statutes, especially, are written in relatively clear, concise language. At least, this is their intent. Practically, this means that input and output can both be relatively short and that retrieval of legal information will be involved with fewer semantic difficulties." 1/ In the area of concordance-making, however, the potentialities of machine com- pilation have been put to good use. The pioneer efforts in this area are unquestionably those of Father Roberto Busa, S. J., of the Gallarate Center. As early as 1946, Busa proposed to his superiors that a card file recording all the words used in all of the works of St. Thomas Aquinas should be set up, and he began his actual experiments using IBM punched card equipment in 1949 (Busa, 1953 [87], 1960 [91], and 1958 [92]; Secrest, 1958 [540]). 2/ Appearing in 1951, his Sancti Thomas Aquinatis Hymnorum Ritualium Varia Specimina Concordantiarum is the first known example of a complete word index that was compiled by machine techniques. The early Gallarate work was carried out on standard punched card equipment, but from the time of the concordance to the Dead Sea Scrolls, computers have also been used (Tasman, 1959 [595], [596], and[597] ). The major continuing task is still to other works of St. Thomas. Other machine-compiled concordances produced by Busa's Center include one to Goethe's Farbenlehre, Bd. 3. 1/ 2/ Asher and Kurfeerst, 1963 [24], pp.1-2. See also Scheele(ed.), 1961£ 522], pp.206-209. 17