MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Compiled by Machine
chapter
Mary Elizabeth Stevens
National Bureau of Standards
completed study made by theTRW Computer Division, Thompson Ramo Wooldridge,
involves the investigation of the possibilities for a center to provide text in machine-
usable form. The report gives a total figure of approximately 50,000,000 words of text
so available as of February 28, 1964, but this includes non-scientific text, such as news-
paper and popular magazine materials (Mersel and Smith, 1964 [415]).
Mersel and Smith also report on the estimated requirements for machine-usable text
for various research groups, averaging over a million words per year per group. Yet, at
present keypunching costs of one cent or more per word, is it reasonable to assume that
any of these research groups can provide a budget of over $100, 000 per year for this
purpose alone? Moreover, this budget would provide for the conversion of no more than
a thousand 1,000-word items or a hundred 10,000-word items at costs, respectively, of
$100 or $1,000 per item. For the present, therefore, the conclusion is ine[OCRerr]capab1e: either
indexing or search based upon full text processing is not yet practical. Even the most
enthusiastic proponents of "searching full natural language text1' (Swanson, 1960 [589])
and "maximum-depth indexing' `(Simmons and McConlogue, 1962 E 555]) generally agree as
to the present impracticality of full-text mechanized indexing except for special limited
cases.
The two problems of determining what to search for, given full text, and of feasibility
of conversion of text into machine-usable form thus combine td limit "complete indexing"
largely to the special cases of providing corpora for studies in the field of computational
linguistics and of compiling the traditional scholarly tool- -the concordance to all the words
in a given literary work or works. Apparent exceptions, including experimental work
with abstracts only and the law statutes studies, are usually cases in which the selective
principle of disregarding common words (and hence the bulk of the actual text) is applied
automatically either on input or in subsequent processing (Cleverdon and Mills, 1963
£131] ). These cases, therefore, may be considered machine-generated indexes rather
than machine[OCRerr]ompiled. Moreover, it should be noted that:
......The law, itself, is an appropriate field for data retrieval. The statutes,
especially, are written in relatively clear, concise language. At least, this
is their intent. Practically, this means that input and output can both be
relatively short and that retrieval of legal information will be involved with
fewer semantic difficulties." 1/
In the area of concordance-making, however, the potentialities of machine com-
pilation have been put to good use. The pioneer efforts in this area are unquestionably
those of Father Roberto Busa, S. J., of the Gallarate Center. As early as 1946, Busa
proposed to his superiors that a card file recording all the words used in all of the works
of St. Thomas Aquinas should be set up, and he began his actual experiments using IBM
punched card equipment in 1949 (Busa, 1953 [87], 1960 [91], and 1958 [92]; Secrest,
1958 [540]). 2/ Appearing in 1951, his Sancti Thomas Aquinatis Hymnorum Ritualium
Varia Specimina Concordantiarum is the first known example of a complete word index
that was compiled by machine techniques. The early Gallarate work was carried out on
standard punched card equipment, but from the time of the concordance to the Dead Sea
Scrolls, computers have also been used (Tasman, 1959 [595], [596], and[597] ). The
major continuing task is still to other works of St. Thomas. Other machine-compiled
concordances produced by Busa's Center include one to Goethe's Farbenlehre, Bd. 3.
1/
2/
Asher and Kurfeerst, 1963 [24], pp.1-2.
See also Scheele(ed.), 1961£ 522], pp.206-209.
17