2. INDEXES COMPILED BY MACHINE A first and obvious use of machines in indexing processes is in the manipulation of index entries, previously selected on the basis of human analysis, to produce various orderings, duplications and listings of these entries. The power of machine techniques to speed and economize the sorting, ordering and listing operations in the preparation or compilation of indexes was recognized quite early, both in the field of library science and in the consideration of potential areas of application by specialists in machine potentialities. In particular, two specialized types of index, at least in the broad sense, are such that their compilation would be almost prohibitive in terms of time and cost were it not for the use of machines. These are, respectively, the case of the complete index, the index to all words of a text in their various contexts, which is a concordance, 1/ and the case of the "citation index", which has been used in the field of law for many years but has only quite recently been suggested for literature search purposes related to scientific and technical information. 1/ See, for example. Doyle,1963 [162] , ~p. 11: "Without data-processing machinery, concordances are prohibitively expensive to generate for most uses except in those cases where it is well known that a given volume of text is going to be used again and again, by large numbers of people over a long period of time. As we know, clergymen have made use of manually prepared concordances of the Bible since the 12th century". 14 In machine-compiled indexes, no item or entries are eliminated by the machine, whereas in even the most rudimentary of machine-generated indexes, such as KWIC, various reductive or extractive operations are automatically applied as a part of the machine procedure. We shall be concerned in this section with brief discussions of machine-compiled indexes and related devices, specifically, concordances, card or book catalogs mechanically prepared, citation indexes, and special indexes such as Tabledex. The use of machines to compile, sort, duplicate and list index entries can only be con- sidered to be mechanized indexing in a relatively trivial sense. We shall consider, there- fore, only a few representative examples, emphasizing early work and some of the pioneering instances. 2.1 Concordances and Complete Text Processing When as early as 1856, Crestadoro proposed the use of permutations of the words in titles as a subject-content index the only "machines" available for the processing opera- tions were people acting in a strictly clerical way. Precisely such clerical operations have been used for centuries in a process that is, in the special sense of full representa- tion of document contents, an index-producing operation--the making of concordances. 1/ The task of listing each separat~ word in a book in all the contexts in which it appears is incredibly time-consuming and tedious when carried out by manual means. There are those who have spent the major part of their lifetimes at this task. For example: "It 21 took James Strong thirty years to compile his exhaustive Concordance of the Bible..." - The use of machines capable of processing signals which represent and preserve in- formation offered a potentially revolutionary change, and with the advent of the electronic computer even more radical possibilities of very high speed processing were opened up. As early as 1949, J. W. Mauchly (the co-inventor of ENIAC and UNIVAC) envisioned the use of computers for documentation and library science activities. He suggested that the full information contents of the Library of Congress collections could be recorded in machine language, stored in this form on magnetic tape, and searched by machine in a procedure which would match words or other selection indicia occurring in the recorded information to the specified words or selection criteria of a query or search prescription. Specifically, he estimated that the entire collection, then amounting to 10, 000, 000 books, could when transcribed to binary-code representation 3/ be serially searched in 20 hours. 4/ 1/ See, for example, Black, 1962 [65], p.314: "The oldest book in the world has had such an index for many years--the concordance to the Bible;" Markus, 1962 [394], p.19: "The ultimate in permutation for indexing is a published concordance;" Linder, 1960 [363], p.99: "~e know of a concordance prepared in the 13th Century;" Simmons and McConlogue, 1962 [555], p.3: "Complete indexing has been used of course for centuries in the preparation of concordances." 2/ 3/ Carlson, 1963 [101], p. 211. That is, markings which have one of two values (thus, binary digits or "bits"), can be used to distinguish between 2n different other symbols such as alphabetic characters by using log 2n of such markings. A binary code for the 26 letters of the English alphabet requires a five-bit representation for each letter. If numeric digit characters are also recorded, (26+10), a six-bit code representation is required. Mauchly, 1949 [406], p.295. See also "Report to the Secretary of Commerce on the application of machines..." 1954 [620], p. 67. 15 4/ Mauchly's suggestion was, in effect, the idea of a complete index that could be searched by machine. We should note, however, that although subsequent technological advances could significantly decrease his original time estimate, the crucial questions that remain are those of what, assuming one-to-one representation of document text, one would search for. ~l/ Natural language searching by machine, in the sense of full text inspection, is a "pay-as-you-go" concordance technique. It is, however, a technique which must be aided and abetted by various forms of synonym reduction, syntactic normaii~ation, homograph resolution and other special processing operations if it is to be in any sense an effective tool for selection of clues to be retrieved. Gardin, in a series of recent lectures on automatic documentation, (Gardin, 1963 [207, 208] )2/ refers to the opinions of some investigators that it should be possible to "jump" the stage of indexing and to search the natural language texts directly. The problem, he points out, then shifts to the determination of all the various ways in which the possible answers to a question may have been expressed in these natural language "complete indexes". Instead of carrying out reductions or condensations of the documents, as in normal indexing procedures, amplifications of questions are required. "Reductive" indexing of the source documents can only be eliminated at the expense of "expansive" indexing of questions. Gardin concludes that the gain from this is very doubtful. There is also the presently staggering burden of time and cost to convert full texts to machine-usable form. As of February, 1961, it was estimated that the natural language text material available for machine processing amounted to little more than the words contained in the Harvard Classics five-foot shelf (Stevens, ~962 [567]). Perhaps up to ten times that amount is now available, notably in the 6, 000, 000 words of the statutes of Pennsylvania 3/ and in several million additional words that have since been keypunched at the Center for Automation of Literature Analysis, Gallarate, Italy. ~4/ A very recently 1/ See, for example, Yngve, 1959 [657], pp .978-979: "We will have to find formal connections between widely divergent ways of saying essentially the same thing. In addition there is much that we will have to learn about searching. If we had today a complete grammar of English which was capable of rendering explicit all the relations and distinctions implicit in the document, I doubt that we would know how to use it effectively in a machine search situation. We would be embarrassed by the very wealth of the information available. Much more must be learned about search situations." 2/ See also Bar-Hillel, 1962 [35], p.415: "Could not the stage of clue assignment be completely skipped and the request topic be directly compared with the original documents? It is very natural that such a thought should have arisen, but it must be stressed that there is nothing in our knowledge of the workings of communication which would indicate that such a proposal is, or ever will be, practical." 3/ 4/ See various references by J.F.Horty, W. B. Eldridge and S.F. Dennis, E.M.Fels, R. Wilson. R. Busa, data reported at the NATO Advanced Study Institute on Automatic Docu- ment Analysis, Venice, July 1963. 16 completed study made by theTRW Computer Division, Thompson Ramo Wooldridge, involves the investigation of the possibilities for a center to provide text in machine- usable form. The report gives a total figure of approximately 50,000,000 words of text so available as of February 28, 1964, but this includes non-scientific text, such as news- paper and popular magazine materials (Mersel and Smith, 1964 [415]). Mersel and Smith also report on the estimated requirements for machine-usable text for various research groups, averaging over a million words per year per group. Yet, at present keypunching costs of one cent or more per word, is it reasonable to assume that any of these research groups can provide a budget of over $100, 000 per year for this purpose alone? Moreover, this budget would provide for the conversion of no more than a thousand 1,000-word items or a hundred 10,000-word items at costs, respectively, of $100 or $1,000 per item. For the present, therefore, the conclusion is ine~capab1e: either indexing or search based upon full text processing is not yet practical. Even the most enthusiastic proponents of "searching full natural language text1' (Swanson, 1960 [589]) and "maximum-depth indexing' `(Simmons and McConlogue, 1962 E 555]) generally agree as to the present impracticality of full-text mechanized indexing except for special limited cases. The two problems of determining what to search for, given full text, and of feasibility of conversion of text into machine-usable form thus combine td limit "complete indexing" largely to the special cases of providing corpora for studies in the field of computational linguistics and of compiling the traditional scholarly tool- -the concordance to all the words in a given literary work or works. Apparent exceptions, including experimental work with abstracts only and the law statutes studies, are usually cases in which the selective principle of disregarding common words (and hence the bulk of the actual text) is applied automatically either on input or in subsequent processing (Cleverdon and Mills, 1963 £131] ). These cases, therefore, may be considered machine-generated indexes rather than machine~ompiled. Moreover, it should be noted that: ......The law, itself, is an appropriate field for data retrieval. The statutes, especially, are written in relatively clear, concise language. At least, this is their intent. Practically, this means that input and output can both be relatively short and that retrieval of legal information will be involved with fewer semantic difficulties." 1/ In the area of concordance-making, however, the potentialities of machine com- pilation have been put to good use. The pioneer efforts in this area are unquestionably those of Father Roberto Busa, S. J., of the Gallarate Center. As early as 1946, Busa proposed to his superiors that a card file recording all the words used in all of the works of St. Thomas Aquinas should be set up, and he began his actual experiments using IBM punched card equipment in 1949 (Busa, 1953 [87], 1960 [91], and 1958 [92]; Secrest, 1958 [540]). 2/ Appearing in 1951, his Sancti Thomas Aquinatis Hymnorum Ritualium Varia Specimina Concordantiarum is the first known example of a complete word index that was compiled by machine techniques. The early Gallarate work was carried out on standard punched card equipment, but from the time of the concordance to the Dead Sea Scrolls, computers have also been used (Tasman, 1959 [595], [596], and[597] ). The major continuing task is still to other works of St. Thomas. Other machine-compiled concordances produced by Busa's Center include one to Goethe's Farbenlehre, Bd. 3. 1/ 2/ Asher and Kurfeerst, 1963 [24], pp.1-2. See also Scheele(ed.), 1961£ 522], pp.206-209. 17 Other relatively well-known examples of machine-compiled concordances include those to the Revised Standard Version of the Bible (Ellison, 1957 [186]; Cook, 1957 [139]) and to Matthew Arnold's poetry (Painter, 1960 [461]; Parrish [467, 468]). The Cornell Concordance Series, under the general editorial supervision of Parrish, includes in- vestigations of Old English, such as The Anglo-Saxon Poetic Records (Bessinger, 1961 [59]). The November 1962 issue of Current Research and Development in Scientific Documentation, No. 11, [430], lists several concordances compiled by machine including the work of Sebeok [533, 534] and associates at Indiana University on Cheremis folksongs, the work on the National Vocabulary of the French language under Quemada at the University of Besancon, ~1/ the preparation of glossaries and concordances to the works of Kant at the University of Bonn 2( , and concordances to medieval German texts being compiled by Wisbey at the University of Cambridge (Wisbey, 1962 [646], [647]). At the University of Gothenburg in Sweden, work has begun on mechanical linguistic analysis of English language texts, using the machine-readable teletypese ttertape,s used for the printing of paperback books l85]'~'Anoth recent (Ellegard, 1960 [184] and 1962 [ example is that of the work at the Summer School of Linguistics, University of Mexico (Grimes and Mvarez, 1961 [243]). By 1963, Marthaler writes that `1Compiling con- cordances with the aid of a computer is already standard routine to such an extent that it needs hardly be described in detail." 4/ As of January 1964, a general-purpose com- puter program for the IBM 7090 which can compile various types of concordances has been announced as available from the Mechanolinguistics Project at the University of California. (1964 [95]). 5/ The major advantage of using machines to compile concordances is, of course, the enormous difference in the time required to complete the work. Thus, only 120 hours were required on the UNIVAC computer to prepare the 800,000 words of the Concordance to the Revised Standard Version of the Bible (Cook, 1957 [139]; Ellison, 1957 [186] ).6/ 1/ 2/ 3/ 4/ 5/ See `1Actes du colloque sur le mecanisation. .", 1961 [1]; Quemada, 1961 [485] and 1959 [486]; Centre d'Etude du Vocabulaire Francaise, `1Specimens de Trav~ux lexicographiques... ", 1960 [106]. National Science Foundations CR&D ~eport No. 11 [4301 p.316 Ibid, p.321. Marthaler, 1963 [399] , p. l~ "California Concordance Pro~ram Available", 1964 [95) 6/ Carlson, 1963 [101) , p.211. 18 In the use of the JBM 705 for the concordance to the Summa Theologiae, Fr. Busa reports that oflly 60 hours were required to arrange in alphabetical order 1, 600,000 words. 1/ This advantage of speed, with the concomitant benefits of both economy and timeliness, is illustrated by Tasman as follows: * . It has been estimated that it would take 50 scholars 40 years. . to manually index the 13 million or so words of St. Thomas Aquinas! complete works. IBM punched card machines would produce the indexes and concordances much more accurately and would take ten scholars about four years. Large-scale data processing techniques would reduce the time to about 25 percent... (or)... ten scholars to do the job in less than a year 2/ Other advantages stem from the facility with which further machine processing can be introduced. Once the text is in machine-readable form, a number of valuable byproducts can be derived. Examples are statistics on the number of words that have 2, 3,... n letters, frequencies of letter usage; printouts~of occurrences of specified words or groups of words; and lists alphabetized on terminal rather than initial letters. Added advantages of computer processing are further exemplified in the options available with the California concordance computer program (1964 [95]), some of which are as follows: (1) The user may obtain a restricted rather than a full concordance by supplying a list of words for which nd entries are to be made. (2) The user may obtain a selective concordance by supplying a list of words for which, and only for which, entries are to be made. (3) Each entry word may be centered with its preceding and succeeding context, up to the limits of one full line of 131 characters, or each entry word may be listed together with the full sentence or verse in which it occurs. (4) Text with interlinear information such as grammatical symbols can be used and selective concordances can be compiled on the basis of such interlinear information. (5) The citations of an entry can be listed in order of textual occurrence, in an order determined by preceding or following words in its context or in an order determined by accompanying interlinear symbols. 2.2 Card Catalogs, Book Catalogs, Bibliographies and Subject Index Listings Prepared by Machine The use of machines such as punched card equipment for the preparation and pro- cessing of library card catalogs and of index listings was advocated by a few far-sighted documentalists at least as early as the 1930's (Parker, 1938 [463]; Dewey, 1959 [153]). 1/ 2/ See his statement in Scheele, 1961 [522], p.209. Tasman, 1958,(596] , p.1~. 19 McCormick's bibliography on mechanized library processes (1963 [407]) lists a number of early suggestions, notably those of Fair in 1936 [187], Shera in 1938 [547], and Gates [zzs] and Callander [96, 97, 98] in 1946. Cox, Bailey and Casey proposed the use of punched card equipment for the preparation of bibliographies in the field of chemistry in 1945 [14z]. By 1946, Gull claimed that: ..... Punched cards and present equipment offer new possibilities right now for solving the problems of the indexes to Chemical Abstracts. These indexes are large undertakings in themselves, and the work of arranging, cumulating, and printing them can be simplified by placing the index information on punched cards at the time the abstracts are made. With current indexes on punched cards, two or three cumulations of the author index during the year will greatly reduce the work required in using current issues from that approach. Cumu- lations of the subject, patent, and formula indexes immediately become possible for intervals more frequent than once a year. 11 [245] The following year (1947) saw a summary by Gull of potential applications of punched cards in special libraries [247], and Becker surveyed some of the then discernible prospects for library mechanization, as a student in the Library School of Catholic University. He stressed such advantages as flexibility in the processing of new material for abstracting, indexing? filing, and interfiling purposes and the printing out of various listings in any format. ~/ The potential use of machines for library science and documentation had not actually been recognized, however, for many years after the invention of punched card equipment. Both the punched card developments (beginning with Hollerith and Powers in the 1880's) and the electronic computers developed from 1946 onward were first applied to the auto- matic manipulation of information in the sense of statistical, mathematical, or engineer mg data, rather than to information about data or information about other information. Dr. John Shaw Billings, himself a librarian of note, was apparently the first to suggest to Herman Hollerith the idea of recording information as holes punched in cards which could then be sorted mechanically. ~ Larkey comments: "It is not known if Billings ever thought of applying the principle to bibliographic work, but it would seem eminently fitting that it might be so utilized it 3/ Larkey himself as head of the Army Medical Library Research Project at the Welch Medical Library, Johns Hopkins University, was certainly one of the pioneers in such utilization, but this was almost 70 years from the date of the Billings-Hollerith conversations. The Army Project, begun in late 1948 or early 1949, had as its contract 1i Becker, 1947, [43], pp. 11-12: "From the flexible arrangement of the cards, bibliographies become readily available by subject, author, and title. In special libraries, where material on one subject is concentrated, the research possibilities of gathering, sorting, filing, and printing information are almost limitless. Con- tinuous machine interfiling permits keeping current with new entry additions." 2I 3/ "With the masters...", 1963 [648], p.18. Larkey, 1953[351], p.34. 20 objective `1to explore existing and projected methods, emphasizing machine methods, applicable to such pilot projects as may be necessary" (Larkey, 1949 [348], 1956 [349], and 1953 [351]). Also as of 1949, the Library of the Department of Agriculture is reported to have "conducted an experiment in the use of electronic data-processing machines to produce the author and subject indexes to the `Bibliography of Agriculture'." ~ 1 It is not until the early 1950's, however, that punched card machine techniques were actively put to use for the preparation of card catalogs, book catalogs, bibliographies and various index listings. Then, a number of independent but largely concurrent applications were tried out on at least an experimental basis, including in addition to the work of the Welch Medical Library Project pioneering efforts in mechanized book catalog production (Griffin, 1960 [242]; Martin, 1953 [400]; Berry, 1958 [58]) and what is claimed to be the "first successful non-experimental punched-card catalog of periodicals", the Serial Titles Newly Received (now New Serial Titles), as published by the Library of Congress from 1951 onwards. 2/ The work at the Welch Medical Library continued for several years, the final report being issued in 1955 [234]. Beginning in 1951, the project maintained in punched card form the subject heading authority list used for the Current List of Medical Literature (Larkey, 1953 [351]; Garfield, 1953 [217] and 1954 [220]." Garfield has stated that this work "clearly demonstrated the ease of converting alphabetic subject heading lists to categorized or classified lists of terms by the use of punched card equipment." 3JThat is, each heading or subheading had assigned to it a numeric code reflecting its appropriate position in the classified system, which could then be used by machine for sorting, ordering and listing. Ingenious use was made of the IBM 101 Statistical Machine in the preparation of printed subject indexes (Garfield, 1953 [218] and 1954 [216]). Other subject heading lists maintained by punched card techniques by 1953 or earlier included those of the U.S. Patent Office and the Technical Information Division of the Library of 4/ Congress. The first loose-leaf printed book catalog to be produced by machine methods was apparently that of the King County Public Library in the State of Washington in 1951, and the following year the Los Angeles County Library inaugurated a similar system for the distribution of a master book catalog prepared by mechanized techniques (Berry, 1958 [58]; Griffin, 1960 [242]; Martin, 1953 [400]; Alvord, 1952 [4]). The work on mechanized preparation of lists of periodicals at the Library of Congress has been reported as follows: "In 1951, the Library began publishing, at monthly intervals, Serial Titles Newly Received. In 1953, its title was changed to New Serial Titles. Ever since its inception, the fundamental ingredient of the publication has been the IBM punched card... 1/ 2/ 3/ U.S. Congress, Senate Cornmittee on Government Operations, 1960[ 619] , p.147. Dewey, 1959 [153] , p.36. Garfield, 1959 [2211 , p.471. 4/ Garfield, 1954 [2201 , p.1. 21 "Two important advantages of the punched-card method were foreseen when the publication began. First, it would be possible to print lists from the cards at will, without any further editing or proofreading, once the information was in punched-card form. Second, there was the possibility of mechanically preparing special lists of titles, selected on the basis of subject, country, or language 1/ Thus, by 1953, "a number of instances of printed indexes prepared by machine" could be claimed. 2/ The use of punched cards to sort, to prepare tabular listings for various drafts and revisions, and to interfile corrected or revised entries greatly facilitated the preparation at Battelle Memorial Institute of the subject index to the Proceedings of the International Conference on the Peaceful Uses of Atomic Energy, 1955 (Lipetz, 1960 [367]). Developments in the use of punched card machine techniques in bibliographic opera- tions of these types, beginning in the 1950's, have by no means been limited to the United States. For example, Remington Rand punched cards have been used in the preparation of a national union catalog of Italian libraries 3/ and Mikhailov reports for the All-Union Institute of Scientific and Technical Information (VINITI) as follows: "The development program for machine production of indexes has been underway at the Institute for a number of years. . In fact, operational use of Soviet-made punch-card machines to compile the author indexes for some of the series of our Abstract Journal has been practiced at the Inst~tute since 1957 " 4/ In France, at the Centre d'Etudes Nucleaires, Saclay, a program has been developed for mechanization of the production of biweekly and cumulative indexes and for demand searches (Chonez, 1960 [116, 117, 118]). With the advent of automatic data processing systems, the speed, the flexibility and the capability for multiple-purpose processing buttress the claim that the card catalog can be "replaced or supplemented by book catalogs made with the aid of mechanized equip- ment". ~5/ It is further claimed that "The printed catalog produced by means of automatic equipment combines the best features of the conventional card catalog and the traditional printed catalog, and adds to both new dimensions that would have been unbelievable a generation ago." 6/ A joint project is under way by the Medical Libraries of Columbia, 1/ 2/ 3/ 4/ 5/ 6/ U.S. Congress Senate Committee on Government Operations, 1960 [619], p.85. Larkey, 1953, [351], p. 38. Berry, 1958 [58], p. 287. Mikhailov, 1962 [410], p. 50. McCormick, 1963 [408], p. 195. Vertanes, 1961 [625], p.242. This is with reference to the LILCO Library Printed Catalog, which is prepared by sorting and processing information on titles, authors and titles-by-subject-groupings serving as indexes to the holdings at the Long Island Lighting Company. 22 Harvard, and Yale Universities for computer preparation of book catalogs for books published from 1960 onward (Kilgour, et al 1963 [ 324]). Another recent illustrative example of the production of printed book catalogs by means of computer compilation is that of the Boeing "SLIP" System (Weinstein and Spry, 1963 [633]). Mong with recognition of computer-processing potentialities there has emerged increased awareness of the desirability of taking advantage of one-time recording of information to serve multiple purposes: the principle of by-product data generation. The advantages for the library and document collection are that a single recording of biblio- graphic information in machine-usable form can lead to a variety of products, specifically including printed book catalogs, 1/ recurrent and demand bibliographies, the requisite number of copies for conventional card catalogs, card catalog sets or catalog listings for the personal use of the individual worker, input to mechanized selection and retrieval systems, and machine-manipulatable data for such other purposes as circulation control. Turner and Kennedy report, for example, the initial use of a Flexowriter to prepare library catalog cards and the by-product generation, via a 1401 computer, of bi-weekly listings of unclassified report titles at the Lawrence Radiation Laboratory, the "SAPIR" System (Turner and Kennedy, 1961 [615]). Chasen discusses a change from a previous punched card system for circulation and recall at General Electric's Missile and Space Division Laboratory to a combined Flexowriter and G.E. 225 computer procedure to provide mechanized retrieval, compilation of desk catalogs, computer updating of catalogs and files, and the maintenance of subscription lists (Chasen, 1963 [108]). Fasana describes a system at the Air Force Cambridge Research Laboratory Library where typing indications in the tape are used as boundary codes. He reports: "Input tapes are currently being processed on a computer to automatically produce catalog card sets, circulation control records, and book form indexes. Original input tapes now being accumulated will form the basis of a machine-searchable file to be used in the future for more sophisticated printouts and searches.' 2/ For such applications, Durkin and White make the following typical claims: "The system described has permitted the IBM Command Control Center Engineering Library to produce its catalog cards and library bulletin both faster and cheaper. Since a by-product of this process is the preparation of all catalog information in See for example, Olney, 1963 [458], p.42: "During the past few years a number of libraries have initiated a program of mechanization... by punching on IBM cards or paper tape some of the bibliographic information normally giyen on catalog cards. Recording this information in machine-readable form makes it very easy to prepare printed book catalogs. 1/ 2/ Fasana, 1963 [195], p. 326. This system involves the Natural Format" and procedures developed for AFCRL see also Lipetz et al, 1962 [368]. 23 "Machine-Interpretable by Itek Corporation; punched card form, it has also permitted the establishment of a circulation control system, the publication of overdue notices and reading lists, and the eventual institution of a computer retrieval program'1 (Durkin and White, 1961 [173]; White, 1963 [638]). Heiliger reports for the library of the new Chicago Campus of the University of Illinois as follows: "The type of bibliography the computer can produce does make greater use of LC card information than do present card catalogs. With the computer programmed with a set of library filing rules and a set of symbols that describes for the computer the various parts of the bibliographic unit, it can print-out, for instance, a list of books published in a given country, between certain years, on a certain subject (or combination of subjects), that are illustrated and have bibliographies. It will also be possible to permute on individual items in LC subject headings in the same fashion that Chemical Titles does on titles. This index has been dubbed POSH (permuted on subject headings)." 1/ Some recent experimental work at Inforonics, Inc. puts major emphasis on by- product data generation, beginning with the actual preparation of manuscripts for publi- cation. Tape typewriter processing of manuscript for journal articles is being studied from the point of view of producing machine-usable text. This text, together with coded identification of the separate items in the text, is so prepared that computer programs can produce from the single-input automatic typesetting tapes for the article itself, author and subject index entries, and the like. Computer text transformations can also produce entries for citation indexes, abstract journals and search files (Buckland, 1963 [83, 84]). Other computer-produced indexes or special indexes involving compilation rather than selection by machine include indexes to Nuclear Science Abstracts (Day and Lebow, 1960 [151]), the Current List of Medical Literature (Chonez, 1960 [116, 117, 118]), the Retrieval Guide to Thermophysical Properties Research Literature 2/ and the Research and Development Abstracts of the USAEC (Sherrod, 1963 [541]). At the Atomic Energy Commission also, a modification of this RDA computer program is used for author, corporate author, number and subject indexes for the Engineering Materials List, which includes announcements of blueprints and drawing5.~3/ In several instances, machine processing capabilities are used for permuted listings under various assigned indexing terms.4/ Special cases of machine permutation operations involve compilation and organization of chain indexes, used to reflect the various key entries in faceted classification systems (Dowell and Marshall, 1962 [159]; Foskett, 1962 [199]; Olney 1963 [458]). 1/ 2/ 3/ 4/ Heiliger, 1962 [259], p. 475. Markus, 1962 [394], p. 19; Touloukian, 1962, 1963 [~o7] Davis, 1963 [iso] p.237. See, for example, Savage, 1963 [438 System (Wheater, reports on the SWIFT program for NASA's STAR (Newbaker and ] ); the AIMS System (Heller, 1963 [ 260'], and the SPINSTRE 1963 [639] ). 24 A final special case of a computer-compiled index should be noted. This is the work of Schultz and ~erpherd with reference to the annual meetings of the Federation of American Societies for Experimental Biology (FASEB) (Schultz and3hepherd,1960 [532]; Schultz, 1963 [527];Shepherd 1963[5451". 1/ The indexing terms are generated first by the authors of the papers but are then run against a computer program, which by thesaurus-type look- up eliminates synonyms and supplies syndetic devices in addition to formatting the subject index for printout. The machine-readable thesaurus developed for this project presently performs the following four basic functions (Schultz, 1963 [527]): 1. It accepts words from titles and indicia supplied by the authors without modification if they match acceptable indexing terms. 2. It recognizes certain other words as acceptable if modified and modifies them accordingly, for example, by "use" directions for synonyms and near-synonyms. 3. It adds additional indexing terms when certain words occur, an example being penicillin', use also `antibiotics'. 4. It deletes certain words if they do not occur in the context of an acceptable indexing phrase. 2. 3 Tabledex and Other Special Purpose Indexes The uses of machine techniques in index compilation so far discussed represent instances in which conventional tools of bibliographic control can be prepared at lower cost or more rapidly, or both. In addition, however, certain new and unconventional types of index have been or are being produced with the aid of computers. The Tabledex method, as proposed by Ledley in 1958 (Ledley, 1958 [352], Zusman, et al, 1962 [661]; O'Connor, 1960 [442]),, involves coordinate indexing in bound book form, with special features to facilitate search, conserve space and display index terms co-occurring with a given term for a given item. 21 A major advantage claimed for this method is that by the use of computers bibliographies and book-form indexes can be organized, compiled, and printed in page format within a matter of hours. A Tabledex index typically consists of a bibliography proper, in which each citation has been assigned an identifying number; an alphabetical list of the indexing terms used, 1/ 2/ These investigators claim the first production of a conventional subject index by computer. See, for example, O'Connor, 1960 [446], p. 241: "Ledley approximately halves the average size of the document descriptions required by imposing an order on the vocabulary of indexing terms. When a document description belongs in a term subset, only those terms of the description need to be recorded which come later in term order than the term of the term of the subset. This illustrates another type of storage organization." 25 which may also have numeric codes; and a set of indexing tables. These tables contain item numbers in the leftmost column3 and either the names or the codes for indexing terms assigned to an item along the row. There is one such table for each distinct term used in indexing the items. To facilitate searching, only those terms which are of higher numeric or alphabetic order than that for the term for which the particular table is compiled are recorded in the rows. Thus to make a search on several terms, the user turns to the table for the one of these terms that has the lowest term value, which table records all items to which the term has been assigned, and checks the rows of the table for the second lowest ranking term, the third, and so on. Variations in the Tabledex method allow for the automatic assignment of numeric codes to the indexing terms based on relative frequency of use within the collection. Ledley also discusses methods for finding articles associated with all except one, all except two, or all except n of the given words in a search prescription. 1/ A first example of a computer-compiled Tabledex index was that to a bibliography prepared by the Library of Congress for the International Geophysical Year (Zusman et al, 1962 [661]). ~" The computer program for the IBM 7090 carried out the operations of assigning accession numbers, extracting index terms and compiling the term lists, determining frequencies so as to assign frequency numbers to the terms, organizing and preparing the tables, and developing an author index. Two formats were used, one giving terms by numeric code and the other spelling out the terms as normal words. The latter feature provides a measure of browsability in the system. Y/ A Tabledex compilation program is also in use at the Applied Physics Laboratory of Johns Hopkins University (Olmer and Rich, 1963 [454]). Another coordinate index search tool, making use of what is in effect a document- descriptor matrix with special codes and column arrangements to save space and facilitate rapid scanning, is the Scan-Column Index suggested in 1960by OtConnor [449]. He further suggested the use of computers for compilation, as follows: IA computer can organize information about documents into a scan-column index. The input needed consists of the document identifications and their accompanying 1/ 2/ 3/ Ledley, ~959 [352], pp. 1235-1239. See also National Science Foundation CR&D No.11 [430], pp. 130-131. Zusman, et al 1962, [661], p. ii: ... The word tables have the advantage that browsing can be accomplished and possible associations made during the search... Such `browsing1 can be enhanced by including at the end of each row in a table all II the other words also associated with the article of that row 26 index terms... and an indication of either the number of columns desired or the column density desired. The computer will determine the frequency of each term, the positive and negative correlations of terms, and the quantity of these correlations by counting or sampling key figures, such as the average number of terms per document. It then can assign column-character codes accordingly. - In 1961, Costello described the use of computer techniques for compilation and computer printout of a dual dictionary for a coordinate indexing system using links and roles at DuPont's Polychemicals Department. Mter manual analysis, term-role assign- ments are keypunched, the cards are listed for editing including the elimination of synonyms and the indication of appropriate postings to more generic terms, and re- keypunched for conversion to magnetic tape. Tapes for posting of items and links to term-roles are merged by computer with tapes giving alphabetical equivalents of term codes and with appropriate syndetic indications for final output on an IBM 407 high-speed printer [141]. Still another instance of a coordinate index, modified to show pre-coordination of terms as compiled by computer, is that ot the Electronic Properties Information Center (Johnson, 1963 [301]). The system consists of abstract cards maintained in accession number order, together with machine printouts that pre-coordinate descriptors within nine major categories. The listings of pre-coordinated descriptors are arranged in three different indexes; alphabetically arranged within each category, alphabetized with- out respect to category but with code indication of the category reference, and a non- categorized listing arranged alphabetically in reverse order. Advantages of machine processing include the ease with which various statistical counts can be made, such as the average number of items in the system for a given material and a specified property. Summary indications of the state-of-the-art in the field of interest can be obtained, "for the system will indicate not only areas where research has been done, but also areas where gaps in the literature occur, and a measure of the growth of research activities in the field can be developed." ~2I Z. 4 Citation Indexes "A citation index is a directory of cited references in which each reference is accompanied by a list of source documents which cite it." This is a relatively new 1/ 2/ 3/ O'Connor, 1962 [449], pp 18-49. Johnson, 1963 [301], p. 296. Sher and Garfield, 1963 [546], p. 63. 27 type 0£ bibliographic search tool that would be almost impossible to compile without the use of machines. 1/In at least one case, moreover, the availability of mechanical devices was itself the inspiration for the idea of a citation index to the scientific litera- ture. Garfield states in a 1954 paper that he was led to the idea of "Shepardizing" from an earlier concern with the development of citation codes or "coden" 2/ that would facilitate machine processing of bibliographic and index entries.3/ The value of Shepard's Citations in tracking down precedents and decisions has been recognized in the legal field for many years. 41The desirability of a similar tool for literature searchers in the fields of scientific and technical information was suggested about a decade and a half ago, when Seidell and others proposed its use for patent searching (Seidell, 1949 [541]; Hart, 1949 [255]). In 1954, the Bush Committee in its considerations of the potential applicability of machines to Patent Office problems received a proposal from the Atlantic Research Corporation of Alexandria, Virginia, which was to cover "the development of a Patent Citation Index, comparable to Shepard's Citations'1. 5/In the period 1954-1956, both Garfield ~6/and Fano ~1independently advocated the development of a citation indexing tool for scientific and technical literature. As 1/ See, for example, Atherton, 1962 Ezs], p.4: "The volume of data to be processed is so massive that processing machines are a necessity'1; Garfield 1954 [210], p.4: "Where such large volume of data is to be handled it must be expected that mechanical devices of high speed and versatility... would probably be a determining factor in the system's success.'1 2/ 3/ 4/ 5/ 6/ 7/ That is, brief codes, often mnemonic, for journal title abbreviations and other clues to publisher and date of publication. Garfield, 1954[210], p. 2. How to Use Shepard's Citations [281] has been published periodically by Shepard's Citations, Inc., Colorado Springs, since 1873. U.S. Dept. of Commerce "Report to the Secretary of Commerce...," 1954 [620], p. 27. Garfield [210, 211, 212]. Adair, writing in January, 1955, specifically acknow- ledges a suggestion of Garfield's (for 1955 [2], p.32) but Garfield in turn credits Adair,(1963 [214], p. 290) Fano, 1956 [191], p.3: "Let us accept, at least for the sake of this argument, the conclusion that linguistic associations between documents cannot lead to a satis- factory definition of a bibliography. Then the only other type of association for which evidence is available is that provided by simultaneous reterences in the literature, by the concomitant use of documents by experts as evidenced by library records3 and by other similar joint events." 28 of today, there are at least five or six instances of citation indexes that have been pro- duced, several different experimental investigations are under way, and new interest has been generated by the considerations of the Weinberg panel. Thus: "Of the newer approaches to the indexing of scientific documents, the Weinberg Panel was particularly impressed with the citation index as a promising biblio- graphy tool. In order to learn more about this approach, the National Science Foundation is currently sponsoring the compilation and publication of extensive citation indexes for the fields of genetics and also for statistics and probability; and is supporting two kinds of experiments to evaluate different techniques for using citation data in indexes and searching systems in the field of physics." !~ In general, the principle of citation indexing is based upon the hypothesis that the bibliographic references cited by an author provide significant clues to the subject content of the author's own paper and/or that there is a certain commonality in subject between papers that cite the same references or that are co-cited. 2/ The principle can be applied to the compilation of bibliographical or indexing tools in several different ways. First, there is the method of citedness, which groups for a given item the identifications of sub- sequent items that have cited it. The converse of this is, of course, the bibliography or reference list of a given item. 3/ In the first case, we are concerned with `1descendants," and in the list of references with "ancestors". 4/ 1/ Committee on Scientific Information, 1963, [l35~, p. 16. 2/ Compare Adair, 1955, [z~, p. 32, with respect to Shepard's Citations itself: !~Since all of the cases listed under a given case have cited it, it follows that they must all be, more or less, pertinent to the case cited." See also Kessler, 1963, [32o~, p. 1: `1This method ... originated in the hypothesis that the biblio- graphy of technical papers is one way by which the author can indicate the intellectual environment within which he operates, and if two papers show similar bibliographies there is an implied relation between them." See Salton1 ~96Z, [5203, p.III-3: "A citation index consists of a set of biblio- graphic references (the set of `cited1 documents), each being followed by a list of all those documents (the `citing' documents) which include the given cited document as a reference. A citation index is to be distinguished from a reference index which lists all cited documents under each citing document." See, for example, Tukey, 1962, [611], p.5: "Any user's greatest need is likely to be for access to the latest information rather than to the oldest, but the latest items are children, not ancestors. Genealogy is important, but progress requires tracing descendants lung and Vandeputte, 1960, [2913, p.11, make a similar distinction between "histoire" (antecedents) and "filiation" (successors). 29 3/ 4/ A second method, implied in Fano's suggestions for the use of relative frequencies of association between items found in the literature, is one of citingness, which groups together items that cite one or more identical references. This method has been developed by Kessler and his associates as the technique of "bibliographic coupling'1 (Kessler, [317] through [323]. The purpose here is to identify groupings of related items where relatedness is defined in terms of the number of references shared by each of the members of the group with some given test paper or with each other. It is noted that where the citedness index and the reference list typically give the bibliographic references themselves as the searching or retrieval tool, the bibliographic coupling technique seeks rather to define groups of similar papers.!, A third method, and one which may be combined with either of the other two, is to derive indexing terms for a given paper from the overlay of indexing terms previously assigned to any papers which it cites. Salton2/further suggests that: .1... Citation indexes could be used to extend a given set of index terms by starting with the terms attached to a given document or document set, and adding to them the `related' terms obtained from new documents which cite the original ones." The suggested advantages of citation indexing include the claims that this tool does not require trained indexers, ~3/ that it is highly susceptible to mechanization (Garfield, 1955 [213], 1956 [212], 1957 [211]; Atherton, 1962 [25]: Becker and Hayes, 1963 [45]), and that it may cost significantly less than subject indexing. ii A major advantage claimed is responsiveness to user, rather than indexer, interests and view points. Some of the representative claims with respect to this factor are as follows: 11 See Atherton and Yovich, 1962 [26], p. 3: "Kessler's method, however, does not retrieve the references cited by a paper. Instead these references are examined to determine the `bonds' between papers; e.g., if two papers share six references, in common, they are said to have a `coupling strength' of six. By applying either of two criteria of coupling, one can `filter out smaller groups of papers' related to a given paper." 2/ 3' 4/ Salton, 1962 [520], p. 111-8; see also Lesk, 1963 [356]. Atherton, 1962, [25], p.3. See Atherton and Yovich, 1962 [26], pp. 3-4: "Garfield estimates cost of abstract- ing and indexing 200, 000 articles in one year to be $3 million. He estimates the cost of a citation index for these same articles (approximately 3 million citations) to be $300,000." See also Doyle, 1963, [162], p.8: "The editing labor, the input preparation cost, and the automatic processing time are all so small that it's very likely citation indexing is destined for a great surge of popularity in the immediate future. Committee on Scientific Information, 1963 [135], pp. 55-56: "Because the index- ing is based on the author's rather than on an indexer's estimate of what articles are related to what other articles, citation indexes are particularly responsive to the user's, rather than to the indexer's viewpoint." 30 5' 11The most feasible scheme for alerting individuals to what is of interest in their own field requires an on-going up-to-date citation index. For each narrow field of interest of an individual there are, it is believed with good reason, three to five to ten key items such that: (cl) If he knew that a new item referred to one of his key items, the individual would be glad to skim the new item, (c2) An individual who skimmed all new items referring to one of his key items would be adequately alerted to the newest results in his own specialties." 1, "A research worker who finds one article several years old can relate later developments by locating all subsequent articles that have referred to it. Corrections and errata can be brought together by a citation index." 2/ ~`Citation indexing will overcome artificial dividing lines that are drawn in various abstracting services." 3/ "It is believed that citation indexes will be useful.. . in bringing together related materials in different fields where the interrelationships are not readily identifiable from other types of indexes." 4/ "Since the end product of a citation indexing is a ]3isting which collects in one place the bibliographical descendants of a given cited author, bringing these titles together helps to illuminate for the searcher the extent and nature of information association patterns employed by other authors who had a similar or related interest to his own. Its development, therefore, serves as an approach to the user's frame of reference, not the indexer's." 5/ The importance of being able to pick up more than the principal subject matter clues is indeed an advantage of citation indexing. Garfield, commenting on the potential cross-breeding of interests, gives an example of a personal search for more information on the RCA electronic scanning pencil in which he was led to one of Busa's reports on machine use in philological analysis and to an article of interest in the field of informa- tion theory. 6/ Garfield further points out that the cross-breeding can extend across 1/ 2/ 3/ 4/ 5/ 6/ Tukey, 1962 [611], p.9. Atherton, 1962 [25], p.2. See also Garfield, 1955 [213], p.1. Atherton and Yovich, 1962 [26], p.3. Brownson, 1963 [82], p.3. See also Garfield, 1957 [211], p.4. BeckerandHayes, 1963[45], p.137. Garfield, 1954[210], pp.4-5. 31 `I changes of terminology with time, and Lipetz suggests that it can break down barriers with respect to use of foreign literature. 24 Other claimed advantages relate to the usefulness of the citation index for purposes other than those of direct literature search. Such other purposes include identification of significant research by "equating frequency of citation with relative significance of subject matter'1, (Salton, 1962 [520]), determinations of the number of references cited in a given field or by journal or publication date (Atherton, 1962 [25]), evaluation of the relative importance of various scientific journals (Westbrook, 1960 [636]; Kessler, 1961 [322]), tracing of trends in the history of ideas or in a particular field of literature 3/ (Brownson, 1963 [82]; Salton, 1962 [520]) - and empirical studies of the frequencies of self-citation, multiple authorship, and the like (Atherton, 1962 [25]). A number of disadvantages of the citation index are to be noted, however. First is the obvious lack of consistency between authors in terms of whether or not they cite the prior literature at all and in terms of the completeness and correctness of the citations they do make. 4/ Atherton quotes Westbrook as saying: "Science is subject to changing fashions of interest that lead to a distorted number of published papers in a given subject and an inordinately high level of citations to any one who reports first on the fashionable subject. The method will not appraise work performed but not published." 5/ 1/ 2/ 3/ Ibid, p.6: "Changes in terminology are to a certain extent overcome through the citation approach, since the author who makes a reference to a paper that is forty or fifty years old is making the jump in terminology for us." See also ~arfield, 1956[212], p.11. Lipetz, 1963, [366], p. 265: "It is reasoned that availability of a citation index derived from Soviet physics journals and approachable through familar American references should stimulate utilization of the Soviet physics journals in the United States." See also Reisner, 1963 [497], p. 71: "Citation indexes are receiving increasing attention as bibliographic aids and as sociometric tools. As sociometric tools, they are being used to explore the flow of information across national boundaries and from pure to applied fields, to determine the structure of a field, and to determine the `value' of documents or authors." See, for example, Doyle, 1963 [162], p. 8: "The disadvantages of this kind of indexing is, of course, that it depends on authors providing ample and suitable references"; Salton, 1962 [520], p.111-7: "In many cases personal preferences are evident both as to number and types of papers cited; authors have varying back- grounds, and there may also exist a tendency toward self-citation regardless of relevancy"; Thompson, 1963 [600], p. Il-i: "The difficulties... are largely due to the extreme variability of format and to the lack of standardization which prevails in the publication of citations." 4/ 5/ Atherton, 1962 [25], p. 4, citing J. H. Westbrook. 32 An author not cited frequently enough or not cited witim a given time period will not appear in the citation index. Doyle points out that there are "many kinds of documents we would like to retrieve where it is not customary to provide citations at al]". 111n the bibliographic coupling method, both those papers which make no references to any other paper and those papers which do not share at least one reference with some other paper in the system are automatically excluded. 2/ Other disadvantages of the citation indexing technique relate to difficulties of the lack of standard practices in the citing of references and to problems of recognizing whether one citation is or is not equivalent to another. These are, of course, related to the normal difficulties arising from non-standardized formats and practices in descriptive cataloging, in use of journal abbreviations, in transliterations of foreign language titles and names, and the like, but they are now aggravated by the present prospects for direct machine processing. As Lipetz points out: !tAuthor1s names may be cited in somewhat different ways, and there is no simple mechanical procedure for bringing together the different versions. For example, an author's name may be cited both with and without initials; it would take a comparison of the additional information on the cited reference to establish that these authors are the same. Even more difficult are the problems of mechanically determining that a misspelling has occurred." Both the disadvantages of incomplete and disproportionate coverage and of failures to equate equivalent citations are quite readily obvious to the user of a citation index if he is reasonably familiar with the subject field or document set that is covered. Thus, the use of the citation index as the exclusive tool for literature search is subj ect to defects of both oversight and 1over-cite' which are cumulative and which are often easily recognizable. Atherton and Yovich emphasize that: "Knowledge of these weaknesses tends to prevent anyone from trusting the system's ability to retrieve the pertinent literature." 4/ In general, however, the citation index has not been proposed as an exclusive means for literature search and retrieval, but rather as one of a set of tools or as a supplement to other indexes. In this connection, it is of interest to note that a manual technique of literature search tested at The Therinophysical Properties Research Center 1/ 2/ 3/ 4/ 5' Doyle, 1963 [162], p. 8. See Atherton and Yovich, 1962 [26], p.39; Marthaler, 1963 [399], p. 23. Lipetz, 1962 [364], p. 262. Atherton and Yovich, 1962 [26], p.39. See, for example, Tukey 1962 [611], p.10: "The citation index, in its retrieval and pursuit uses, is not something to be used alone. Rather, it is the tool whose presence makes all the other tools more effective." 33 while not using a citation index as such, makes use of a supplementary citation tracing technique both to shorten manual search time through abstract journals and to follow up additional search leads (Lykoudis, et al, 1959 [387]; Cezairhyan, 1962 [107]). The technique is briefly described as follows: "One starts searching the abstracting journal beginning with the most recent issue and going back through a number of years, a. Next, the bibliographies of the papers located in these a years are searched for new references. The references found in this second step of the search will, in general, cover a period of years (b - a). Then one reverts back to searching through the ab- stracting journal again for another period of a years starting with the year b. This cyclic procedure of alternate searches through the abstracting journal, followed by searching the bibliographies of uncovered papers, is repeated until the total number of desired years of search is covered." 1/ In a sample search on the thermophysical properties of metals, the results showed that the cost of the cyclic procedure was only 65% of the cost of conventional manual search using the abstract journals only. Recent efforts in the development and use of citation indexes proper include experi- ments in evaluation at the American Institute of Physics, ~/ an extensive compilation and processing program at the Institute for Scientific Information 3/ and a cooperative pro- gram between the Statistical Techniques Research Group of Princeton University and the Bell Telephone Laboratories (Tukey, 1962 [611] and [6~z~). Reisner has re- ported work on the compilation of a citation index to 30, 000 patent disclosures and its experimental evaluation in progre~s at IBM's Thomas J. Watson Research Center (1963 [497]). Goodman is concerned with a citation index to the literature of new educational media, especially that on programmed learning and teaching machines (1963 [235]). At the Centre d'Etudes Nucleaires de Saclay, a citation index to papers in the field of thermonuclear fusion and plasma physics is being prepared 4/ Lipetz is carrying on work in the preparation and evaluation of citation indexes, begun at the Itek Corporation, as an independent worker and consultant to the A. I. P. project. 5i Carroll and Summit report that citation indexing is under consideration at Lockheed's Missile and Space Division, (1962 [102]). Kessler and associates at M.I.T. 6/ and Salton's group at `I Lykoudis et al, 1959 [387], abstract, p. 351. 2/ 3/ 4/ 5/ 6/ Atherton and Yovich, 1962 [26]; National Science Foundation's CR&D Report No. 11, p. 12. Ibid, pp. 27-28. Ibid, p. 76. Ibid, p. 181. Ibid, p. 128. 34 the Harvard Computation Laboratory (Salton, 1961 [512], 1962 [513], 1963 [514] and [515]), are concerned with citations as a basis for grouping and categorizing sets of related documents. Early examples of citation indexes that have been produced include the precedents in the fields of statistics and information theory listed by Tukey. 1/ Tukey also refers to early experimentation involving manually manipulated card files by J. L. Hodges, Jr., Charles H. Kraft, and William H. Kruskal. ~, Goodman (1963 [235]) describes the use of Termatrex cards showing for each item other items cited by it. Examples of machine-compiled citation indexes, however, are those of Garfield and Sher in the field of genetics (1963 [546]), Lipetz's experimental index to the citations in the proceedings of the two United Nations conferences on the peaceful uses of atomic energy, (1961 [364], 1960[365]), and the citation indexto references listed inthe tiShort Papers" submitted for the 1963 Annual Meeting of the American Documentation Institute (Luhn, 1963 ~377]). As of January, 1964, the first five volumes of Science Citation Index are available from the Institute for Scientific Information. These volumes are reported to have 2, 250, 000 lines of copy representing the computer-compiled citation trails for 102,000 articles published in 1961. Al Preliminary evaluations of the citation indexing principle have, as noted previously, been carried out in an American Institute of Physics project supported by the National Science Foundation. One experiment involved the selection of a single paper from the December 1, 1961 issue of The Physical Review and the tracing of references and citations through that journal for the period y956 tQ 1960 A bibliography of 64 papers was pro- duced as a result. This was then evaluated by a nuclear physicist, who found that the titles alone were an insufficient basis for judging whether or not these papers should all have been in~cluded, and who commented critically that there was no way of knowing if all the papers really relevant to the subject of the test paper had indeed been found. A further check by search of the subject index did in fact reveal six pertinent papers which had been missed by the citation indexing technique. A second experiment at the American Institute of Physics involved application of Kessler's "coupling strength" criteria to 41 of the 64 papers selected in the first experiment, the remainder being excluded because they shared no references with any other paper. The resultant groupings of presumably highly related papers were also evaluated by a subject matter specialist, who found them relevant to each other but the selection incomplete. Atherton and Yovich, reporting these A. I. P. experiments, con- cluded that: 1'More work will have to be done before the usefulness of citation indexing can be accurately determined." 11 21 3/ 4/ Tukey, 1962 [611], pp. 23-24. Ibid. p. 24. See news note, Special Libraries, Jan. 1964, p. 58. Atherton and Yovich, 1962 [26], p. 22. 35 Kessler himself and his associates have also conducted some experiments in comparative evaluation of indexing aids derived from citation data on the one hand and from conventional subject indexing on the other. The basis for evaluation was a total of 334 papers published in The Physical Review in 1958. The study involved detailed comparison of the ways in which these papers fell into related groups according to the "analytic subject index" used by the journal's editors and according to the method of "bibliographic coupling". The essentials of the latter method are described as follows: "a. A single item of reference used by two papers is called one unit of coupling between them. "b. A number of papers constitute a related group GA, if each member of the group has at least one coupling unit to a given test paper~P0 "c. The coupling strength between P~ and any member Of GA is measured by the number of coupling units (n) between them ` 1/ For the 334 papers, 73 categories of the Analytic Subject Index (ASI) had been used. For the bibliographic coupling method, each of the papers was in turn considered as the test paper and groups were formed for any of the 333 other papers that shared one or more citations with it. In general, it was concluded that there was good correlation between the groupings of papers achieved by the two methods. It should be noted, how- ever, that 44 papers fell into no groups at all on the basis of the bibliographic coupling criterion. 2/ Salton and associates at the Harvard Computation Laboratory are also concerned with the citation indexing principle as a possible basis for grouping similar documents. They are also concerned with evaluation of results so obtained by comparison with document groups obtained by subject indexing means. In the comparative experiments, data were first compiled for a closed document set of 62 items as to similarities with respect to both "citedness" and "citingness". The same items were manually indexed and similarity coefficients between these items were derived from overlappings of assigned index terms. When the two measures of similarity were compared with each other and with document associations obtained by random assignments of "citations" and "terms", the conclusions reached were as follows: "The similarity coefficients obtained by comparing overlapping citations for a sample document collection with overlapping, manually generated index terms are much larger than those obtained by assuming a random assignment of citations and terms to the documents; relatively large similarity coefficients are generated for nearly all documents which exhibit at least a minimum number of citations; little seems to be gained by using citation links of length greater than two; for early documents, citedness furnishes a better indication than the amount of citing, and vice versa for recent documents; for documents which can both cite and be cited, equally good indications seem to be obtained by comparing citing and cited documents." 3/ 1/ 2/ 3/ Kessler, 1963 [32o~, p.1, footnote. Ibid, p. 5. Salton, 1962 [szo~, p. 111-42. 36 In the Salton project, tests of the value of citation links for the assignment of index terms have been made by comparing the citation pattern of an "unknown" document with those of other documents in the collection to derive a set of five "related" documents, where relatedness is decided on the basis of the magnitude of the similarity coefficients for the citation links. Any index term that appears at least twice in the set of terms previously assigned to the five related documents is then assigned to the new item. In general, approximately 50% of the terms so assigned were also assigned to the same `1new" items by human indexing procedures. 1/ As we have previously noted, however, the advantages of citation indexing are likely to be most effectively applied when used as part of an array of other tools. Tukey suggests, in particular, that permutation indexes of titles, as in KWIC systems, would be of great value as "starter" and "re-check" mechanisms for the use of citation indexes.2~ Brownson reports: "Consideration is now being given to the possibility of experimenting with a `hybrid' type of index that would combine permuted titles, authors, and citation data. Such an index might be more useful than any of the individual types of indexes issued singly; and, since no human indexing judgment would be involved, it could be prepared largely by machine and issued rapidly." Williams, while at ITEK, proposed a hybrid integrated index combining listings by authors, corporate authors or author affiliations, keywords-in-context frorn title, and references to works cited by and to works citing an item, and she also developed a sample 4/ format for selected items from several journals in the field of philosophy. - Precisely such a hybrid tool was provided with the Short Papers for the A. D. I. Annual Meeting 1963, and it was indeed issued rapidly. A brief period of only two or three weeks elapsed between receipt of many of the manuscripts and the distribution of two automatically typeset volumes. The second of these volumes contains a KWIC and an author index to these papers themselves, a bibliography and citation index to all papers referenced by them, and KWIC and author indexes to the cited papers, all computer-compiled within this time period. ~/ 1/ 2I 3' 4/ 5' Ibid, See also Lesk 1963, E 3s7~, p. V-8. Tukey, 1962, L6ll~, p. 12. Brownson, 1963 ~82], p. 4. T. M. Williams, private communication, dated January 4, 1962. Luhn, 1963 [376], and [377] , pp. 353-38Z. 37 2.5 Machine Conversion From One Index Set to Another A final possibility in the general area of machine compilation of indexes and machine use to improve the availability of indexes is as yet in a highly speculative stage. This is the possibility of converting from one index set to another by machine look-up procedures. In the Welch Medical Library project, mentioned earlier, use was made of punched card techniques to convert from one index arrangement to another, 1/ but machine- recognizable identifiers for both arrangements were explicitly encoded in the material. In recent studies at Datatrol, however, preliminary investigations have been conducted looking toward machine lookup of index-term equivalence tables in order to convert, for example, DDC descriptors to corresponding subject headings used in the AEC vocabulary. Hammond and Rosenborg (1962 [250] and [252]) report on the compilation of a uni- lateral table of "indexing equivalents" between approximately 7, 000 DDC descriptors and those AEC subject headings judged by them to be identical, synonymous, or "usefully" equivalent, such as one or the other being subsumed by a broader or more generic term. Findings showed 23.8% of the terms of the DDC vocabulary presumably identical to those of AEC, 38.1% of lower generic level, 7.4% of higher generic level, and 10.9% for which no useful equivalents could be found. A sample table of indexing equivalents was prepared for DDC-to-AEC conversion, but not in the opposite direction. Since, in general, convertibility of indexing vocabularies would be desirable wherever duplication of cataloging and indexing effort is likely to occur (that is, where two or more different documentation organizations receive at least some of the same material as inputs to their systems), the results of these preliminary studies are pro- vocative and appear to merit the further study that is being sponsored by an Interagency Task Group on Vocabulary Study of the Committee on Scientific Information, under the Federal Council for Science and Technology. There are many substantial difficulties, however. When applied to actual indexing of the same items by the two agencies, it was found that for 277 items indexed by both AEC and DDC (then ASTIA): "ASTIA used a total of 2, 571 descriptors, and AEC 840 subject headings... of these, 392, or roughly half of the AFC terms, were either completely or, for all practical purpose, identical." Painter (1963 [460]) made further studies of equivalency in her investigations of duplication and consistency of subject indexing at several Government agencies. For ZOO items indexed by both AEC and DDC, she found 20% DDC equivalency, 67% AEC equiva- lency, and 30% similarity of actual indexing. She concludes, in part: "In considering these solutions and the statistics revealed by the studies it should be concluded that with a maximum of only 69 percent equivalency, or convertibility, and a minimum of 28 percent, there is still a large proportion of terms which will 1/ 2/ Garfield, 1959 [221], p. 471. Hammond 1962[250], p. 4. 38 necessitate some other form of retrieval. This is the proportion which is involved with the problem of generics, where a term in one system subsumes two of another ---and vice-versa. An additional problem evolves in attempting to reconcile two different subject concepts, one, the subject heading which usually has a single access point and one, the uniterm or descriptor which has multiple access through coordination. Thus the practicality of a system made up of many units supplying information indexed differently, using as a basis for retrieval a table of equivalents, is questionable." 1/ Moreover, the results of tests of inter-indexer consistency rates within the same agency were not encouraging. Thus Painter further concludes: "Tne study, in combining the results of the equivalency analysis and the consistency of indexing within each system and an equivalency of only 30 percent within the broadest system, a table of equivalents is at present of little value in either a manual or a machine system. In order to apply a table of equivalents efficiently, both a high degree of consistency and a high degree of equivalency is essential." 2/ She therefore stresses that the possibilities for conversion by machine techniques from one indexing set to an equivalent set for another vocabulary are adversely affected by the generally poor rates of inter-indexer consistency. With reference both to the Datatrol Studies 3/ and to corroborative findings of her own, she states: "The value of equivalency studies and most particularly the table of equivalents presuppose the consistency of indexing. Convertibility between systems is thus dependent on the consistency of indexing. Without consistency, the vocabularies as units are not sound; equivalencies cannot be drawn or effectively used for convertibility." 4/ 1/ 2/ 3/ 4/ Painter, 1963 [460], p. 104. Ibid, p. ix. Hammond, 1962 [250]; Hammond and Rosenborg, 1962 [252]. Painter, 1963, [460]. p. 109. Note that these estimates of nter-indexer con- sistency may be quite optimistic, as discussed on pp. 157-l6Oof this report. 39