8. OPE~~TIONAL CONSIDERATIONS Whatever the verdict of evaluation of one or more automatic indexing techniques, whether of the derivative, modified derivative, or assignment type, there are certain operational considerations and problems that typically affect any attempt to apply such techniques in actual production operations. These considerations, which also affect lin- guistic data processing operations in general, include input considerations, availability of methods or devices for converting text to machine-usable form, programming consider- ations, questions of format and content of output, and problems of customer acceptance of the machine products. 8.1 Questions of input Input considerations include, first, questions of the extent and availability of mate- rial which can be handled directly by the machine. This may be limited to title only, to title plus abstract, title plus other material, 1/ preselected text or automatically gener- ated extracts; or it may in a few cases extend to full running text. Possible future re- quirements may extend to the processing not only of full text but of interspersed graphic material (equations, charts, diagrams, drawings, photographs) as well. We have considered typical arguments for and against the limitation of input to titles only, to augmented titles, and to abstracts in other sections of this report. The points to be emphasized here are requirements for pre-editing or post-editing, provisions for error detection and error correction, the time and cost requirements of conversion equipment if material is not already available in machine-usable form, and the like. As Cornelius suggests: "Present day computers, if used for machine indexing, will be generaUy input limited and will require excessive data preparation. Causes of these limitations are: time required for translation to machine language, verification of this ma- chine language, and the capability or lack of capability of correction in the input media." 2/ Examples of pre-editing requirements, even for the simple case of keyword-in- title indexing, include the spelling out of chemical symbols, the encoding or the omission of subscripts and superscripts, insertions of hyphens to prevent indexing of a word, and substitutions of blanks for hyphens in compound words to assure indexing of each com- ponent. 3/ For full text, a far more extensive and elaborate set of rules and conventions must be developed and applied. 4/ Other editing may be required for format standard- 1/ This may specifically include cited titles, as suggested variously by Bohnert, 1962 [69], p. 19; Giuliano and Jones, 1962 [229], p. 10; Swanson, 1963 [580], p. 1; Gallagher and Toomey, 1963 [205], p. 53; and as used in the SADSACT method, see pp. 98-99 0£ this report. 2/ Cornelius, 1962 [140], p. 42. 3/ See, for example, Kennedy, 1961 [311], p. 120. 4/ See, for example the sophisticated proposals of Nugent, 1959 [441], and Newrnan et al, 1960 [439]. 164 ization, especially in the case of citation indexes compiled by machine. 1/ O'Connor notes, however, that "the provision of pre-editing information can slow down the keypuncher or typist, increase the chance of mistakes, and require more intelligence or training on the typist's part." 2/ Questions of error detection and error correction apply both to the original text and to transcribed versions if these are necessary. That is, the basic documents themselves may contain typographical errors, misspellings, and the like, and additional errors are bound to occur at all subsequent stages requiring human processing. Wyllys discusses the need for the correction of spelling errors, mentions suggested computer programs for detection, and cites a private communication from Stiles suggesting that the criteria for accepting words as valid be either that they are identified as already being in the system vocabulary or that they occur at least twice in the input item. 3/ Swanson's analysis of the reasons for retrieving irrelevant, and failing to retrieve relevant, material in the case of text searching on the nuclear physics abstracts includes typical data on the effect of errors. 4/ He found, for example, that failures to record hyphenated words, subscripts, superscripts and other special symbols accounted for about 5 percent of failures to retrieve relevant items, and errors in transcription of either text or search instructions accounted for another 3 percent of these failures. Errors in key- punching of the search requests alone accounted for 4 percent of the cases of irrelevant retrievals. By contrast, in the newspaper clippings experiments where the input material was already in machine-usable form transcription errors were not a factor but the input tape itself had many errors. In this special case, however, Swanson reports: "Garbles are not important simply because messages are sufficiently redundant to insure that even if one or two keywords for a given category are garbled, almost invariably others are present." 5/ The news clippings material used by Swanson represents one class of materials that are today initially available in machine-usable form, because the original recording of the message or text resulted in a machine-usable medium, such as punched paper tape. A punched paper tape is produced as the product of many typesetting operations, especially for newspaper and magazine publication, and this will be increasingly true in the future, together with computer-prepared tapes for input to automatic typographic composing equipment. To date, however, equipment to convert from these tapes to the particular machine language of a given computer processing system is largely non-available, is costly, and is highly subject to error. 6/ 1/ See, for example, Atherton, 1962 [25], p. 4; Marthaler, 1963 [399], p. 22. However, at least one computer program has been developed to assist in this pro- cess. See Thompson, 1963 [600], p. 11-1: "The present program takes biblio- graphic citations and automatically arranges then into a standard format in such a way that the various parts of the citation are unambiguously identified. These standardized citations can later be processed by sorting and matching procedures to identify similar citations and to effect various rearrangements. 2/ O'Connor, 1960 [444], p. 8. 3/ Wyllys, 1963 [653], p. 15. 4/ 5/ Swanson, 1961 [586], Appendix. Swanson, 1963 [580], p. 5. 6/ Compare, for example, Savage, 1958 [521], p. 11: "The use of tape as the original input to the process has offered a number of problems which have yet to be solved. One is the occurrence of typographical errors." 165 Moreover, to date, very little material in the scientific and technical literature is available in this form. As of 1961, it was reported that a survey by McGraw-Hill indicated that only about 2 or 3 percent of the publications in the United States were then prepared by typesetting tape, that most of this was in the form of Monotype tape which because of its 30-column width and special format is not generally compatible with tape reading equip- ment, and that tapes had many errors in them which would require considerable effort to correct. 1/ As of late 1963, Bennett reports: "Computer processing of natural language text material requires that a body of data be available in machine-readable form. At present such a body of data results only from a direct human copying process. An inquiry into existing transcriptions of text which were machine-readable showed that they were abbreviated both interms of completeness and in number of symbols represented. As an alternative text pro- duced as a by-product of typesetting operations is clearly an eventual possibility, but present practices make the detection of unit delimiters such as ends-of-sentences difficult. " In the future, both machine-usable text from publishers and printers and the similar- ly machine-usable paper tape produced as a byproduct from the original keystroking of manuscript on such equipment as Flexowriters and Justowriters may alleviate this problem for new items. Nevertheless, the wealth of the world's present literature, the informal and unpublished technical reports of high current interest but limited initial distribution, and material acquired from foreign sources, will continue to pose for the foreseeable future major problems either of automatic reading of the printed page or of human re- transcription at high cost. While there have been many promising developments in automatic character recog- nition techniques, the devices that are now available for production use are limited to small character sets, such as a single alphabet in a single font, often of special design. The multi-font page reader is not only not yet commercially available but may not become so for some years to come. Even if it were, there are many unresolved and as yet in- completely specified problems involved in the development of suitable rules for the machine so that it can distinguish between title or page number and text, figure caption and text, author's name in a cited reference and the title of the paper cited, and the like. A case in point, not only for automatic reading equipment of the future but for machine processing of machine -usable material available today, is the difficulty of machine recognition of punctuation marks as used for different purposes. 3/ In the absence, then, both of scientific and technical documents already in machine language form and of character recognition equipment capable of reading the printed page, we are left with the unsatisfactory situation of re-transcribing input material either by use of a tape typewriter or by keypunching to punched cards. That this situation is un- satisfactory and is a major bottleneck in machine processing of text in excess of the bibliographic citation data only is evidenced by such typical statements as these: 1/ Cornelius, 1962 [140], p. 47. z/ Bennett, 1963 [50], p. 141. 3/ See Bennett quotationabove; Luhn, 1959 [384], p. 22, andcoyaud, 1963 [143]. 166 "The expense of transcribing such documents in their entirety will be justifiable to a limited extent only and it may, therefore, be assumed that automatic processing will be mainly applied to future literature." 1/ "As long as we are limited to using the equipment that is available now, the pre- paration of data for input will be an expensive procedure and a major cost factor in automatic processing of natural language." z/ ..... In a discussion of indexing by machine, we must recognize the preparation of input to the system as the major item of cost of operation." 3/ "Present inability to read documents automatically would make it necessary to punch cards or tapes, an operation likely to be even more expensive than reading by humans." 4/ In addition to the high costs of manual retranscription, it is also noted that keypunching "tends to undermine the purpose of natural text retrieval by requiring human effort at the input end of the process." 5/ In particular, keypunching or keystroking requirements undermine the purposes of rapid indexing as well as filing for retrieval by virtue of the time required to transcribe text. Horty and Walsh report, for example: "Flexowriter operators can produce between 1400 and 1800 lines per day of statutory text. Keypunch operators used in previous experiments could punch approximately 100 lines per hour of alphabetic materials, but could not maintain this rate for a sustained period of time. " 6/ Thus, until such time as more versatile character recognition equipment is available, even some of the most ardent advocates of full text processing are forced to the use of considerably less than full text for other than research purposes. Swanson comments, for example: "... One must note that the manual recording of text may be exorbitantly expensive. If so, a judicious selection process may permit a reasonable compromise between the expense of input and the depth of indexing which results. For example, it is reasonable to select the title, abstract, table of contents (if any), sub-headings, and key sentences or paragraphs." 7/ 1/ Luhn, 1959 [384], p. Z. Ray, 1961 [496], p. 51. 3/ Howerton, 1961 [z8z], p. 3Z7. 4/ Levery, 1963 [359], p. Z35. 5/ Doyle, 1959 [168], p. Z. 6/ Horty and Walsh, 1963 [z80], p. Z59. 7/ Swanson, 1963 [580], p. 1. 167 "Costs come much more into line if we make available to the machine something on the order of one per cent of the full text. Then, 0£ course, the problem of selecting that one per cent presents itself. I' 1/ 8. Z Examples of Processing Considerations A second major area of operational considerations involves the machine processing problems, given a specified input. For most of the automatic derivative, and modified or normalized derivative, schemes, this is primarily a question of the limitations of machine language to a vocabulary of, typically, no more than 64 distinct characters for input, internal manipulation, and output. In addition, the limited number of characters that can ~e packed into a single machine-word complicates internal processing, storage, file look- up (i.e., against exclusion or inclusion lists), and sorting operations. Arbitrary truncation of text words to, say, 6 characters per word, leads to certain computer processing or storage economics. However, it leads also to complications in the selection of words either to be included (clue word lists) or excluded (stop lists) in many of the proposed methods both for derivative and for assignment indexing. Additional problems of artificial homography are created. Obvious examples are "Probab-le, -ility"; "Condit-ion, -ional, " "Freque-nt, -ntly, -ncy, " "Commun-ity, -ication;-al", and the like. Barnes and Resnick include in their studies of the effectiveness of an SDI System z/ the use of 6 different truncation levels (from 4 to 9 characters). No significant differences were found in terms of the number of hits (matches of a new item to a user's profile which he considered to be of definite interest to him) but there were significant differences in the number of notifications sent him, as presumably matching his interest, and the amount of "trash" (irrelevant items) among these notifications. The importance of the selection criteria in derivative indexing, operationally con- sidered, is largely a matter of the length and the contents of the stop lists. Variability in practice among the various producers of KWIC indexes has previously been noted, 3/ but there are some interrelated and interlocking factors which affect the quality, the costs, and the customer acceptance of this type of machine-generated index. First, the number of pages in a printed index is directly related to the total costs of producing that index. 4/ The amount of material covered on a single page can be increased by photographic or other type of reduction (e.g., the 96 lines per page of the Bell Laboratories KWIC program out- put are reduced by xerography to 6Z percent of the machine output page size), (Kennedy, 1961 [311]) but the reduction must not be such as to exceed reasonable limits of legibility. This, in turn, means that the number of entries generated for each title (obviously, a function of the words that survive stop list purging) needs to be held to a reasonable minimum. Thus: "One of the major limitations of the published index stems from the conflict between the quantity of text that must be placed between the covers and the capacity of the printed page to handle it. The size of the page and the legibility of the printing determines the maximum density of characters which can be read without special aids." 5/ 1/ Swanson, 196Z [584], pp. 470-471. Barnes and Resnick, 1963 [36]. See also p. 148 of this report. 3/ See discussion, pp.65-66. 4/ See Markus, 1963 [394], p. 16. 5/ Tame, 1961 [59Z], p. 153. 168 The question of stop list effectiveness therefore becomes an operational factor as well as one that may affect the quality and acceptability of the product. On the other hand3 too generous a purging of the input titles may of course reduce the utility of the title index by the elimination of too many potential access points and, in particular, many that users may be most tempted to look for. A related problem has to do with the number of pages required because of the length of the title line allowed in the listings. A suggestion advanced by Brandenberg (1963 [80]) is the assignment of numeric codes to the machine stop words used and the insertion of these codes into the listed title line in the place of these presumably insignificant words. Thus one of the KWIC entries for the title3 "Determining Aspects of the Russian Verb from Context in Machine Translation" might go from: RMINING ASPECT OF THE ERMINING 03Z 416 712 RUS CONTEXT IN MACHINE TRANSLATION. IDETE to: CONTEXT 308 MACHINE TRANSLATION. /DET This particular example was picked at random from a KWIC index utilizing a 103-106 character title line, 1/ but it was deliberately shortened to the 60-character line length found in many such indexes in order to illustrate effects of chopping and wrap-around. Coincidentally, it also illustrates some of the difficulties of designing a well-balanced exclusion list since in this case the purged word "aspect" is apparently being used in a technical sense rather than in the common one of "Various aspects of...". By accident, this case does show rather severe "aspects" of the chopping problem in the loss also, for this entry, of "Russian" and "verb" although they would of course be picked up in the entry blocks for these words. Certainly, however, the claimed advantages of context checking are not striking, even without the introduction of the numeric codes. It is true that for excluded words longer in length than those in our example the possible conservation of the character-space to reduce the chopping effects for the same length line may result in im- provements. However, the replacement of, for example, "Preliminary investigations of..." by numeric codes would hardly assist the user in determining quickly from the many possible entries under "..." which he should select for further personal perusal. Turning to the case of automatic assignment indexing, the processing considerations likely to be involved in operational factors affecting the evaluation of a system are much less easily exemplified. Obviously, conditions that hold for research experiments on small (and usually, especially selected) samples do not necessarily relate to requirements in potential productive applications. Exceptions are the problems of the sizes of term- term and term-document co-occurrence correlation matrices that can be readily manipu- lated, previously mentioned, 2/ and the concurrent problems of the size, and hence the representativeness, of inclusion lists or clue-word vocabularies that can be accommodated. Both Maron and Borko found, even in their limited test samples, a certain proportion of new items that could not be indexed or categorized at all because these new items did not contain any of the clue words recognizable by the system. 3/ Due perhaps to longer selective clue word lists, as well as to the special nature of his items, Swanson found no instances, for 775 test items, of failure to assign because of lack of indicative clues in the input material. In the case of 60 tests against the SADSACT model, which uses approx- imately 1, 600 words drawn from a "teaching sample" of items previously indexed to de- scriptors, (related by frequency of co-occurrence to any of 70-odd descriptors with whose A' Walkowicz, 1963 [629], pp. 136 and 137. See pp. 108 and 160 of this report. 3/ See Maron, 1961 [395]; also Borko and Bernick, 1963 [78]. 169 assignment they had co-occurred)3 the machine had a sufficient basis in the input material for the derivation of a selection-score for at least lZ descriptors for each new item. The items were closely similar to, though not identical with, the source items from which the word associations with descriptors assigned had been drawn. The sample is obviously critically small. Nevertheless, the possibility that extensive clue word lists, notwith- standing the incorporation of trivial and even erroneous associations, can be used as effectively as smaller, more precise, and more carefully tailored lists, but with signifi- cant gains in memory space or computational reqilirements, is suggestive. A somewhat related conclusion, again reflecting the effect of processing requirements, is stated by Needham as follows: 11The main point to be made is that theoretical elegance must be sacrificed to com- putational possibility: there is no merit in a classification program which can only be applied to a couple of hundred objects." 1/ In KWIC type derivative indexing by machine, except in terms of allowable character sets and word-lengths conveniently processed, the problem of appropriate programming languages does not arise to any serious extent. For the processing of material in research on natural language text, however, the choice of interpretative and compiler types of auto- matic programming languages may involve computational requirements which, while being inappropriate in a production situation, offer considerable flexibility and versatility for experimental purposes. Examples of special programs of this type include the use of Yngve's COMIT by Baxendale and Knowlton, the development and use of FEAT by Olney, Doyle, and others at SDC, and the use of list-processing techniques in the General Inquirer system. z/ Yngve describes the use of his program as follows: "COMIT has also been used in the experimental work in information retrieval of Baxendale and Knowlton at IBM. The purpose of their COMIT program was to accept as input the title of a document and to produce as output, not only descriptors, but pairs of descriptors which are roughly of the form adjective-noun. The purpose of the work is to automatically generate, from document titles, retrieval words of a more specific nature than simply Boolean functions of the existence of certain words in a title. " 3/ The FEAT program was designed originally for word and significant-word-pair frequency counts. Olney describes the program in part, as follows: "FEAT is designed to perform frequency and summary counts of words and word pairs occurring in its natural text input; i.e., text written it' ordinary English and transcribed into Hollerith code according to some set of keypunching rules. To focus attention on the semantic aspects of word pairs rather than on their syntactic aspect, pairs of which one member is a function word, such as `the', `is', `by', etc., are excluded. "Using a bucket list structure of the type proposed by C. J. Sheen in FN-1634, the program sorts each incoming word serially, constructing a list within each of Z56 buckets for good words of a given alphabetic range ... and another list within each good word entry for the Doubles and Reverses which will be ordered alphabetically 1/ Needham, 1963 [433], p. 8. Stone, et al, various references, p. 137 of this report. 3/ Yngve, 196Z [655], p. Z6. 170 on that word . .. If there are four different Double types of which the first word is `external' the addresses of the four different second words form a new list which is linked to the entry for ~ Each word type occurs only once in core, and all word pairs of which it is a member refer to it by means of its core addresses. 11The program could process millions of words, automatically generating frequency counts far larger than the Thorndike and Lange counts, which cost many man-years, and in addition, FEAT would provide complete lists of word pairs (Doubles and Reverses), which, so far as we know, have never been counted in a sample of appre- ciable size, despite their importance for semantic analysis of text. FEAT is used, together with a modified version of the Proto-Synthex program, and special output formatting routines, for another SDC programj the Descriptor Word Index Program, which produces a content-word-concordance for natural language text as well as statistics reflecting the type of words that occur, frequencies of occurrence, and posi- tional data, (Olney, 1960 [457], 1961 [456]; Stone, 1962 [574]. The IPL-V list-processing language is used by Kochen in some of his work on sim- ulated concept processing by machine. Programs for accepting sentences written in a formal language which was constructed of names and logical predicates (inserted either from a console or in the form of punched cards), for updating and re-organizing a file of such sentences, for storing and manipulating metalinguistic sentences such as "If X is author of Y and Y pertains to topic Z, then X has worked on Topic Z", for interrogating the file, and for tracing associations between names linked through various predicates, have been written in this language. 1/ 8.3 Output Considerations Turning to operational problems of output, the question of limitations of computer printout language to, in most cases, a single set of upper case alphabetic characters, numerals, and a few special symbols, 2/ is a serious factor in customer acceptance with respect to appearance -- format, legibility, readability. Involved here are questions pre- viously mentioned. Where, in the only presently available outputs of machine-generated indexes, the KWIC type permuted title indexes, should the indexing access point "slot" be on the page? Should all or only part of the title be displayed? Should 60- or 106-character lines be used? More detailed discussion of these and related points are provided by, for example, Youden (1963 [658]) Kennedy (1962 [311]) and Brandenberg (1963 [80]). A separate, but related question, is how much identification, and in what form, should be provided for the item itself either directly as a part of the index entry or by cross-reference to the address of more detailed information. There seems to be quite general agreement that the typical user needs something more than author's name and title 1/ 2/ Kochen, etal, 1962 [328], p. 34. See, for example, Lipetz, 1960 [365], p. 252: "A disadvantage of keypunched cards however, is the lack of capacity to record or to print other symbols than a one-case alphabet, one case of arabic numerals, and about a dozen punctuation marks and miscellaneous symbols. Citations in the scientific literature generally make use of a much larger number of significant symbols: multiple cases, multiple fonts, italics, boldface, Greek letters, mathematical symbols, etc. " Note, however, that Chem ical-Biological Activities, a digest produced by Chemical Abstracts Service, uses printouts of the modified IBM 1403 chain printer, using 120 characters (see Fig. 5). 171 alone to guide him. 1/ However, if the full bibliographic citation, perhaps the abstract as well, is to be printed out by machine, the problems of limited character set are even more severe. This problem is today being solved, in some cases, by separate operations in- volving sorting and assembly of the full citations and abstracts of the items indexed, sepa- rately prepared, for photographic reproduction or typesetting. Hopefully, this partial solution will become obsolete as automatic type-composition eq\1ipment and computer-pre- pared typesetting techniques become more generally available. Operational considerations thus involve the costs, the availability, and the limitations of equipment now usable for machine-generated index production. Schultz and Schwartz report, as of October, 1962. "There are two major bottlenecks in automated index production caused by inadequate equipment development at the present state-of-the-art: "1. There is no way of using automatic input of the printed page or the indexer's notes; "2. There is insufficient flexibility in the forms of output available for a computer -produced index. Both of these areas are being worked on by equipment manufacturers, and an early solution has been promised." 2/ In general, operational considerations of this type do not affect the appraisal of auto- matic assignment indexing techrnques, because these have not yet been developed to the point of practical application on any realistic scale. Moreover, the difficulties of problem definition and basic understanding of language and meaning yet remaining to be resolved are such that radical new advances in computer technology, associative memories, char- acter readers and pattern recognition devices may completely alter the picture before practical systems are ready for operational tests. Thus, for example, it is claimed: "It appears desirable to begin experimentation with automatic indexing so that solu- tions will become known by the time character recognition equipment will have pas- sed the laboratory stage." 3/ Similarly, Doyle suggests that the "present rate of solution of the intellectual problems of IR is sufficiently slow that these advanced devices will be in common use long before IR will truly benefit from their presence", and he urges that researchers proceed as though such machines were already with us. 4/ 1/ Compare, for example3 Montgomery and Swanson, 1962 [421], p. 366: "This study suggests that indexing should be based on more than titles and that a bibliographic citation system should present to the requestor something more than titles"; See also, in addition to references cited, p. 61, footnote I, IBM "ACSI-matic auto- abstracting project. .. ", Vol 3, 1961 [290], p. 89: "The use of titles in document searching without any additional abstract seems to lead to a high number of errors, i.e., accepting documents which should be rejected, as not enough informa- tion is available to judge the pertinence of documents." 2/ Schultz and Schwartz, 19b2 [531], p. 432. 3/ Levery, 1963 [359], p. 235. 4/ Doyle, 1961 [169], p. 3. 172