MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Operational Considerations chapter Mary Elizabeth Stevens National Bureau of Standards "Costs come much more into line if we make available to the machine something on the order of one per cent of the full text. Then, 0£ course, the problem of selecting that one per cent presents itself. I' 1/ 8. Z Examples of Processing Considerations A second major area of operational considerations involves the machine processing problems, given a specified input. For most of the automatic derivative, and modified or normalized derivative, schemes, this is primarily a question of the limitations of machine language to a vocabulary of, typically, no more than 64 distinct characters for input, internal manipulation, and output. In addition, the limited number of characters that can [OCRerr]e packed into a single machine-word complicates internal processing, storage, file look- up (i.e., against exclusion or inclusion lists), and sorting operations. Arbitrary truncation of text words to, say, 6 characters per word, leads to certain computer processing or storage economics. However, it leads also to complications in the selection of words either to be included (clue word lists) or excluded (stop lists) in many of the proposed methods both for derivative and for assignment indexing. Additional problems of artificial homography are created. Obvious examples are "Probab-le, -ility"; "Condit-ion, -ional, " "Freque-nt, -ntly, -ncy, " "Commun-ity, -ication;-al", and the like. Barnes and Resnick include in their studies of the effectiveness of an SDI System z/ the use of 6 different truncation levels (from 4 to 9 characters). No significant differences were found in terms of the number of hits (matches of a new item to a user's profile which he considered to be of definite interest to him) but there were significant differences in the number of notifications sent him, as presumably matching his interest, and the amount of "trash" (irrelevant items) among these notifications. The importance of the selection criteria in derivative indexing, operationally con- sidered, is largely a matter of the length and the contents of the stop lists. Variability in practice among the various producers of KWIC indexes has previously been noted, 3/ but there are some interrelated and interlocking factors which affect the quality, the costs, and the customer acceptance of this type of machine-generated index. First, the number of pages in a printed index is directly related to the total costs of producing that index. 4/ The amount of material covered on a single page can be increased by photographic or other type of reduction (e.g., the 96 lines per page of the Bell Laboratories KWIC program out- put are reduced by xerography to 6Z percent of the machine output page size), (Kennedy, 1961 [311]) but the reduction must not be such as to exceed reasonable limits of legibility. This, in turn, means that the number of entries generated for each title (obviously, a function of the words that survive stop list purging) needs to be held to a reasonable minimum. Thus: "One of the major limitations of the published index stems from the conflict between the quantity of text that must be placed between the covers and the capacity of the printed page to handle it. The size of the page and the legibility of the printing determines the maximum density of characters which can be read without special aids." 5/ 1/ Swanson, 196Z [584], pp. 470-471. Barnes and Resnick, 1963 [36]. See also p. 148 of this report. 3/ See discussion, pp.65-66. 4/ See Markus, 1963 [394], p. 16. 5/ Tame, 1961 [59Z], p. 153. 168