MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Generated by Machine-Automatic Derivative Indexing chapter Mary Elizabeth Stevens National Bureau of Standards "As it is currently run, the auto-indexing program selects about one word in ten as a keyword in articles of three thousand words or less. In articles longer than three thousand words it tends to pick about one word in fifteen. This high incidence of keywords naturally increases the amount of noise results returned by the query program, although good search strategy cuts them down considerably." As of October 1963, the system was reported to be fully operative although not as yet extensively tested in actual use. Gallagher and Toomey give illustrative auto-extract results on two tested papers, one being Luhn's own "Automatic Creation of Literature Abstracts". They give comparative results for manual versus machine selection of key- words as index or search terms with 88.6 percent agreement, the human indexers having selected, in 6 tests reported, 132 words and the machine method 117. Modifications under consideration include pre-edit flagging of terms in author and cited-reference fields for special weighting, setting the length of the abstract as a function of the total number of worQs in an item, and, in the search program, generating additional search terms by means of association factor techniques such as those suggested by Stiles. To the basic approach of straight-forward word frequency counting, Luhn himself has suggested that improvements might be obtained from considering closely adjacent words, 2/word pairs, y/ and reference to vocabularies specific to a given field. 4/ Other possibilities are capitalized words and lookup against an inclusion list. He also suggests: "If certain words could be given in their relationships to other words, more specific meanings may be identified by such combinations. These relationships may range from the mere co-occurrence of certain words within a phrase or sentence to the combinations of specific parts of speech. " 5/ Various investigators have proceeded to explore these and other possible improve- ments, including incorporation of relative frequency information, use of information about distances between high-ranked si[OCRerr]nificant words, word pairs and word n-tuples, 1/ 2/ 3/ 4/ 5/ Gallagher and Toomey, 1963 [205], p.51. Luhn, 1959[384], p.10. Luhn, 1962[373], p.11. Luhn, 1959 [384], pp. 8 and 10. Ibid, p. 5. 77