MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Generated by Machine-Automatic Derivative Indexing chapter Mary Elizabeth Stevens National Bureau of Standards Actual experiments in application of relative frequency techniques to automatic ex- tracting processes have been pursued since 1959 by various investigators. Edmundson and Wyllys and Damerau (1963 [148])were certainly among the first. Edmundson and Bohnert were engaged in experimental investigations at Planning Research Corporation in 1959, Iland the following year Edmundson, Oswald, and Wyllys worked on the auto- indexing and auto-extracting of the 40, 000 words of text contained in nine articles in the subject field of missilery. 2/ Wyllys has continued work on relative frequencies (1963 [6533] ). At the System Development Corporation Doyle, in some of his work,has also explored the relative frequency approach (1961 [161]). An example in Europe is work reported by Meyer-Uhlenried and Lustig, where significant keywords from abstracts are used not only as indexing terms directly, but by means of keyword lists and micro- thesauri can also be used to assign documents to specific subject fields (1963 [417]). 3.3.4 Significant Word Distances Another technique that has been investigated for the improvement of automatic ex- traction operations based on the statistics of word frequencies is that of distances between significant words. The desirability of attaching greater weight to n-tuples of immediately adjacent words and to the co-occurrences of words within the same sentence has been mentioned previously. Savage, in relatively early work developing some of the initial proposals of Luhn, considered intra-sentence distances between significant words as follows: The criterion is the relationship of the high-frequency words to each other, rather than their distribution over the whole sentence. Consequently, it seems reasonable to consider only those portions of sentences which are bracketed by high-frequency words and to set a limit for the distance at which any two such words shall be considered as being significantly related . . . An analysis of many sentences and many documents indicates that a useful limit is four or five non- significant words between a[OCRerr]y two high-frequency words " 3/ Doyle has also noted the tendency of words that are in fact highly related in a content- revealing sense to co-occur in the same sentence or as quite direct neighbors. The same investigator has also suggested that word distances can be used to provide "clustering" effects that might, for example, sort out the possibly different topics cove re[OCRerr]/in intro- ductory or background discussions, the main text, and various appendices. - 1/ 2/ 3/ 4/ National Science Foundation's CR&D Report No.5, [430], p33; Bar-Hillel 1962 [35], p.418. National Science Foundation's CR&D Report No. 6 [430], pp 43-44. Savage 1958 [521], p.4. Later related work has included a method for generating auto-extracts which adds to the high-frequency word sentence scores a correction factor for the number of words in gaps between such words. (See Rath et al, 1961 [493;) Doyle 1961 [166], p. 7. 83