MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Generated by Machine-Automatic Derivative Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
Actual experiments in application of relative frequency techniques to automatic ex-
tracting processes have been pursued since 1959 by various investigators. Edmundson
and Wyllys and Damerau (1963 [148])were certainly among the first. Edmundson and
Bohnert were engaged in experimental investigations at Planning Research Corporation
in 1959, Iland the following year Edmundson, Oswald, and Wyllys worked on the auto-
indexing and auto-extracting of the 40, 000 words of text contained in nine articles in the
subject field of missilery. 2/ Wyllys has continued work on relative frequencies
(1963 [6533] ). At the System Development Corporation Doyle, in some of his work,has also
explored the relative frequency approach (1961 [161]). An example in Europe is work
reported by Meyer-Uhlenried and Lustig, where significant keywords from abstracts are
used not only as indexing terms directly, but by means of keyword lists and micro-
thesauri can also be used to assign documents to specific subject fields (1963 [417]).
3.3.4 Significant Word Distances
Another technique that has been investigated for the improvement of automatic ex-
traction operations based on the statistics of word frequencies is that of distances between
significant words. The desirability of attaching greater weight to n-tuples of immediately
adjacent words and to the co-occurrences of words within the same sentence has been
mentioned previously. Savage, in relatively early work developing some of the initial
proposals of Luhn, considered intra-sentence distances between significant words as
follows:
The criterion is the relationship of the high-frequency words to each other,
rather than their distribution over the whole sentence. Consequently, it seems
reasonable to consider only those portions of sentences which are bracketed by
high-frequency words and to set a limit for the distance at which any two such
words shall be considered as being significantly related . . . An analysis of many
sentences and many documents indicates that a useful limit is four or five non-
significant words between a[OCRerr]y two high-frequency words " 3/
Doyle has also noted the tendency of words that are in fact highly related in a content-
revealing sense to co-occur in the same sentence or as quite direct neighbors. The same
investigator has also suggested that word distances can be used to provide "clustering"
effects that might, for example, sort out the possibly different topics cove re[OCRerr]/in intro-
ductory or background discussions, the main text, and various appendices. -
1/
2/
3/
4/
National Science Foundation's CR&D Report No.5, [430], p33; Bar-Hillel
1962 [35], p.418.
National Science Foundation's CR&D Report No. 6 [430], pp 43-44.
Savage 1958 [521], p.4. Later related work has included a method for generating
auto-extracts which adds to the high-frequency word sentence scores a correction
factor for the number of words in gaps between such words. (See Rath et al, 1961
[493;)
Doyle 1961 [166], p. 7.
83