MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Generated by Machine-Automatic Derivative Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
3.3.3 Relative Frequency Techniques - Edmundson and Wyllys, and Others
The first comprehensive critique of word frequency approaches to automatic extract-
ing and indexing was undoubtedly that of Bar-Hillel (1959 [33], 1960 [34]), followed closely
by Edmundson and Wyllys (1961 [181]), who themselves have experimented with various
alternative or improved methods for obtaining measures of word significance by statistical
analysis. These critics have been in agreement both on many points of specific criticism
and on suggested possibilities for amelioration of observed difficulties, especially in
terms of considering relative word frequencies within a particular subject field. In
addition, several other investigators independently proposed a relative frequency approach
at about the same time. 1/
Some typical expressions of opinion on the importance of relative frequency criteria
are as follows:
"Let me propose here a system of auto-indexing which, to my knowledge, has never
been publicly proposed before in this form and which seems to me superior to any
other system I have heard of ... Assume that ... we are given a list of the average
relative frequencies of all English `words' . . . It would then be possible, for any
given document, to rank-order all the `words' occurring in this document according
to the excess of their relative frequency within the document over their average
relative frequency. By some mechanically implementable standard or other, an
initial segment of this list is selected as the index-set." 2/
"Very general considerations from information theory suggest that a word's
information should vary inversely with its frequency rather than directly, its
lower probability evidencing greater selectivity or deliberation in its use. It is
the rare, special, or technical word that will indicate most strongly the subject
of an author's discussion. Here, however, it is clear that by `rare' we must
mean rare in general usage, not rare within the document itself. In fact it would
seem natural to regard the contrast between the word's relative frequency f
within the document and its relative frequency r in general use ... as a more re-
vealing indication of the word's value in indicating the subject, matter of a
document." 3/
1/
Compare, for example, Kochen, 1963 [327], p.7: "The idea of contrasting words
which occur frequently in a document against the frequency of this word in the
background language for purposes of selecting index terms seem to have been
suggested first by Bohnert and the author, then described in more detail by
Edmundson and Wyllys, and tested empirically by Damerau. Something similar
was suggested even earlier by Bar-Hillel." See Bar-Hillel, 1962 [35], p.418,
footnote, with respect to himself, Edmundson, and Bohnert. See also, however,
Doyle 1962 [163], p.388: "Edmundson and Wyllys were probably the first to
publicly advocate contrasting word frequencies within a document to word fre-
quencies within a given field and using these relative frequencies as criteria for
scoring and selecting sentences."
2/
3'
Bar-Hillel, 1959 [33], pp 4-8-9.
Edmundson and Wyllys, 1961 [181], p.227.
81