NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report

MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Generated by Machine-Automatic Derivative Indexing chapter Mary Elizabeth Stevens National Bureau of Standards 3.3.3 Relative Frequency Techniques - Edmundson and Wyllys, and Others The first comprehensive critique of word frequency approaches to automatic extract- ing and indexing was undoubtedly that of Bar-Hillel (1959 [33], 1960 [34]), followed closely by Edmundson and Wyllys (1961 [181]), who themselves have experimented with various alternative or improved methods for obtaining measures of word significance by statistical analysis. These critics have been in agreement both on many points of specific criticism and on suggested possibilities for amelioration of observed difficulties, especially in terms of considering relative word frequencies within a particular subject field. In addition, several other investigators independently proposed a relative frequency approach at about the same time. 1/ Some typical expressions of opinion on the importance of relative frequency criteria are as follows: "Let me propose here a system of auto-indexing which, to my knowledge, has never been publicly proposed before in this form and which seems to me superior to any other system I have heard of ... Assume that ... we are given a list of the average relative frequencies of all English `words' . . . It would then be possible, for any given document, to rank-order all the `words' occurring in this document according to the excess of their relative frequency within the document over their average relative frequency. By some mechanically implementable standard or other, an initial segment of this list is selected as the index-set." 2/ "Very general considerations from information theory suggest that a word's information should vary inversely with its frequency rather than directly, its lower probability evidencing greater selectivity or deliberation in its use. It is the rare, special, or technical word that will indicate most strongly the subject of an author's discussion. Here, however, it is clear that by `rare' we must mean rare in general usage, not rare within the document itself. In fact it would seem natural to regard the contrast between the word's relative frequency f within the document and its relative frequency r in general use ... as a more re- vealing indication of the word's value in indicating the subject, matter of a document." 3/ 1/ Compare, for example, Kochen, 1963 [327], p.7: "The idea of contrasting words which occur frequently in a document against the frequency of this word in the background language for purposes of selecting index terms seem to have been suggested first by Bohnert and the author, then described in more detail by Edmundson and Wyllys, and tested empirically by Damerau. Something similar was suggested even earlier by Bar-Hillel." See Bar-Hillel, 1962 [35], p.418, footnote, with respect to himself, Edmundson, and Bohnert. See also, however, Doyle 1962 [163], p.388: "Edmundson and Wyllys were probably the first to publicly advocate contrasting word frequencies within a document to word fre- quencies within a given field and using these relative frequencies as criteria for scoring and selecting sentences." 2/ 3' Bar-Hillel, 1959 [33], pp 4-8-9. Edmundson and Wyllys, 1961 [181], p.227. 81