MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Generated by Machine-Automatic Derivative Indexing chapter Mary Elizabeth Stevens National Bureau of Standards "We naturally find that the words of greatest interest are those for which there exists the greatest contrast between general usage frequency and local (within the article) usage frequency. 11 1/ "Luhn has bypassed syntactical analysis by taking advantage of the information content of the most frequently used topical words in articles ... Edmundson et al take a further step in a desirable direction by bringing in information from outside the article being analyzed: words and terms are given greater topical value as the contrast increases between the frequency of use within the article and the rarity of general usage." 2/ `1A further refinement of the process of automatic analysis would be the develop- ment of special sets of reference frequencies for special fields of interest. This would have two benefits: it would become possible to classify documents as to field, and it would become possible to note the significance of words which are frequent in the document and frequent in a very large reference class c0 of literature (i.e. , these words would not be significant with respect to c0) but which are rare in the special field. For example, the word `emotion' might be too common in general usage to seem significant, but frequent occurrence of the word would stand out in a paper on electronic circuitry (e.g. , of a robot) when compared with its frequency in general electrical engineering literature." "One of the . .. goals is to investigate a relative-frequency approach to the cate- gorization of documents. .. For this investigation it will be necessary to develop sets of reference frequencies for words used in different subject fields. It was suggested by Fdmundson and Wyllys that these sets of reference frequencies, when developed, could be used to categorize a document as belonging to a particular subject-field, by means of measuring the degree of matching (e.g. , with the chi- squared test) between the proportional frequencies of words in the documents and the sets of reference frequencies." 4/ Two points in the comments quoted above appear especially worthy of note. The first is that of introducing at least some measure of reference to material other than the individual author's own choice of linguistic expression and specific terms. We shall dis- cuss this factor in more detail in a later section of this report. The second point, derived in part from the first, is the specific suggestion of movement away from purely derivative indexing by machine in the direction of automatic assignment indexing and automatic categorization or classification. 1/ 2/ 3/ 4/ Doyle, 1959 [165], p. 9. Doyle, 1961 [169], p. 3. Edmundson and Wyllys, 1961 [181], p.228. Wyllys, 1963 [653], p. 10. 82