MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Generated by Machine-Automatic Derivative Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
"We naturally find that the words of greatest interest are those for which there
exists the greatest contrast between general usage frequency and local (within the
article) usage frequency. 11 1/
"Luhn has bypassed syntactical analysis by taking advantage of the information
content of the most frequently used topical words in articles ... Edmundson et al
take a further step in a desirable direction by bringing in information from outside
the article being analyzed: words and terms are given greater topical value as the
contrast increases between the frequency of use within the article and the rarity of
general usage." 2/
`1A further refinement of the process of automatic analysis would be the develop-
ment of special sets of reference frequencies for special fields of interest. This
would have two benefits: it would become possible to classify documents as to
field, and it would become possible to note the significance of words which are
frequent in the document and frequent in a very large reference class c0 of
literature (i.e. , these words would not be significant with respect to c0) but which
are rare in the special field. For example, the word `emotion' might be too
common in general usage to seem significant, but frequent occurrence of the word
would stand out in a paper on electronic circuitry (e.g. , of a robot) when compared
with its frequency in general electrical engineering literature."
"One of the . .. goals is to investigate a relative-frequency approach to the cate-
gorization of documents. .. For this investigation it will be necessary to develop
sets of reference frequencies for words used in different subject fields. It was
suggested by Fdmundson and Wyllys that these sets of reference frequencies,
when developed, could be used to categorize a document as belonging to a particular
subject-field, by means of measuring the degree of matching (e.g. , with the chi-
squared test) between the proportional frequencies of words in the documents and
the sets of reference frequencies." 4/
Two points in the comments quoted above appear especially worthy of note. The first
is that of introducing at least some measure of reference to material other than the
individual author's own choice of linguistic expression and specific terms. We shall dis-
cuss this factor in more detail in a later section of this report. The second point,
derived in part from the first, is the specific suggestion of movement away from purely
derivative indexing by machine in the direction of automatic assignment indexing and
automatic categorization or classification.
1/
2/
3/
4/
Doyle, 1959 [165], p. 9.
Doyle, 1961 [169], p. 3.
Edmundson and Wyllys, 1961 [181], p.228.
Wyllys, 1963 [653], p. 10.
82