MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Generated by Machine-Automatic Derivative Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
"As it is currently run, the auto-indexing program selects about one word in ten
as a keyword in articles of three thousand words or less. In articles longer than
three thousand words it tends to pick about one word in fifteen. This high incidence
of keywords naturally increases the amount of noise results returned by the query
program, although good search strategy cuts them down considerably."
As of October 1963, the system was reported to be fully operative although not as
yet extensively tested in actual use. Gallagher and Toomey give illustrative auto-extract
results on two tested papers, one being Luhn's own "Automatic Creation of Literature
Abstracts". They give comparative results for manual versus machine selection of key-
words as index or search terms with 88.6 percent agreement, the human indexers having
selected, in 6 tests reported, 132 words and the machine method 117. Modifications
under consideration include pre-edit flagging of terms in author and cited-reference
fields for special weighting, setting the length of the abstract as a function of the total
number of worQs in an item, and, in the search program, generating additional search
terms by means of association factor techniques such as those suggested by Stiles.
To the basic approach of straight-forward word frequency counting, Luhn himself
has suggested that improvements might be obtained from considering closely adjacent
words, 2/word pairs, y/ and reference to vocabularies specific to a given field. 4/
Other possibilities are capitalized words and lookup against an inclusion list. He also
suggests:
"If certain words could be given in their relationships to other words, more
specific meanings may be identified by such combinations. These relationships
may range from the mere co-occurrence of certain words within a phrase or
sentence to the combinations of specific parts of speech. " 5/
Various investigators have proceeded to explore these and other possible improve-
ments, including incorporation of relative frequency information, use of information
about distances between high-ranked si[OCRerr]nificant words, word pairs and word n-tuples,
1/
2/
3/
4/
5/
Gallagher and Toomey, 1963 [205], p.51.
Luhn, 1959[384], p.10.
Luhn, 1962[373], p.11.
Luhn, 1959 [384], pp. 8 and 10.
Ibid, p. 5.
77