MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Generated by Machine-Automatic Derivative Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
Considering first the most frequently occurring words in a given text as too common
to be subject-indicative (those usually stopped or purged by a suitable exclusion dictionary
or stop list, for example) and next the least frequent words as being rarely topical in a
content-revealing sense, Luhn settles upon a middle range of frequency of word occur-
rence as the basis for his auto-condensation processes. The actual frequency counts are
computed, together with indications of page, line, and occurrence within the same
sentence. When this has been done for the complete text, each individual sentence is then
checked for the I!scorel! of relatively high frequency words occurring in it, and sentences
with the highest scores are then automatically selected, in textually-occurring order, and
are printed out as an abstract, more properly an extract, of the document.
The automatic encoding of documents may be achieved either by taking the high
ianking words of the selected sentences or by selecting the highest ranking of the words
in the entire document as index entries. Luhn typically justifies these procedures as
follows:
"Of various automatic procedures for deriving typical patterns for characterizing
documents, the systems here proposed are based on operations involving
statistical properties of words . . . It is held that the more often a certain word
appears in a document the more it becomes representative of the subject matter
treated by the author. In grading words in accordance with the frequency of usage
within a document, a pattern is derived which is typical of that document and unique
amongst all similarly derived patterns of a collection of documents. It is proposed
that the more similar two such patterns are the more similar is the intellectual
contents of the documents they represent...
The creation of an encoding pattern may consist of listing an appropriate
portion of the words ranking highest on the word frequency list derived from a
document. Experiments conducted so far on documents ranging in size from 500
to 5000 words have indicated that word patterns consisting of from ten to twenty-
four of the highest ranking words furnish adequate discrimination and resolution
for retrieval, sixteen such words being a likely average. 1/
At Wright-Patterson Air Force Base an automated information selection and
retrieval system has been developed jointly by Air Force and IBM personnel
(Gallagher and Toomey, 1963 L205[OCRerr]). It involves both auto-indexing and auto-
abstracting techniques following the Luhn word-frequency-counting techniques. Pre-
editing is applied to demarcate fields (e.g. , title, author) and to flag certain text words,
particularly proper names, for special treatment. Special treatment, over and above the
frequency-based selection score, is also given to words in the title field.
On the abstracting side, modifications to the original Luhn formula involve
segmenting sentences in terms of strings of both high and low valued words separated
by either periods or continuous strings of low valued words, on the assumption that
long consecutive strings of low value words should weight negatively. The automatic
extract consists of the highest ranking 20 percent of the sentences subject to the
restriction that no less than 7 and no more than 20 sentences should be selected. On the
indexing side, the investigators report:
1/
Luhn, 1959 [37l[OCRerr], p.47.
76