MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Generated by Machine-Automatic Derivative Indexing chapter Mary Elizabeth Stevens National Bureau of Standards Considering first the most frequently occurring words in a given text as too common to be subject-indicative (those usually stopped or purged by a suitable exclusion dictionary or stop list, for example) and next the least frequent words as being rarely topical in a content-revealing sense, Luhn settles upon a middle range of frequency of word occur- rence as the basis for his auto-condensation processes. The actual frequency counts are computed, together with indications of page, line, and occurrence within the same sentence. When this has been done for the complete text, each individual sentence is then checked for the I!scorel! of relatively high frequency words occurring in it, and sentences with the highest scores are then automatically selected, in textually-occurring order, and are printed out as an abstract, more properly an extract, of the document. The automatic encoding of documents may be achieved either by taking the high ianking words of the selected sentences or by selecting the highest ranking of the words in the entire document as index entries. Luhn typically justifies these procedures as follows: "Of various automatic procedures for deriving typical patterns for characterizing documents, the systems here proposed are based on operations involving statistical properties of words . . . It is held that the more often a certain word appears in a document the more it becomes representative of the subject matter treated by the author. In grading words in accordance with the frequency of usage within a document, a pattern is derived which is typical of that document and unique amongst all similarly derived patterns of a collection of documents. It is proposed that the more similar two such patterns are the more similar is the intellectual contents of the documents they represent... The creation of an encoding pattern may consist of listing an appropriate portion of the words ranking highest on the word frequency list derived from a document. Experiments conducted so far on documents ranging in size from 500 to 5000 words have indicated that word patterns consisting of from ten to twenty- four of the highest ranking words furnish adequate discrimination and resolution for retrieval, sixteen such words being a likely average. 1/ At Wright-Patterson Air Force Base an automated information selection and retrieval system has been developed jointly by Air Force and IBM personnel (Gallagher and Toomey, 1963 L205[OCRerr]). It involves both auto-indexing and auto- abstracting techniques following the Luhn word-frequency-counting techniques. Pre- editing is applied to demarcate fields (e.g. , title, author) and to flag certain text words, particularly proper names, for special treatment. Special treatment, over and above the frequency-based selection score, is also given to words in the title field. On the abstracting side, modifications to the original Luhn formula involve segmenting sentences in terms of strings of both high and low valued words separated by either periods or continuous strings of low valued words, on the assumption that long consecutive strings of low value words should weight negatively. The automatic extract consists of the highest ranking 20 percent of the sentences subject to the restriction that no less than 7 and no more than 20 sentences should be selected. On the indexing side, the investigators report: 1/ Luhn, 1959 [37l[OCRerr], p.47. 76