NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report

MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Generated by Machine-Automatic Derivative Indexing chapter Mary Elizabeth Stevens National Bureau of Standards Some of the reasons for keeping stop lists short, however, may reflect unnecessary programming difficulties. Turner and Kennedy have reported that in the SAPIR system a title word is compared only with the group of nonsignificant words that have the same number of characters, in order to reduce the machine time required for the exclusion list search. 1/ Skaggs and Spangler give an account of an exclusion list system developed for general text processing as follows: "A representative form developed by General Electric is composed of three groups of words, high frequency, special and standard. The high frequency words (25) occur most frequently in English text. A compression of approximately 35 percent will occur for most kinds of text when these 25 words are deleted. The special words are derived from the particular body of text being processed The com- position of this group is left to the program user. Normally the words for this group are selected by making an Editing list in alphabetical sequence. The words appearing in the index position on the preliminary listing are then reviewed. "Standard words are words that occur with a relatively high frequency in most types of text and therefore are appropriate for a general purpose screen. In the GE program, 375 words are used in this group. "To minimize computer processing time, it is desirable that words in the Ex- clusion Dictionary be arranged in approximate order of their frequency of occurrence." 2/ It should be noted, however, that in most cases stop list searches can be programmed in the form of so-called "logarithmic", "partitioning" or "bifurcation" searches in which the number of machine operations required is only log2N + 1, where N is the number of words in the list. The more words excluded, the fewer the title entry lines that must be included in the final index. This is a factor involving first of all the user in the sequential scanning he must do, where, as Coates has remarked, the retrieval effectiveness is usually in inverse proportion to the amount of such scanning required. 31 Secondly, longer stop lists help to minimize the long block problem, since it is obviously the most frequently occurring title words that have not been excluded that cause the longest blocks of entries. 1/ 2/ 3' Turner and Kennedy, 1961 [614], p.7. Skaggs and Spangler, 1963 [557], p. 29. Coates, 1962 [134], p. 430. 66