IR4873 NIST Interagency Report 4873: Automatic Indexing Automatic Indexing chapter Donna Harman National Institute of Standards and Technology 4 frequency for some text collections, such as "beloW", "near", "always", and "that". Note that unlike high fre- quency words, "fluff" words do not necessarily hurt retrieval performance, and will not Seriously affect storage. Often these words become crucial to retrieval, such as in a query "stocks with costs below X dollars", or "res- taurants near the harbor". A more suitable method of constructing a stop list would be to produce a word frequency listing for the text to be indexed, and then examine each of the high frequency words. If there is no known importance of a given word in the application, then that word can be safely placed on a stop list. An example of this pnccedure is the work done at the National Institute of Standards and Technology (NIST) with a 25[OCRerr]megabyte collection of the Wall Street Journal. The top twenty-seven high-frequency words were examined, and four words were removed as possibly important ("a", "at", "from" and "to"). The rernaining twenty-tlii[OCRerr]e words then became the stop list. This was a reduction from a previously-used stop list from the SMART project of 418 words. The shrinkage of the stop list caused an increase of about 25% in the index storage, but made available for searching an addi- tional 395 words. This new stop list is shown as Table 2 as an illustration of an abbreviated stop list rather than as a particularly recommended one. TABLE 2 Sample Stop Words an been in or which and but is that will are by it the with as for of this be have on was It should be noted that commercial systems are even more conservative in the use of stop lists. ORBIT Search Service has only eight stop words: "and", "an", "by", "from", "of", "or", "the", and "with" [OCRerr]ox 1992). The MEDLARS system has even fewer stop words. 2A Use of suffixing or stemming Many information retrieval systems also use suffixing or stemming to replace all indexed words with their root forms. Different stemming algorithms have been used, including "standard" algorithms, and algorithms built for a specific domain such as medical English [OCRerr]acak 1978). For a survey of the various algorithms see Frakes (1992). Three standard algorithms, an "S" stemming algorithm, the Lovins (1%8) algorithm, and the Porter (1980) algorithm, are most often used, and the following excerpts (Harman 1991) show some of their charac- teristics. The "S stemming algorithm, a basic algorithm conflating singular and plural word forms, is commonly used for minimal stemming. The rules for a version of this stemmer, shown in Table 3, are only applied to words of sufficient length (three or more characters), and are applied in an order dependent manner (i.e., the first applica- ble rule encountered is the only one used). Each rule has three parts: a specification of the qualifying word ending, such as "ies"; a list of exceptions; and the necessary action.