IR4873
NIST Interagency Report 4873: Automatic Indexing
Automatic Indexing
chapter
Donna Harman
National Institute of Standards and Technology
4
frequency for some text collections, such as "beloW", "near", "always", and "that". Note that unlike high fre-
quency words, "fluff" words do not necessarily hurt retrieval performance, and will not Seriously affect storage.
Often these words become crucial to retrieval, such as in a query "stocks with costs below X dollars", or "res-
taurants near the harbor".
A more suitable method of constructing a stop list would be to produce a word frequency listing for the text
to be indexed, and then examine each of the high frequency words. If there is no known importance of a given
word in the application, then that word can be safely placed on a stop list. An example of this pnccedure is the
work done at the National Institute of Standards and Technology (NIST) with a 25[OCRerr]megabyte collection of the
Wall Street Journal. The top twenty-seven high-frequency words were examined, and four words were removed
as possibly important ("a", "at", "from" and "to"). The rernaining twenty-tlii[OCRerr]e words then became the stop list.
This was a reduction from a previously-used stop list from the SMART project of 418 words. The shrinkage of
the stop list caused an increase of about 25% in the index storage, but made available for searching an addi-
tional 395 words. This new stop list is shown as Table 2 as an illustration of an abbreviated stop list rather than
as a particularly recommended one.
TABLE 2
Sample Stop Words
an been in or which
and but is that will
are by it the with
as for of this
be have on was
It should be noted that commercial systems are even more conservative in the use of stop lists. ORBIT
Search Service has only eight stop words: "and", "an", "by", "from", "of", "or", "the", and "with" [OCRerr]ox 1992).
The MEDLARS system has even fewer stop words.
2A Use of suffixing or stemming
Many information retrieval systems also use suffixing or stemming to replace all indexed words with their
root forms. Different stemming algorithms have been used, including "standard" algorithms, and algorithms built
for a specific domain such as medical English [OCRerr]acak 1978). For a survey of the various algorithms see Frakes
(1992). Three standard algorithms, an "S" stemming algorithm, the Lovins (1%8) algorithm, and the Porter
(1980) algorithm, are most often used, and the following excerpts (Harman 1991) show some of their charac-
teristics.
The "S stemming algorithm, a basic algorithm conflating singular and plural word forms, is commonly used
for minimal stemming. The rules for a version of this stemmer, shown in Table 3, are only applied to words of
sufficient length (three or more characters), and are applied in an order dependent manner (i.e., the first applica-
ble rule encountered is the only one used). Each rule has three parts: a specification of the qualifying word
ending, such as "ies"; a list of exceptions; and the necessary action.