IR4873
NIST Interagency Report 4873: Automatic Indexing
Automatic Indexing
chapter
Donna Harman
National Institute of Standards and Technology
5
TABLE 3
An "S" Stemmer
IF a word ends in "ies", but not "eies" or "ales"
THEN "ies"--> 11y
IF a word ends in "es", but not "aes", "ecs", or "oes"
THEN "es" --> "e"
IF a word ends in "S", but not "us" or "55"
THEN "5"--> NULL
The Lovins stemmer woiks similarly, but on a much larger scale. It contains a list of over 260 possible
suffixes, a large exception list, and many cleanup rules. In contrast, the Porter algorithm looks for about 60
suffixes, producing word variant conflation intermediate between a simple singular-plural technique and Lovins
algorithm. Table 4 shows an example of the differences among the three stemmers. The first column shows the
actual words (full words) from the query. The next three columns show the words that are conflated with the
original words (words that stem to the same root for that stemmer) based on three different stemmers. The
starred terms are the ones that were useful in retrieval for this particular query and are shown only to indicate
the "quasi-random" matching that occurs when matching query terms with terms in relevant documents.
TABLE 4
Stemmer Differences for query 109 of the Cranfield test. collection
Query -- panels subjected to aerodynamic heating
FULL WORD S PORTER LOVINS
*panels *panel *panel *panel
______________ *panels *panels *panels
subjected subjected subjected subjected
*subject *subject
subjective subjective
__________ subjects subjects
*aerodyn[OCRerr]i[OCRerr] *aerodynamic *aerodynamic *aerodynamjc
aerodynamics aerodynamics aerodynamics
*aerodynamically *aerodynamically
_____________ ______________ _________________ aerodynamicist
*heating *heating *heating *heating
*heated *heated
*heat
heats
heater
Stemming or suffixing is done for two principal reasons: the reduction in index storage required and the
increase in performance due to the use of word variants. The storage savings using stemming is data and imple-
mentation dependent. For small text collections on machines with litfie [OCRerr]torage, a sizable amount of inverted file
storage can be saved using stemming. For the 1.6 megabyte manual shown in Table 1, approximately 20% of
storage was saved by using the Lovins stemmer. Lennon eL al. (1981) showed compression percentages for the
Lovins stemmer of 45.8% for the Brown Corpus. However, for the larger text collections normally used in
online retrieval, less storage is saved. The savings was less than 14% for the text of 50 megabytes in Table 1,
probably because this text contains large amounts of numbers, misspellings, proper names, etc. (items that usu-
ally cannot be stemmed).