NIST Interagency Report 4873: Automatic Indexing

IR4873 NIST Interagency Report 4873: Automatic Indexing Automatic Indexing chapter Donna Harman National Institute of Standards and Technology 5 TABLE 3 An "S" Stemmer IF a word ends in "ies", but not "eies" or "ales" THEN "ies"--> 11y IF a word ends in "es", but not "aes", "ecs", or "oes" THEN "es" --> "e" IF a word ends in "S", but not "us" or "55" THEN "5"--> NULL The Lovins stemmer woiks similarly, but on a much larger scale. It contains a list of over 260 possible suffixes, a large exception list, and many cleanup rules. In contrast, the Porter algorithm looks for about 60 suffixes, producing word variant conflation intermediate between a simple singular-plural technique and Lovins algorithm. Table 4 shows an example of the differences among the three stemmers. The first column shows the actual words (full words) from the query. The next three columns show the words that are conflated with the original words (words that stem to the same root for that stemmer) based on three different stemmers. The starred terms are the ones that were useful in retrieval for this particular query and are shown only to indicate the "quasi-random" matching that occurs when matching query terms with terms in relevant documents. TABLE 4 Stemmer Differences for query 109 of the Cranfield test. collection Query -- panels subjected to aerodynamic heating FULL WORD S PORTER LOVINS *panels *panel *panel *panel ______________ *panels *panels *panels subjected subjected subjected subjected *subject *subject subjective subjective __________ subjects subjects *aerodyn[OCRerr]i[OCRerr] *aerodynamic *aerodynamic *aerodynamjc aerodynamics aerodynamics aerodynamics *aerodynamically *aerodynamically _____________ ______________ _________________ aerodynamicist *heating *heating *heating *heating *heated *heated *heat heats heater Stemming or suffixing is done for two principal reasons: the reduction in index storage required and the increase in performance due to the use of word variants. The storage savings using stemming is data and imple- mentation dependent. For small text collections on machines with litfie [OCRerr]torage, a sizable amount of inverted file storage can be saved using stemming. For the 1.6 megabyte manual shown in Table 1, approximately 20% of storage was saved by using the Lovins stemmer. Lennon eL al. (1981) showed compression percentages for the Lovins stemmer of 45.8% for the Brown Corpus. However, for the larger text collections normally used in online retrieval, less storage is saved. The savings was less than 14% for the text of 50 megabytes in Table 1, probably because this text contains large amounts of numbers, misspellings, proper names, etc. (items that usu- ally cannot be stemmed).