IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Suffix Dictionaries
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VI-9
These average results may be supplemented by the individual request
data given in Figures 5, 6, 7 and 8e Using the normalized recall and pre-
cision measures as indicators of merit, it can be seen that 71% to 7[OCRerr]% of the
requests favor stem on IRE-3 (Figure 5), and 53% to 75% of the requests
favor stem on ADI abstracts (Figure 7) and text (Figure 8)e The Cran-l
result favoring suffix `s' is confirmed by figures relating to the individual
request also, with 72% to 77% preferring suffjx 1[OCRerr]t, ignoring those requests
which have equal merit for both dictionariese Each figure includes plots of
both normalized recall and precision versus the individual requestse In
the case of Cran-l these plots show that suffix [OCRerr] is superior on the average
because many of the requests favor suffix `5' by very small amounts. In
the IRE-3 and ADI collections the stem dictionary displays some large changes
in individual requests in its superiority over suffix `5'.
[OCRerr]. Performance Analyses
Two phenomena require explanation: firstly, the IRE and ADI runs
involving logical vectors and overlap correlation which sometimes show suf-
fix `5' superior to stem; and secondly, the superiority of suffix `5' on the
Cran-l collection.
The first phenomenon is less important than the second, because
logical and overlap runs are inferior to cosine numeric runs in any case.
Cases where suffix `5' is better than stem must be caused by circumstances
of the type considered in part 2, where full suffix removal conflates some
words that match with non-relevant documents and thus adversely affect per-
formance. It was noted in section III that the use of numeric vectors
(weighted) gives a clear advantage over logical vectors when a dictionary
is in use that includes a reasonably large amount of mapping (i.e., it