IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Suffix Dictionaries
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VI-l~
conflates many words), and a similar, but unexplained, relationship is noted
when the use of cosine is compared to overlap. From a strictly experimental
viewpoint dictionaries such as suffix `5' and stem should be compared without
the addition of weighting procedures and cosine, in order that the dictionary
mapping characteristics may be tested alone. In this case, the overlap logi-
cal results show that stem and suffix `S' dictionaries perform very similarly,
and therefore within the context of the requests and relevance decisions
in use, no advantage should be gained from full suffix recognition as per-
formed automatically. This finding is in accordance with the general con-
clusions of the second Aslib-Cranfield project [8], although in those results
the nearest equivalent to the stem dictionary does perform a little better
than suffix `5'.
However, a more practical conclusion in the case of SMART is that
stem is the superior dictionary on the IRE-3 and ADI collections, since the
cosine correlation and numeric vectors have clearly been proved to be ad-
vantageous, and would be advocated for use in any operational version of
SMART.
The superiority of suffix `5' on Cran-l is one of several instances
where the Cran-l result differs from the other collections. In the case of
Cran-l the difference in word mapping between suffix `5' and stem is less
marked than in the other collections, since Figure 9 shows that the Cran-l
stem dictionary includes 8[OCRerr] of the concept classes contained in suffix
`5', whereas the IRE-3 and ADI stem dictionaries are based on more mapping
characteristics, including only 76% and 74% of suffix `5', respectively.
As expected, this affects the match with requests and documents, since
Figure 10 shows that at a cosine correlation cut-off of 0.35, the stem
dictionary in Cran-l does not retrieve so many additional documents over
suffix `5' than is true for the other collections.