ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IV-~2
:`ig. 12 does for the ?l[OCRerr]arris 3? thesaurus what F-'.g. 9 did for the
nii.[OCRerr][OCRerr]l dictionary: specifically, it shows the effect of using the thesaurus
fer title words only, coir'[OCRerr]ared to using it throughout, and of applying
higher weights to the title than to the remainder of the text. The
results are substantially in agreement with those previously obtained
for the null thesaurus: the `1title only" process is again much poorer,
indicating that synonym recognition for title words alone, while better
than no s[OCRerr]'nonym reco[OCRerr]nition at all, is still not nearly so effective as
f[OCRerr][OCRerr]ll synonym detection,:. also as before, the increased weighting of title
words does not substantially add to the retrieval effectiveness.
C) The Phrase Dictionary
The performance of the statistical phrase dictionary may be evaluated
by using the outpi[OCRerr]t of Figs. 13 and 14. Fig. 13 presents a comparison
between the early 11Harris 2" thesaurus, and the same thesaurus supplemented
by statistical phrases o#' equal weight. The same procedures are compared
in Fig. l[OCRerr] for the more powerful "Harris 3" thesaurus. Fig. 14 also includes
performance figures for two combined searches consisting first of the reg'ilar -
thesaurus look-up followe[OCRerr] by a statistical phrase look-up, in [OCRerr]hich phrases
are weighted one and a half times as much as individual concepts.
Fig. 13 shows that the statistical phrase process affords a noticeable
improv[OCRerr]ment in retrieval effectiveness, compared with the t'Harris 2"
thesaurus alone; a much smaller improvement is obtained over "Harris 3",
as seen in Fig. 14. The third dictionary includes fewer ambiguities, thus
explaining why the phrase process is less important in this case.
For both synonym dictionaries it may be noticed that for very high