ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IV-~2 :`ig. 12 does for the ?l[OCRerr]arris 3? thesaurus what F-'.g. 9 did for the nii.[OCRerr][OCRerr]l dictionary: specifically, it shows the effect of using the thesaurus fer title words only, coir'[OCRerr]ared to using it throughout, and of applying higher weights to the title than to the remainder of the text. The results are substantially in agreement with those previously obtained for the null thesaurus: the `1title only" process is again much poorer, indicating that synonym recognition for title words alone, while better than no s[OCRerr]'nonym reco[OCRerr]nition at all, is still not nearly so effective as f[OCRerr][OCRerr]ll synonym detection,:. also as before, the increased weighting of title words does not substantially add to the retrieval effectiveness. C) The Phrase Dictionary The performance of the statistical phrase dictionary may be evaluated by using the outpi[OCRerr]t of Figs. 13 and 14. Fig. 13 presents a comparison between the early 11Harris 2" thesaurus, and the same thesaurus supplemented by statistical phrases o#' equal weight. The same procedures are compared in Fig. l[OCRerr] for the more powerful "Harris 3" thesaurus. Fig. 14 also includes performance figures for two combined searches consisting first of the reg'ilar - thesaurus look-up followe[OCRerr] by a statistical phrase look-up, in [OCRerr]hich phrases are weighted one and a half times as much as individual concepts. Fig. 13 shows that the statistical phrase process affords a noticeable improv[OCRerr]ment in retrieval effectiveness, compared with the t'Harris 2" thesaurus alone; a much smaller improvement is obtained over "Harris 3", as seen in Fig. 14. The third dictionary includes fewer ambiguities, thus explaining why the phrase process is less important in this case. For both synonym dictionaries it may be noticed that for very high