Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Search Matching Functions chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 111-32 and ADI give the same result. The exceptions are the stem dictionary on abstracts and titles Cran-l, and suffix `5' dictionary on abstracts and text ADI. Figures 18, 19, 20 and 21 present precision versus recall graphs for the stem and thesaurus dictionaries on the IRE-3 collection (Figure 18), the Cran-l collection (Figure 19), and the ADI collection using text (Figure 20) and abstracts (Figure 21). General merit strongly favors numeric, the only exceptions being the low recall high precision area on Cran-l Stem, and the small differences in the curves on ADI abstract stem. Since the normalized measures for both recall and precision show ADI test suffix `5[OCRerr] to prefer logical vectors, a precision versus recall graph of this output together with ASI abstracts suffix `5' is given in Figure 22. The graphs show numeric to be superior on both plots up to o.8 recall; the difference in merit obtained by the normalized measures compared with the graphs of standard measures is considered in Section II. Comparisons of individual request merit are given in Figure 23, 76.5% to 88.2% of the requests favor numeric on IRE-3, 51.14% to 77.8% numeric on Cran-l, and 145.[OCRerr]% to 65.[OCRerr] favor numeric on ADI. where favor C) Analysis of Performance The thesaurus dictionaries show a better improvement for numeric over logical than the stem and suffix `5' dictionaries; a specific reason for this is suggested by the data in Figure 24. Using four ADI dictionaries and the ADI text results, it seems that numeric gives the best increases in performance over logical with dictionaries that contain few concept classes. The dictionary with the smallest number of classes is an exception to this for four of the performance measures used, this dictionary, however, has a performance that is inferior to the stem dictionary thus explaining the dis- crepancy. The grouping of words achieved by a thesaurus provides a greater