Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Search Matching Functions chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. "I-li The overlap correlation provides generally higher correlation coef- ficients than cosine, but this is of no direct importance, 3ince the correla- tions are used only to order the documents into a ranked list in relation to each search request SQ that the positions taken up by the relevant docu- ments may be determined. The correlation values could be displayed for the user to permit him to examine only those documents above a certain correlation; however, since a ranked output is provided, it seems more likely that users will examine the highest ranked documents anyway and continue to look at the ranked list until they are satisfied, or until they are unwilling to examine additional documents on the basis of the document titles or abstracts. B) Retrieval Performance Results Retrieval runs are made on SMART comparing the overlap and cosine correlation coefficients, without weights (i.e. logical vectors), and keeping other variables such as document length and dictionary type constant. Twelve comparison runs on three coUections are presented in Figure 3, and evaluated by normalized recall and normalized precision. In every case, the run with the cosine correlation gives a higher normalized recall and precision than the run with the overlap correlation. The ADI text thesaurus run shows an .0[OCRerr]3 increase in normalized recall, and the Cran-l Abstract Stem run shows a normalized precision increase of .055, both increases in favor of cosine. Figures [OCRerr], 5, 6, and 7 present precision versus recall graphs for the stem and thesaurus dictionaries on the IRE-3 collection (Figure [OCRerr]), the Cran-l collection (Figure 5), and the ADI collection using text (Figure 6) and abstracts (Figure 7). General merit still favors cosine, although the ADI results show very small differences in the curves, and overlap is superior to cosine in the low recall-high precision area on the stem diction[OCRerr][OCRerr]y, text, ana abstract runs.