IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Search Matching Functions
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
"I-li
The overlap correlation provides generally higher correlation coef-
ficients than cosine, but this is of no direct importance, 3ince the correla-
tions are used only to order the documents into a ranked list in relation
to each search request SQ that the positions taken up by the relevant docu-
ments may be determined. The correlation values could be displayed for the
user to permit him to examine only those documents above a certain correlation;
however, since a ranked output is provided, it seems more likely that users
will examine the highest ranked documents anyway and continue to look at the
ranked list until they are satisfied, or until they are unwilling to examine
additional documents on the basis of the document titles or abstracts.
B) Retrieval Performance Results
Retrieval runs are made on SMART comparing the overlap and cosine
correlation coefficients, without weights (i.e. logical vectors), and keeping
other variables such as document length and dictionary type constant.
Twelve comparison runs on three coUections are presented in Figure 3, and
evaluated by normalized recall and normalized precision. In every case, the
run with the cosine correlation gives a higher normalized recall and precision
than the run with the overlap correlation. The ADI text thesaurus run shows
an .0[OCRerr]3 increase in normalized recall, and the Cran-l Abstract Stem run shows
a normalized precision increase of .055, both increases in favor of cosine.
Figures [OCRerr], 5, 6, and 7 present precision versus recall graphs for the
stem and thesaurus dictionaries on the IRE-3 collection (Figure [OCRerr]), the
Cran-l collection (Figure 5), and the ADI collection using text (Figure 6)
and abstracts (Figure 7). General merit still favors cosine, although the ADI
results show very small differences in the curves, and overlap is superior
to cosine in the low recall-high precision area on the stem diction[OCRerr][OCRerr]y, text,
ana abstract runs.