IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Search Matching Functions
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
III-~s
produces better denominator values as well for the relevant documents (see
Figure 30); no explanation for this phenomenon is suggested.
Support for the second hypothesis is obtained in Figure 31, where
the Cranfield request on ablation is again used as an examplee The two rele-
vant documents have poor matches with the request, but since the matching con-
cept is the most important request word, weights of [OCRerr] are derived fr[OCRerr] the
frequency of occurrence in the document; a non-relevant document with more
matching concepts, but spurious ones, is ranked below the relevant with the
weights in use.
5. Conclusions and Suggested Further Studies
A matching function that consists of the cosine correlation with
numeric vectors has been shown to be nearly always superior to the use of
either the overlap correlation or logical vectorse A simplified table of
results using precision versus recall graphs, for normalized measures, and
individual requests is given in Figure 32.
The cosine correlation coefficient works better than the overlap
coefficient because the factor of document length included in the cosine
coefficient reduces the request/document correlation for a number of the
highly matched non-relevant documents, since there is a strong correlation
among non-relevant documents between number of matching concepts and the length
of the document. The superiority of weighted concepts evidenced by the
superiority of numeric as opposed to logical vectors is due to two reasons.
The first is that highly weighted matching concepts tend to distinguish between
important and trivial occurrences of those concepts in the documents, and
thus tend to make better distinctions between relevant and non-relevant
documents. The second reason is that if different concepts in a request
receive different weights, such weighting does discriminate between vital