IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Search Matching Functions
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
111-10
Mm (a , b)
OVERLAP = ___
Min( [OCRerr]a , [OCRerr]b)
5 5
- Min(8 , 18) = = 0.63
(i)
COSINE =
Za.b
w-1b2
= 1% = 0.42
(2)
Both functions are designed for use with weighted concept numbers, and their
use in this mariner is illustrated in part 4. Since in the tests carried out,
the requests are generally shorter than the documents (except occasionally
when title runs are being made), the overlap function in documentary terms
measures the inclusion of the request terms in the document only. Thus, if
a request with eight concepts matches five of them in several documents, all
such documents will receive identical correlations with the request. The
cosine function measures the similarity of the total request to the total
document, and non-matching concepts in both requests and documents affect the
final correlation. Thus, for a request that matches five out of eight concepts
in several documents, the document that has the fewest number of non-matching
concepts will receive the highest correlation. Cosine thus takes into account
document length, following the principle that if two documents have equal
request/document matching concepts, the shorter document has a higher proba-
bility of being useful to the requestor, since it will contain less extraneous
material. In documentary terms this principle seems of doubtful validity
since a requestor may be equally satisfied by treatment of the requested
topic in a long document as in a short one.