IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Search Matching Functions
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
111-23
do not favor cosine occur when non-relevant documents are increased in match
and relevant documents decreased. Figure 11 shows that both these changes
take place on the Cran-l Stem run at any rate, but that changes favoring
cosine as against overlap occur at a ratio of about 3 to 2.
In seeking an explanation for this, it must be assumed that only the
factor of document length brought to bear in the cosine correlation can be
causing this result. It seems likely therefore that the distribution of
documents by length among the strongly matched and weakly matched, and among
the relevant and non-relevant is not even in the ordering induced by overlap.
The first suggested explanation is simply that relevant documents tend to be
short in length, and non-relevant documents are long. This, however, does
not turn out to be the case: in the ADI coliection 70 of the total collection
of 82 are relevant to one or more of the requests, and in the Cran-l collection
153 out of the 200 are at some time relevant. Figure 13 shows, for the Cran-l
collection, that the average document length varies by trivial amounts comparing
all 200 against the 153 which are sometimes relevant, and the 47 which are
never relevant.
The remaining explanation is that the distribution of documents by
length differs between relevant and non-relevant documents for a given level
of match. Specifically, it is hypotheBized that highly matched non-relevant
documents (i.e. highly ranked on overlap) are longer than average. Analysis
is performed to test this hypothesis, by taking pairs of relevant and non-
relevant documents, both pairs having an almost identical (and [OCRerr]t[OCRerr]ing) match
on overlap, and comparing document lengths. Figure 14 gives an individual
example, showing how the non-relevant document has 104 stem concepts and the
relevant one 39. When more than one relevant and non-relevant document have
identical matches, lengths can be averaged over all such documents; the results
given in Figure 15 are thus based on over 100 documents. Figure 15 shows