Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Search Matching Functions chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 111-23 do not favor cosine occur when non-relevant documents are increased in match and relevant documents decreased. Figure 11 shows that both these changes take place on the Cran-l Stem run at any rate, but that changes favoring cosine as against overlap occur at a ratio of about 3 to 2. In seeking an explanation for this, it must be assumed that only the factor of document length brought to bear in the cosine correlation can be causing this result. It seems likely therefore that the distribution of documents by length among the strongly matched and weakly matched, and among the relevant and non-relevant is not even in the ordering induced by overlap. The first suggested explanation is simply that relevant documents tend to be short in length, and non-relevant documents are long. This, however, does not turn out to be the case: in the ADI coliection 70 of the total collection of 82 are relevant to one or more of the requests, and in the Cran-l collection 153 out of the 200 are at some time relevant. Figure 13 shows, for the Cran-l collection, that the average document length varies by trivial amounts comparing all 200 against the 153 which are sometimes relevant, and the 47 which are never relevant. The remaining explanation is that the distribution of documents by length differs between relevant and non-relevant documents for a given level of match. Specifically, it is hypotheBized that highly matched non-relevant documents (i.e. highly ranked on overlap) are longer than average. Analysis is performed to test this hypothesis, by taking pairs of relevant and non- relevant documents, both pairs having an almost identical (and [OCRerr]t[OCRerr]ing) match on overlap, and comparing document lengths. Figure 14 gives an individual example, showing how the non-relevant document has 104 stem concepts and the relevant one 39. When more than one relevant and non-relevant document have identical matches, lengths can be averaged over all such documents; the results given in Figure 15 are thus based on over 100 documents. Figure 15 shows