IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Search Matching Functions
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
111-27
that, on overlap, non-relevant documents are longer than relevant, the length
being c[OCRerr]nsiderably above the average. Similar parameters are calculated
for the cosine ordering in Figures l[OCRerr] and 15, and very similar lengths are
then obtained for relevant and non-relevant documents in this case.
Since the analysis shows that non-relevant documents with strong
matches are longer than average, it is now obvious that cosine effectively
lowers the ranks of these documents, and thus provides a better retrieval
performance than overlap. Although it is certainly the case that non-relevant
documents with weak matches must be shorter than average, it seems that their
low match (sometimes as low as zero) is never sufficient to increase their
rank by any significant amount on overlap; it is the strongly matched non-
relevant only that are responsible for the superiority of the cosine correlation.
This phenomenon is probably caused by the fact that not all non-
relevant documents have an equal probability of resulting in spurious matches;
as seems logical, the probability of spurious matches is greater in larger
documents. Spurious matches result from spurious concept combinations, which
arise because no judgments of importance are made to discriminate between
request concepts; that is, any combination of, say, three concepts (out of
six in a request) is assumed to be as important as any other. An example of
this is given in [OCRerr]igure 14, where both a non-relevant and a relevant document
match in three'out of the six concepts; the data of Figure 16 show, however,
that the non-relevant match on words such as ??[OCRerr][OCRerr][OCRerr]gIl, "report" and
`I
measurement" which turn out to be spurious. Such spurious matches are more
likely to occur for long non-relevant documents than for short ones. The
logical search formulations used in post-coordinate manual systems would
eliminate many such false matches; some success in this direction can be
achieved without manual search form[OCRerr][OCRerr]ation by use of weighting methods, to ke
described next.