IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Search Matching Functions
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
III-[OCRerr]7
Examining Figures 26, 27, and 28, the following individual cases may be noted:
1. Relevant documents that are long receive low ranks on logical
cosine unless very highly matched. The great length of document
797 is offset on cosine numeric by the very high weights associated
to the matching concepts.
2. Relevant documents having few matching concepts that are ranked
below certain higher-matching non-relevant documents with cosine
logical receive improved ranks on numeric cosine when matching con-
cepts are highly weighted (see documents 1420 and 7914). When
matching concepts of relevant documents are not highly weighted,
the numeric measure usually worsens their rank positions (see documents
793, 795 and 796).
From this data two hypotheses emerge: First, if a relevant and a
non-relevant document have similar numbers of matching concepts or similar
rank positions using logical cosine, the introduction of weights will on average
result in higher matches for the relevant than the non-relevant documents.
It seems reasonable that low weighted matching concepts should have a higher
probability of reflecting a trivial occurrence of those concepts in the docu-
ment than is the case for concepts that are highly weighted.
The second hypothesis is that weights assigned to the matching
concepts provide some measure of discrimination between concepts aceording to
their importance; this discrimination is of value in matching relevant docu-
ments. In such cases spurious matches with many concepts are distinguished
from correct matches even if obtained with fewer concepts.
Evidence that the first hypothesis holds for request 0137 is given
in Figure 29, showing that the change from logical to numeric produces far
better cosine correlation values in the numerator for relevant documents
compared with the non-relevant documents. In this example, numeric also