IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Summary
summary
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
single number serves as an indication of system performance, and continuous
recall-precision curves. The construction of the recall-precision curves
is described in detail in section II, as are the methods used for produ[OCRerr]inq
curves averaged over many search requests. Extensions of the basic eval[OCRerr][OCRerr]-
ation techniques are also (9iscussed to cover cases where variable relevance
grades are assigned to documents, and to indicate the problems inherent
in a comparison between experimental and operational retrieval systems.
Detailed test results, covering the correlation methods used to
compare analyzed documents with analyzed search requests are given in
section III. The analysis includes, in particular, a comparison between
the "overlap" coefficient which represents a measure proportional to the
number of matching terms, and the "cosine" coefficient which takes into
account also the total number of terms present in a given document. In
each case, the terms are either weighted in accordance with their pre-
sumed importance, or unweighted. The conclusion reached is that the
cosine correlation used with the weighted content identifiers produces
a superior retrieval performance in comparison with the other possible
correlation procedures.
Ten additional correlation measures are examined in section Iv by
K. Reitsma and J. sagalyn, using the ADI collection for test purposes.
The coefficients used include, in particular, the cosine function, the
overlap measure, the inner product, the Maron-Kuhns measure, the Parker-
Rhodes-Needham measure, and others. Overall, when all recall levels are
taken into account, the cosine measure again produces the best results.
xiii