IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Evaluation Parameters
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
"-9
These measures are desirable even with a ranking system, since they alone seem
capable of representing a user's viewpoint (Viewpoint 2, Properties 5 and 6,
Figure 2). It is a simple matter to construct performance curves of this type
from a ranked output, since a series of cut-off points may be chosen, precision
and recall calculated, and the points joined for form a curve.
A precision versus recall curve for an individual request is presented
in Figure [OCRerr], using the familiar graph, and showing the shape of the curve
when a cut-off is established after each document. Results for a single
request always exhibit the step pattern, but interpolation and extrapolation
technique to be described in Part 4 produce a smoother curve. The practice,
as with the normalized measures, is to present results averaged over a wholE
set of search requests, so Figure 5 shows as an example some averages for
two retrieval runs in the form of a tabular computer print-out, and a graph
of the precision versus recall curves.
A quite similar "performance characteristic" curve is proposed for
use with ranking systems by Giuliano and Jones [OCRerr]8], it seems to offer no
advantage over the precision versus recall curve. It is advocated for another
reason to be discussed in Part 5. The normalized `1sliding ratio" measure also
proposed by Giuliano and Jones uses either the recall or precision ratios
at each cut-off point. The equation is given and an example is calculated
in Figure 6, showing that up to a cut-off equal to the nuniber of relevant
items, this measure is the precision ratio, and at higher cut-offs, the
measure equals recall. While it is true that a perfect result would produce
a perfect measure of performance, it would do so at every cut-off point,
which would not seem to be a desirable result. In the perfect case, for
example, a user who wanted high recall, not knowing how many relevant the
system contained, might suggest a cut-off too `e8rly' in the list, and
miss some relevant items, yet this measure would show a perfect result at