IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Evaluation Parameters chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. "-9 These measures are desirable even with a ranking system, since they alone seem capable of representing a user's viewpoint (Viewpoint 2, Properties 5 and 6, Figure 2). It is a simple matter to construct performance curves of this type from a ranked output, since a series of cut-off points may be chosen, precision and recall calculated, and the points joined for form a curve. A precision versus recall curve for an individual request is presented in Figure [OCRerr], using the familiar graph, and showing the shape of the curve when a cut-off is established after each document. Results for a single request always exhibit the step pattern, but interpolation and extrapolation technique to be described in Part 4 produce a smoother curve. The practice, as with the normalized measures, is to present results averaged over a wholE set of search requests, so Figure 5 shows as an example some averages for two retrieval runs in the form of a tabular computer print-out, and a graph of the precision versus recall curves. A quite similar "performance characteristic" curve is proposed for use with ranking systems by Giuliano and Jones [OCRerr]8], it seems to offer no advantage over the precision versus recall curve. It is advocated for another reason to be discussed in Part 5. The normalized `1sliding ratio" measure also proposed by Giuliano and Jones uses either the recall or precision ratios at each cut-off point. The equation is given and an example is calculated in Figure 6, showing that up to a cut-off equal to the nuniber of relevant items, this measure is the precision ratio, and at higher cut-offs, the measure equals recall. While it is true that a perfect result would produce a perfect measure of performance, it would do so at every cut-off point, which would not seem to be a desirable result. In the perfect case, for example, a user who wanted high recall, not knowing how many relevant the system contained, might suggest a cut-off too `e8rly' in the list, and miss some relevant items, yet this measure would show a perfect result at