IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Evaluation Parameters
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
11-31
Cranfield curve, but any differences due to this effect are very small indeed.
In fact the Quasi-Cranfield and lVSemi[OCRerr]Cranfieldl methods result in a quite
similar performance curve, but the latter doe give the [OCRerr]heoreti[OCRerr]al maximum
performance that a user could achieve. Othe' choices of cut-off to be used
at the vertical segments would give curves position[OCRerr] low[OCRerr]r on the graph
than for these two methods, and would probably give performance ci[OCRerr]ves that
would be more typical of user experience. However, for experimental test
comparisons, the procedures used are completely adequate.
C) Extrapolation Techniques for Request Generality Variations
Discussion of the recall level cut-off techniques suggests consideration
of one further problem, caused by the variation in numbers of relevant docu-
ments for different requests. The problem is that requests having few relevant
documents cannot exhibit low recall values, and therefore have shorter precision
recall curves than those that have many relevant documents. The extreme
example is furnished by a request with only one relevant document, where the
performance on a graph is reflected by only a single point on the graph, some-
where at 1.0 recall. The question arises as to whether the performance of
such a request should still be incorporated in the average results at recall
levels lower than 1.0, and five possible methods are suggested.
The first method is to use individual precision-recall curves only
at points where they can in fact be drawn by methods discussed in lart [OCRerr]B;
at low recall values, only those requests having many relevant documents will
then enter into the averages. Figure 18 gives an example based on 42 requests,
where the numbers of requests that would enter into the averages are given
at each of ten recall levels. Although this method is quite simple to use and
gives quite acceptable results for tinternalt test comparisons, any attempts
to compare dissimilar request sets are complicated by different request