Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Evaluation Parameters chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 11-31 Cranfield curve, but any differences due to this effect are very small indeed. In fact the Quasi-Cranfield and lVSemi[OCRerr]Cranfieldl methods result in a quite similar performance curve, but the latter doe give the [OCRerr]heoreti[OCRerr]al maximum performance that a user could achieve. Othe' choices of cut-off to be used at the vertical segments would give curves position[OCRerr] low[OCRerr]r on the graph than for these two methods, and would probably give performance ci[OCRerr]ves that would be more typical of user experience. However, for experimental test comparisons, the procedures used are completely adequate. C) Extrapolation Techniques for Request Generality Variations Discussion of the recall level cut-off techniques suggests consideration of one further problem, caused by the variation in numbers of relevant docu- ments for different requests. The problem is that requests having few relevant documents cannot exhibit low recall values, and therefore have shorter precision recall curves than those that have many relevant documents. The extreme example is furnished by a request with only one relevant document, where the performance on a graph is reflected by only a single point on the graph, some- where at 1.0 recall. The question arises as to whether the performance of such a request should still be incorporated in the average results at recall levels lower than 1.0, and five possible methods are suggested. The first method is to use individual precision-recall curves only at points where they can in fact be drawn by methods discussed in lart [OCRerr]B; at low recall values, only those requests having many relevant documents will then enter into the averages. Figure 18 gives an example based on 42 requests, where the numbers of requests that would enter into the averages are given at each of ten recall levels. Although this method is quite simple to use and gives quite acceptable results for tinternalt test comparisons, any attempts to compare dissimilar request sets are complicated by different request