Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Evaluation Parameters chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 11-21 for test purposes and meets the need of indicating a user-oriented view of the result; the micro method on the other hand tends to give undue weight to requests that have many relevant documents. As Salton and Rocchio have shown [i,s] the macro method results in somewhat better precision recall curves, but the difference between the two methods with current collections and requests is near to or less than 5%, as seen in the comparison of Figure II. An occasional use of the micro method has usually given the same perfor- mance merit when two options are compared, so that this is[OCRerr]sue does not affect comparative test results at all. Further work on the averaging problem may reveal that the arithmetic mean is not the only suitable method to use. Averaging is a problem simply because of the extreme variance in individual results, as can be seen from the plot of individual precision recall curves for 22 requests given in Figure 12. The macro evaluation curve for these 22 requests is given in Figure 13, together with a curve based on the median, rather than the mean. The scatter of results raises the question of statistical significance; this matter is discussed elsewhere [9]. B) Cut-off Techniques Cut-off techniques in conventional manual and mechanized retrieval systems usually depend on the search terms used, with specified term-matches establishing the cut-off points. The equivalent in SMART is the use of the correlation coefficient that is obtained between the request and each document, but the provision of ranked output permits other cut-off criteria to be used, specifically related to the exact nim[OCRerr]ber, or acceptability of the documents as they are examined. Cut-off techniques for experimental purposes must be based on methods applicable to all requests, regardless of variations in the number of relevant items. For this reason the ranked output list only is used.