IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Evaluation Parameters chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 11-18 These examples given are practically the only such observed in over one hundred performance comparisons, and thus are definitely the exception rather than the rulee The reasons for the discrepancies lie in the way in which the different measures apply diferent weight to different distributions of the relevant documents; some research proposed by Michael Lesk is designed to investigate this problem. [OCRerr]. The Construction of Average Precision Versus Recall Curves In the context of the SMART experiments, the construction of a precision versus recall curve for a set of search requests requires techniques for averaging over individual requests, chosing cut-off points to construct curves, and coping with problems that arise because individual requests have differing nunibers of relevant documents. Different methods of meeting these three problems are suggested, and these methods are divided into those that are suitable only for test comparisons (Purposes 1 and 2, Figure 2), and those that satisfy the need to accurately simLLlate the result experienced by real users (Purpose 3, Figure 2). An additional problem that arises only for the fast cluster searches is also discussed. A) Averaging Techniques The two main alternative averaging techniques have been described as `micro evaluation't and "macro evaluation" [1,5,6). The micro method re- quires the cogilation over all requests of the nuniber of documents both retrieved and relevant (for a given cut-oft) so that one final precision- recall pair can be calculated, whereas the macro method requires the com- putation of precision-recall pafrs for each request with the final precision- recall pair obtained by averaging, using the arithmetic mean. The macro method is generally preferred because it provides both adequate comparisons