IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Evaluation Parameters
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
11-18
These examples given are practically the only such observed in over
one hundred performance comparisons, and thus are definitely the exception
rather than the rulee The reasons for the discrepancies lie in the way in
which the different measures apply diferent weight to different distributions
of the relevant documents; some research proposed by Michael Lesk is designed
to investigate this problem.
[OCRerr]. The Construction of Average Precision Versus Recall Curves
In the context of the SMART experiments, the construction of a
precision versus recall curve for a set of search requests requires techniques
for averaging over individual requests, chosing cut-off points to construct
curves, and coping with problems that arise because individual requests have
differing nunibers of relevant documents. Different methods of meeting these
three problems are suggested, and these methods are divided into those that
are suitable only for test comparisons (Purposes 1 and 2, Figure 2), and
those that satisfy the need to accurately simLLlate the result experienced
by real users (Purpose 3, Figure 2). An additional problem that arises only
for the fast cluster searches is also discussed.
A) Averaging Techniques
The two main alternative averaging techniques have been described
as `micro evaluation't and "macro evaluation" [1,5,6). The micro method re-
quires the cogilation over all requests of the nuniber of documents both
retrieved and relevant (for a given cut-oft) so that one final precision-
recall pair can be calculated, whereas the macro method requires the com-
putation of precision-recall pafrs for each request with the final precision-
recall pair obtained by averaging, using the arithmetic mean. The macro
method is generally preferred because it provides both adequate comparisons