IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Evaluation Parameters
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
11-2
Although generality tends to vary between requests, an average value for a
set of requests serves to characterize a particular series of experiments.
A second purpose of performance measurement is that of making
`external' comparisons between results obtained in different situations,
in which generality is expected to differ. Such comparisons may be made even
within an experimental test environment, if different request sets or collection
sizes are introduced and compared.
A third purpose that may be distinguished is a specific need to
interpret experimental results in terms of expected real-life merit, rather
than merely comparing different techniques in a laboratory. Experimental
tests of the kind conducted by SMART are simulation-tests, and any con-
clusions drawn from the results may need to be presented in a way that would
be typical of the performance if the system were being used operationally.
The choice of performance measures is also affected by viewpoint,
either the viewpoint of the user, or of a researcher seeking fundamental insight
into retrieval capability. User satisfaction is restricted to properties
11a", "b", and "c" in Figure 1, since a user is interested in examining as
few non-relevant items as possible, and as many relevant items as he wishes
to see, but he is not concerned about "d", or about the total collection
size. From a system efficiency viewpoint, which is of concern in some types
of research, the value of "d", and the coliection size are needed. For
example, test comparisons between situations of differing generality require
measures that include "d" if a strict comparison of efficiency is the object.
Still more sophisticated techniques may be needed, since correct system
efficiency comparisons require adjustment for differing concentrations of
documents by subject in different collections, so that the actual collection
size can be replaced by the real number of documents within the subject