IRE
Information Retrieval Experiment
Evaluation within the enviornment of an operating information service
chapter
F. Wilfrid Lancaster
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Evaluation criteria 109
identify the desired outcomes of an information service and to select output
measures that are at least predictors of the desired outcomes. Looked at in
this way, an appropriate output measure may be regarded as at least a distant
approximation of an outcome measure. To take one example, the desired
outcome of an SDI service is presumably to make the users of the service
hetter informed. The degree to which this outcome is achieved, however, is
virtually impossible to measure. Nevertheless, it seems reasonable to suppose
that an SDI service is more likely to make a user better informed if it brings
to his attention documents that directly match his interest, and were
previously unknown to him, than if it is unable to deliver any matching
items. In this case, then, we have identified output measures (recall, precision,
novelty) that can be regarded as approximations of the desired outcome
measure. Likewise, in certain situations, we can identify input measures that
can be regarded as predictors of outcomes. The size of the collection of a
library, or its rate of growth, for example, might be regarded as a reasonable
predictor of the document delivery capabilities of that library.
The input/output/outcome distinction may be considered related to the
distinction between long range and short range objectives. Drucker'4 has
pointed out that it is virtually impossible to evaluate any type of service
institution against its long range objectives. Instead, we should back away
from the long range objectives and identify short range objectives that are
distant approximations of the long range objectives and that can be converted
into meaningful evaluation criteria. As one example, Drucker points to the
`saving of souls' as the long range objective of the church. The extent to
which this objective is reached by a particular church, however, is, to say the
least, an unpromising evaluation problem. On the other hand, a short range
objective of the church may be to encourage young people in the community
to attend services and other church activities. The extent to which this is
achieved is precisely measurable. If we accept that church attendance may
contribute to the saving of souls, evaluation against the short range objective
may be regarded as a distant approximation of evaluation against the long
range objective.
Before leaving the subject of evaluation levels, it may be worth pointing
out that, in certain information service applications at least, purely
quantitative measures may relate only to successes but ignore failures
completely. An obvious example is library circulation figures. A book
borrowed by a user reflects, in some sense, a library success, but circulation
figures tell us nothing about the library's failures[OCRerr]how many users are unable
to find the items they seek. In this case a purely quantitative measure gives
us a very incomplete picture of the library's performance. We need, instead,
a qualitative measure, one that balances the successes against the failures, in
this case some type of document delivery score.
6.2 Evaluation criteria
The users of services of any kind usually evaluate them, consciously or
unconsciously, against cost, time and quality criteria. Users of information
services also tend to judge them against these same criteria. The specific