IRE Information Retrieval Experiment Evaluation within the enviornment of an operating information service chapter F. Wilfrid Lancaster Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Evaluation criteria 109 identify the desired outcomes of an information service and to select output measures that are at least predictors of the desired outcomes. Looked at in this way, an appropriate output measure may be regarded as at least a distant approximation of an outcome measure. To take one example, the desired outcome of an SDI service is presumably to make the users of the service hetter informed. The degree to which this outcome is achieved, however, is virtually impossible to measure. Nevertheless, it seems reasonable to suppose that an SDI service is more likely to make a user better informed if it brings to his attention documents that directly match his interest, and were previously unknown to him, than if it is unable to deliver any matching items. In this case, then, we have identified output measures (recall, precision, novelty) that can be regarded as approximations of the desired outcome measure. Likewise, in certain situations, we can identify input measures that can be regarded as predictors of outcomes. The size of the collection of a library, or its rate of growth, for example, might be regarded as a reasonable predictor of the document delivery capabilities of that library. The input/output/outcome distinction may be considered related to the distinction between long range and short range objectives. Drucker'4 has pointed out that it is virtually impossible to evaluate any type of service institution against its long range objectives. Instead, we should back away from the long range objectives and identify short range objectives that are distant approximations of the long range objectives and that can be converted into meaningful evaluation criteria. As one example, Drucker points to the `saving of souls' as the long range objective of the church. The extent to which this objective is reached by a particular church, however, is, to say the least, an unpromising evaluation problem. On the other hand, a short range objective of the church may be to encourage young people in the community to attend services and other church activities. The extent to which this is achieved is precisely measurable. If we accept that church attendance may contribute to the saving of souls, evaluation against the short range objective may be regarded as a distant approximation of evaluation against the long range objective. Before leaving the subject of evaluation levels, it may be worth pointing out that, in certain information service applications at least, purely quantitative measures may relate only to successes but ignore failures completely. An obvious example is library circulation figures. A book borrowed by a user reflects, in some sense, a library success, but circulation figures tell us nothing about the library's failures[OCRerr]how many users are unable to find the items they seek. In this case a purely quantitative measure gives us a very incomplete picture of the library's performance. We need, instead, a qualitative measure, one that balances the successes against the failures, in this case some type of document delivery score. 6.2 Evaluation criteria The users of services of any kind usually evaluate them, consciously or unconsciously, against cost, time and quality criteria. Users of information services also tend to judge them against these same criteria. The specific