IRE Information Retrieval Experiment Evaluation within the enviornment of an operating information service chapter F. Wilfrid Lancaster Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 108 Evaluation within the environment of an operating information service `given') and focus instead on one of the other levels. Most of the evaluations of operating systems have, in fact, been restricted to evaluations of their effectiveness (e.g. in terms of the number of users who express subjective satisfaction or the number of actual demands that are satisfied according to some more objective criteria). Few detailed cost analyses have been conducted or, at least, few are reported in the literature. And realistic cost- effectiveness analyses are even more scarce. This is a pity because it can be argued that a study of effectiveness has little real meaning unless related to costs and that, certainly, a cost analysis has little real value unless related to level of effectiveness. Managers of information services are, or should be, concerned with optimum allocation of the resources available (i.e. one that achieves the maximum quality of service possible within budgetary constraints) and optimum resource allocation is only likely to come from a true cost-effectiveness analysis. A useful distinction, first made by King and Bryant1 3, is that between macroevaluation and microevaluation. A macroevaluation of a system is one that measures its present level of performance (e.g. in terms of recall and precision or as a document delivery score) and is content to let the study rest there. A macroevaluation, then, merely establishes a benchmark. But a microevaluation goes much beyond this. It seeks to answer such questions as `Why is the system operating at this level?', `Under what conditions does the system perform well and under what conditions does it perform badly?', and `What can be done to raise the level of performance in the future?' A microevaluation, then, is diagnostic while a macroevaluation is not. Another possibly useful distinction in the information services environment is that between inputs, outputs and outcomes. Again, this is a sequence of increasing complexity. Inputs to an information service are the easiest things to measure. They can be expressed in purely quantitative terms: how many documents, how many people, how much money? Outputs are more difficult to deal with because output measures must take into account quality as well as quantity. For example, in the evaluation of a question-answering service the appropriate output measure is not the number of questions submitted. It is not even the proportion of questions for which an answer is supplied. It is the proportion of questions submitted for which a complete and correct answer is supplied. The outcome of an information service is the most difficult aspect to study for the notion of outcome brings us back to that of impact, effect or benefit. It is more difficult to evaluate outcomes than it is to evaluate outputs and it is more difficult to evaluate outputs than it is to quantify inputs. All types of information services will probably have reliable input data but few have meaningful qualitative output data and data on outcomes are likely to be non-existent. Where they exist in the information services field (e.g. applied to various types of libraries), standards tend to be entirely related to inputs. This is not because inputs are most important (far from it) but merely because inputs are easiest to look at, quantify and reduce to `standard' form. In the evaluation of an operating information service we should primarily be interested in its outcomes. After all, it is the beneficial outcomes that presumably justify the existence of the service. But it may not be possible to evaluate outcomes; or, at least, the evaluation of outcomes may be so complex as to discourage the attempt. On the other hand it should be possible to j