IRE Information Retrieval Experiment Evaluation within the enviornment of an operating information service chapter F. Wilfrid Lancaster Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 120 Evaluation within the environment of an operating information service obtrusively or unobtrusively (Bunge1 8, Powell' 9, Crowley and Childers4, King and Berry20). On the surface, simulations of this kind may be regarded as imperfect substitutes for real life studies. But not all advantages lie with the real life situation. To begin with, simulations do not disturb the users of the system; they are also likely to be considerably cheaper than the real life study. Moreover, it could be argued that a real life study tells us only how the system performs in relation to actual demands and tells us nothing about the potential performance of the system in relation to the latent needs that may never be converted into demands. There are obvious dangers associated with looking at demands only (Line21, Lancaster22) since the demands (expressed needs) of users are likely to be influenced by their expectations of the capabilities of the system. Evaluation of a service in relation to expressed needs, with no concern for information needs that are unexpressed, may cause managers of an information service to move that service further towards the expressed needs and further away from the unexpressed needs. A simulation such as the document delivery test, insofar as this simulation can be assumed to reflect latent needs as well as expressed needs, offers certain advantages over an evaluation that looks at expressed needs only. A major difference between the evaluation of an operating information service and the evaluation of a system in a laboratory environment is that the latter, lacking real users with real needs, must adopt some `ideal' as the appropriate standard for performance. But the real service should not be evaluated against an ideal but only against the real needs of users, which may be much less than the ideal. An obvious example is the use of such measures as recall and precision ratios. In the experimental environment it is reasonable to evaluate the performance of a system in relation the ideal of 100 per cent recall and 100 per cent precision. But the ideal is not necessarily the best measure to use in the evaluation of an operating system. It makes no sense to regard as a failure a search that achieves only, say, 20 per cent recall if the requester does not require a high level of recall and is perfectly satisfied with `a few good relevant items'. In an operating environment, evaluation measures must always be related to the precise needs of the users of the service. Moreover, evaluation measures that seem perfectly appropriate in the experimental situation may be inappropriate to the operational situation because better (e.g. more direct) measures may exist. The precision ratio is a good example of this. In effect, the precision ratio is a user cost factor associated with achieving a particular level of recall (another way of looking at it is as a form of penalty associated with the attainment of a particular recall ratio). This ratio may be the only meaningful cost factor to use in an experimental environment. But in many operating environments there may be much more direct cost factors, such as the unit cost (in $$ or time) to the user per relevant item retrieved. Thus, if a user pays $25 for the results of an online search, and finds five relevant documents in search output, he is paying $5 for each relevant item retrieved. If he conducts his own search in a printed index, and finds six relevant items in 2 hours, we could say that the unit cost (in time) per relevant item retrieved is 20 minutes. These are much more direct measures of the cost of achieving a particular level of recall than is the precision ratio.