IRE Information Retrieval Experiment Evaluation within the enviornment of an operating information service chapter F. Wilfrid Lancaster Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Some problems of evaluation applied to operating Systems 121 The evaluation of an operating information service is likely to involve many more compromises than the evaluation of an experimental system. In the latter case, for example, we might be able to determine the `true' recall ratio for a search since it may be possible to have every document in the collection judged for relevance against every request used in the evaluation. In a real operating environment, however, it is impossible to establish true recall (Lancaster5) and we must instead be satisfied with some method of estimating the recall ratio of the search. There is likely to be a difference, too, between the evaluation standards appropriate to the experimental and to the operating environments. In the former, it is impossible to evaluate the results of a literature search against information needs. Since there are no real users, there are no real information needs. The best we can do is to evaluate the results of a search against a request statement (relevance as opposed to pertinence, Lancaster5). But this is not good enough in the operating environment. The evaluation of a search against a request statement is an artificial situation. Since we have real users with real needs we must ask these users to evaluate the results of a search in terms of the degree to which they contribute to the satisfaction of the information need that prompted the request to the system. The problems of controlled experimentation within an operating environ- ment have already been mentioned. In a purely experimental situation it should be possible to control all extraneous variables so that one can be quite sure of what is affecting what. In a real life environment it is not so easy to experiment. One finds oneself always compromising between the experimen- tal design and concern for the needs of the users. For example, if a completely new information service is introduced, one that promises to be much more effective than any of its predecessors, it is difficult to explain to members of a `control group' why they are denied use of the service. Yet, if we really want to assess the impact of the service, some type of control group of this kind will be necessary. There may always be self-selected control groups (e.g. those people who choose not to use a new service) but a self-selected group is likely to be quite different from a group that is randomly selected to form a control. A rare example of a true experimental design in the evaluation of various approaches to the provision of information services can be found in a recent paper by Olson23. The design used was a 3>( 3 factorial design as illustrated in Figure 6.1. Two levels of `technical information intervention' and two of Behovioural interventions Control Level 1 Level 2 Control Technical information Level 1 interventions Level 2 Figure 6.1. Factorial design: levels of interventions (from Olson23).