IRE Information Retrieval Experiment Evaluation within the enviornment of an operating information service chapter F. Wilfrid Lancaster Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Some problems of evaluation applied to operating Systems 119 performance of various processing options in SMART. Obtrusiveness may, however, have some effect on a diagnostic microevaluation since certain system components may benefit from the spotlight effect while others are unable to benefit. The obtrusiveness of a study cannot in any way minimize failures attributable to the database searched (i.e. indexing and vocabulary failures) but it might reduce failures relating to the exploitation of the database since a searcher, knowing he is observed, may put more effort into his interaction with the user and into the construction of the search strategy itself. The evaluation of an operating information system usually requires many more compromises than the evaluation of an experimental system. To begin with, we probably don't want to evaluate all searches conducted (even all conducted within a restricted time period) but only a sample of these searches. Ideally we would like to draw these searches completely at random. But in a national system, with potential users spread over great distances, a purely random assignment may be impracticable. The difficulties of dealing remotely with many geographically dispersed users may be too great. Instead of drawing a completely random sample of users, we may have to be content with some compromise. In his evaluation of MEDLARS, for example, Lancaster15 identified a number of organizations whose members, based on records of searches conducted in the past, might be considered to form a microcosm of the complete user population. Not only could these organiza- tions, collectively, be expected to generate the required number of searches, but the distribution of their searches by subject could be expected to resemble rather closely the subject distribution of all requests from whatever source. Defining a search to be evaluated as one coming from a selected group of organizations greatly facilitated the conduct of the evaluation since contacts with the requesters, including distribution of the necessary evaluation forms and other materials, could be entrusted to librarians or other information specialists on the staff of these organizations. Moreover, with a limited number of organizations involved, it was possible to secure agreement to co- operate from the executive officer of each organization. This was an encouragement to the co-operation Qf the individual staff members without in any way influencing the type of re4uests they made to the system. The problems involved in securing the co-operation of large numbers of users of informations services has encouraged the use of `realistic simulations' of these services in certain evaluation applications. In such simulations a `proxy' of a real user is employed. The proxy behaves in a way that is assumed to be typical of the behaviour of a real user and the performance of the system in relation to the needs of the proxy is evaluated. One example of such a simulation is the document delivery test (Orr et al.3). In this test, 300 citations, presumed representative of the document needs of the users of a particular centre, are checked against the centre on a particular day to determine (a) how many of the items are owned, and (b) how available each owned item is on that day. A similar test has been described by De Prospo et al.'7. In essence, the document delivery test simulates 300 users walking into the centre on a particular day, each one seeking a particular document. Another form of simulation is the use of a set of questions for which complete and correct answers are known to test the question-answering ability of an information centre. The set of test questions can be applied to the centre