IRE Information Retrieval Experiment Evaluation within the enviornment of an operating information service chapter F. Wilfrid Lancaster Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 118 Evaluation within the environment of an operating information service recall and precision failures. The failure analysis itself entails an examination of each document involved, the indexing records for the documents, the requests that caused the searches to be conducted, the search strategies, the system vocabulary, and the relevance assessments of the users. Through the examination of each of these it should be possible to determine which component of the system was largely responsible for the failures occurring. In addition to the analysis of the failures occurring in particular searches, the evaluator can use the recall and precision ratios, or alternative measures of search performance, as indicators of conditions under which the system seems to perform well and under which it seems to perform badly. For example, searches can be grouped by broad subject category, and an average performance figure, or figures, can be derived for each group. It would then be possible to identify subject areas in which unusually low scores occur. Through the joint use of performance figures, in this way, and analyses of failures in particular searches, the evaluator learns a great deal about the characteristics of the system, its weaknesses and limitations as well as its strong points. The joint use of the performance figures and failure analyses should answer most of the questions identified in the work statement for the evaluation. The final element in the analysis and interpretation phase is that in which the evaluator presents his report to the managers of the system, including in his report recommendations on what might be done to improve its performance. The fifth and final step of the evaluation programme is that in which some or all of the recommendations are implemented, that is, the step in which the evaluation results are applied to the improvement of the system. Although not specifically mentioned in the discussion above, the value of a pretest should be recognized. Before the complete evaluation is carried out, it is important to follow through all the proposed procedures on a small sample of transactions, to ensure that the procedures are, in fact, viable and that they are capable of gathering the data needed to complete the study. One obvious problem associated with an operating system is the fact that, because a rather high level of user co-operation may be demanded (e.g. in assessing the relevance of items retrieved), it is virtually impossible to conduct the evaluation unobtrusively. Not only will the users know they are participating in a study but, since we may need their co-operation in collecting and delivering various records, the staff may also be aware that an evaluation is taking place. How much effect this obtrusiveness is likely to have on the evaluation results is a matter of some debate. It could be argued that the overall performance figures for `observed' searches may exceed, on the average, overall performance figures for unobserved searches. In this sense, the evaluation may be considered a study of the system under somewhat ideal conditions. On the other hand, the effect of the obtrusiveness may be considered to apply equally to all searches included in the evaluation. This being so, the obtrusive nature of the study may not have any significant effect for certain evaluation purposes. For example, the obtrusiveness of aj study is unlikely to affect a comparison of performance across various subject fields since there is little reason to suppose that the obtrusiveness will affe ctI one subject area more than another. This is somewhat similar to the findings of Lesk and Salton16 that even major differences in relevance decisions, from one judge to another, had no significant effect on comparison of the 1