IRE Information Retrieval Experiment Evaluation within the enviornment of an operating information service chapter F. Wilfrid Lancaster Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Evaluation criteria ill (1) A particular document whose identity is known. (2) Specific factual information of the type that might come from some type of reference book or from a machine-readable data bank[OCRerr]for example, thermophysical property data on a particular substance. (3) A few `good' articles, or references to them, on a specific topic. (4) A comprehensive literature search in a particular subject area. (5) A current alerting service by which the user is kept informed of new literature relevant to his current professional interests. These different needs have different response time requirements associated with them. The requirement relating to the current alerting service is that it should deliver regularly and frequently and that the information supplied should be as up-to[OCRerr]date as possible. The user needing a comprehensive literature search is usually engaged in a relatively long-term research project. Speed of response may not be critical to him, except that there may be some date beyond which the search results will have no value or, at least, greatly reduced value; he is willing to wait longer in order to achieve completeness; that is, completeness is more important to him than speed. For the other types of information needs, on the other hand, the user generally wants fairly rapid response. The cost and time criteria relevant to the evaluation of information services seem fairly obvious and are relatively constant from one activity to another. But the quality criteria are perhaps less obvious and vary considerably with the particular service being evaluated. They may also vary with the kind of need that a particular user has in relation to a service. There seem to be two major qualitative measures of success as applied to information services: (1) Does the user get what he is seeking or not? (2) How completely or accurately does he get it? The first of these measures, which applies, for example, to the search for a particular item or the answer to a particular factual question, is simple and unequivocal. The second, however, is much more difficult to apply in practice because it implies both a human value judgement and the use of some graduated scale to reflect degree of success. The second type of measure is necessary, however, in the evaluation of most types of information retrieval activity. `Recall' and `precision' are two criteria frequently used to judge the performance of a search in an information retrieval system. Because these measures are well known and well accepted in the evaluation of operating information services, they will not be defined here. The precision ratio and the recall ratio, used jointly, express the filtering capacity of the system[OCRerr]its ability to let through what is wanted and to hold back what is not. Neither one on its own gives a complete picture of the effectiveness of a search. It is always possible to get 100 per cent recall if we retrieve enough of the total collection; if we retrieve the entire collection, we certainly achieve 100 per cent recall. Unfortunately, however, precision would be extremely low in this situation because, for any typical request, the great majority of the items in the collection are not relevant. The precision ratio may be viewed as a type of cost factor in user time the time required to separate the relevant citations from the irrelevant ones in