IRE Information Retrieval Experiment Retrieval effectiveness chapter Cornelis J. van Rijsbergen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 38 Retrieval effectiveness Since P(w2/x) = 1 - P(w1/x), we can rewrite this inequality as P(w1/x)> (=p) This rule is similar to the one specified by the probability ranking principle, the difference being that now we have an explicitly defined cut-off in terms of a cost function. If we choose 112=121, i.e. [OCRerr]= 1 our retrieval rule becomes: IfP(w1/x) > + then retrieve else do not retrieve Or in other words, if the probability of relevance is greater than the probability of non-relevance for a document we should retrieve that document. We vary the importance we attach to the cost of a false drop in comparison with the cost of a recall failure by changing the cut-off. For example, we may decide that 112 121 which means that the failure to retrieve a relevant document is twice as costly as the retrieval of a non-relevant document. For this the cut-off fi is set to +. By ranking we avoid having to specify the cut-off in advance, on the other hand we pay the price of ranking. It is important to realize that setting a cut- off on P(w1/x) still maximizes the expected number of relevant documents in the retrieved set. 3.4 Measurement of effectiveness The measurement of retrieval effectiveness within an experimental set-up is beset with many difficulties. These difficulties have been with us for many years, and are likely to remain unresolved for many years yet. A typical retrieval experiment has been described in Chapter 2, 50 I shall not repeat it here, except to emphasize that its aim is usually to establish the absolute or relative effectiveness of some search strategy, information structure, ordering process, etc., within the context of an overall retrieval system. The output for such an experiment may be a ranking (partial or full) of documents, or simply an unordered set. Each query will have associated with it some output for which retrieval effectiveness measures can be calculated. In comparing the results for different tests with the same queries and document collection, one aims to produce statistical summary data which will enable statements to be made about the comparative merits of differently designed subsystems. In the main, experimentalists have concentrated on two types of statements. (1) What is the probability that a retrieved document is relevant for the system operating at that level of recall? (2) What is the probability of a retrieved document being relevant for a query at a particular recall value? How well these two questions are answered depends on the method of evaluation adopted. The discussion will concentrate on evaluation of rankings, evaluating unordered sets is a special case of this.