IRE Information Retrieval Experiment Retrieval effectiveness chapter Cornelis J. van Rijsbergen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 36 Retrieval effectiveness data made available to the Systems for this purpose, then the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.' The original formulation of this principle was in terms of the probability of `usefulness' instead of `relevance'. To date most implementations of the principle have worked with the probability of relevance. It is not too difficult to formulate a relationship between usefulness and relevance so that optimality in terms of relevance implies optimality in terms of usefulness. The crucial point is though that the estimation of the probability of relevance is based on content derivable data whereas an estimate of probability of usefulness can only be based on data not concerned with the content, e.g. language of document, date of publication. A detailed discussion of this distinction can be found in Harper's thesis4. As I mentioned at the start of this section, retrieval effectiveness and retrieval strategy have been fitted together to guarantee optimal retrieval effectiveness for certain strategies. The easiest way to see this is to use the definition of precision and recall in terms of expected number of relevant documents in a set. If the probability of relevance of a document represented by x is given by P(relevance/x) or P(A/x) then the probability ranking principle tells us that to achieve optimal retrieval we should rank the documents in decreasing order of this probability. Now the retrieved set B, defined by setting some cut-off on the ranking, will contain those documents with the greatest values of P(A/x). Therefore compared with any other set of documents of the same size as B, the sum, P(A/x), will be a maximum, or in words, the expected number of relevant documents in B will maximized. This is true for any set B defined by a cut-off on the ranking. Since expected precision and recall are defined by dividing the expected number of relevant documents in B by the size of B and A respectively, expected precision and recall will be maximized at any cut-off by ranking the documents in order of their probability of relevance. The interplay of the measures of retrieval effectiveness and the definition of the retrieval strategy is quite clear. In fact ranking documents in this way ensures the optimization of a host of effectiveness measures expressed in terms of precision and recall. For example, any linear combination of precision and recall will be maximized as well. It is important to realize that in formulating this principle very little has been said about the structure of the description x associated with a document. To estimate the probability of relevance for a particular document some assumptions will have to be made about the form of x. A common assumption is that x is a binary vector representing the absence or presence of index terms. Also, assumptions about the statistical dependence or independence of the occurrence of index terms can then be made to help in the estimation of P(A/x). Briefly, this estimation is usually implemented through Bayes' rule, P(A/x) = P(x/A)P(A) P(x)