IRE Information Retrieval Experiment Retrieval effectiveness chapter Cornelis J. van Rijsbergen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Theoretical foundations 33 I[OCRerr]tiis analogy between relevance and an observable in quantum mechanics (oM) should not be taken too seriously, its use is mainly to highlight the tujierent uncertainty associated with relevance. One interesting aspect of the *iii.iJ()gy is, however, that a further similarity between the formalism for Iliformation retrieval and QM becomes apparent when one considers the [OCRerr]%`[OCRerr][OCRerr]II established trade-off between precision and recall. This is similar to the I I&[OCRerr]isenberg uncertainty principle in physics, where for example momentum .Ili(i position cannot be measured simultaneously to any desired level of iccuracy: increasing the accuracy for one leads to a necessary decrease in Iccuracy for the other. Similarly in attempting to increase precision we uways find a decrease in recall. In fact under some mathematical models in lii formation retrieval the trade-off is a necessary one, and not simply observed t~inpirically. It must be emphasized that measuring retrieval effectiveness is a form of tierived, as opposed to fundamental, measurement. The fundamental &ltlantity' involved is relevance; once this has been established we can it tempt to measure retrieval effectiveness. Because of the difficulties involved iii establishing a theory of relevance, relevance-based measures have typically f)een used in an experimental (artificial as opposed to operational) context, that is, in one where the relevance of a document has been decided in .`t(ivance. Given such an experimental set-up, it would appear that retrieval [OCRerr]tr(1tegies can be evaluated for their effectiveness in terms of, say, precision md recall without any difficulty. Unfortunately life is not that simple, and (jitficulties arise with the form of measurement at different levels. A few are is follows (I) Sampling level? (2) One or more variables? (.3) How to normalize for relevance feedback information? (4) Effect of ranking on form of measurement? (5) Effect of interpolation, interpretation? (6) Effect of averaging technique? I:.ach of these technical problems, except the first which has already been covered by Robertson, will be touched upon in the rest of this chapter. 3.2 Theoretical foundations *I[OCRerr]he problems of measurement in information retrieval differ from those encountered in the physical sciences in one important aspect. In the physical sciences there is usually an empirical ordering of the quantities we wish to measure. For example, we can establish empirically by means of a scale which masses are equal, and which are greater or less than others. Such a situation does not hold in information retrieval. There is no empirical ordering for retrieval effectiveness and therefore any measure of retrieval effectiveness will by necessity be artificial. The basic variables underlying any measure of retrieval effectiveness are usually precision and recall, or some other equivalent pair. The conventional way to define these is in terms of ratios; however, more recently it has proved f'ruit2ul to define them as probabilities. Recall is defined as the probability