IRE Information Retrieval Experiment Retrieval effectiveness chapter Cornelis J. van Rijsbergen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Measurement of effectiveness 39 In the main there are two distinct ways of looking at the problem of I1[OCRerr]ee[OCRerr]tsuring effectiveness. One way assumes that the effectiveness of a given [OCRerr]y%.[OCRerr]tem for a set of queries is a direct function of the eff[OCRerr]ctiveness for IIl(tividual queries without reference to how these might have been arrived let us call this the predictive approach. Another way is to insist that to represent the effectiveness for a set of queries one should set up a [OCRerr][OCRerr]([OCRerr]rrespondence between levels of effectiveness of different queries based on %4[OCRerr]IflC control variable (e.g. match value) used to generate the ranking of (locuments in the first place; let us call this the descriptive approach. A minor v[OCRerr]triation of this latter approach is to use the rank order number of the (I('(ument as the control variable, the actual value is then ignored. l)erhaps if the above is illustrated by describing what happens in terms of l'l.ccision and recall it will be clearer. I shall limit the example to these two I[OCRerr]Ir((meters as any other two parameters would be treated analogously. The fir[OCRerr]t (ipproach attempts to summarize, by simple averaging, precision values .[OCRerr]I given recall values. The second approach averages both precision and I('c(IIJ at a given value of a control parameter, e.g. co-ordination level. Both iiicthods have problems, the first requires the precision value to be defined at l([OCRerr]ctII values not necessarily achieved at any control variable value. The `.(`c()nd method requires a decision about which value of the control variable I(([ one query will correspond to what value of the control variable for Illother query, so that averaging may be done across queries for precision- `ciii values at corresponding values of the control variable. This still leaves (j)Cfl the question, for either method, how might these averages be computed? I ii[OCRerr] predictive approach requires interpolation and extrapolation ofprecision [OCRerr].iities so that averages can be computed at given recall values. On the other i[OCRerr]ind in the descriptive approach one need not calculate precision-recall ``ilues for individual queries at any given value of the control variable, iii[OCRerr]tead one pools the documents and calculates what are known as micro- iverages. In other words for all queries one pools the documents retrieved .iiid the relevant documents retrieved and then calculates an average recall .111(1 precision. Once the averages have been calculated it would appear that ilic descriptive approach answers question (1), whereas the predictive .ii[OCRerr]proach answers question (2). Of course once the averages have been cilculated we still only have a set of average precision recall values; a final [OCRerr]tcp is to link these points into a continuous curve. There are arguments for both approaches. In an earlier publication I have \lrongly argued in favour of the predictive approach7. Sparck Jones8 hd[OCRerr]s .ii'gued in favour of the descriptive approach. Although the above discussion has assumed that retrieval output is subject to a control variable leading to a sequence of nested sets of retrieved [OCRerr]t()cuments, some strategies will only retrieve one set of documents. For cxample, output from a boolean search, or from a cluster-based retrieval [OCRerr]tr1tegy, will be just one unordered set of documents. Problems arise when .`ittempting to compare `set retrieval' with `ranked retrieval', which requires [OCRerr]()me statement about the comparative performance of two retrieval [OCRerr]trategies, one for which the effectiveness is represented by a graph, the other hy a point. It is for cases like this that a single number effectiveness measure, ciii it F, can be useful. If one assumes that for every point of the graph a [OCRerr]ingle number measure can be calculated, then one way of comparing