IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Decision 9: How to analyse the data? 87 Microaverage of precision is 18/41 =0.439. Macroaverage of precision is 2.5/4=0.625. The choice of averaging method hinges on whether one wishes to give (locuments or queries equal weight in the averaging process. However, if the averages are to be used as sample estimates of population values, as discussed in the next section, then the microaverages should be used, as these have the Ntatistically desirable property of maximum likelihood (see Tague and I[OCRerr]arradane' 5) Another advantage of microaveraging is that one does not usually have to deal with the undefined value 0/0. In macroaveraging, one can either set such ratios equal to 1 or throw out the query. Neither course is really satisfactory. Another problem, thoroughly discussed by Sparck Jones21 and others, relates to the recall-precision graph. Given ordered document output for a set of queries, the recall-precision graph will depend on both the measure of document-query similarity (the scores) and the choice of points to be displayed on the graph. As described in Section 5.3, there are a number of ways in which the document query similarity can be measured. These include: (1) Co-ordination level, i.e. the number of terms matching between query and document. (2) Cosine coefficient and other weighting functions. Documents may be ranked on the basis of any of these measures. In order to construct a recall[OCRerr]precision graph, the points at which recall and precision values will be averaged over queries and displayed on the graph must then be determined. There are four possibilities: (1) Average recall and precision across queries at fixed document-query similarity scores. This method works well with co-ordination level scores but creates problems with document-query weights which assume a large number of values. (2) Average recall and precision across queries at fixed document ranks. This method is useful when the document-query scores assume a large number of values. (3) Average recall and precision values at either fixed scores or fixed ranks and then interpolate precision at standard recall values, for example 0,0.1,0.2....0.9,1. This gives a smoother curve than Methods I and 2. Two interpolation methods have been suggested: (a) linear interpolation, (b) interpolation to the left between averaged recall values (`pessimis- tic' interpolation). (4) Interpolate precision values at standard recall values for each query and then average precision values over the queries. When the number of terms matching between document and query (co- ordination level) is an independent variable, a set of average recall and precision values can be obtained for a query at each degree of match, i.e. at 1,2,3,... matching terms. A problem arises because not all queries have the same number of terms, so that the average will be over different numbers of queries at some co-ordination levels. One can examine only subsets consisting