IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Decision 9: How to analyse the data? 95 users[OCRerr]are shown in Table 5.5. The appropriate conditions for each test are also indicated. TABLE 5.5 I)[OCRerr].sign/Van'abte type Test Independent samples Normal, equal variances Ttest Continuous or discrete with many values, large sample Z test (>30) Continuous, discrete, some ordinal W ilcoxon-Mann-W hitney test, Median test I )cpendent samples Normal T test of differences Continuous or discrete with many values, large sample Z test of differences (>30) Continuous, discrete, some ordinal Sign test By dependent samples, we mean that the samples under the two treatments `[OCRerr]re matched in some fashion[OCRerr]for example, two indexing languages applied to the same set of documents, two search strategies used with the same set of queries. The non-parametric tests such as the Wilcoxon-Mann-Whitney test and the Sign test have generally been developed on the basis of the assumption ihat the data are continuous. Modified procedures have been developed for %.[OCRerr]ituations in which the data are discrete and ties are present. Noether25 points out that, in the long run, the proportion of times that HO is rejected when true corresponds to the chosen significance level. Many texts also %.uggest that these tests can be applied to ordinal data. However, because the tierivation of the tests depends on an assumption of continuous or discrete &lata, this approach should not be taken unless it makes sense to consider the r[OCRerr][OCRerr]nks as merely representing an underlying continuous scale. For example, one might ask users to rank documents from two search strategies as to relevance and use a Wilcoxon test to compare the results if it was felt that the ranks represented a continuous relevance weight. Whether this assumption is justified is a theoretical rather than pragmatic question. The condition that the population variances are equal, required for the Student Ttest, may be tested using an F test. As an example of both a classical and a non-parametric test for the same hypothesis, consider the following test comparing two indexing languages. `!`en queries are searched using both language A and language B. The null hypothesis and alternative hypothesis are: HO: H1-H2=0 Hi: H1 - H2 #0 where it[OCRerr] and it2 represent the average precisions for the two languages. The significance level is set to 0.05. The sample precision values for the ten queries for each method are: Method A: 0.65,0.18,0.32,0.49, 0.64,0.30,0.86,0.22, 0.35,0.20 Method B: 0.78,0.19,0.33,0.47, 0.66,0.77,0.97,0.21, 0.36,0.13 Since the sample size is small, if one were not certain of the normality of