IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 96 The pragmatics of information retrieval experimentation precision, the Sign test could be used here, as samples are dependent. To determine the test statistic K, a + or a - is assigned to each query depending on whether the precision score for B is greater than or is less than that for A. This gives the following sequence of signs: The test statistic K is the number of plus signs, so that K= 7. K has a binomial distribution with parameters n = 10 and p = 1/2. Since the test is two sided, i.e. the null hypothesis will be rejected for both high and low values, a two-sided critical region is needed. With discrete distributions taking few values, it is not always possible to define a critical region which will have a probability exactly equal to the significance level. Here, the binomial distribution provides the following probabilities under HO: P(K<2orK>8) = 0.022 P(K<3or K> 7) = 0.1 With a significance level of either 0.022 or 0.1 we would accept the null hypothesis with a K value of 3. Thus, we can conclude that for [OCRerr] = 0.05 it will also be accepted. If we are willing to assume a normal distribution for precision scores, perhaps from previous evidence, then the Student Ttest can be applied. Here, instead of assigning a + or - to each query, we determine the difference between the A and B precisions. This gives the following differences: 0.11,0.01,0.01, -0.02,0.02,0.47,0.11, -0.01,0.01, -0.07 The test statistic is T= ½D7s.= 1.325 where D is the average of the 10 differences, S is their sample standard deviation, and n is the sample size, 10. The critical region for this test, determined from table of the Tdistribution, is T> 2.26 or T< -2.26. Thus, the same conclusion as in the Sign test is reached-accept HO. Though one may be surprised at the lack of a significant difference for this data, it must be remembered that small samples in general require very large differences to attain significance. Essentially, the test is saying that the observed superiority of Method B could arise from random fluctuations among queries. Obviously, the T test is more sensitive to the magnitude of the differences. Tests for comparing three or more treatments are shown in Table 5.6. ANOVA procedures also exist for many more complicated multifactor designs, such as Latin squares. Until recently, corresponding non-parametric tests did not exist. However, there is active development in this area, and the investigator is advised to consult the recent statistics literature. If variances of the samples under different treatments appear to be unequal, they may be stabilized by a transformation of the original observations. Some common transformations are the square root, the logarithmic, and the arcsin (see Winer18 for details). Winer also points out