IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Decision 9: How to analyse the data? 97 TABLE 5.6 Design Type of variable Approx. normal, equal variances Continuous, discrete, some ordinal Single factor Independent samples One-way ANOVA Kruskal-Wallis test Dependent samples One-way ANOVA, repeated measures Friedman test Complete blocks One-way ANOVA, complete blocks Noether's T test Incomplete blocks One-way ANOVA, incomplete blocks Durbin test tfrat the F test is relatively insensitive to moderate departures from normality. Thus, it may be used when the data are only approximately normal. Many of the variance stabilizing transformations also make the data more normal. In general, data consisting of counts, e.g. number of relevant documents, or times, e.g. search time, should be analysable by parametric methods. The arcsin transformation is useful in stabilizing the variances and improving the normality of proportions such as recall and precision. Times which are skewed towards low values can have their distributions improved by the logarithmic transformation. Following a significant ANOVA, i.e. a significant difference in treatments, the experimenter may wish to test which particular treatment pairs differ. A number of tests are available for such contrasts: the Newman-Keuhls, Duncan, Tukey, and Sheffe' tests. Details may be found in Winer. Wherever possible, a parametric test is to be preferred to a non-parametric one because of its great efficiency. Pittman (see Noether) defines efficiency as follows: `If we have two tests of the same hypothesis and significance level and if for the same power with respect to the same alternative one test requires a sample size Ni and the other a sample size N2, the relative efficiency of the first with respect to the second is given by e = N2/Nl.' Noether gives specific examples of the efficiency of non-parametric tests against normal curve alternatives. The asymptotic (i.e. large sample) efficiency of the T[OCRerr], Kruskal-Wallis, Durbin, Friedman, and Wilcoxon- Mann-Whitney tests will not fall below 0.864 and may be as high as 0.955. The Sign test, however, has an efficiency of only 0.64. Another advantage of parametric tests is that they are easier to compute. Most non-parametric tests require ranking the observations, an operation whose time is proportional to n2, or at least n log n. Parametric tests, on the other hand, are based on adding and squaring[OCRerr]perations whose time is proportional to n. For large samples, this difference may be important. Exploring relationships Exploring relationships may involve either: (1) Determining if two variables are related or independent, e.g. is search time related to searcher experience? (2) Estimating the degree of relationship between them, e.g. what is the correlation between the frequency of use of a document and its age?