IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Decision 9: How to analyse the data? 95
users[OCRerr]are shown in Table 5.5. The appropriate conditions for each test are
also indicated.
TABLE 5.5
I)[OCRerr].sign/Van'abte type Test
Independent samples
Normal, equal variances Ttest
Continuous or discrete with many values, large sample Z test
(>30)
Continuous, discrete, some ordinal W ilcoxon-Mann-W hitney
test, Median test
I )cpendent samples
Normal T test of differences
Continuous or discrete with many values, large sample Z test of differences
(>30)
Continuous, discrete, some ordinal Sign test
By dependent samples, we mean that the samples under the two treatments
`[OCRerr]re matched in some fashion[OCRerr]for example, two indexing languages applied
to the same set of documents, two search strategies used with the same set of
queries.
The non-parametric tests such as the Wilcoxon-Mann-Whitney test and
the Sign test have generally been developed on the basis of the assumption
ihat the data are continuous. Modified procedures have been developed for
%.[OCRerr]ituations in which the data are discrete and ties are present. Noether25
points out that, in the long run, the proportion of times that HO is rejected
when true corresponds to the chosen significance level. Many texts also
%.uggest that these tests can be applied to ordinal data. However, because the
tierivation of the tests depends on an assumption of continuous or discrete
&lata, this approach should not be taken unless it makes sense to consider the
r[OCRerr][OCRerr]nks as merely representing an underlying continuous scale. For example,
one might ask users to rank documents from two search strategies as to
relevance and use a Wilcoxon test to compare the results if it was felt that the
ranks represented a continuous relevance weight. Whether this assumption
is justified is a theoretical rather than pragmatic question.
The condition that the population variances are equal, required for the
Student Ttest, may be tested using an F test.
As an example of both a classical and a non-parametric test for the same
hypothesis, consider the following test comparing two indexing languages.
`!`en queries are searched using both language A and language B. The null
hypothesis and alternative hypothesis are:
HO: H1-H2=0
Hi: H1 - H2 #0 where it[OCRerr] and it2 represent the average precisions for the
two languages. The significance level is set to 0.05. The sample precision
values for the ten queries for each method are:
Method A: 0.65,0.18,0.32,0.49, 0.64,0.30,0.86,0.22, 0.35,0.20
Method B: 0.78,0.19,0.33,0.47, 0.66,0.77,0.97,0.21, 0.36,0.13
Since the sample size is small, if one were not certain of the normality of