IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
96 The pragmatics of information retrieval experimentation
precision, the Sign test could be used here, as samples are dependent. To
determine the test statistic K, a + or a - is assigned to each query depending
on whether the precision score for B is greater than or is less than that for A.
This gives the following sequence of signs:
The test statistic K is the number of plus signs, so that K= 7. K has a binomial
distribution with parameters n = 10 and p = 1/2.
Since the test is two sided, i.e. the null hypothesis will be rejected for both
high and low values, a two-sided critical region is needed. With discrete
distributions taking few values, it is not always possible to define a critical
region which will have a probability exactly equal to the significance level.
Here, the binomial distribution provides the following probabilities under
HO:
P(K<2orK>8) = 0.022
P(K<3or K> 7) = 0.1
With a significance level of either 0.022 or 0.1 we would accept the null
hypothesis with a K value of 3. Thus, we can conclude that for [OCRerr] = 0.05 it will
also be accepted.
If we are willing to assume a normal distribution for precision scores,
perhaps from previous evidence, then the Student Ttest can be applied. Here,
instead of assigning a + or - to each query, we determine the difference
between the A and B precisions. This gives the following differences:
0.11,0.01,0.01, -0.02,0.02,0.47,0.11, -0.01,0.01, -0.07
The test statistic is
T= ½D7s.= 1.325
where D is the average of the 10 differences, S is their sample standard
deviation, and n is the sample size, 10.
The critical region for this test, determined from table of the Tdistribution,
is T> 2.26 or T< -2.26. Thus, the same conclusion as in the Sign test is
reached-accept HO.
Though one may be surprised at the lack of a significant difference for this
data, it must be remembered that small samples in general require very large
differences to attain significance. Essentially, the test is saying that the
observed superiority of Method B could arise from random fluctuations
among queries. Obviously, the T test is more sensitive to the magnitude of
the differences.
Tests for comparing three or more treatments are shown in Table 5.6.
ANOVA procedures also exist for many more complicated multifactor
designs, such as Latin squares. Until recently, corresponding non-parametric
tests did not exist. However, there is active development in this area, and the
investigator is advised to consult the recent statistics literature.
If variances of the samples under different treatments appear to be
unequal, they may be stabilized by a transformation of the original
observations. Some common transformations are the square root, the
logarithmic, and the arcsin (see Winer18 for details). Winer also points out