C o m p a r i s o n   o f   r u n s   
v i a   r a n d o m i z a t i o n   t e s t i n g

randCompareMeans.pl implements a partial randomization test of the
hypothesis that two search runs, whose effectiveness is measured e.g.,
by (mean) average precision, are significantly different - against the
null hypothesis that the differences are due to chance.

Here is the thinking behind this randomization test. Given observed
scores for two systems on the same 24 topics, we can calculate the
mean score for each system and the difference of these means. We would
like to know if the difference is due to the systems or to chance.

To test this we generate a distribution of differences between the
means under the null hypothesis that the difference is due to chance
and count how many such differences are equal to or more extreme than
the observed difference. We take this count divided by the total
number of generated differences as the probability that the observed
difference in means is due to chance.

The code generates the distribution under the null hypothesis by
calculating the difference for each topic's pair of observed scores
and then iterating some tens of thousands of times randomly choosing
for each pair of scores for a given topic whether to multiply the
difference by -1 or 1 (i.e., to reverse the score-to-run assignment or
not) before summing, calculating the mean, and finding the difference
in the means.

The perl script is written to be run aginst the search.results.table
provided by NIST, which provides a row for each run evaluated and
a column for run name and the average precision score for each topic
searched.

The script takes the following arguments:
 target number of iterations used in the randomization (usually >= 10,000)
 significance level to be applied (0.01, 0.05, etc)
 the file in which to find the run results (search.results.table)
 a string which must be contained by the run name to be included in the
   comparison
 a string which must be contained by the run name to be included in the
   comparison
 a string which must be contained by the run name to be included in the
   comparison
 a string which must be contained by the run name to be included in the
   comparison

-------------------------------------------------------------------------
Example usage (to compare fully automatic runs from Tsinghua University
in TRECVID 2005):

randCompareMeans.pl 100000 0.01 search.results.table  F_ THU

-------------------------------------------------------------------------
Example output contains a line for each comparison that showed
one run to be significantly better than another. Each line contains
the following:

- name of run 1
- comparison result
- name of run 2
- probability observed difference due to chance,
- count of generated differences equal or more extreme than observed,
- total number of unique differences generated under the null hypothesis,
- observed difference in means (here: MAP) (Run 1 - Run 2)


F _A_1_THU02_2              >  F_A_2_THU04_4 0.001    120  99726  0.057 
F _A_2_THU01_1              >  F_A_2_THU03_3 0.000      1  99687  0.058 
F _A_2_THU01_1              >  F_A_2_THU04_4 0.000     43  99707  0.071 
F _A_2_THU01_1              >  F_A_2_THU05_5 0.002    208  99681  0.012 
F _A_2_THU01_1              >  F_A_2_THU06_6 0.001     65  99714  0.054 
F _A_2_THU05_5              >  F_A_2_THU03_3 0.000      9  99685  0.045 
F _A_2_THU05_5              >  F_A_2_THU04_4 0.001     73  99682  0.059 
F _A_2_THU05_5              >  F_A_2_THU06_6 0.001    112  99673  0.041 
F _A_2_THU07_7              >  F_A_2_THU04_4 0.002    238  99699  0.059 


Target iterations: 100000
significance level: 0.01
scores file: search.results.table
run substring 1: F_
run substring 2: THU
run substring 3:
run substring 4:

Number of runs each run is significantly better than
according to current test:
  4   F_A_2_THU01_1
  3   F_A_2_THU05_5
  1   F_A_2_THU07_7
  1   F_A_1_THU02_2
  0   F_A_2_THU06_6
  0   F_A_2_THU04_4
  0   F_A_2_THU03_3

-------------------------------------------------------------------------
Background

Randomization tests have been around (at least as thought experiments)
for a long time. In 1935 R. A. Fisher developed the idea of a paired
comparison test that is based on randomly assigning the signs of the
differences in paired scores.

Randomization tests came into their own along with other computer
intensive methods in statistics as computer power increased.

According to Bryan Manly's 1997 "Randomization, Bootstrap and Monte
Carlo Methods in Biology":

  Randomization tests have the advantages of being valid with non-
  random samples and allowing the user to choose a test statistic that
  is appropriate for the particular situation being considered. A
  disadvantage, that results cannot necessarily be generalized to a
  population of interest, is less serious than it might seem at first
  sight because the generalization that is often made with conventional
  statistical procedures is based on the unverifiable assumption that
  non-random samples are equivalent to random samples, or the data are
  for the entire population of interest. p. 23

  Randomization tests and classical parametrics tests tend to have
  similar power when the conditions for the parametric test are
  justified. With data from non-standard distributions there is some
  evidence to suggest that randomization tests are more powerful than
  classical alternatives. p. 90

  [...]Kempthorne and Doerfler (1969) concluded from a study involving
  sampling eight distributions (normal, uniform, triangular, etc.) that
  Fisher's randomization test [...] is generally superior to the 
  Wilcoxon signed ranks test, the sign test and the F test for testing
  for a treatment effect with a paired comparison design. p.80

  Kempthorne, Oskar and Doerfler, T. E. The Behaviour of Some
  Significance Tests Under Experimental Randomization. Biometrika,
  Vol. 56, No. 2(Aug., 1969), 231 - 248


Joseph. P. Romano (On the Behavior of Randomization Tests Without a
Group Invariance Assumption. Journal of the American Statistical
Association, Vol. 85, No. 411 (Sep. 1990), pp. 686-692. See
www.jstor.org/view/01621459/di985985/98p0348x/0) discusses limits on
the use of randomization tests when the samples come from populations
which differ in more than their location:

  "Within the context of comparing two samples, however, the behavior
  of the randomization approach depends on the functional of interest.
  For example, if the two samples have equal sizes, the randomization 
  approach for testing means is generally valid, but is invalid if
  the ratio of the sample sizes does not tend to 1. On the the other
  hand, the randomization approach is generally invalid for comparing
  medians, even when the sample sizes are identical." p.687


See also: Smucker, M. D., Allan, J., and Carterette, B. 2007. A
comparison of statistical significance tests for information retrieval
evaluation. In Proceedings of the Sixteenth ACM Conference on
Conference on information and Knowledge Management (Lisbon, Portugal,
November 06 - 10, 2007). CIKM '07. ACM, New York, NY, 623-632. DOI=
http://doi.acm.org/10.1145/1321440.1321528