C o m p a r i s o n o f r u n s
v i a r a n d o m i z a t i o n t e s t i n g
randCompareMeans.pl implements a partial randomization test of the
hypothesis that two search runs, whose effectiveness is measured e.g.,
by (mean) average precision, are significantly different - against the
null hypothesis that the differences are due to chance.
Here is the thinking behind this randomization test. Given observed
scores for two systems on the same 24 topics, we can calculate the
mean score for each system and the difference of these means. We would
like to know if the difference is due to the systems or to chance.
To test this we generate a distribution of differences between the
means under the null hypothesis that the difference is due to chance
and count how many such differences are equal to or more extreme than
the observed difference. We take this count divided by the total
number of generated differences as the probability that the observed
difference in means is due to chance.
The code generates the distribution under the null hypothesis by
calculating the difference for each topic's pair of observed scores
and then iterating some tens of thousands of times randomly choosing
for each pair of scores for a given topic whether to multiply the
difference by -1 or 1 (i.e., to reverse the score-to-run assignment or
not) before summing, calculating the mean, and finding the difference
in the means.
The perl script is written to be run aginst the search.results.table
provided by NIST, which provides a row for each run evaluated and
a column for run name and the average precision score for each topic
searched.
The script takes the following arguments:
target number of iterations used in the randomization (usually >= 10,000)
significance level to be applied (0.01, 0.05, etc)
the file in which to find the run results (search.results.table)
a string which must be contained by the run name to be included in the
comparison
a string which must be contained by the run name to be included in the
comparison
a string which must be contained by the run name to be included in the
comparison
a string which must be contained by the run name to be included in the
comparison
-------------------------------------------------------------------------
Example usage (to compare fully automatic runs from Tsinghua University
in TRECVID 2005):
randCompareMeans.pl 100000 0.01 search.results.table F_ THU
-------------------------------------------------------------------------
Example output contains a line for each comparison that showed
one run to be significantly better than another. Each line contains
the following:
- name of run 1
- comparison result
- name of run 2
- probability observed difference due to chance,
- count of generated differences equal or more extreme than observed,
- total number of unique differences generated under the null hypothesis,
- observed difference in means (here: MAP) (Run 1 - Run 2)
F _A_1_THU02_2 > F_A_2_THU04_4 0.001 120 99726 0.057
F _A_2_THU01_1 > F_A_2_THU03_3 0.000 1 99687 0.058
F _A_2_THU01_1 > F_A_2_THU04_4 0.000 43 99707 0.071
F _A_2_THU01_1 > F_A_2_THU05_5 0.002 208 99681 0.012
F _A_2_THU01_1 > F_A_2_THU06_6 0.001 65 99714 0.054
F _A_2_THU05_5 > F_A_2_THU03_3 0.000 9 99685 0.045
F _A_2_THU05_5 > F_A_2_THU04_4 0.001 73 99682 0.059
F _A_2_THU05_5 > F_A_2_THU06_6 0.001 112 99673 0.041
F _A_2_THU07_7 > F_A_2_THU04_4 0.002 238 99699 0.059
Target iterations: 100000
significance level: 0.01
scores file: search.results.table
run substring 1: F_
run substring 2: THU
run substring 3:
run substring 4:
Number of runs each run is significantly better than
according to current test:
4 F_A_2_THU01_1
3 F_A_2_THU05_5
1 F_A_2_THU07_7
1 F_A_1_THU02_2
0 F_A_2_THU06_6
0 F_A_2_THU04_4
0 F_A_2_THU03_3
-------------------------------------------------------------------------
Background
Randomization tests have been around (at least as thought experiments)
for a long time. In 1935 R. A. Fisher developed the idea of a paired
comparison test that is based on randomly assigning the signs of the
differences in paired scores.
Randomization tests came into their own along with other computer
intensive methods in statistics as computer power increased.
According to Bryan Manly's 1997 "Randomization, Bootstrap and Monte
Carlo Methods in Biology":
Randomization tests have the advantages of being valid with non-
random samples and allowing the user to choose a test statistic that
is appropriate for the particular situation being considered. A
disadvantage, that results cannot necessarily be generalized to a
population of interest, is less serious than it might seem at first
sight because the generalization that is often made with conventional
statistical procedures is based on the unverifiable assumption that
non-random samples are equivalent to random samples, or the data are
for the entire population of interest. p. 23
Randomization tests and classical parametrics tests tend to have
similar power when the conditions for the parametric test are
justified. With data from non-standard distributions there is some
evidence to suggest that randomization tests are more powerful than
classical alternatives. p. 90
[...]Kempthorne and Doerfler (1969) concluded from a study involving
sampling eight distributions (normal, uniform, triangular, etc.) that
Fisher's randomization test [...] is generally superior to the
Wilcoxon signed ranks test, the sign test and the F test for testing
for a treatment effect with a paired comparison design. p.80
Kempthorne, Oskar and Doerfler, T. E. The Behaviour of Some
Significance Tests Under Experimental Randomization. Biometrika,
Vol. 56, No. 2(Aug., 1969), 231 - 248
Joseph. P. Romano (On the Behavior of Randomization Tests Without a
Group Invariance Assumption. Journal of the American Statistical
Association, Vol. 85, No. 411 (Sep. 1990), pp. 686-692. See
www.jstor.org/view/01621459/di985985/98p0348x/0) discusses limits on
the use of randomization tests when the samples come from populations
which differ in more than their location:
"Within the context of comparing two samples, however, the behavior
of the randomization approach depends on the functional of interest.
For example, if the two samples have equal sizes, the randomization
approach for testing means is generally valid, but is invalid if
the ratio of the sample sizes does not tend to 1. On the the other
hand, the randomization approach is generally invalid for comparing
medians, even when the sample sizes are identical." p.687
See also: Smucker, M. D., Allan, J., and Carterette, B. 2007. A
comparison of statistical significance tests for information retrieval
evaluation. In Proceedings of the Sixteenth ACM Conference on
Conference on information and Knowledge Management (Lisbon, Portugal,
November 06 - 10, 2007). CIKM '07. ACM, New York, NY, 623-632. DOI=
http://doi.acm.org/10.1145/1321440.1321528