C o m p a r i s o n o f r u n s v i a r a n d o m i z a t i o n t e s t i n g randCompareMeans.pl implements a partial randomization test of the hypothesis that two search runs, whose effectiveness is measured e.g., by (mean) average precision, are significantly different - against the null hypothesis that the differences are due to chance. Here is the thinking behind this randomization test. Given observed scores for two systems on the same 24 topics, we can calculate the mean score for each system and the difference of these means. We would like to know if the difference is due to the systems or to chance. To test this we generate a distribution of differences between the means under the null hypothesis that the difference is due to chance and count how many such differences are equal to or more extreme than the observed difference. We take this count divided by the total number of generated differences as the probability that the observed difference in means is due to chance. The code generates the distribution under the null hypothesis by calculating the difference for each topic's pair of observed scores and then iterating some tens of thousands of times randomly choosing for each pair of scores for a given topic whether to multiply the difference by -1 or 1 (i.e., to reverse the score-to-run assignment or not) before summing, calculating the mean, and finding the difference in the means. The perl script is written to be run aginst the search.results.table provided by NIST, which provides a row for each run evaluated and a column for run name and the average precision score for each topic searched. The script takes the following arguments: target number of iterations used in the randomization (usually >= 10,000) significance level to be applied (0.01, 0.05, etc) the file in which to find the run results (search.results.table) a string which must be contained by the run name to be included in the comparison a string which must be contained by the run name to be included in the comparison a string which must be contained by the run name to be included in the comparison a string which must be contained by the run name to be included in the comparison ------------------------------------------------------------------------- Example usage (to compare fully automatic runs from Tsinghua University in TRECVID 2005): randCompareMeans.pl 100000 0.01 search.results.table F_ THU ------------------------------------------------------------------------- Example output contains a line for each comparison that showed one run to be significantly better than another. Each line contains the following: - name of run 1 - comparison result - name of run 2 - probability observed difference due to chance, - count of generated differences equal or more extreme than observed, - total number of unique differences generated under the null hypothesis, - observed difference in means (here: MAP) (Run 1 - Run 2) F _A_1_THU02_2 > F_A_2_THU04_4 0.001 120 99726 0.057 F _A_2_THU01_1 > F_A_2_THU03_3 0.000 1 99687 0.058 F _A_2_THU01_1 > F_A_2_THU04_4 0.000 43 99707 0.071 F _A_2_THU01_1 > F_A_2_THU05_5 0.002 208 99681 0.012 F _A_2_THU01_1 > F_A_2_THU06_6 0.001 65 99714 0.054 F _A_2_THU05_5 > F_A_2_THU03_3 0.000 9 99685 0.045 F _A_2_THU05_5 > F_A_2_THU04_4 0.001 73 99682 0.059 F _A_2_THU05_5 > F_A_2_THU06_6 0.001 112 99673 0.041 F _A_2_THU07_7 > F_A_2_THU04_4 0.002 238 99699 0.059 Target iterations: 100000 significance level: 0.01 scores file: search.results.table run substring 1: F_ run substring 2: THU run substring 3: run substring 4: Number of runs each run is significantly better than according to current test: 4 F_A_2_THU01_1 3 F_A_2_THU05_5 1 F_A_2_THU07_7 1 F_A_1_THU02_2 0 F_A_2_THU06_6 0 F_A_2_THU04_4 0 F_A_2_THU03_3 ------------------------------------------------------------------------- Background Randomization tests have been around (at least as thought experiments) for a long time. In 1935 R. A. Fisher developed the idea of a paired comparison test that is based on randomly assigning the signs of the differences in paired scores. Randomization tests came into their own along with other computer intensive methods in statistics as computer power increased. According to Bryan Manly's 1997 "Randomization, Bootstrap and Monte Carlo Methods in Biology": Randomization tests have the advantages of being valid with non- random samples and allowing the user to choose a test statistic that is appropriate for the particular situation being considered. A disadvantage, that results cannot necessarily be generalized to a population of interest, is less serious than it might seem at first sight because the generalization that is often made with conventional statistical procedures is based on the unverifiable assumption that non-random samples are equivalent to random samples, or the data are for the entire population of interest. p. 23 Randomization tests and classical parametrics tests tend to have similar power when the conditions for the parametric test are justified. With data from non-standard distributions there is some evidence to suggest that randomization tests are more powerful than classical alternatives. p. 90 [...]Kempthorne and Doerfler (1969) concluded from a study involving sampling eight distributions (normal, uniform, triangular, etc.) that Fisher's randomization test [...] is generally superior to the Wilcoxon signed ranks test, the sign test and the F test for testing for a treatment effect with a paired comparison design. p.80 Kempthorne, Oskar and Doerfler, T. E. The Behaviour of Some Significance Tests Under Experimental Randomization. Biometrika, Vol. 56, No. 2(Aug., 1969), 231 - 248 Joseph. P. Romano (On the Behavior of Randomization Tests Without a Group Invariance Assumption. Journal of the American Statistical Association, Vol. 85, No. 411 (Sep. 1990), pp. 686-692. See www.jstor.org/view/01621459/di985985/98p0348x/0) discusses limits on the use of randomization tests when the samples come from populations which differ in more than their location: "Within the context of comparing two samples, however, the behavior of the randomization approach depends on the functional of interest. For example, if the two samples have equal sizes, the randomization approach for testing means is generally valid, but is invalid if the ratio of the sample sizes does not tend to 1. On the the other hand, the randomization approach is generally invalid for comparing medians, even when the sample sizes are identical." p.687 See also: Smucker, M. D., Allan, J., and Carterette, B. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Management (Lisbon, Portugal, November 06 - 10, 2007). CIKM '07. ACM, New York, NY, 623-632. DOI= http://doi.acm.org/10.1145/1321440.1321528