Approximate Randomization tests February 5th, 2013 Classic t-test Why ar testing? • Classic tests often assume a given distribution (student t, normal, …) of the variable • This is ≈ok for recall, but not for precision or Fscore • Possible hypotheses to test with nonparametric tests is limited Illustration • • • • 30,000 runs, 1000 instances, 500 of class A True positives (TP): 400 (stdev:80) False positives (FP): 60 (stdev: 15) Assumption: true and false positives for class A are normally distributed. This is already an approximation since TP and FP are restricted by 0 and the number of instances. Definitions • Recall = truly predicted A / A in reference = truly predicted A / Cte If A is normal, recall is normal. • Precision = truly predicted A / A in system A in system is a non-linear combination of TP and FP. Precision is not normal. • F-score: non-linear combination of recall and precision Not normal. Approximate randomization test • No assumption on distribution • Can handle complicated statistics • Only assumption: independence between shuffled elements • References: – Computer Intensive Methods for Testing Hypotheses, Noreen, 1989. – More accurate tests for the statistical significance of results differences, Yeh, 2000. Basic idea • Exact randomization test Glass 1 Glass 2 Glass 3 Glass 4 Contents Polish Premium Russian Budget Expert Polish Premium Budget Russian Exact probability H0: expert is independent of contents P(ncorrect ≥ 2) = 7/24 = 0.29 Thus, do not reject H0 because the probability is larger than alpha=0.05. Approximate probability • The number of permutations is n! => quick increase of number of permutations • If too much permutations to compute: approximation: P = (nge + 1) / (NS + 1) – nge : number of times pseudostatistic ≥ actual statistic – NS: number of shuffles – +1: correction for validity DIFFERENT SETUPS Translation to instances • Each glass is an instance • Contents and expert are two labeling systems • Contents has an accuracy of 100%, expert has an accuracy of 50% • Statistic is precision, f-score, recall, … instead of accuracy Stratified shuffling • For labeled instances, it makes no sense to shuffle the class label of one instance to another • Only shuffle labels per instance MBT • Assumpton of independence between instances • Shuffle per sentence rather than per token System 1 System 2 This DT NNS is VBZ VB nice JJ RB . . . Term extraction • Shuffling extracted terms between output of two term extraction systems Reference System 1 System 2 happy happy sad good good lively happy angry Script • http://www.clips.ua.ac.be/~vincent/software.html#art • http://www.clips.ua.ac.be/scripts/art • Options: – – – – – Exact and approximate randomization tests Instance based, also for MBT Term extraction based Stratified Shuffling Two sided / one-sided (check code!) Remarks on usage • It makes no sense to shuffle if exact randomization can be computed • The value of p depends on NS. The larger NS, the lower p can be • Validity check – Sign-test – Re-test: to alleviate bad randomization Sign test • Can be compared with P for accuracy • H0: correctness is System 1 independent of system i.e. P(groen) = 0.5 • Binomial test System 2 Interpretation (1) Reference System 1 System 2 A A B B A B C A B How much do these two systems differ based on precision for the A label? - Maximally Intermediate Minimally Interpretation (2) Labels PrecisionA A B C System 1 System 2 Δ AB AB AB 1/3 0 1/3 BA AB AB 0 1 -1 AB AB BA 1/2 0 1/2 BA BA AB 0 1/2 -1/2 BA AB BA 1/2 0 1/2 AB BA BA 1 0 1 BA BA BA 0 1/3 -1/3 AB BA AB 1/2 0 1/2 Conclusion • Approximate randomization testing can be used for many applications. • The basic idea is that the actual difference between two systems is (im)probable to occur when all possible permutions of the outputs are evaluated. • Difference can be computed in many ways as long as the shuffled elements are independent.