CSI5388: Functional Elements of Statistics for Machine Learning Part II 1 Contents of the Lecture Part I (This set of lecture notes): • Definition and Preliminaries • Hypothesis Testing: Parametric Approaches Part II (The next set of lecture notes) • Hypothesis Testing: Non-Parametric Approaches • Power of a Test • Statistical Tests for Comparing Multiple Classifiers 2 Non-parametric approaches to Hypothesis testing The hypothesis testing procedures discussed in the previous lecture are called parametric. This means that they are based on assumptions regarding the distribution of the population for which the test was ran, and rely on the estimation of parameters from these distributions. In our cases, we assumed that the distributions were either normal or followed a Student t distribution. The parameters we estimated were the mean and the variance. The problem we now turn to is the issue of hypothesis testing that is not based on assumptions regarding the distribution and do not rely on the estimation of parameters. 3 The Different Types of non-parametric hypothesis testing approaches I There are two important families of tests that do not involve distributional assumptions and parameter estimations: • Nonparametric tests, which rely on ranking the data and performing a statistical test on the ranks. • Resampling statistics which consist of drawing samples repeatedly from a population and evaluating the distribution of the result. Resampling Statistics will be discussed in the next lecture. 4 The Different Types of non-parametric hypothesis testing approaches II The nonparametric tests are quite useful in populations for which outliers skew the distribution too much. Ranking eliminates the problem. However, they typically are less powerful (see further) than parametric tests. Resampling statistics are useful when the statistics of interest cannot be derived analytically (e.g., statistics about the median of a population), unless we assume a normal distribution. 5 Non-Parametric Tests Wilcoxon’s Rank-Sum Test The case of independent samples The case of matched pairs 6 Wilcoxon’s Rank-Sum Tests Wilcoxon’s Rank-Sum Tests are equivalent to the t-test, but apply when the normality assumption of the distribution is not met. As a result of their non-parametric nature, however, power is lost (see further for a formal discussion of power). In particular, the tests are not as specific as their parametric equivalent. This means that, although we interpret the result of these non-parametric tests to mean one thing of a central nature to the distributions under study, they could mean something else. 7 Wilcoxon’s Rank-Sum Test (for two independent samples) Informal Description I Given two populations with n1 observations in group 1 and n2 observations in group 2. The null hypothesis we are trying to reject is: “H0: The two samples come from identical populations (not just populations with the same mean)”. We consider two cases: • Case 1: The null hypothesis is false (to a substantial degree) and the scores from population 1 are generally lower than those of population 2. • Case 2: The null hypothesis is true. This means that the two samples came from the same population. 8 Wilcoxon’s Rank-Sum Test (for two independent samples) Informal Description II In both cases, the procedure consists of ranking the scores of the two populations taken together. • Case 1: In the first case, we assume that the ranks from population 1 should be generally lower than those of population 2. Actually, we could also expect that the sum of the ranks in group 1 is smaller than the sum of the ranks in group 2. • Case 2: In the second case, we assume that the sum of ranks of the first group is about equal to the sum of ranks of the second group.9 Wilcoxon’s Rank-Sum Test (for two independent samples) n1 and n2 ≤ 25: Consider the two groups of data sets of size n1 and n2, respectively, where n1 is the smallest sample size. Rank their scores together from lowest to highest. In case of an x-way tie just after rank y, then assign (y+1 + y +2 + … + y+ x)/x to all the tied elements. Add the scores of the group containing the smallest number of samples (n1) (if both groups contain as many samples, choose the smallest value). Call this sum Ws. Find the value V in the Wilcoxon table, for n1 and n2 and the significance level s required, where n1 in the table corresponds to the smallest value, as well. Compare Ws to V and conclude that the difference between the two groups at the chosen level, L1 for a one-tailed test or 2*L1 for the two-tailed test is significant only if Ws < V. If Ws ≥ V, the null hypothesis cannot be rejected. 10 Wilcoxon’s Rank-Sum Test (for two independent samples) n1 and n2 > 25: Compute Ws as before Use the fact that Ws approaches a normal distribution as size increases with: • A mean of m= n1(n1+n2+1)/2, and • A standard error of std= sqrt(n1n2(n1+n2+1)/12) Compute the z statistic z = (Ws – m)/std Use the tables of the normal distribution. 11 Wilcoxon’s Matched Pairs Signed Ranks Test (for paired scores) Informal Description Logic of the Test: Given the same population tested under different circumstances C1 and C2. If there is improvement in C2, then most of the results recorded in C2 will be greater than those recorded in C1 and those that are not greater will be smaller by only a small amount. 12 Wilcoxon’s Matched Pairs Signed Ranks Test (for paired scores) n ≤ 50 We calculate the difference score for each pair of measurement We rank all the difference scores without paying attention to their signs (i.e., we rank their absolute values) We assign the algebraic sign of the differences to the ranks We sum the positive and negative ranks separately We choose as test statistic T, the smaller of the absolute values of the two sums. We compare T to a Wilcoxon T table 13 Wilcoxon’s Matched Pairs Signed Ranks Test (for paired scores) n > 50 Compute T as before Use the fact that T approaches a normal distribution as size increases with: • A mean of m= n(n+1)/4 and • A standard error std= sqrt(n(n+1)(2n+1)/24) And compute the z statistic z = (T – m)/std Use the tables of the normal distribution. 14 Power Analysis 15 Type I and Type II Errors Definition: A Type I error (α) corresponds to the error of rejecting H0, the null hypothesis, when it is, in fact, true. A Type II error (β) corresponds to the error of failing to reject H0 when it is false. Definition: The power of a test is the probability of rejecting H0 given that it is false. Power = 1- β 16 Why does Power Matter? I All the hypothesis tests described in the previous three sections are only concerned about reducing the Type I error. i.e., they try to ascertain the conditions under which we are rejecting a hypothesis rightly. They are not at all concerned about the case where the null hypothesis is really false, but we do not reject it. 17 Why does Power Matter? II In the case of Machine Learning, reducing the type I error means reducing the probability of us saying that there is a difference in the performance of the 2 classifiers, when in fact, there isn’t. Reducing the type II error means reducing the probability of us saying that there is no difference in the performance of the two classifiers, when, in fact, there is. Power matters because we do not want to discard a classifier that shouldn’t have been discarded. If a test does not have enough power, then this kind of situation can arise 18 What is the Effect Size? The effect size measures how strong the relationship between two entities is. In particular, if we consider a particular procedure, in addition to knowing how statistically significant the effect of that procedure is, we may want to know what the size of this effect is. There are different measures of effect sizes, including: • Pearsons’ correlation coefficient • Odd’s ratio • Cohen’s d statistics Cohen's d statistic is appropriate in the context of a t-test on means. It is thus the effect size measure we concentrate on here. [Wikipedia: http://en.wikipedia.org/wiki/Effect_size] 19 Cohen’s d-statistics Cohen’s d-statistic is expressed as: d = (X1 – X2)/ sp Where sp2, the pooled variance estimate is: sp2= ((n1-1)*s12 + (n2-1)*s22) (n1+n2-2) and sp, its square root. [Note this is not exactly Cohen’s d measure which was expressed in terms of parameters. What we show above is an estimate of d]. 20 Usefulness of the d statistic d is useful in that it standardizes the difference between the two means. We can talk about deviations in terms of proportions of standard deviation points that are more useful than actual differences that are domain dependent. Cohen came up with a set of guidelines concerning d: • d=.2 has a small effect, but is probably meaningful; • d= .5 is a medium effect that is noticeable. • d= .8 shows a large effect size. 21 Statistical Tests for Comparing Multiple Classifiers 22 What is the Analysis of Variance (ANOVA)? The analysis of variance is similar to the t-test in that it deals with differences between sample means. However, unlike the t-test that is restricted to the difference between two means, ANOVA allows us to assess whether the differences observed between any number of means are statistically significant. In addition, ANOVA allows us to deal with more than one independent variable. For example, we could choose, as two independent variables, 1) the learning algorithm and 2) the domain to which the learning algorithm is applied. 23 Why is ANOVA useful? One may wonder why ANOVA is useful in the context of classifier evaluation. Very simply, if we want to answer the following common question : "How do various classifiers fare on different data sets?", then we have 2 independent variables: the learning algorithm and the domain, and a lot of results. ANOVA makes it easy to tell whether the difference observed are indeed significant. 24 Variations on the ANOVA Theme There are different implementations of ANOVA: • One-way ANOVA is a linear model trying to assess if the difference in the performance measures of classifiers over different datasets is statistically significant, but does not distinguish between the performance measures’ variability within-datasets and the performance measure variability between-datasets. • Two-way/Multi-way ANOVA can deal with more than one independent variable. For instance, two performance measures over different classifiers over various datasets. Then there are other related tests as well: 25 • Friedman’s test, Post-hoc tests, Tukey Test, etc… How does One-Way ANOVA work? I It considers various groups of observations and sets as a hypothesis that all the means are equal. The opposite hypothesis is that they are not all equal. The ANOVA model is as follows: xij = μi + eij • where xij is the jth observation from group i, μi is the mean of group i and eij is the noise that is normally distributed with mean 0 and common standard deviation σ 26 How does One-Way ANOVA work? II ANOVA monitors three different kinds of variation in the data: • Within-group variation • Between-group variation • Total variation = within-group variation + between-group variation Each of the above variations are represented by sums of squares (SS) of the variations. The statistics of interest in ANOVA is F, where F = Between-group variation Within-group variation Larger F’s demonstrate greater statistical significance than smaller ones. Like for z and t, there are tables of significance levels associated with the F-ratio. 27 How does One-Way ANOVA work? III The goal of ANOVA is to find out whether or not the differences in means between different groups are statistically significant. To do so, ANOVA partitions the total variance into variance caused by random error (the within group SS) and variance caused by actual differences between means (the between-group SS). If the null hypothesis holds, then the withingroup SS should be about the same as the between-groups SS. We can compare these two SS using the F test, which checks whether the ratio of the two SSs is significantly greater than 1. 28 What is Multi-Way ANOVA? In One-Way ANOVA, we simply considered several groups. For example this could correspond the comparing the performance of 10 different classifiers on one domain. How about the case where we compare the performance of these same 10 different classifiers on 5 domains? Two-Way ANOVA can help with that If we were to use an additional dimension such as the consideration of 6 different (but matched) threshold levels (as in AUC) for each classifier on the same 5 domains, then Three-way ANOVA could be used, and so on… 29 How Does Multi-Way ANOVA work? In our example, the difference between One-Way ANOVA and Two-Way ANOVA can be illustrated as follows: • in One-Way ANOVA, we would calculate the within-group SS by collapsing the results obtained on all the data sets together within each classifier results. • In Two-Way ANOVA, with would calculate all the withinclassifier, within-domain variances separately and group the results together. • As a result, the spooled within-group SS of two-way ANOVA would be smaller than the spooled within-group SS of one-way ANOVA. Multi-way ANOVA is thus a more statistically powerful test than One-way ANOVA since we need fewer observations to find significant effects. 30