Wilcoxon and Kruskal-Wallis: Rank-based alternatives to t-tests and ANOVA T-tests and ANOVA can perform poorly when have large outliers in our data or have data are not normally distributed. Non-normal distributions violate the assumptions of ANOVA, and can greatly reduce the power to detect significant effects. Even when graphs of our data suggest there is a difference among the groups, we may get nonsignificant p-values from t-tests and ANOVA as a result of the outliers or non-normal distributions. In these cases, we can use alternatives to the t-test and ANOVA based on ranks: Non-parametric (rank) test Wilcoxon rank sum test Wilcoxon signed rank test Kruskal-Wallis test Corresponding parametric test T-test (ordinary, two sample version) Paired t-test ANOVA T-tests and ANOVA are called parametric tests because they assume that the data follow a distribution (such as normal distribution) that has parameters mean and standard deviation. The relationship between the rank-based tests and the parametric tests is like the relationship between the median and the mean. The median (which is based on rank) is less affected by outliers than is the mean. Because the rank-based methods use do not rely on assumptions about the distribution (such as normal) or parameters (such as mean and standard deviation), they are also called distribution-free or nonparametric methods. The Wilcoxon rank sum test is also known as the Mann-Whitney U test. It sometimes called the Wilcoxon-Mann-Whitney test to give credit to everyone. If the sample size is large, deviations from normal distribution are less of a problem for t-tests and ANOVA. Wilcoxon rank sum for two independent samples: Recall a t-test example where we asked the question, do colon cancer patients have elevated level of the mucin protein in their blood? We measured the level of the protein mucin in the blood of patients with colon cancer and in healthy controls. Group Colon cancer Healthy control Mucin level 83, 89, 90, 93, 98 99, 100, 103, 104, 141 The boxplot shows a separation of the two groups, but the p-value for the t-test is not significant (p=0.054). The Wilcoxon rank sum test is an alternative to the t-test that uses the rank value of each observation, rather than the actual value. Here are the rank values of the original observations. Group Colon cancer Healthy control Mucin level 83, 89, 90, 93, 98 99, 100, 103, 104, 141 Rank Mucin level 1, 2, 3, 4, 5 6, 7, 8, 9, 10 Using the rank values, the Wilcoxon rank sum test yields a p-value of p=0.0079, so we reject the null hypothesis and conclude that colon cancer patients have mucin levels different from those of healthy controls. Excel does not have the rank-based tests built in. If you only have Excel, you can convert the original values to ranks and analyze the rank-values using a t-test or ANOVA. This method does not give identical results to using the Wilcoxon tests or Kruskal-Wallis, but the results are quite similar, and the method is an acceptable alternative if you don't have software that will do the rank tests. The p-value for the t-test using rank value is p=0.0010 is similar to the p-value for the Wilcoxon rank sum test p=0.0079. Recall that we tried a log transform of the data to reduce the effect of the outlier. For these data, the log transform yields a t-test p-value of p=0.036. If a log transform makes your data more normally distributed, it may be worthwhile trying that before trying the rank methods. Wilcoxon signed rank test for paired (matched) samples: Recall that we used the paired t-test to test for difference in paired (matched) samples, such as the difference before and after treatment. The Wilcoxon signed rank test is the rank-based analog of the paired t-test. Here's an example. Do bears lose weight between winter and spring? We previously used the paired t-test to examine the change in weight of bears, where the same bears were weighed in winter and in spring. We'll analyze the data using the Wilcoxon signed rank test. Measurement time Winter Spring Difference Bear weights 300,470,550,650,750,760,800,985,1100,1200 280,420,500,620,690,710,790,935,1050,1110 20, 50, 50, 30, 60, 50, 10, 50, 50, 90 Notice that all the bears lose weight. Using the paired t-test, we get p = 0.0001053, which is significant. We reject the null hypothesis that the change in weight between winter and spring is zero. The Wilcoxon signed rank test gives us p =0.0053, so we still reject the null hypothesis. In this case, the results are similar for paired t-test and Wilcoxon signed rank test. Construct some data sets with outliers to see when the t-test and Wilcoxon tests give different results. The Kruskal-Wallis test for two or more treatment groups The Kruskal-Wallis test is the non-parametric version of one-way ANOVA, and allows us to compare two or more treatment groups. For the chickwts example: Kruskal-Wallis rank sum test data: weight by feed Kruskal-Wallis chi-squared = 37.3427, df = 5, p-value = 5.113e-07 When should I use t-tests and ANOVA versus Wilcoxon tests and Kruskal-Wallis? The Wilcoxon tests and Kruskal-Wallis don't require the assumption of normal distributions, and are not affected by outliers, so why don't we always use them instead of t-tests and ANOVA? There are several considerations. If the data are normally distributed, ANOVA and t-tests have slightly more power to detect differences than do the rank-based tests. It is easier to include additional variables in ANOVA, and regression models (which also assume normally distributed residuals) than it is to include additional variables in rankbased models. ANOVA and t-tests give us a quantitative measure of the difference between the group means, and can provide group means adjusted for covariates. It is generally easier to get confidence intervals using methods that assume the data are normally distributed. But newer computer-intensive methods such as bootstrap now make getting confidence intervals easier for rank methods. When we have very small sample size (5 to 10 per group) it may be mathematically nearly impossible to get a significant p-value using a non-parametric test. But we may be able to get significance using a parametric test, if we are willing to make the assumption of normal or T distributions. Do the Wilcoxon tests and Kruskal-Wallis test compare the medians of groups? The Wilcoxon tests and Kruskal-Wallis do not compare the medians of the treatment groups. They compare the entire distributions of the groups (that is, the ranks of all the observations, not just the median). You can construct two treatment groups that have identical medians, but that differ in their rank sums, and will give significant differences in the Wilcoxon or Kruskal-Wallis tests. group.A=(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1, 1,1, 1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1) group.B=(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 2,2,2,2,2,2,2,22,2,2,2,2,2,2,2,2,2,2,2,2,2,2,22,2,2,2,2,2,2,2) The median of both groups is 0. The Wilcoxon rank sum test gives p = 0.0295, indicating that the two groups are significantly different. This result shows that, although group A and group B have identical medians they differ in location, and differ significantly in the Wilcoxon test.