This time: ANOVA examples, course wrap-up. ANOVA/Review example: We have a zombie outbreak. We’ve identified the cause: the nefarious Bayes Virus. So far, we have five possible treatments we’re testing by administering them to 40 petri dishes of infected blood each in the hopes that some or all of the treatments are effective in stopping someone from becoming a zombie. First thing, we want to know: Is there a significant difference in the number of viruses (called viral load in epidemiology) per dish between the treatments. First, do a scatterplot of the number of viruses over each of the treatments. There are a lot of points, and they overlap, which makes the scatterplot hard to read. We’ll also do a side-by-side boxplot so we can see the median and related statistics. Potential issue: The number of viruses is much more scattered for treatment 5. The box is larger; so are the whiskers. This is a problem because if one group is a lot more scattered than the others, we can’t make the assumption of equal variance. (Can’t use pooled standard deviation) Another potential issue is the positive skew. The upper whiskers are longer than the lower whiskers, and all the outliers are on the are on the upper side. ANOVA assumes each group is (approximately) normal, but the normal distribution is symmetric, not skewed. If the distribution in each group is far from normal, this could also make ANOVA inaccurate. (Small and moderate breaks from normality won’t cause problems) The skew is also apparent in the histogram of each group. To use ANOVA, we need to make the equal variance assumption; we need to pool the standard deviation. Let’s try to find patterns related to this spread. (From Analyze Descriptive Stats Explore) Treatment Mean Std. Dev. Median IQR A 18.85 9.71 17.50 14 B 27.15 14.19 23.50 11 C 34.70 17.91 26 29 D 62.10 27.92 57 31 E 149.82 78.19 129 77 The mean is larger than the median in every case, more evidence of a positive skew. The treatments with higher mean viruses also have higher mean standard deviations. This trend is true to a lesser degree between the median and IQR (Interquartile Range) Treatment A B C D E Also, Mean 18.85 27.15 34.70 62.10 149.82 Std. Dev. 9.71 14.19 17.91 27.92 78.19 Median 17.50 23.50 26 57 129 IQR 14 11 29 31 77 Why might this be? Consider: what’s a more meaningful difference? 1 virus turning into 2, or… 10,000 viruses turning into 10,001? Why might this be? Consider: what’s a more meaningful difference? 1 virus turning into 2, or… 10,000 viruses turning into 10,001? Viruses multiply, so what really matters is by what factor they’ve multiplied, not how many have been added to the group. A difference of 1 virus means more when there are fewer of them. We’ve run into this problem before when looking at the correlation between GDP/Capita (money earned per person) and life expectancy. Some countries were 100 times richer than others. Also, in terms of health, money is a lot like the viruses. One dollar means a lot more when you’re only making $2-3 a day than if you live a wealthy country. The problem was scaling. We needed a lens through which to see very small amounts of money and very large amounts. That was the log transform. Without a transform, one of the requirements of a correlation isn’t met because we had a non-linear relationship. Without the log-transform, we still found a correlation, but it was weaker than it should have been. We have a similar problem if we neglect to log-transform our zombie virus data. The fairy godmother tried to transform a pumpkin into a royal coach, but her wand was broken. bibbidi-bobbidi-beardie Let’s look at the log transform of the number of viruses. If you’re trying this analysis at home, just use the log10virus variable instead of virus. First, the scatterplot. Then the side-by-side boxplot. The IQRs are much closer in size than before. Aside from treatment B, they’re all about the same. The skew is gone too: The whiskers are the same length and the outliers are appearing on both the upper and lower ends. In other words, the data in these groups looks symmetric. Finally the summary stats. Log-Transformed Data Treatment A B C D E Mean 1.22 1.38 1.49 1.76 2.12 Std. Dev. 0.22 0.22 0.22 0.18 0.21 Median 1.24 1.37 1.42 1.76 2.11 IQR 0.37 0.20 0.40 0.24 0.25 The mean is increasing for each treatment, but not the standard deviation is very similar for all five treatments. We can use pooled standard deviation now. Log-Transformed Data Treatment A B C D E Mean 1.22 1.38 1.49 1.76 2.12 Std. Dev. 0.22 0.22 0.22 0.18 0.21 Median 1.24 1.37 1.42 1.76 2.11 IQR 0.37 0.20 0.40 0.24 0.25 Also, the mean and median are very similar in most cases now. There’s no trend of mean > median. These groups are close enough to symmetrical that we’ll assume normality. Now we can do an ANOVA and have confidence in the results. p-value is very small, so we have strong evidence that there is some difference between the means. With N = 200, even small differences between the means will be detected with a small p-value. That’s from having such a large sample. Also, 20.128 / 28.922 = 0.696 69.6% of the variation in the number of viruses (specifically in log(number of viruses)) Can be explained by the different treatments. This leaves 30.4% to unknown factors, like any good zombie movie. With statistics, even a zombie epidemic feels like a stroll. Different analysis - Same results. Consider the following data set and question. We have a collection 17 trucks and 13 cars, and we’ve tracked the amount they are driver per workday in km. We want know if there the amount driven per day is different between the two types of vehicles. Response: km driven (interval) Explanatory: type of vehicle (car or truck) This is only two means, so we could do a two-sample t-test. There is no pairing structure, so this is an independent samples test. Also, by looking at the scatterplot… …it appears that pooled variance is reasonable. We run the independent samples t-test. First, the Levene test has a large p-value (greater than 0.05), so we fail to reject the hypothesis of equal variance. With equal variance, we can pool the standard deviation. That means use the top row (equal variances assumed). Against the null that the means of both groups are equal, we have a p-value of 0.483. At alpha = 0.05, we fail to reject this null. The means are not significantly different. Also, the confidence interval includes zero, so a zero difference is feasible (this also means we fail to reject the null). We could have also used… the t-score of .711 and… the degrees of freedom of 29 (17 + 14 – 2)… …to test if there was a difference between the two means (car driving length and truck driving length) in the t-table. (two-tailed t-critical at df=29 is 2.045. 0.711 < 2.045. t < t-critical, so fail to reject) We could also have done an ANOVA, which works for 2 or more groups. Here we’re testing the null hypothesis that all the population means are the same. There are only two means, so really we’re testing if those two are the same. Since we’re essentially testing the same thing under the same assumptions as an independent t-test with pooled variance, we get the same p-value of 0.483, and… the same degrees of freedom of 29. For interest: The F-stat for two groups is the t-stat squared. This concludes the course material for Stat 203. Mission Accomplished? “My hope is at the end of the semester you are… - Less intimidated by stats than at the beginning of the semester. - Able to handle the most common kinds of statistical problems, and know what kinds of questions to ask of a specialist when something more complex comes up. - 3 credits wiser.” Lecture 1-1 Where can you go with Stat 203? ARCH 376, POL 315, and SA 355, all require Stat 203 (or equivalent). It’s also recommended for Criminology Honours. From my limited experience with these courses, the program JMP is used more than SPSS, but some knowledge will carry over. If you end up using SPSS a lot in future work, there is certificate you can get through IBM. By doing your assignments, you have 20-30% of the level 1 certification already. http://www-03.ibm.com/certify/certs/47100101.shtml I’m afraid that’s all I know as I don’t have the certificate myself. If you REALLY enjoyed this course and statistics in general, the first course for a minor in Stats is Stat 270 – Probability. However, it’s quite different from this course and is a LOT more mathematical (calculus is required a pre-requisite). Unfortunately Stat 203 doesn’t count for credit towards a minor in stats either. For everyone else, who isn’t taking another stats course: I hope this serves you when the need to handle data does come up: A large portion new research papers in psych/crimin/sociology or health sciences use data in some form, and often the analysis is at or near the level that we covered in this course. At the very least, it’s one more requirement out of the way. Practice session: West Mall Centre 3260, 10am-noon Tuesday. Feel free to leave early / drop in late. Final exam: B9201, 3:30-6:30pm Thor’s day. West Mall is roughly across from the gym. It’s the building with Tim Hortons. B9201 is one of the lecture theatres just off from the AQ main floor (Same floor as the main exit to this room). BLUE: You are here (AQ 3181) RED: Practice session (WMC 3260) BLACK: Final exam (B 9201) Recommended reading for after the final (What I wish was the textbook was): Outliers by Malcolm Gladwell. Freakanomics, Super Freakanomics by Levitt and Dubner. Predictably Irrational by Dan Arielly The Numerati by Stephen Baker Moneyball by Michael Lewis