Announcements Lecture 22: ANOVA Homework 10 due Thursday 12/1 Statistics 101 OpenIntro quiz tonight 6pm - Thursday noon Presentations - next Monday. Attend the whole session you’re signed up for Prof. Rundel November 29, 2011 Statistics 101 (Prof. Rundel) L 22: ANOVA Recap November 29, 2011 1 / 29 Recap Review question Below is the output of a regression model for predicting % voted for McCain in the 2008 presidential election from region (levels = midwest, northeast, south, west). Cases are each state. Which region has the highest average % voted for McCain? Review question Which of the following is false about assumptions and conditions for multiple linear regression? (a) Residuals should have a linear relationship with the response variable. Coefficients: (Intercept) regionnortheast regionsouth regionwest (b) Explanatory variables should have linear relationship with the response variable. (c) Explanatory variables should not have strong linear relationships with each other. Estimate Std. Error t value Pr(>|t|) 25.750 3.779 6.814 1.73e-08 *** -15.417 5.773 -2.671 0.0104 * 8.812 4.999 1.763 0.0846 . 1.327 5.241 0.253 0.8012 (a) midwest (d) Residuals should be normally distributed around 0. (b) northeast (c) south (d) west Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 2 / 29 Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 3 / 29 ANOVA ANOVA ANOVA To compare means of 2 groups we use a Z or a T statistic. Clicker question To compare means of 3+ groups we use a new test called ANOVA and a new statistic called F. ANOVA is used to assess whether the mean of the outcome variable is different for different levels of a categorical variable. H0 : The mean outcome is the same across all categories, What does ANOVA mean? (a) Another variance analysis (b) Analysis of variance (c) An omnibus variance analysis µ1 = µ2 = · · · = µk , (d) Aardvarks not over vanilla ants where µi represents the mean of the outcome for observations in category i. HA : The mean of the outcome is different for some (or all) groups. Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 4 / 29 Statistics 101 (Prof. Rundel) ANOVA L 22: ANOVA 5 / 29 ANOVA Model and assumptions z/t test vs. ANOVA - Purpose yi ,j = µi + j , ANOVA where an observation belongs to a group i and has error j . z/t test In fitting this model we assume that the errors are Compare means from two groups to see whether they are so far apart that the observed difference cannot reasonably be attributed to sampling variability. independent the nearly normal have nearly constant variance These are more or less the same conditions as multiple linear regression L 22: ANOVA November 29, 2011 Compare the means from two or more groups to see whether they are so far apart that the observed differences cannot all reasonably be attributed to sampling variability. H0 : µ1 = µ2 If the assumptions are met we perform ANOVA to evaluate if the data provide strong evidence against the null hypothesis that all µi are equal Statistics 101 (Prof. Rundel) November 29, 2011 6 / 29 Statistics 101 (Prof. Rundel) H0 : µ1 = µ2 = · · · = µk L 22: ANOVA November 29, 2011 7 / 29 ANOVA ANOVA z/t test vs. ANOVA - Method z/t test vs. ANOVA z/t test ANOVA Compute a test statistic (a ratio). Compute a test statistic (a ratio). x̄1 − x̄2 z /t = SE (x̄1 − x̄2 ) With only two groups t-test and ANOVA are equivalent, but only if we use a pooled standard variance in the denominator of the test statistic → which we kinda skipped over since it’s rarely used variability bet. groups F= variability w/in groups With more than two groups, ANOVA compares the sample means to an overall grand mean. Large test statistics lead to small p-values. If the p-value is small enough H0 is rejected and we conclude that the population means are not equal. Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 8 / 29 Statistics 101 (Prof. Rundel) L 22: ANOVA ANOVA November 29, 2011 9 / 29 ANOVA Aldrin in the Wolf River Data Aldrin concentration (nanograms per liter) at three levels of depth. The Wolf River in Tennessee flows past an abandoned site once used by the pesticide industry for dumping wastes, including chlordane (pesticide), aldrin, and dieldrin (both insecticides). 1 2 3 4 5 6 7 8 9 10 These highly toxic organic compounds can cause various cancers and birth defects. The standard methods to test whether these substances are present in a river is to take samples at six-tenths depth. But since these compounds are denser than water and their molecules tend to stick to particles of sediment, they are more likely to be found in higher concentrations near the bottom than near mid-depth. Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 10 / 29 aldrin 3.80 4.80 4.90 5.30 5.40 5.70 6.30 7.30 8.10 8.80 Statistics 101 (Prof. Rundel) depth bottom bottom bottom bottom bottom bottom bottom bottom bottom bottom aldrin 3.20 3.80 4.30 4.80 4.90 5.20 5.20 6.20 6.30 6.60 depth middepth middepth middepth middepth middepth middepth middepth middepth middepth middepth L 22: ANOVA aldrin 3.10 3.60 3.70 3.80 4.30 4.40 4.40 4.40 5.10 5.20 depth surface surface surface surface surface surface surface surface surface surface November 29, 2011 11 / 29 ANOVA ANOVA Exploratory analysis Hypotheses Clicker question Bottom What are the correct hypotheses for testing for a difference between the mean aldrin concentrations among the three levels? Middepth (a) H0 : µB = µM = µS HA : µB , µM , µS Surface Aldrin concentration (nanograms per liter) at three levels of depth. (b) H0 : µB , µM , µS HA : µB = µM = µS 3 4 5 6 Bottom Mid-depth Surface Statistics 101 (Prof. Rundel) mean 6.04 5.05 4.20 7 (c) H0 : µB = µM = µS HA : At least one pair of means are different. 9 SD 1.58 1.10 0.66 L 22: ANOVA ANOVA 8 (d) H0 : µB = µM = µS = 0 HA : At least one pair of means are different. (e) H0 : µB = µM = µS HA : µB > µM > µS November 29, 2011 12 / 29 Statistics 101 (Prof. Rundel) L 22: ANOVA ANOVA and the F test ANOVA Test statistic November 29, 2011 13 / 29 November 29, 2011 15 / 29 ANOVA and the F test Test statistic (cont.) F= variability bet. groups variability w/in groups F= variability bet. groups MSG = variability w/in groups MSE MSG is mean square between groups dfG = k − 1 where k is number of groups MSE is mean square error - variability in residuals dfE = n − k Does there appear to be a lot of variability within groups? How about between groups? Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 14 / 29 where n is number of observations. Statistics 101 (Prof. Rundel) L 22: ANOVA ANOVA ANOVA and the F test ANOVA p-value ANOVA and the F test ANOVA output for Aldrin data F= depth Residuals variability bet. groups variability w/in groups Df 2 27 Sum Sq 16.96 37.33 Mean Sq 8.48 1.38 F value 6.13 Pr(>F) 0.0064 Let’s see what this table shows... dfG = 3 − 1 = 2 dfE = n − k = 30 − 3 = 27 MSG = 162.96 = 8.48 .33 MSE = 3727 = 1.38 F= 8.48 1.38 = 6.13 p − value = P (F > 6.13) ≈ 0.0064 In order to be able to reject H0 , we need a small p-value, which requires a large F statistic. dfG = 2 ; dfE = 27 In order to obtain a large F statistic, variability between sample means needs to be greater than variability within sample means. 0 Statistics 101 (Prof. Rundel) L 22: ANOVA ANOVA November 29, 2011 16 / 29 Statistics 101 (Prof. Rundel) ANOVA and the F test 6.13 L 22: ANOVA ANOVA Conclusion - in context November 29, 2011 17 / 29 ANOVA and the F test Conclusion Clicker question If p-value is small (less than α), reject H0 .The data provide convincing evidence that at least one pair of means are different from each other (but we can’t tell which ones). What is the conclusion of the hypothesis test? The data provide convincing evidence that the average aldrin concentration If p-value is large, fail to reject H0 .The data do not provide convincing evidence that at least one pair of means are different from each other, the observed differences in sample means are attributable to sampling variability (or chance). (a) is different for all groups. (b) on the surface is lower than the other levels. (c) is different for at least two of the groups. (d) is the same for all groups. Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 18 / 29 Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 19 / 29 ANOVA Graphical diagnostics for ANOVA ANOVA (1) independent residuals (2) nearly normal residuals If we believe the order in which data was collected might introduce dependence (like with time series data) we should verify this assumption by making a plot of the residuals vs. order of data collection. Normal probability plot of residuals 1 2 3 4 5 6 7 8 9 10 Statistics 101 (Prof. Rundel) aldrin 3.20 3.80 4.30 4.80 4.90 5.20 5.20 6.20 6.30 6.60 depth middepth middepth middepth middepth middepth middepth middepth middepth middepth middepth aldrin 3.10 3.60 3.70 3.80 4.30 4.40 4.40 4.40 5.10 5.20 L 22: ANOVA ANOVA ● 1 ● ● ● ● ●●●● ●●● 0 ●● −1 −2 ● Frequency Sample Quantiles depth surface surface surface surface surface surface surface surface surface surface 8 ● 2 We will assume independence of observations, and hence residuals. depth bottom bottom bottom bottom bottom bottom bottom bottom bottom bottom Histogram of residuals ● In this case we have data that has already been sorted, so we have no way of telling what order the data was collected in. aldrin 3.80 4.80 4.90 5.30 5.40 5.70 6.30 7.30 8.10 8.80 Graphical diagnostics for ANOVA ● ● ● ● ●● ● ●● 6 4 2 ●● ● 0 ● −2 −1 0 1 2 −3 −2 −1 0 1 2 3 Does this assumption appear to be satisfied? November 29, 2011 20 / 29 Statistics 101 (Prof. Rundel) Graphical diagnostics for ANOVA L 22: ANOVA ANOVA (3) constant variance November 29, 2011 21 / 29 Graphical diagnostics for ANOVA Assumptions - recap Side-by-side box plot of the outcome. 9 Independence is always important, though sometimes difficult to check. 8 7 Normality condition is important when the sample sizes for each group is relatively small. 6 Constant variance condition is especially important when the sample sizes differ between groups. 5 4 3 surface bottom middepth Does this assumption appear to be satisfied? Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 22 / 29 Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 23 / 29 ANOVA Multiple comparisons & Type 1 error rate ANOVA Which means differ? Multiple comparisons The scenario of testing many pairs of groups is called multiple comparisons. Earlier we concluded that at least one pair of means differ. The natural question that follows is “which ones?” The Bonferroni correction suggests that a more stringent significance level is more appropriate for these tests: We can do two sample t tests for differences in each possible pair of groups. α∗ = α/K Can you see any pitfalls with this approach? where K is the number of comparisons being considered. When we run too many tests, the Type 1 Error rate increases. If there are k groups, then usually all possible pairs are k (k −1) compared and K = 2 . This issue is resolved by using a modified significance level. Statistics 101 (Prof. Rundel) Multiple comparisons & Type 1 error rate L 22: ANOVA ANOVA November 29, 2011 24 / 29 Statistics 101 (Prof. Rundel) Multiple comparisons & Type 1 error rate L 22: ANOVA ANOVA Determining the modified α November 29, 2011 25 / 29 Multiple comparisons & Type 1 error rate Which means differ? Clicker question Clicker question In the aldrin data set depth has 3 levels: bottom, mid-depth, and surface. If α = 0.05, what should be the modified significance level for two sample t tests for determining which pairs of groups have significantly different means? Based on the p-values given below, which pairs of means are significantly different t test between bottom & surface: p-value = 0.005244 t test between bottom & mid-depth: p-value = 0.1236 t test between mid-depth & surface: p-value = 0.05441 (a) α∗ = 0.05 (b) α∗ = 0.05/2 = 0.025 (a) bottom & surface (c) α∗ = 0.05/3 = 0.0167 (b) bottom & mid-depth (d) α∗ = 0.05/6 = 0.0083 (c) mid-depth & surface (d) bottom & mid-depth; mid-depth & surface (e) bottom & mid-depth; bottom & surface; mid-depth & surface Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 26 / 29 Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 27 / 29 ANOVA Using ANOVA for multiple regression ANOVA Revisit - Kid’s scores Testing the model as a whole Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.59241 9.21906 2.125 0.0341 * mom_hsyes 5.09482 2.31450 2.201 0.0282 * mom_iq 0.56147 0.06064 9.259 <2e-16 *** mom_workyes 2.53718 2.35067 1.079 0.2810 mom_age 0.21802 0.33074 0.659 0.5101 Predicting cognitive test scores of three- and four-year-old children using characteristics of their mothers. Data are from a survey of adult American women and their children. 1 .. . 5 6 .. . kid score 65 434 mom hs yes Using ANOVA for multiple regression mom iq 121.12 mom work yes mom age 27 115 98 yes no 92.75 107.90 yes no 27 18 70 yes 91.25 yes 25 Residual standard error: 18.14 on 429 degrees of freedom Multiple R-squared: 0.2171, Adjusted R-squared: 0.2098 F-statistic: 29.74 on 4 and 429 DF, p-value: < 2.2e-16 The F statistic and p-value is used to test the entire model. H0 : The model as a whole is not significant. HA : The model as a whole is significant. What is the conclusion of the test? Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 28 / 29 Statistics 101 (Prof. Rundel) L 22: ANOVA November 29, 2011 29 / 29