Stat 401 A – HW 10 answers 1) Anscombe data sets. a) 1 pts. All four data sets have the same estimated regression coefficients and se’s. b) 1 pts. All four data sets have the same R2 values. c) 2 pts. All four data sets have the same answer. Either all are good descriptions or all are bad descriptions (your choice). Note: Whether any one of them is good or bad depends on the context for the analysis. You might have a desired se for the slope, or a desired p-value for the slope, or a desired R2 value. Whether the analysis is good or bad then depends on whether the results are sufficiently precise, have a sufficiently small pvalue or a sufficiently large R2. Key point is that the same decision would be made for all. d) 2 pts. (plots not included) The regression line is a good description only for set 1. The other three sets have issues: set 2: quadratic, not linear set 3: regression line clearly affected by one point set 4: regression line completely dependent on one point Note: For this question most lost points because they didn’t read the question right. Most people still said the regression on set 3 was right because it only included 1 outlier, but the regression line is affected by that point. Note: These four data sets were published in a classic (1973) paper by Anscombe. This was written when statistical software was just beginning to be widely available. Because it was now easy to get numbers (and hard to draw plots with 1970’s technology), the tendency was to focus on the numbers. Anscombe’s paper demonstrated the folly of that practice. This is why I emphasize looking at data or derived quantities (e.g. residuals). The citation to Anscombe’s paper is on the references page of the class web site. 2) pace of life data. a) 2 pts. 15 25 25 15 20 25 30 15 20 25 30 15 bank 20 30 walk 25 10 talk 15 heart 15 25 10 20 30 -5 residual 0 5 b) Not graded, because no answer asked for. c) 2 pts. I don’t see any pattern. Certainly, no pattern worth worrying about. 16 18 20 predicted 22 24 d) 2 pts. Heart = 3.18 + 0.405 bank + 0.452 walk – 0.179 talk (6.33) (0.197) (0.200) (0.222) e) 2 pts. Cities with an additional 1 unit of walking speed but the same bank rate and talk rate have an average additional 0.45 units death rate from heart disease. Note: a causal explanation is ok (because the non-causal statement above is somewhat clumsy). The keys are a statement about amounts and the clause about same bank rate and talk rate. f) 2 pts. This is the overall (3 df) F test. F = 3.07 Note: This is part of the default output for SAS and JMP. It the last line of output from summary() in R. It can also be obtained by fitting the intercept only model, e.g., by lm(heart ~ +1, data=pace), then compare that model to the full model by specifying two models to anova(). Note: Some students included the individual tests for the parameter equal to zero; there was some confusion in the wording. g) 2 pts. walk: p = 0.65, jog: p = 0.57 h) 2 pts. Walk and jog are moderately correlated (e.g., plot them or look at a scatterplot matrix of all 5 variables). Hence, adding jog to a model that has walk doesn’t contribute much. Similarly, adding walk to a model that has jog doesn’t contribute much. Hence the large p-values for both terms. Note: If you tested the 2 df hypothesis that both walk slope = 0 and jog slope = 0, you would find that pvalue is much smaller.