Homework #10 answers

advertisement
Stat 401 A – HW 10 answers
1) Anscombe data sets.
a) 1 pts. All four data sets have the same estimated regression coefficients and se’s.
b) 1 pts. All four data sets have the same R2 values.
c) 2 pts. All four data sets have the same answer. Either all are good descriptions or all are bad
descriptions (your choice).
Note: Whether any one of them is good or bad depends on the context for the analysis. You might have
a desired se for the slope, or a desired p-value for the slope, or a desired R2 value. Whether the analysis
is good or bad then depends on whether the results are sufficiently precise, have a sufficiently small pvalue or a sufficiently large R2. Key point is that the same decision would be made for all.
d) 2 pts. (plots not included) The regression line is a good description only for set 1. The other three
sets have issues:
set 2: quadratic, not linear
set 3: regression line clearly affected by one point
set 4: regression line completely dependent on one point
Note: For this question most lost points because they didn’t read the question right. Most people still
said the regression on set 3 was right because it only included 1 outlier, but the regression line is
affected by that point.
Note: These four data sets were published in a classic (1973) paper by Anscombe. This was written
when statistical software was just beginning to be widely available. Because it was now easy to get
numbers (and hard to draw plots with 1970’s technology), the tendency was to focus on the numbers.
Anscombe’s paper demonstrated the folly of that practice. This is why I emphasize looking at data or
derived quantities (e.g. residuals). The citation to Anscombe’s paper is on the references page of the
class web site.
2) pace of life data.
a) 2 pts.
15
25
25
15 20 25 30
15 20 25 30
15
bank
20
30
walk
25
10
talk
15
heart
15
25
10
20
30
-5
residual
0
5
b) Not graded, because no answer asked for.
c) 2 pts. I don’t see any pattern. Certainly, no pattern worth worrying about.
16
18
20
predicted
22
24
d) 2 pts. Heart = 3.18 + 0.405 bank + 0.452 walk – 0.179 talk
(6.33) (0.197)
(0.200)
(0.222)
e) 2 pts. Cities with an additional 1 unit of walking speed but the same bank rate and talk rate have an
average additional 0.45 units death rate from heart disease.
Note: a causal explanation is ok (because the non-causal statement above is somewhat clumsy). The
keys are a statement about amounts and the clause about same bank rate and talk rate.
f) 2 pts. This is the overall (3 df) F test. F = 3.07
Note: This is part of the default output for SAS and JMP. It the last line of output from summary() in R.
It can also be obtained by fitting the intercept only model, e.g., by lm(heart ~ +1, data=pace), then
compare that model to the full model by specifying two models to anova().
Note: Some students included the individual tests for the parameter equal to zero; there was some
confusion in the wording.
g) 2 pts. walk: p = 0.65, jog: p = 0.57
h) 2 pts. Walk and jog are moderately correlated (e.g., plot them or look at a scatterplot matrix of all 5
variables). Hence, adding jog to a model that has walk doesn’t contribute much. Similarly, adding walk
to a model that has jog doesn’t contribute much. Hence the large p-values for both terms.
Note: If you tested the 2 df hypothesis that both walk slope = 0 and jog slope = 0, you would find that pvalue is much smaller.
Download