Solutions 5: Regression

advertisement
1
Solutions 5: Regression
Sxx , Syy represent corrected sums of squares for x and y, and Sxy is the corrected sum
of products. S 2 represents the residual mean square, i.e., residual sum of squares divided
by residual degrees of freedom. For simple linear regression, the residual sum of squares
2 /S
is Syy − Sxy
xx with n − 2 degrees of freedom.
The following method of calculating sums of squares and products is usually faster and
more accurate than using the direct formulas.
P
P
P
P
X
X
X
( x) 2
( y) 2
( x) ( y)
2
2
Sxx =
x −
, Syy =
y −
, Sxy =
xy −
n
n
n
1. Summary statistics:
P
P
P 2
n
x
y
x
5 275 705 16125
y2
101341
P
P
xy
40155
Sxx
1000
Syy
1936
Sxy
1380
For this
unusually, there is no loss of accuracy in using the direct formulas
Pexample,
2
Sxx = (x − x̄) , etc.
Regression slope b̂ = 1380/1000 = 1.38, regression SumSq 13802 /1000 = 1904.4 with
1 d.f. Residual SumSq 1936 − 1904.4 = 31.6 with 3 d.f. ANOVA table is
Source
Df Sum Sq Mean Sq
F
Regression
1
1904.4
1904.4 180.9
Residual
3
31.6
10.53
Total
4
1936
√
The slope estimate b̂ has standard error (10.53/1000) = 0.1026 and t = 13.45 with
3 d.f. (P < 0.001). Alternatively, use the F statistic. There is strong evidence that
blood pressure is increasing with age, and the best estimate of the increase is about
13.8 mm (±1.0 mm) every 10 years.
The confidence interval for the slope is 1.38 ±3.18 × 0.1026, or (1.05, 1.71).
2. From the summary statistics we calculate
x̄
4.32
ȳ
9.73
Sxx
13.016
Syy
3.381
Sxy
5.444
Slope b̂ = 5.444/13.016 = 0.418, intercept â = 9.73 − 0.418 × 4.32 = 7.92. Regression
equation is y = 7.92 + 0.418 x.
Regression SumSq 5.4442 /13.016 = 2.277, residual SumSq
3.381 − 2.277 = 1.104,
√
S 2 = 1.104/8 = 0.138. Hence b̂ has standard error (0.138/13.016) = 0.103 and
t = 4.06 with 8 d.f. (P < 0.01). The relationship between purity and yield is significant
at the 1% level.
2
3. Following on from the plotting code given as part of this question, calculate both
regressions and draw the regression line (D2 on D1 ):
fit1 <- lm(D2 ~ D1, weight = frq, data = shells)
fit2 <- lm(D1 ~ D2, weight = frq, data = shells)
abline(fit1, lty = "dashed")
We have to remember that the roles of the x and y axes are interchanged when plotting
the second line on the existing graph:
b0 <- coef(fit2)[1]; b1 <- coef(fit2)[2]
abline(a = -b0/b1, b = 1/b1, lty = "dotted")
For the second regression, b0 and b1 are intercept and slope relative to axes (y,x),
−b0 /b1 and 1/b1 are intercept and slope relative to (x,y).
The two lines are different because they solve different problems: predicting D1 from
the value of D2 , and predicting D2 from the value of D1 . With a high correlation, as
here, the two lines are close together (with small angular separation). With zero correlation, the two lines are at right angles. Low to moderate correlations give something
between these two extremes.
4. After fitting a regression, it is good practice to check the assumptions of the model. The
first and most useful diagnostic plot is the ‘residuals v fitted values’ plot. This looks
satisfactory. There is no obvious relationship between size of residual and fitted value.
The three largest residuals are labelled but these are not necessarily outliers. Humans
and rhesus monkeys have a brain size larger than expected from the relationship with
body weight, while the brain of the water opossum is smaller than expected.
The Q-Q (quantile-quantile) plot is a check on the distribution of the residuals. Any
non-linearity in this plot is evidence of some kind of deviation from normality (e.g.,
skewness or kurtosis). Here everything is OK.
(See ?plot.lm for other types of diagnostic plot.)
5. Sxx = 42, Sxy = 65, Syy = 135.5, b̂ = 1.5476, ȳ = 18.75, x̄ = 3.5, â = 13.333,
regression equation is y = 13.333 + 1.5476 x.
Source
d.f. Sum of Squares Mean Square F-ratio
Regression
1
100.595
100.595
17.29
Residual
6
34.905
5.818
Total
7
135.500
√
Standard error of b̂ is (S 2 /Sxx ) = 0.3722, and t = 1.5476/0.3722 = 4.16, with 6 d.f.
From F tables (1 and 6 d.f.), or t tables (6 d.f.), P < 0.01. There is strong evidence
that yield increases with the amount of irrigation (about 1.5 tonnes/hectare for every
additional cm of irrigation).
Estimated variance of â + b̂ x is
3
S 2 [1/n + (x − x̄)2 /Sxx ].
Setting x = 0 gives intercept 13.33 ± 1.56. Setting x = 6 gives prediction 22.62 ± 1.26,
and setting x = 8 gives prediction 25.72 ± 1.88.
Assumptions: deviations from regression are normally distributed with constant variance. For prediction at x = 8 we also assume that regression line remains appropriate
this far beyond the range of the data. In R,
water <- 0:7
yield <- c(12, 18, 13, 19, 22, 20, 21, 25)
lmfit <- lm(yield ~ water)
summary(lmfit)
anova(lmfit)
newdata <- data.frame(water = c(0,6,8))
predict(fit, new = newdata, se = TRUE)
The mean yields come from question 2, week 4. So we have analysed this data in two
stages, first as anova then as regression. The total sum of squares in the regression
analysis is (apart from a factor of 3) the treatment sum of squares in the ANOVA
of week 4. Note that the regression mean square has been tested against the ‘deviation from regression’ mean square with 6 d.f. rather than the ‘between plot within
treatment’ mean square from week 4.
6. For all three data sets, approximately, Sxx = 110, Sxy = 55.0, Syy = 41.2, b̂ = 0.5,
ȳ = 7.5 and the fitted line is y = 3 + 0.5 x.
However, the same ANOVA and regression estimates can arise from very different data
patterns. Check by plotting the original data, or residuals against fitted values. The
regression is straightforward for y1 . For y2 , there is a quadratic relationship, i.e. the
points tend to lie on a parabolic curve rather than a straight line. In y3 a strong
regression is hidden by an ‘outlier’.
7. The data plot with fitted line suggests that the data for location 12 might be an outlier.
The diagnostic plot obtained with
plot(fit, which = 1, add.smooth = FALSE, id.n = 0)
shows the discrepancy even more clearly. In the original paper, this point was marked
with the comment ‘evidently a mistake’. Subsequently various statisticians have used
these data as a testbed for methods of outlier detection. To the best of my knowledge,
all have decided that the point is an outlier.
With very large sets of data, a careful approach to detecting outliers is often abandoned
in favour of filtering the data by an arbitrary rule, such as ‘reject all residuals greater
than 3 times the residual standard deviation’. This inevitably rejects some genuine
observations.
4
8. The code below gives t = 2.018 with 9 d.f. to test for zero regression of sister’s height
on brother’s height (P = 0.07, not significant at the 5% level). ‘Multiple R-squared’
is 0.3114, and the correlation estimate is the square root 0.558 (with the sign of the
regression slope, here positive). Regressing brother’s height on sister’s height produces
the same answer. This small data set does not firmly establish a relationship between
brother’s and sister’s height.
brother <- c(71, 68, 66, 67, 70, 71, 70, 73, 72, 65, 66)
sister <- c(69, 64, 65, 63, 65, 62, 65, 64, 66, 59, 62)
fit <- lm(sister ~ brother)
summary(fit)
# alternatively,
cor.test(brother,sister)
By hand calculation, with x = brother’s height, y = sister’s
√ height, n = 11, Sxx = 74,
Syy = 66, Sxy = 39. Calculate correlation
as
r
=
S
/
(Sxx Syy ) = 0.558. To check
xy
√
√
2
significance, calculate t = r n − 2/ 1 − r with 9 d.f. This value of t is identical to
that obtain by either of the two possible regressions.
An alternative approach, which gives a slightly different answer, is to analyse the data
as a two-way anova (compare problem 6 in week 4).
heights <- c(brother, sister)
sex <- gl(2, 11, labels = c("bro", "sis"))
family <- gl(11, 1, 22)
fit <- aov(heights ~ sex + family)
summary(fit)
Df Sum Sq Mean Sq F value
Pr(>F)
sex
1 137.5
137.5 44.355 5.64e-05 ***
family
10 109.0
10.9
3.516
0.0299 *
Residuals
10
31.0
3.1
If we regard family as a random effect, the residual and family mean squares have
2 and σ 2 + 2σ 2 , where σ 2 and σ 2 are between and within family
expectations σW
W
B
B
W
components of variance. An estimate of the intra-class correlation coefficient is
(MB − MW )/(MB + MW ) = (10.9 − 3.1)/(10.9 + 3.1) = 0.557
which is very close to the ordinary (inter-class) correlation 0.558. The small difference
arises because the inter-class version is calculated on the assumption that the brother
measurements might have a different variance to the sister measurements, whereas the
intraclass correlation assumes same variance for brothers and sisters.
5
9. (a) There is an obvious relationship between winning time and distance, approximately
linear.
hills <- read.csv("hills.csv")
fit <- lm(Time ~ Distance, data = hills)
with(hills,
{plot(Distance, fitted(fit), type = "n", las = 1)
points(Distance, Time)})
abline(fit, lty = "dashed")
(Plotting in the normal way, parts of the fitted line lie outside the plotting region.
This way, the first plot sets the scale on the y axis taking account of the fitted values
but draws nothing. Then ‘points’ adds the points to the graph.)
plot(fit, which = 1:2, add.smooth = FALSE)
(b) The residual versus fitted plot is satisfactory for the less difficult races, but the
three largest residuals occur among the four most difficult races (as judged by the
fitted values). The same three races (Bens of Jura, Lairig Ghru, and Two Breweries
Fell) also look out of place on the Q-Q plot.
(c) Next week we will learn how to use lm to investigate the relationship between a
dependent variable like Time and two or more predictors (like Distance and Climb).
Download