Chapter 9: Multiple Regression

advertisement
Chapter 9: Multiple Regression
We can use several predictors to build a model for Brain weight of mammals.
> brain <- read.csv("data/brain-body.csv")
> plot(brain)
## matrix of scatterplots
> plot(log(brain)) ## use log on all variables.
2500
●
●
●
●
● ●●
●
●
●
● ●
●
●
●
●
●●●
●
●
●
●● ●
●
●●
●
●
●●
●
●●●● ●● ●
●
●● ●●
●
●
0
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
● ●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●● ● ●
●
●
●
●
●
●
●
●
●
●
●●
●● ●●●
●●
●
●
●●●● ●● ●
●● ●●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●●
●
●
● ●●
● ● ●
●●
●
●●●●
● ●
●●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●
●
●●
●
●
●
●●
●●
●●●
0
●
2000 4000
●
●
0
2.0
●●
●
200
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
● ●
●●
●
● ● ●
●
●● ●●
●●
●
●
●●●
●●
●● ● ● ●
● ● ●●
●● ●
200
0.0
1.0
litter
●
500
●
●
●
●
●
●
●
●
●
● ● ●
●
● ●●
●● ●
●●
●
●● ●
●
●● ●
● ●● ●●
●●
●
●●
●
●
●
●
●
●● ●
● ●
●●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
● ●
0 2 4 6 8
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●● ●●●
●●●●
●
●●
●
●●
●
●
●
●
●●
●
● ●
●
●
●●
●
●●
●● ●
●● ●
● ●
●
●●●
●
● ● ●●
● ●●●
●●● ●●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●●
● ●●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●●
●
●● ● ●
●● ●●● ●●
● ●●
●●
●
● ●
●
● ● ●
●
● ●●●
●
●●
●
●
body
●
0
●
●
●
●
●
● ●●
●
●●
● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
8
●
●
●●●● ●
●
●
●
●
●
●
●
●●
●
●●
●
● ●● ●
●
●●
●
●●
●
●
●
●
●●
●● ●●
●
●●
●
●●
●
●●
●●
●●
● ● ●●
●
●●
●●●
●
●●● ● ●
●●●
●
●
●●
●
●●● ●
●
0.0
7
5
3
1
gestation
●
●
●●●●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
● ●
●
●●
●
●
●
●
●
●●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
● ●●
●
●●
●
●
● ●
●●
●●
●
●
●
●
●
●
●
500
●
4
●●
●●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●
●●
●●
●● ●
●
●
●
●
●●
●●
●●
●●
brain
−4
1000
0
body
●
0
●
4
●
−4
●
●
●
●● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●●●
●● ●●
●
● ● ●●
●
●●●
● ●
●●
●
●
●●
●● ●
● ●
●●
●●
●
●●
●●●
●
●
●●
●
●● ●
●
●
●●
●● ●
●●● ●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●● ● ●
●●
●●
● ●●
●
●● ●
●●
●●● ●
● ●
●●
●
●
●
●
●
●●● ●
●
●
● ●●●
● ●●
●●
●
●
●
●●●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
● ●●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●●
●
●
● ●●
●●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●●
●
●●
●
●
●
● ●
●●
●
●●
1.0
2.0
● ●
●
●
●
●
●●
● ●
●● ●
●●●
● ●
●
●
● ●
● ●
●
●●
●
●
●
●
●
● ●●
● ●● ●
●
0 2 4 6 8
●
7
● ● ● ●
●
●●
●●
●
●●
●● ●●●
●
● ●●
●
●
●
●
● ●
●
● ●
●●
●
● ●●
●
●
● ●
●
●
gestation
6
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
5
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
● ●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
● ●
●
●●
●
●
●●
●
● ●● ● ● ●
●
●
●●
●
●
●●
●
●
●●
●●
●● ●●
●
●
●●
●●●●
●●
●
●
●
●●
●
●
●
●●
●
●
● ●
3
4
5
● ●
●
●●
● ● ●
● ●
● ●
●●
● ●●●
●●
●
●
●●
●
●●
●●●●● ●●
●
●
5
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
3
●
4
brain
●
3
1
●
8
2500
2000 4000
1000
0
0
litter
6
Which scale is preferred? Which predictor has strongest correlation with brain weight (logged
or raw)?
> brainL2 <- log(brain)
> names(brainL2) = c("Lbrain","Lbody","Lgest", "Llitter")
> brain.fit1 <- lm(Lbrain ~ Lbody, brainL2)
> summary(brain.fit1)
Call:
lm(formula = Lbrain ~ Lbody, data = brainL2)
Residuals:
Min
1Q
Median
-1.16218 -0.44640 -0.04525
3Q
0.35076
Max
1.83561
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.33235
0.07325
31.84
<2e-16
Lbody
0.71919
0.02037
35.30
<2e-16
Residual standard error: 0.5781 on 94 degrees of freedom
Multiple R-squared: 0.9299,Adjusted R-squared: 0.9291
F-statistic: 1246 on 1 and 94 DF, p-value: < 2.2e-16
## two possible predictors.
Do they improve the fit?
> par(mfrow=c(1,2))
> plot(resid(brain.fit1)~brainL2$Lgest)
>
## plot(resid(brain.fit1)~brainL2$litter)
1
> plot(resid(brain.fit1)~ jitter(brainL2$Llitter))
> cor(resid(brain.fit1),brainL2$Lgest) ## .29
> cor(resid(brain.fit1),brainL2$Llitter) ## -.44
●
●
3
4
5
0.0
●
1.0
●
−1.0
●● ●
●
●
●
● ●●● ●
●
●
●
●
●
● ●
●
●
● ● ●●
●●
●
●
●
● ●●
●●●
●●
●
●
●●●
●●
●●●
●
●
●
●
●
●
●
●
● ● ●● ● ●
●
● ●●
●
●
●
●
●●
●
●
●● ●
●
● ● ●
●●
●
●●
●
●
resid(brain.fit1)
1.0
0.0
−1.0
resid(brain.fit1)
●
6
●
●
●
●●
●
●
●●
●
●
●
●
●●
● ●●
●
●● ●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●●
●
●
●●
●
●
● ● ●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
0.0
brainL2$Lgest
●
●
0.5
1.0
1.5
●
●
● ●
●
●
2.0
jitter(brainL2$Llitter)
Add a second predictor. This takes us into another dimension.
> brain.fit2 <- lm(Lbrain ~ Lbody + Llitter, data=brainL2)
> summary(brain.fit2)
Residuals:
Min
1Q
Median
3Q
Max
-1.13795 -0.32132 -0.02347 0.35217 1.65212
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.79827
0.10014 27.945 < 2e-16
Lbody
0.65153
0.02079 31.339 < 2e-16
Llitter
-0.53845
0.09029 -5.963 4.42e-08
Residual standard error: 0.4943 on 93 degrees of freedom
Multiple R-squared: 0.9493,Adjusted R-squared: 0.9482
F-statistic: 869.9 on 2 and 93 DF, p-value: < 2.2e-16
Important: the t-test in the Lbody row is for the linear effect of Log body weight on log brain
weight given that log litter is in the model.
Similarly, the t-test in the Llitter row is for the linear effect of log litter size on log brain weight
given that log body wt is in the model.
> anova(brain.fit2)
Analysis of Variance Table
Response: Lbrain
Df Sum Sq Mean Sq F value
Pr(>F)
Lbody
1 416.40 416.40 1704.268 < 2.2e-16
Llitter
1
8.69
8.69
35.562 4.417e-08
Residuals 93 22.72
0.24
The anova command is sequential. It first does an ESS F test for the SLR of Lbrain on Lbody
compared to a null model of a single mean (no Lbody effect). Then it tests adopts the model
with the Lbody effect as the null model and tests to see if Llitter improves the fit. The second
test tests the Llitter effect conditional on having Lbody in the model.
Try the other predictor:
> brain.fit3 <- lm(Lbrain ~ Lbody + Lgest, data=brainL2)
> summary(brain.fit3)
2
Residuals:
Min
1Q
Median
-1.00286 -0.30372 -0.05242
3Q
0.37851
Max
1.58788
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.45728
0.45848 -0.997
0.321
Lbody
0.55117
0.03236 17.033 < 2e-16
Lgest
0.66782
0.10875
6.141 2.00e-08
Residual standard error: 0.4902 on 93 degrees of freedom
Multiple R-squared: 0.9501,Adjusted R-squared: 0.949
F-statistic: 885.2 on 2 and 93 DF, p-value: < 2.2e-16
The t-test in the Lbody row is for
The t-test in the Lgest row is for
The F test uses the “single mean” null model, so it answers the question: “is either predictor helping to explain the response?”
> anova(brain.fit3)
Analysis of Variance Table
Response: Lbrain
Df Sum Sq Mean Sq F value
Pr(>F)
Lbody
1 416.40 416.40 1732.785 < 2.2e-16
Lgest
1
9.06
9.06
37.713 2.002e-08
Residuals 93 22.35
0.24
First test:
Second test:
> brain.fit4 <- lm(Lbrain ~ Lbody + Lgest +Llitter, data=brainL2)
> summary(brain.fit4)
Residuals:
Min
1Q
Median
3Q
Max
-0.95415 -0.29639 -0.03105 0.28111 1.57491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.85482
0.66167
1.292 0.19962
Lbody
0.57507
0.03259 17.647 < 2e-16
Lgest
0.41794
0.14078
2.969 0.00381
Llitter
-0.31007
0.11593 -2.675 0.00885
Residual standard error: 0.4748 on 92 degrees of freedom
Multiple R-squared: 0.9537,Adjusted R-squared: 0.9522
F-statistic: 631.6 on 3 and 92 DF, p-value: < 2.2e-16
Lbody line tests:
Lgest line tests:
Llitter line tests:
3
First test:
> anova(brain.fit4)
Analysis of Variance Table
Response: Lbrain
Df Sum Sq Mean Sq
F value
Pr(>F)
Lbody
1 416.40 416.40 1847.4486 < 2.2e-16
Lgest
1
9.06
9.06
40.2084 8.388e-09
Llitter
1
1.61
1.61
7.1541 0.008852
Residuals 92 20.74
0.23
Coefficient estimates:
Model number
1
Second test:
Third test:
intercept
Lbody
Llitter
2
3
4
These predictors are correlated, sharing some of the same information, so when we add a
term to the model, it changes coefficient estimates for the others.
With multiple regression, all inferences are conditional on the other terms in the model. The
t-tests are conditional on all other terms (above or below). The anova F tests are sequential,
conditional on the tests above.
Assumptions are the same as with SLR. Diagnostics are the same:
par(mfrow=c(1,3));
plot(brain.fit4, which = 1:3)
Residuals vs Fitted
Normal Q−Q
Scale−Location
● Human
being
0
2
4
Fitted values
6
8
1.5
Tapir ●
●
●●
● ● ●
● ●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
● ●●
●
●
●
●
●●
●
● ●●
●
● ●
●
●
●
●
●●
● ●
● ●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
0.0
●
●
● Tapir
−2
Dolphin ●
●
●
1.0
Standardized residuals
2
1
−2
−1.0
●●
●
0
0.0
●
●
●●
●
● ●
●
● ●
●
●
Standardized residuals
●
●
●
● ● ●● ●
● ●●
● ●
●
● ● ●
●
●
●●
●
●
● ●
●
●●
● ●
●
●
● ●
●
● ●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
Tapir ● ●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
−1
0.5
●
●
−0.5
Residuals
1.0
Dolphin ●
0.5
Dolphin ●
●
●
●
●●
● ●
● ●●
●
●
●●
●
● ●
●
● ●
being
Human being ●
3
1.5
● Human
−1
0
1
2
0
Theoretical Quantiles
2
4
6
8
Normality: seems good, no
problems in the qqnorm plot
Linearity: There is curvature
in the first plot, This is questionable.
Constant variance: slight fan
shape in plot 1, slight trend
in plot 3.
Independent obs: No reason
to doubt independence.
Fitted values
> summary(brain.fit5 <- update(brain.fit4, . ~ . +I(Lbody^2)))
Call:
lm(formula = Lbrain ~ Lbody + Lgest + Llitter + I(Lbody^2), data = brainL2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.927918
0.643674
1.442 0.15285
Lbody
0.618311
0.035979 17.185 < 2e-16
Lgest
0.415145
0.136820
3.034 0.00314
Llitter
-0.280850
0.113250 -2.480 0.01498
I(Lbody^2) -0.013114
0.005178 -2.532 0.01304
Residual standard error: 0.4614 on 91 degrees of freedom
Multiple R-squared: 0.9567,Adjusted R-squared: 0.9548
F-statistic: 503.2 on 4 and 91 DF, p-value: < 2.2e-16
4
Lgest
Download