Chapter 10: Inferential Tools for Multiple Regression

advertisement
Chapter 10: Inferential Tools for Multiple Regression
10.2 Inferences About Regression Coefficients
When possible, structure the model so that important questions can be answered by looking at
a single coefficient (instead of having to add or subtract several).
10.2.1 Least Squares Estimates and Standard Errors
• Bat Echolocation Case Study
Echolocation Data
●
●
2
●
1
Question: Does echo-location add more energy expenditure on top of that required for
flight after accounting for body mass?
log(energy)
3
● ●
●
●
●
0
●
E−bat
Bird
N−bat
●
2
3
4
5
6
log(body mass)
• Estimates of the coefficients (β’s) are the least squares (minimizing the sum of squared
residuals) estimates
– Computer software does the calculations (more complex than SLR). Read estimates
from the output.
• Write out the parallel regression lines model for the bat echolocation study:
• Preliminary R-work for setting up the indicator variables:
batBird <- read.csv("data/batBirds.csv",head=T)
batBird$lenergy <- log( batBird$energy)
batBird$lmass <- log( batBird$mass)
## take non-echo-locating bats as the reference or baseline level.
## set up two indicator variables
batBird$ebat <- ifelse( batBird$type=="echolocating bats",1,0)
batBird$bird <- ifelse( batBird$type=="non-echolocating birds",1,0)
echo.fit1 <- lm(lenergy ~ lmass + ebat + bird, data=batBird)
summary(echo.fit1)
1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.57636
0.28724 -5.488 4.96e-05
lmass
0.81496
0.04454 18.297 3.76e-12
ebat
0.07866
0.20268
0.388
0.703
bird
0.10226
0.11418
0.896
0.384
Residual standard error: 0.186 on 16 degrees of freedom
Multiple R-Squared: 0.9815,
Adjusted R-squared: 0.9781
F-statistic: 283.6 on 3 and 16 DF, p-value: 4.464e-14
Note: very high R2 , neither indicator has a small p-value.
10.2.2 Tests and Confidence Intervals for Single Coefficients
• Need to understand what (if any) question each parameter answers.
• Computer output always gives tests of whether each parameter is equal to zero AFTER
accounting for (i.e. at a fixed value of) the other explanatory variables in the model!
– t-test:
– Confidence interval:
• Significance depends on what other explanatory variables are included.
– The meaning of a coefficient depends on what other explanatory variables are included
in the model.
– The p-value must be interpreted according to which other variables are included.
– Compare the results from three models for the echolocation study.
> echo.fit0 <- lm(lenergy ~ ebat + bird, data=batBird)
> echo.fit2 <- lm(lenergy ~ lmass + ebat + bird +lmass:ebat + lmass:bird, data=batBird)
> summary( echo.fit0)$coef
Estimate Std. Error
t value
Pr(>|t|)
(Intercept) 3.3961199 0.4223589 8.040839 3.405224e-07
ebat
-2.7432722 0.5973057 -4.592744 2.590334e-04
bird
-0.6087747 0.4876981 -1.248262 2.288543e-01
> summary( echo.fit1)$coef
Estimate Std. Error
t value
Pr(>|t|)
(Intercept) -1.57636019 0.28723641 -5.4880235 4.960396e-05
lmass
0.81495749 0.04454143 18.2966188 3.757574e-12
ebat
0.07866369 0.20267926 0.3881191 7.030432e-01
bird
0.10226193 0.11418264 0.8955996 3.837429e-01
> summary( echo.fit2)$coef
Estimate Std. Error
t value
Pr(>|t|)
(Intercept) -0.2024478 1.2613342 -0.1605029 0.87477786
lmass
0.5897821 0.2061380 2.8611031 0.01257151
ebat
-1.2680675 1.2854200 -0.9865005 0.34063099
bird
-1.3783900 1.2952413 -1.0641955 0.30525147
lmass:ebat
0.2148750 0.2236226 0.9608819 0.35291437
lmass:bird
0.2455882 0.2134322 1.1506616 0.26914645
2
– What does the ebat coefficient measure for each of the models?
1. µ̂{lenergy | lmass, type} = 3.40 − 2.74ebat − 0.61bird
2. µ̂{lenergy | lmass, type} = −1.58 + 0.08ebat − 0.10bird + 0.815lmass
3. µ̂{lenergy | lmass, type} = −0.20 − 1.27ebat − 1.38bird + 0.59lmass +
0.21(ebat × lmass) + 0.25(bird × lmass)
10.2.3 Tests and Confidence Intervals for Linear Combinations of Coefficients
What if your question isn’t answered by a single β? How can we infer about linear combinations of regression coefficients?
Some strategies:
1. Redefining the reference level trick:
• Redefine the reference level so that question is answered by a single β.
> levels(batBird$type)
[1] "echolocating bats"
"non-echolocating bats" "non-echolocating birds"
> batBird$type = factor(batBird$type, levels = levels(batBird$type)[c(2,1,3)])
> coef(lm(lenergy ~ lmass + type, batBird))
(Intercept)
lmass
-1.57636019
0.81495749
typeecholocating bats typenon-echolocating birds
0.07866369
0.10226193
## new reference level is non-echolocating bats
2. Computer centering trick:
• Use if you want to estimate the mean of Y at some specific combination of the X’s.
>
summary( lm(lenergy ~ I(lmass -2) + type, batBird))$coef
Estimate Std. Error
t value
(Intercept)
0.05355479 0.20498864 0.2612574
I(lmass - 2)
0.81495749 0.04454143 18.2966188
typeecholocating bats
0.07866369 0.20267926 0.3881191
typenon-echolocating birds 0.10226193 0.11418264 0.8955996
3
Pr(>|t|)
7.972271e-01
3.757574e-12
7.030432e-01
3.837429e-01
3. Make confidence Bands for Multiple Regression Surfaces
plot(energy ~ mass, batBird, ylim = c(0,60), col = rep(c(4,3,2),c(4,12,4)))
with(batBird, tapply(mass,type,summary))
‘non-echolocating bats‘
‘echolocating bats‘
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
Min. 1st Qu. Median
Mean 3rd Qu.
258.0
300.8
471.5
495.0
665.8
779.0
6.70
7.45
7.85
28.85
29.25
Max.
93.00
‘non-echolocating birds‘
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
24.3
108.2
302.5
263.2
391.0
480.0
### add 3 back-transformed fitted curves
curve( exp( -1.5763 + 0.815 * log(x) ), from = 258, to = 780,add=T, col=4) ## non-echo bats
curve( exp( -1.5763 + 0.0786+ 0.815 * log(x) ), from = 7, to = 100,add=T, col=2) ## birds
curve( exp( -1.5763 + 0.10226 + 0.815 * log(x) ), from = 24, to = 480,add=T, col=3) ## echo-bats
## to get the book’s intervals:
multiplier = sqrt(4*qf(.95,4,16)) ## Scheffe’s f multiplier = 3.468
lenergy.fits <- predict( lm(lenergy ~ lmass + type, batBird) , se.fit = T,
newdata = list(lmass=log(rep(c(100,400),3)), type=rep(levels(batBird$type),each=2)))
exp(cbind( lenergy.fits$fit - multiplier * lenergy.fits$se.fit,
lenergy.fits$fit + multiplier * lenergy.fits$se.fit) )
[,1]
[,2]
1 5.929275 13.11050
2 19.757080 37.68814
3 6.124656 14.85476
4 16.038811 54.33526
5 7.919106 12.04393
6 24.249319 37.67485
## to fool R into using this multiplier and build the intervals for me:
fake.Tlevel <- 1- pt(-multiplier, 16)*2 ## 1 - alpha = 0.9968
batPred <- predict(echo.fit1, interval = "conf", level = fake.Tlevel)
10 20 30 40 50 60
0
energy
## add intervals based on 1:4 are non-ebats, 5:16 are birds, 17:20 are ebats
arrows(batBird$mass[1:4], exp(batPred[1:4,2]), batBird$mass[1:4], exp(batPred[1:4,3]),col=4,
code=3,length=.1, angle=90)
arrows(batBird$mass[5:16], exp(batPred[5:16,2]), batBird$mass[5:16], exp(batPred[5:16,3]),col=3,
code=3,length=.1, angle=90)
arrows(batBird$mass[17:20], exp(batPred[17:20,2]), batBird$mass[17:20], exp(batPred[17:20,3]),col=2,
code=3,length=.1, angle=90)
●
●
●
●
●
●● ●
●
●●
●
●
●●
●
●●
0
200
400
600
800
mass
4
3. Direct calculation of the standard errors using formulas (Section 10.4.3):
• Linear combination of regression coefficients:
γ = C0 β0 + C1 β1 + C2 β2 + . . . + Cp βp
g=
• Calculating the standard error is more complicated because we cannot assume that
the β’s are independent (as we did for the µ’s). We now must consider covariances
when calculating SE(g):
V ar(g) = C02 SE(β̂0 )2 + C12 SE(β̂1 )2 + . . . + Cp2 SE(β̂p )2 +
2C0 C1 Cov(β̂0 , β̂1 ) + 2C0 C2 Cov(β̂0 , β̂2 ) + . . . + 2Cp−1 Cp Cov(β̂p−1 , β̂p )
• Use a computer program to get the estimated covariances.
– R code: vcov(lm.fit)
• Example: Suppose we want SE(β̂3 − β̂4 ) for the echo location data to compare
intercepts of birds to ebats.
> vcov(echo.fit1)
(Intercept)
lmass
ebat
bird
(Intercept) 0.08250476 -0.012105043 -0.050561460 -0.019207031
lmass
-0.01210504 0.001983939 0.006869742 0.001730953
ebat
-0.05056146 0.006869742 0.041078880 0.014639320
bird
-0.01920703 0.001730953 0.014639320 0.013037675
> sqrt( 0.041078880 + 0.013037675 -2 * 0.014639320)
[1] 0.1576005
## or use matrix multiplication:
> contrastCis <- c(0,0,1,-1)
> sqrt( contrastCis %*% vcov(echo.fit1) %*% contrastCis)
[,1]
[1,] 0.1576005
10.2.4 Prediction
• If prediction is the only objective, then there is no need to interpret the coefficients.
• As in SLR, prediction error is calculated by combining the residual SE with the SE of the
estimate of the population mean at values of the explanatory variables:
SE[P red{Y |X1 = x1 , X2 = x2 , . . .}] =
5
10.3 Extra-Sums-of-Squares F -Tests
What if we want to test whether several coefficients are all zero (simultaneously)?!
• Bat echolocation example. Could all types have the same intercept?
– Null hypothesis:
– Alternative hypothesis:
• Can t-tests be used to test the above hypothesis?
• When are t-tests equivalent to F -tests?
• What two models should we compare via an ESS F-test?
– Full:
– Reduced:
• An “Overall Significance” F -test:
– Null hypothesis: All regression coefficients except β0 are zero
– Alternative: At least one is non-zero
– What two models are being compared?
– Bat Echolocation example:
∗ Questions of interest:
1. Is there a difference in the mean in-flight energy expenditures of echolating
and non-echolocating bats after body size is accounted for?
2. Is there a difference between birds and the two bat groups?
∗ Look at the coded scatterplot of the data.
6
Echolocation Data
●
●
1
It appears that a parallel lines model
may be appropriate and this would
also be the most convenient inferential
model.
●
2
log(energy)
3
● ●
●
●
●
0
●
E−bat
Bird
N−bat
●
2
3
4
5
6
log(body mass)
∗ Now, let’s investigate whether the parallel lines model really is appropriate statistically.
1. What is a fuller (richer) model that would could test the parallel lines model
against?
2. Fit the fuller model. Examine residual plots.
●
●
●
●
●
●
●
●
●
●
●●
●
1.2
1.0
0.8
●
●
●
●
●●
●
●
●
●
●
●
●
● 16
−2
Fitted values
●
0.0
−1.5
3
●
●
●
●
●
16 ●
2
●
●
0.6
●
●
1
●
●●
●
0
15●
●16
●
0.4
●
●
●
Standardized residuals
0.0
●
14 ●
0.2
●
●
1.4
2.0
1.5
●
●
Scale−Location
15 ●
1.0
●
Standardized residuals
0.1
●
14 ●
0.5
0.3
0.2
● 15
−0.3 −0.2 −0.1
Residuals
Normal Q−Q
14 ●
−0.5 0.0
0.4
Residuals vs Fitted
−1
0
1
Theoretical Quantiles
2
0
1
2
3
Fitted values
3. Perform the F -test for the hypothesis that both interaction terms can be
dropped (i.e. are zero)
echo.fit2 <- lm(formula = lenergy ~ lmass + ebat + bird + lmass:ebat +
lmass:bird)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2024
1.2613 -0.161
0.8748
lmass
0.5898
0.2061
2.861
0.0126
ebat
-1.2681
1.2854 -0.987
0.3406
bird
-1.3784
1.2952 -1.064
0.3053
lmass:ebat
0.2149
0.2236
0.961
0.3529
lmass:bird
0.2456
0.2134
1.151
0.2691
> anova(echo.fit1,echo.fit2)
Analysis of Variance Table
Model 1: lenergy ~ lmass + ebat + bird
Model 2: lenergy ~ lmass + ebat + bird + lmass:ebat + lmass:bird
Res.Df
RSS Df Sum of Sq
F Pr(>F)
1
16 0.55332
7
2
14 0.50487
2
0.04845 0.6718 0.5265
4. If it is safe to go with the parallel lines model, we then want to test whether
the intercepts are the same for the three groups:
> anova(lm(lenergy ~ lmass, batBird), echo.fit1)
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
18
2
16
lenergy ~ lmass
lenergy ~ lmass + ebat + bird
RSS Df Sum of Sq
F Pr(>F)
0.58289
0.55332 2 0.029574 0.4276 0.6593
>
0.07866 +c(-1,1)* qt(.975,16)* 0.20268 ## CI for ebat intercept adjustment
[1] -0.3510024 0.5083224
> exp(c( -0.3510024, 0.5083224)) ## back transformed
[1] 0.703982 1.662500
>
0.10226 +c(-1,1)* qt(.975,16)* 0.11418 ## CI for bird intercept adjustment
[1] -0.1397908 0.3443108
> exp(c( -0.1397908, 0.3443108)) ## back transformed
[1] 0.8695401 1.4110171
∗ Compare the results from R when we create our own indicator variables vs.
letting R do it:
#LET R define the indicator variables for us:
> batBird$TYPE <- read.csv("data/batBirds.csv",head=T)$type
> anova(lm(lenergy~lmass + TYPE, batBird))
Analysis of Variance Table
Response: lenergy
Df Sum Sq Mean Sq F value
Pr(>F)
lmass
1 29.3919 29.3919 849.9108 2.691e-15
TYPE
2 0.0296 0.0148
0.4276
0.6593
Residuals 16 0.5533 0.0346
> summary(lm(lenergy~lmass + TYPE, batBird))$coef
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-1.49770
0.14987 -9.993 2.77e-08
lmass
0.81496
0.04454 18.297 3.76e-12
TYPEnon-echolocating bats -0.07866
0.20268 -0.388
0.703
TYPEnon-echolocating birds 0.02360
0.15760
0.150
0.883
### USE OUR INDICATOR VARIABLES to match above output ##
lm(formula = lenergy ~ lmass + nbat + bird)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.49770
0.14987 -9.993 2.77e-08
lmass
0.81496
0.04454 18.297 3.76e-12
nbat
-0.07866
0.20268 -0.388
0.703
bird
0.02360
0.15760
0.150
0.883
8
10.4 Related Issues
10.4.1 More on R-Squared
• How can we always make R2 100%?
• What are more appropriate tools for model building and checking?
• Adjusted R-squared:
– A version of R2 that includes a penalty for unnecessary explanatory variables. It is
useful for a casual assessment of improvement of fit.
Adjusted R2 =
(Total MS) − (Residual MS)
× 100%
(Total MS)
– What is the disadvantage of the Adjusted R2 vs. the usual R2 ?
10.4.2 Improving a study with replication
• Replication = taking repeated observations at the same X values.
• What does replication allow us to do?
• Where does our estimate of σ 2 come from in the absence of replication?
10.4.6 The Principle of Occam’s Razor
• Simple models are preferred over complicated models. Shave off the excess.
• It is founded in common sense and successful experience.
• Often called the Principle of Parsimony
9
Download