Chapter 10: Inferential Tools for Multiple Regression 10.2 Inferences About Regression Coefficients When possible, structure the model so that important questions can be answered by looking at a single coefficient (instead of having to add or subtract several). 10.2.1 Least Squares Estimates and Standard Errors • Bat Echolocation Case Study Echolocation Data ● ● 2 ● 1 Question: Does echo-location add more energy expenditure on top of that required for flight after accounting for body mass? log(energy) 3 ● ● ● ● ● 0 ● E−bat Bird N−bat ● 2 3 4 5 6 log(body mass) • Estimates of the coefficients (β’s) are the least squares (minimizing the sum of squared residuals) estimates – Computer software does the calculations (more complex than SLR). Read estimates from the output. • Write out the parallel regression lines model for the bat echolocation study: • Preliminary R-work for setting up the indicator variables: batBird <- read.csv("data/batBirds.csv",head=T) batBird$lenergy <- log( batBird$energy) batBird$lmass <- log( batBird$mass) ## take non-echo-locating bats as the reference or baseline level. ## set up two indicator variables batBird$ebat <- ifelse( batBird$type=="echolocating bats",1,0) batBird$bird <- ifelse( batBird$type=="non-echolocating birds",1,0) echo.fit1 <- lm(lenergy ~ lmass + ebat + bird, data=batBird) summary(echo.fit1) 1 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.57636 0.28724 -5.488 4.96e-05 lmass 0.81496 0.04454 18.297 3.76e-12 ebat 0.07866 0.20268 0.388 0.703 bird 0.10226 0.11418 0.896 0.384 Residual standard error: 0.186 on 16 degrees of freedom Multiple R-Squared: 0.9815, Adjusted R-squared: 0.9781 F-statistic: 283.6 on 3 and 16 DF, p-value: 4.464e-14 Note: very high R2 , neither indicator has a small p-value. 10.2.2 Tests and Confidence Intervals for Single Coefficients • Need to understand what (if any) question each parameter answers. • Computer output always gives tests of whether each parameter is equal to zero AFTER accounting for (i.e. at a fixed value of) the other explanatory variables in the model! – t-test: – Confidence interval: • Significance depends on what other explanatory variables are included. – The meaning of a coefficient depends on what other explanatory variables are included in the model. – The p-value must be interpreted according to which other variables are included. – Compare the results from three models for the echolocation study. > echo.fit0 <- lm(lenergy ~ ebat + bird, data=batBird) > echo.fit2 <- lm(lenergy ~ lmass + ebat + bird +lmass:ebat + lmass:bird, data=batBird) > summary( echo.fit0)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 3.3961199 0.4223589 8.040839 3.405224e-07 ebat -2.7432722 0.5973057 -4.592744 2.590334e-04 bird -0.6087747 0.4876981 -1.248262 2.288543e-01 > summary( echo.fit1)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) -1.57636019 0.28723641 -5.4880235 4.960396e-05 lmass 0.81495749 0.04454143 18.2966188 3.757574e-12 ebat 0.07866369 0.20267926 0.3881191 7.030432e-01 bird 0.10226193 0.11418264 0.8955996 3.837429e-01 > summary( echo.fit2)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) -0.2024478 1.2613342 -0.1605029 0.87477786 lmass 0.5897821 0.2061380 2.8611031 0.01257151 ebat -1.2680675 1.2854200 -0.9865005 0.34063099 bird -1.3783900 1.2952413 -1.0641955 0.30525147 lmass:ebat 0.2148750 0.2236226 0.9608819 0.35291437 lmass:bird 0.2455882 0.2134322 1.1506616 0.26914645 2 – What does the ebat coefficient measure for each of the models? 1. µ̂{lenergy | lmass, type} = 3.40 − 2.74ebat − 0.61bird 2. µ̂{lenergy | lmass, type} = −1.58 + 0.08ebat − 0.10bird + 0.815lmass 3. µ̂{lenergy | lmass, type} = −0.20 − 1.27ebat − 1.38bird + 0.59lmass + 0.21(ebat × lmass) + 0.25(bird × lmass) 10.2.3 Tests and Confidence Intervals for Linear Combinations of Coefficients What if your question isn’t answered by a single β? How can we infer about linear combinations of regression coefficients? Some strategies: 1. Redefining the reference level trick: • Redefine the reference level so that question is answered by a single β. > levels(batBird$type) [1] "echolocating bats" "non-echolocating bats" "non-echolocating birds" > batBird$type = factor(batBird$type, levels = levels(batBird$type)[c(2,1,3)]) > coef(lm(lenergy ~ lmass + type, batBird)) (Intercept) lmass -1.57636019 0.81495749 typeecholocating bats typenon-echolocating birds 0.07866369 0.10226193 ## new reference level is non-echolocating bats 2. Computer centering trick: • Use if you want to estimate the mean of Y at some specific combination of the X’s. > summary( lm(lenergy ~ I(lmass -2) + type, batBird))$coef Estimate Std. Error t value (Intercept) 0.05355479 0.20498864 0.2612574 I(lmass - 2) 0.81495749 0.04454143 18.2966188 typeecholocating bats 0.07866369 0.20267926 0.3881191 typenon-echolocating birds 0.10226193 0.11418264 0.8955996 3 Pr(>|t|) 7.972271e-01 3.757574e-12 7.030432e-01 3.837429e-01 3. Make confidence Bands for Multiple Regression Surfaces plot(energy ~ mass, batBird, ylim = c(0,60), col = rep(c(4,3,2),c(4,12,4))) with(batBird, tapply(mass,type,summary)) ‘non-echolocating bats‘ ‘echolocating bats‘ Min. 1st Qu. Median Mean 3rd Qu. Max. Min. 1st Qu. Median Mean 3rd Qu. 258.0 300.8 471.5 495.0 665.8 779.0 6.70 7.45 7.85 28.85 29.25 Max. 93.00 ‘non-echolocating birds‘ Min. 1st Qu. Median Mean 3rd Qu. Max. 24.3 108.2 302.5 263.2 391.0 480.0 ### add 3 back-transformed fitted curves curve( exp( -1.5763 + 0.815 * log(x) ), from = 258, to = 780,add=T, col=4) ## non-echo bats curve( exp( -1.5763 + 0.0786+ 0.815 * log(x) ), from = 7, to = 100,add=T, col=2) ## birds curve( exp( -1.5763 + 0.10226 + 0.815 * log(x) ), from = 24, to = 480,add=T, col=3) ## echo-bats ## to get the book’s intervals: multiplier = sqrt(4*qf(.95,4,16)) ## Scheffe’s f multiplier = 3.468 lenergy.fits <- predict( lm(lenergy ~ lmass + type, batBird) , se.fit = T, newdata = list(lmass=log(rep(c(100,400),3)), type=rep(levels(batBird$type),each=2))) exp(cbind( lenergy.fits$fit - multiplier * lenergy.fits$se.fit, lenergy.fits$fit + multiplier * lenergy.fits$se.fit) ) [,1] [,2] 1 5.929275 13.11050 2 19.757080 37.68814 3 6.124656 14.85476 4 16.038811 54.33526 5 7.919106 12.04393 6 24.249319 37.67485 ## to fool R into using this multiplier and build the intervals for me: fake.Tlevel <- 1- pt(-multiplier, 16)*2 ## 1 - alpha = 0.9968 batPred <- predict(echo.fit1, interval = "conf", level = fake.Tlevel) 10 20 30 40 50 60 0 energy ## add intervals based on 1:4 are non-ebats, 5:16 are birds, 17:20 are ebats arrows(batBird$mass[1:4], exp(batPred[1:4,2]), batBird$mass[1:4], exp(batPred[1:4,3]),col=4, code=3,length=.1, angle=90) arrows(batBird$mass[5:16], exp(batPred[5:16,2]), batBird$mass[5:16], exp(batPred[5:16,3]),col=3, code=3,length=.1, angle=90) arrows(batBird$mass[17:20], exp(batPred[17:20,2]), batBird$mass[17:20], exp(batPred[17:20,3]),col=2, code=3,length=.1, angle=90) ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● 0 200 400 600 800 mass 4 3. Direct calculation of the standard errors using formulas (Section 10.4.3): • Linear combination of regression coefficients: γ = C0 β0 + C1 β1 + C2 β2 + . . . + Cp βp g= • Calculating the standard error is more complicated because we cannot assume that the β’s are independent (as we did for the µ’s). We now must consider covariances when calculating SE(g): V ar(g) = C02 SE(β̂0 )2 + C12 SE(β̂1 )2 + . . . + Cp2 SE(β̂p )2 + 2C0 C1 Cov(β̂0 , β̂1 ) + 2C0 C2 Cov(β̂0 , β̂2 ) + . . . + 2Cp−1 Cp Cov(β̂p−1 , β̂p ) • Use a computer program to get the estimated covariances. – R code: vcov(lm.fit) • Example: Suppose we want SE(β̂3 − β̂4 ) for the echo location data to compare intercepts of birds to ebats. > vcov(echo.fit1) (Intercept) lmass ebat bird (Intercept) 0.08250476 -0.012105043 -0.050561460 -0.019207031 lmass -0.01210504 0.001983939 0.006869742 0.001730953 ebat -0.05056146 0.006869742 0.041078880 0.014639320 bird -0.01920703 0.001730953 0.014639320 0.013037675 > sqrt( 0.041078880 + 0.013037675 -2 * 0.014639320) [1] 0.1576005 ## or use matrix multiplication: > contrastCis <- c(0,0,1,-1) > sqrt( contrastCis %*% vcov(echo.fit1) %*% contrastCis) [,1] [1,] 0.1576005 10.2.4 Prediction • If prediction is the only objective, then there is no need to interpret the coefficients. • As in SLR, prediction error is calculated by combining the residual SE with the SE of the estimate of the population mean at values of the explanatory variables: SE[P red{Y |X1 = x1 , X2 = x2 , . . .}] = 5 10.3 Extra-Sums-of-Squares F -Tests What if we want to test whether several coefficients are all zero (simultaneously)?! • Bat echolocation example. Could all types have the same intercept? – Null hypothesis: – Alternative hypothesis: • Can t-tests be used to test the above hypothesis? • When are t-tests equivalent to F -tests? • What two models should we compare via an ESS F-test? – Full: – Reduced: • An “Overall Significance” F -test: – Null hypothesis: All regression coefficients except β0 are zero – Alternative: At least one is non-zero – What two models are being compared? – Bat Echolocation example: ∗ Questions of interest: 1. Is there a difference in the mean in-flight energy expenditures of echolating and non-echolocating bats after body size is accounted for? 2. Is there a difference between birds and the two bat groups? ∗ Look at the coded scatterplot of the data. 6 Echolocation Data ● ● 1 It appears that a parallel lines model may be appropriate and this would also be the most convenient inferential model. ● 2 log(energy) 3 ● ● ● ● ● 0 ● E−bat Bird N−bat ● 2 3 4 5 6 log(body mass) ∗ Now, let’s investigate whether the parallel lines model really is appropriate statistically. 1. What is a fuller (richer) model that would could test the parallel lines model against? 2. Fit the fuller model. Examine residual plots. ● ● ● ● ● ● ● ● ● ● ●● ● 1.2 1.0 0.8 ● ● ● ● ●● ● ● ● ● ● ● ● ● 16 −2 Fitted values ● 0.0 −1.5 3 ● ● ● ● ● 16 ● 2 ● ● 0.6 ● ● 1 ● ●● ● 0 15● ●16 ● 0.4 ● ● ● Standardized residuals 0.0 ● 14 ● 0.2 ● ● 1.4 2.0 1.5 ● ● Scale−Location 15 ● 1.0 ● Standardized residuals 0.1 ● 14 ● 0.5 0.3 0.2 ● 15 −0.3 −0.2 −0.1 Residuals Normal Q−Q 14 ● −0.5 0.0 0.4 Residuals vs Fitted −1 0 1 Theoretical Quantiles 2 0 1 2 3 Fitted values 3. Perform the F -test for the hypothesis that both interaction terms can be dropped (i.e. are zero) echo.fit2 <- lm(formula = lenergy ~ lmass + ebat + bird + lmass:ebat + lmass:bird) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.2024 1.2613 -0.161 0.8748 lmass 0.5898 0.2061 2.861 0.0126 ebat -1.2681 1.2854 -0.987 0.3406 bird -1.3784 1.2952 -1.064 0.3053 lmass:ebat 0.2149 0.2236 0.961 0.3529 lmass:bird 0.2456 0.2134 1.151 0.2691 > anova(echo.fit1,echo.fit2) Analysis of Variance Table Model 1: lenergy ~ lmass + ebat + bird Model 2: lenergy ~ lmass + ebat + bird + lmass:ebat + lmass:bird Res.Df RSS Df Sum of Sq F Pr(>F) 1 16 0.55332 7 2 14 0.50487 2 0.04845 0.6718 0.5265 4. If it is safe to go with the parallel lines model, we then want to test whether the intercepts are the same for the three groups: > anova(lm(lenergy ~ lmass, batBird), echo.fit1) Analysis of Variance Table Model 1: Model 2: Res.Df 1 18 2 16 lenergy ~ lmass lenergy ~ lmass + ebat + bird RSS Df Sum of Sq F Pr(>F) 0.58289 0.55332 2 0.029574 0.4276 0.6593 > 0.07866 +c(-1,1)* qt(.975,16)* 0.20268 ## CI for ebat intercept adjustment [1] -0.3510024 0.5083224 > exp(c( -0.3510024, 0.5083224)) ## back transformed [1] 0.703982 1.662500 > 0.10226 +c(-1,1)* qt(.975,16)* 0.11418 ## CI for bird intercept adjustment [1] -0.1397908 0.3443108 > exp(c( -0.1397908, 0.3443108)) ## back transformed [1] 0.8695401 1.4110171 ∗ Compare the results from R when we create our own indicator variables vs. letting R do it: #LET R define the indicator variables for us: > batBird$TYPE <- read.csv("data/batBirds.csv",head=T)$type > anova(lm(lenergy~lmass + TYPE, batBird)) Analysis of Variance Table Response: lenergy Df Sum Sq Mean Sq F value Pr(>F) lmass 1 29.3919 29.3919 849.9108 2.691e-15 TYPE 2 0.0296 0.0148 0.4276 0.6593 Residuals 16 0.5533 0.0346 > summary(lm(lenergy~lmass + TYPE, batBird))$coef Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.49770 0.14987 -9.993 2.77e-08 lmass 0.81496 0.04454 18.297 3.76e-12 TYPEnon-echolocating bats -0.07866 0.20268 -0.388 0.703 TYPEnon-echolocating birds 0.02360 0.15760 0.150 0.883 ### USE OUR INDICATOR VARIABLES to match above output ## lm(formula = lenergy ~ lmass + nbat + bird) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.49770 0.14987 -9.993 2.77e-08 lmass 0.81496 0.04454 18.297 3.76e-12 nbat -0.07866 0.20268 -0.388 0.703 bird 0.02360 0.15760 0.150 0.883 8 10.4 Related Issues 10.4.1 More on R-Squared • How can we always make R2 100%? • What are more appropriate tools for model building and checking? • Adjusted R-squared: – A version of R2 that includes a penalty for unnecessary explanatory variables. It is useful for a casual assessment of improvement of fit. Adjusted R2 = (Total MS) − (Residual MS) × 100% (Total MS) – What is the disadvantage of the Adjusted R2 vs. the usual R2 ? 10.4.2 Improving a study with replication • Replication = taking repeated observations at the same X values. • What does replication allow us to do? • Where does our estimate of σ 2 come from in the absence of replication? 10.4.6 The Principle of Occam’s Razor • Simple models are preferred over complicated models. Shave off the excess. • It is founded in common sense and successful experience. • Often called the Principle of Parsimony 9