Multiple Regression: Part II (§12.4 - 12.7) • • • • • Testing hierarchical models. Model building and variable selection. Checking model assumptions. Dummy variables. The Peruvian Indian Data – a complete example. 15-1 Testing Hierarchical Models • Suppose we fit all available independent variables in a general multiple regression model (complete model - model 1). Y = b0 + b1x1 + b2x2 + b12x1x2 + e • Now fit the same model with one or more of the terms removed(reduced model - model 2). Y = b0 + b1 x 1 + e • Does the reduced model fit as well as the complete model? • Test two parameters simultaneously. H0: b2 = b12 = 0 15-2 Let SSE1 be the error sums of squares for the complete model (Y = b0 + b1x1 + b2x2 + b12x1x2). Let SSE2 be the error sums of squares for the reduced model (Y = b0 + b1x1). Since Model 1 includes more terms than Model 2, it should fit better, hence we have that SSE1 SSE2 The difference, SSE2 - SSE1 is a measure of the drop in the sum of squares for error attributable to the variables removed from the complete model. 15-3 F-Statistic for Hierarchical Models Define the mean square drop as: MSdrop = (SSE2 - SSE1 ) / (k-g) where k is the number of terms in the complete model (Model 1) and g (<k) is the number of terms in the reduced model (Model 2). The mean square error for the complete model is: MSE1 = SSE1 / (n-k-1) To test the hypothesis that the terms left out of the complete model do not contribute significantly to explaining the variability in y we use the following F statistic. F = MSdrop/MSE1 Reject Ho: Left out parameters = 0 if F > Fk-g,n-k-1,a 15-4 Sequential Models With this concept of partial and full models, we can look at models in two different ways. Y b0 e Y b 0 b1 x1 e Y b 0 b1 x1 b 2 x2 e Y b 0 b1 x1 b 2 x2 b12 x2 x2 e Add constant Add x1 Add x2 Add cross product Mean Model Model 1 Model 2 Model 3 Questions: Is Model 1 better than the Mean Model? Is Model 2 better than Model 1? Is Model 3 better than Model 2? All of these are tested with a drop type test. 15-5 Last In Significant (Partial) Tests y b0 b1x1 b 2 x 2 b3 x 3 e y b0 b1x1 b3 x 3 b 2 x 2 e y b0 b 2 x 2 b3 x 3 b1x1 e Does x3 add to model with x1, x2? Does x2 add to model with x1, x3? Does x1 add to model with x2, x3? Our interest centers on the question of how important is a predictor in explaining variability in y over and above what is explained by predictors already in the model. All these are tested with a drop type test. 15-6 Reduction (Drop) Sums of Squares Y b0 e Y b 0 b1 x1 e Y b 0 b1 x1 b 2 x2 e Y b 0 b1 x1 b 2 x2 b12 x1 x2 e Add constant Add x1 Add x2 Add cross product Mean Model Model 1 Model 2 Model 3 Total SS = Model SS + Error SS R(b2|b0,b1) = Model SS2 - Model SS1 = Error SS1 - Error SS2 R(b12|b0, b1 b2,) = Model SS3 - Model SS2 = Error SS2 - Error SS3 These are called adjusted or partial sums of squares. Additional variability in Y explained by model 3, above and beyond what is already explained by model 2. 15-7 SAS Type Sums of Squares In SAS, Reduction (or Drop or Sequential) Sums of Squares are presented in the Type I Sums of Squares table. This table has the form. Source x1 x2 x3 Sums of Squares R(b1|b0) R(b2|b0, b1) R(b3|b0, b1, b2) Y = b0 + b1 X1 Y = b0 + b1 X1+ b2 X2 Y = b0 + b1 X1 + b2 X2 + b3 X3 In SAS, Last In (or Partial ) Sums of Squares are presented in the Type III Sums of Squares table. This table has the form. Source x1 x2 x3 Sums of Squares R(b1|b0, b2, b3) R(b2|b0, b1, b3) R(b3|b0, b1, b2) Y = b0 + b1 X1 + b2 X2 + b3 X3 Y = b0 + b1 X1 + b2 X2 + b3 X3 Y = b0 + b1 X1 + b2 X2 + b3 X3 15-8 Variable/Predictor Selection • Find the “best” (an appropriate) subset of regressors (predictors) for the model from among all possible candidate regressors (all simple terms, polynomial terms, cross-product terms, etc.) • Problem: How do we define “best”? • The model should include as many regressors as possible so that the information content in these factors can account comprehensively for the predictions of y. • The model should include as few regressors as possible since the variance of predictions increases as the number of regressors increases. Besides, one wants a parsimonious (simple) model for ease of interpretability. 15-9 Y b 0 b1 X 1 b k X k e • Deleting variables from the complete model actually decreases the variances of the parameter estimates for the remaining explanatory variables. (R2 always increases/decreases when variables are added/deleted to/from the model.) sbˆ j ˆ e 1 S x j x j (1 Rx2j x1x2x j1x j1xk ) • Thus, deleting variables from the complete model improves the precision of the estimates for the remaining parameters, as well as the precision of the predictions. • However, if we remove an explanatory variable which is strongly associated with Y, we produce biased estimates of the remaining parameters, the residual variance, and the predictions. 15-10 Model Selection Criteria • • • • The coefficient of multiple determination. R2 = SSRk/SYY Choose models with high R2. However, R2 increases every time more predictors are added, regardless of their importance in predicting Y. Adjusted R2. Adj R2 = 1- (n-1)(1- R2)/(n-k-1) Choose models with high adj. R2. Better suited for model selection than R2. It increases/decreases only when important/unimportant predictors are added to the model. Residual mean square. MSEk = SSEk/(n-k-1) Choose models with low MSE. Mallows’ Cp statistic. Cp = (SSEk/MSE1)-n+2k Choose models with Cp k+1. Measures adequacy of predictions from reduced model relative to those from the full model. (MSE1 is the MSE for the full model; the model containing all available predictors.) 15-11 • Predicted Residual Sum of Squares (PRESS) statistic. ^ (i) ]2 PRESS = S[ yi – y Choose models with low PRESS. y^(i) is the estimate of the ith obs based on a model fitted to the remaining n-1 obs. • Highly recommended: Akaike’s Information Criterion (AIC). Uses sophisticated concepts from information theory to measure how far the candidate model is from the “true” model. Choose models with low AIC. (Burnham, K. P., and D. R. Anderson. 2002. Model selection and multimodel inference: a practical information-theoretic approach. SpringerVerlag, New York, NY. ISBN 0-387-95364-7.) For a multiple regression model with n obs and k predictors, n k 1 2n(k 2) AICc n log MSE n nk 3 Need to also keep in mind that should have (about) at least 10 times more observations than predictors in the candidate models, i.e. 10k<n. 15-12 Automatic Model Selection Approaches • • • Backward Elimination - First fit the model with all possible predictors, then sequentially eliminate those predictors that are least significant in a last-in (partial) test. Forward Selection - First find the one predictor, call it x1, that does the very best job of explaining variation in y. Then add to the model the predictor that is most significant (via a last-in test) when added into the model after x1. Continue until no additional predictors contribute significantly using a last-in test. Stepwise Selection - Begin as with the forward selection method, but each time a new predictor is added into the model, check all other predictors with a last-in test to determine if they should continue to be in the model. Drop any predictor that cannot pass the last-in test. (This method is generally preferred over the other two.) 15-13 Checking Model Assumptions in Multiple Regression Basic Assumptions: 1. Zero expectation: E(ei) = 0 for all i. 2. Constant variance: V(ei) = 2e for all i. 3. Normality: ei are normally distributed. 4. Independence: Corr(ei,ej)=0 for all i j. These are checked using the residuals as estimates of the ei. eˆi y i yˆ i 15-14 Tools for Assumption Checking 1. Zero expectation: E(ei) = 0 for all i. Automatically satisfied if we implement least squares estimation correctly. 2. Constant variance: V(ei) = 2e for all i. Use a plot of residuals versus predicted values. Points should display the same spread regardless of the value of the predicted response. Alternatives: Weighted least squares Generalized linear models 15-15 3. Normality: ei are normally distributed. Normal probability plots (Quantile-quantile plots) Formal tests of normality Histograms and Boxplots 4. Independence: Corr(ei,ej)=0 for all i j. For time (or space) series data: Time series plot Serial autocorrelation estimates and plot Durbin-Watson Statistic Non times series data: Know how the data were collected; was randomization used? No formal statistical tests or indices. 15-16 Comparing the Slopes of Two or More Regression Lines Suppose we have a quantitative explanatory variable, X1, and we have two possible regression lines: one for situation 1 (location A), the other for situation 2 (location B). Use of dummy (or classification) variables in regression. y Equal intercepts y b 0 b1 x1 (loc A) H0 : b 0 b 2 Equal slopes H 0 : b1 b 3 y b 2 b3 x1 (loc B) x1 15-17 Reformulate the model. Dummy variable Define a new variable, x2 such that x2 = 0 for situation 1 (Location A) x2 = 1 for situation 2 (Location B) Then use multiple regression. yˆ 0 1 x1 2 x2 3 x1 x2 When x2=0 we have: yˆ 0 1 x1 b 0 b1 x1 When x2=1 we have: yˆ ( 0 2 ) ( 1 3 ) x1 b 2 b 3 x1 Test of 2=0 equivalent to no intercept difference. Test of 3=0 equivalent to no slope difference. Tests are based on reduction (drop) sums of squares, as previously defined. 15-18 y 20 and 30 y y=0+ 1 x y=0+ 1 x y=(0+2)+ 1 x y=(0+2)+ (1+3) x y 1 20 and 30 20 and 30 x1 y=0+ 1 x y 20 and 30 x1 y=0+ 1 x y=0+ 1 x y=0+ (1+3) x x1 x1 15-19 Generalized Linear Models 2 Normal Regression: y ~ N ( , e ) b 0 b1x1 b 2 x 2 Distributional specification Link function Linear predictor Poisson Regression: Y~Poisson(l) E(Y)= l = exp() =b0+b1x1+ ... Logistic Regression: Y~Binomial(n,p) E(Y)= np =exp()/[1+exp()] =b0+b1x1+ ... 15-20 Multiple Regression Example: The Peruvian Indian Data Anthropologists studying long-term effects of an environmental change on systolic blood pressure, measured this and several other characteristics of 39 Indians who migrated from a primitive environment high in the Andes, into mainstream Peruvian society at a lower elevation. The variables were: X1 = Age (years) X2 = Years since migration X3 = Weight (kg) X4 = Height (mm) X5 = Chin skin fold (mm) X6 = Forearm skin fold (mm) X7 = Calf skin fold (mm) X8 = Pulse rate (beats/min) X9 = Systolic blood pressure X10 = Diastolic blood pressure X11 = Years since migration divided by age Data available in Minitab as PERU.MTW. Response Y is X9. Age Years Weight Height Chin Forearm Calf Pulse Systol Diastol Fraction 21 1 71.0 1629 8.0 7.0 12.7 88 170 76 0.047619 . 28 25 53.0 1568 3.7 4.3 0.0 80 108 62 0.89286 . . 54 40 87.0 1542 11.3 11.7 11.3 92 152 88 0.740741 # 1 15-21 8 39 Pairwise scatterplots R: If data is read into a data frame called “peru” with read.table, issue the command > pairs(peru) MTB: Graph > Matrix Plot 15-22 Matrix of all pairwise correlations (R) > cor(peru) V1 V2 V3 V4 V5 V6 V1 1.000000000 0.588212502 0.4316630 0.055777982 0.157908294 0.05520278 V2 0.588212502 1.000000000 0.4811534 0.072594154 0.221697674 0.14302404 V3 0.431662982 0.481153366 1.0000000 0.450330307 0.561748764 0.54373244 V4 0.055777982 0.072594154 0.4503303 1.000000000 -0.007898078 -0.06893212 V5 0.157908294 0.221697674 0.5617488 -0.007898078 1.000000000 0.63788150 V6 0.055202779 0.143024038 0.5437324 -0.068932124 0.637881501 1.00000000 V7 -0.005374411 0.001099438 0.3918655 -0.002845856 0.515999762 0.73552594 V8 0.090654502 0.236904643 0.3117934 0.007829993 0.223100921 0.42190760 V9 0.005844807 -0.087480460 0.5213643 0.219114553 0.170192453 0.27228023 V10 0.038725834 0.075792139 0.3944963 0.253040787 0.088787528 0.21237426 V11 0.364523488 0.938145452 0.2930832 0.051187749 0.120091662 0.02801547 V7 V8 V9 V10 V11 V1 -0.005374411 0.090654502 0.005844807 0.03872583 0.36452349 V2 0.001099438 0.236904643 -0.087480460 0.07579214 0.93814545 V3 0.391865474 0.311793359 0.521364290 0.39449626 0.29308318 In MTB: V4 -0.002845856 0.007829993 0.219114553 0.25304079 0.05118775 > Stat V5 0.515999762 0.223100921 0.170192453 0.08878753 0.12009166 V6 0.735525936 0.421907596 0.272280231 0.21237426 0.02801547 > Basic Stats V7 1.000000000 0.208715412 0.250789289 0.30649050 -0.11301589 > Correlation V8 0.208715412 1.000000000 0.135477107 0.05969512 0.21419489 V9 0.250789289 0.135477107 1.000000000 0.47519113 -0.27614544 V10 0.306490503 0.059695117 0.475191134 1.00000000 -0.05101310 V11 -0.113015894 0.214194894 -0.276145438 -0.05101310 1.00000000 15-23 Fit full model for subsequent comparison (MTB) Stat > Regression > Options > Variance Inflation Factors The regression equation is Systol = 142 - 1.05 Age + 2.28 Years + 1.35 Weight - 0.0361 Height - 0.841 Chin - 1.09 Forearm - 0.232 Calf + 0.116 Pulse + 0.100 Diastol - 108 Fraction Predictor Constant Age Years Weight Height Chin Forearm Calf Pulse Diastol Fraction Coef 141.93 -1.0487 2.2775 1.3546 -0.03606 -0.8412 -1.088 -0.2316 0.1161 0.1003 -108.36 S = 8.739 SE Coef 49.98 0.3480 0.8644 0.4441 0.03728 0.7637 1.211 0.5532 0.1721 0.1500 32.17 R-Sq = 67.3% T 2.84 -3.01 2.63 3.05 -0.97 -1.10 -0.90 -0.42 0.67 0.67 -3.37 P 0.008 0.005 0.014 0.005 0.342 0.280 0.377 0.679 0.505 0.509 0.002 VIF 3.6 37.9 4.9 1.9 2.2 3.8 2.5 1.3 1.5 27.2 These are too high! R-Sq(adj) = 55.6% Analysis of Variance Source Regression Residual Error Total DF 10 28 38 SS 4393.03 2138.41 6531.44 MS 439.30 76.37 F 5.75 P 0.000 15-24 Source Age Years Weight Height Chin Forearm Calf Pulse Diastol Fraction DF 1 1 1 1 1 1 1 1 1 1 Seq SS 0.22 82.55 2693.40 61.37 366.86 42.69 14.67 2.95 261.93 866.38 Unusual Observations Obs Age Systol 1 21.0 170.00 8 28.0 108.00 VIF’s in R: See p.205 of Faraway’s Practical Regression and ANOVA using R. (Downloadable from R Resources on course website). Package “car” also available from CRAN. Fit 155.02 95.72 SE Fit 6.21 6.55 Residual 14.98 12.28 St Resid 2.44R 2.12R R denotes an observation with a large standardized residual Stepwise selection (R) In MTB: > g <- lm(V9~., data=peru) > step(g,trace=F) Stat > Regression > Stepwise Call: lm(formula = V9 ~ V1 + V2 + V3 + V5 + V11, data = peru) Coefficients: (Intercept) 109.359 V1 -1.012 V2 2.407 V3 1.098 V5 -1.192 V11 -110.811 15-25 All subsets adj. R2 selection (R) > library(leaps) > g <-leaps(peru[,-9],V9,nbest=1,names=c("V1","V2","V3","V4","V5","V6","V7","V8","V10","V11"),method="adjr2") > plot(g$size,g$adjr2) > for (i in 1:dim(g$which)[1]){ print(name[g$which[i,]])} [1] "V3" [2] "V3" "V11" In MTB: [3] "V3" "V10" "V11" Stat > Regression > Best Subsets [4] "V1" "V2" "V3" "V11" [5] "V1" "V2" "V3" "V5" "V11" Will do all subsets based on R2, adj [6] "V1" "V2" "V3" "V5" "V6" "V11" R2, Cp, and s (MSE). [7] "V1" "V2" "V3" "V4" "V5" "V6" "V11" [8] "V1" "V2" "V3" "V4" "V5" "V6" "V8" "V11" [9] "V1" "V2" "V3" "V4" "V5" "V6" "V8" "V10" "V11" [10] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V10" "V11“ Model with V3, V11 Model with V1, V2, V3, V5, V11 is best by adj R2 (also selected by stepwise) Size = K+1 15-26 All subsets Cp selection (R) > g <- leaps(peru[,-9],V9,nbest=1,names=c("V1","V2","V3","V4","V5","V6","V7","V8","V10","V11"),method="Cp") > plot(g$size,g$Cp) > abline(0,1) > g$which for (i in 1:dim(g$which)[1]){ print(name[g$which[i,]])} [1] "V3" [2] "V3" "V11" [3] "V3" "V10" "V11" [4] "V1" "V2" "V3" "V11" [5] "V1" "V2" "V3" "V5" "V11" [6] "V1" "V2" "V3" "V5" "V6" "V11" [7] "V1" "V2" "V3" "V4" "V5" "V6" "V11" [8] "V1" "V2" "V3" "V4" "V5" "V6" "V8" "V11" [9] "V1" "V2" "V3" "V4" "V5" "V6" "V8" "V10" "V11“ [10] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V10" "V11" Model with V3, V11 Full Model is best by Cp Model with Model with V1, V2, V3, V11 V1, V2, V3, V5, V11 15-27 All subsets AICc selection (R) > g <- leaps(peru[,-9],V9,nbest=1,names=c("V1","V2","V3","V4","V5","V6","V7","V8","V10","V11"),method=“r2") > k<-g$size-1; r2<-g$r2; n<-length(V9); sst<-(n-1)*var(V9) > plot(k+1,n*log((1-r2)*sst/n)+2*n*(k+2)/(n-k-3),xlab=“k+1”,ylab=“AICc”) > g$which for (i in 1:dim(g$which)[1]){ print(name[g$which[i,]])} [1] "V3" [2] "V3" "V11" [3] "V3" "V10" "V11" [4] "V1" "V2" "V3" "V11" [5] "V1" "V2" "V3" "V5" "V11" [6] "V1" "V2" "V3" "V5" "V6" "V11" [7] "V1" "V2" "V3" "V4" "V5" "V6" "V11" [8] "V1" "V2" "V3" "V4" "V5" "V6" "V8" "V11" [9] "V1" "V2" "V3" "V4" "V5" "V6" "V8" "V10" "V11“ [10] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V10" "V11" The ranking of high R2 models within a given subset size is the same as the ranking of low AICc models. We can then plot the AICc for each of these best 10 models. The result is that the model with V1, V2, V3, V5, V11 is best according to AICc. 15-28 Fit the 5 variable model Systol = 109 - 1.01 Age + 2.41 Years + 1.10 Weight - 1.19 Chin - 111 Fraction Predictor Constant Age Years Chin Weight Fraction Coef SE Coef 109.36 21.48 -1.0120 0.3059 2.4067 0.7426 -1.1918 0.6140 1.0976 0.2980 -110.81 27.28 T 5.09 -3.31 3.24 -1.94 3.68 -4.06 P 0.000 0.002 0.003 0.061 0.001 0.000 VIF 2.939 29.853 1.484 2.378 20.886 S = 8.45707 R-Sq = 63.9% R-Sq(adj) = 58.4% Analysis of Variance Source Regression Residual Error Total DF SS MS F P 5 4171.21 834.24 11.66 0.000 33 2360.23 71.52 38 6531.44 15-29 Fit the 2 variable model Systol = 60.9 + 1.22 Weight - 26.8 Fraction Predictor Coef SE Coef T P Constant 60.90 14.28 4.26 0.000 Weight 1.2169 0.2337 5.21 0.000 Fraction -26.767 7.218 -3.71 0.001 VIF 1.094 1.094 S = 9.77719 R-Sq = 47.3% R-Sq(adj) = 44.4% Analysis of Variance Source DF SS MS F P Regression 2 3090.1 1545.0 16.16 0.000 Residual Error 36 3441.4 95.6 Total 38 6531.4 15-30 Carry out a “Drop SS Test” to compare these two models H0: Model 2 (V3, V11), with SSE2=3441.4, df2=36. Ha: Model 1 (V1, V2, V3, V5, V11), with SSE1=2360.2, df1=33, MSE1=71.5. ( SSE2 SSE1 ) /( df 2 df1 ) F MSE1 (3441.4 2360.2) /(36 33) 71.5 5.04 Fdf2 df1 ,df1 ,a F3,33,.05 2.89 So the “explanatory” capability of Model 1 (on Y) is significantly greater than Model 2, but that doesn’t mean it is a “better” model… Because of the multicollinearity and complexity, I’ll go with Model 2. 15-31 Check for Interaction in Final Model (R) > g0 <- lm(V9~V3+V11) > g1 <- lm(V9~V3*V11) > summary(g1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 52.2224 34.1710 1.528 0.1354 V3 1.3560 0.5501 2.465 0.0187 * V11 -9.8619 60.7797 -0.162 0.8720 V3:V11 -0.2672 0.9536 -0.280 0.7810 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Insignificant interaction Residual standard error: 9.905 on 35 degrees of freedom Multiple R-Squared: 0.4743, Adjusted R-squared: 0.4292 F-statistic: 10.53 on 3 and 35 DF, p-value: 4.433e-05 > summary(g0) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 60.8959 14.2809 4.264 0.000138 *** V3 1.2169 0.2337 5.207 7.97e-06 *** V11 -26.7672 7.2178 -3.708 0.000699 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 9.777 on 36 degrees of freedom Multiple R-Squared: 0.4731, Adjusted R-squared: 0.4438 F-statistic: 16.16 on 2 and 36 DF, p-value: 9.795e-06 Both variables highly significant! Systol = 60.9 + 1.2*weight - 26.8*fraction 15-32 Influence (R) > influence.measures(g0) dfb.1. dfb.V3 dfb.V11 dffit cov.r cook.d hat inf 1 -0.611150 0.82840 -0.96911 1.23380 0.664 4.19e-01 0.1508 * 8 0.270300 -0.32305 0.40802 0.48303 1.431 7.86e-02 0.2776 * 39 -0.392296 0.37889 0.07270 0.44230 1.574 6.63e-02 0.3314 * > g0$resid[c(1,8,39)] 1 8 39 23.981894 6.509972 5.065127 15-33 Fitted regression equation (plane) and data points: If you have to do a detailed regression analysis get the following book: Applied Linear Regression Models, by Kutner, Nachtsheim, & Neter, 4th edition, 2004, McGraw-Hill/Irwin. 15-34