Multiple Linear Regression Extend model to many explanatory variables • Fitting as before, output essentially the same • F-test for overall significance • Individual t-tests for each variable – depends on the order of inclusion in the model – Explanatory variables can be highly related 1 Multiple Regression - Example Regression Analysis: edible versus height, width, length, shell The regression equation is edible = - 10.2 + 0.144 height + 0.204 width - 0.0199 length + 0.0884 shell Predictor Constant height width length shell Coef -10.189 0.14350 0.2043 -0.01989 0.08843 S = 4.04328 SE Coef 4.640 0.07481 0.1600 0.03712 0.01550 R-Sq = 88.7% T -2.20 1.92 1.28 -0.54 5.71 P 0.031 0.059 0.206 0.594 0.000 R-Sq(adj) = 88.1% 2 ANOVA – sums of squares Analysis of Variance Source Regression Residual Error Total Source height width length shell DF 1 1 1 1 DF 4 77 81 SS 9874.0 1258.8 11132.8 Seq SS 8631.4 710.3 0.1 532.1 MS 2468.5 16.3 F 151.00 Source shell length width height P 0.000 DF 1 1 1 1 Order of terms matters! Seq SS 9671.1 95.1 47.6 60.2 3 Variable Selection Methods • Best subsets – looks at all possible selections of explanatory variables • Stepwise – Forward – add in variables one at time – Backward – start with full model and remove insignificant variables – Full stepwise – combination of forward and backward 4 Best subsets Response is edible Vars 1 2 3 4 R-Sq 86.9 88.5 88.7 88.7 R-Sq(adj) 86.7 88.2 88.2 88.1 Mallows C-p 11.4 2.6 3.3 5.0 S 4.2744 4.0339 4.0248 4.0433 h e i g h t w i d t h l e n g t h s h e l l X X X X X X X X X X Look for maximum R-sq(adj) and minimum C-p 5 The regression equation is edible = - 9.81 + 0.0994 shell + 0.161 height Predictor Constant shell height Coef -9.813 0.09939 0.16070 S = 4.03386 SE Coef 4.402 0.01150 0.04884 R-Sq = 88.5% T -2.23 8.64 3.29 P 0.029 0.000 0.001 R-Sq(adj) = 88.2% Analysis of Variance Source Regression Residual Error Total DF 2 79 81 SS 9847.3 1285.5 11132.8 MS 4923.6 16.3 F 302.58 P 0.000 6 Forward Selection Response is edible on 4 predictors, with N = 82 Step 1 2 Constant 4.423 -9.813 shell T-Value P-Value 0.1327 23.01 0.000 height T-Value P-Value S R-Sq R-Sq(adj) Mallows C-p 0.0994 8.64 0.000 0.161 3.29 0.001 4.27 86.87 86.71 11.4 Leads to same model Shell + height 4.03 88.45 88.16 2.6 7 Model Checking – not great! Residual Plots for edible Normal Probability Plot of the Residuals Residuals Versus the Fitted Values 99 10 90 5 Residual Percent 99.9 50 10 -5 -10 1 0.1 0 -10 -5 0 Residual 5 10 0 Histogram of the Residuals 12 24 36 Fitted Value 48 Residuals Versus the Order of the Data 40 Residual Frequency 10 30 20 10 5 0 -5 -10 0 -12 -6 0 Residual 6 12 1 10 20 30 40 50 60 Observation Order 70 80 8 Categorical Data; Factors • Categorical variables defined as factors are not handled automatically by regression routines => need to transform to dummy variables ! • A k-category factor has – k levels – unique labels for each level – Use Make Indicator Variables to create dummy 0/1 indicators 9 Factor - example • Create 0/1 indicators for SpeciesGroup – Name them oyster, mussel • Tally for Discrete Variables: SpeciesGroup, oyster, mussel SpeciesGroup M O N= Count 92 81 173 mussel Count 0 81 1 92 N= 173 oyster 0 1 N= Count 92 81 173 10 Comparing Two Groups • Simply fit a regression model with a dummy variable for the two-level-factor as covariate. • Regression analysis is equivalent to doing a 2-sample t-test with equal variances 11 Regression Output The regression equation is Cadmium = 0,166 + 0,219 oyster Predictor Constant oyster • • • Coef 0.16562 0.21919 SE Coef 0.01286 0.01876 T 12.87 11.68 P 0.000 0.000 The intercept is the mean for the first level of the factor (M) The oyster estimate is the difference between the means The t-value tests whether this difference is significantly different from zero, i.e. if the two means are equal 12 t-test output Two-sample T for Cadmium SpeciesGroup N Mean StDev M 89 0.1656 0.0870 O 79 0.385 0.151 SE Mean 0.0092 0.017 Difference = mu (M) - mu (O) Estimate for difference: -0.219192 95% CI for difference: (-0.256232, -0.182152) T-Test of difference = 0 (vs not =): T-Value = -11.68 P-Value = 0.000 DF = 166 Both use Pooled StDev = 0.1214 13 ANOVA table: Cadmium versus oysters Analysis of Variance Source Regression Error Total DF 1 166 167 SS 2.01075 2.44516 4.45591 MS 2.01075 0.01473 F 136.51 P 0.000 Same conclusion of significant species effect – F-test is equivalent to t-test for slope parameter (here: oyster). F-test is more general and extends to factors with more than 2 levels … 14 ANOVA with one k-level- factor (k ≥ 3) “One-Way ANOVA” Comparison of means between groups • Linear regression model with k level factor • Assumptions: Normality, Independence, equal variances within groups, Continuous response …. but ANOVA is quite robust and tolerates violations to some extent, if the group sample sizes are similar. 15 For the 3-category species variable Regression Analysis: Cadmium versus CG; ME Predictor Constant CG ME Coef 0.46031 -0.12691 -0.29469 SE Coef 0.02013 0.02609 0.02347 Analysis of Variance Source DF SS Regression 2 2.3174 Residual Error 165 2.1385 Total 167 4.4559 T 22.87 -4.86 -12.56 MS 1.1587 0.0130 P 0.000 0.000 0.000 F 89.40 P 0.000 Overall test of differences between groups 16