Maths 0146 Statistical Modelling in Biology SPSS Practical Multiple regression Solutions 1.6 Exercise - Mass and Physical Measurements for Male Subjects Examine graphically, the relationships between mass and the possible explanatory variables. Do the assumptions needed to perform linear regression seen reasonable ? Partial Regression Plot Dependent Variable: MASS 6 4 2 0 MASS -2 -4 -6 -1.5 FORE -1.0 -.5 0.0 .5 1.0 1.5 Partial Regression Plot Dependent Variable: MASS 5 4 3 2 1 0 MASS -1 -2 -3 -3 -2 -1 0 1 2 3 2 4 6 BICEP Partial Regression Plot Dependent Variable: MASS 6 4 2 0 MASS -2 -4 -6 CHEST -4 -2 0 Partial Regression Plot Dependent Variable: MASS 5 4 3 2 1 0 MASS -1 -2 -3 -2.0 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 NECK Partial Regression Plot Dependent Variable: MASS 5 4 3 2 1 0 MASS -1 -2 -3 -6 -4 SHOULDER -2 0 2 4 Partial Regression Plot Dependent Variable: MASS 8 6 4 2 0 -2 MASS -4 -6 -8 -10 0 10 WASTE Partial Regression Plot Dependent Variable: MASS 4 2 0 -2 MASS -4 -6 -8 -6 HEIGHT -4 -2 0 2 4 6 8 Partial Regression Plot Dependent Variable: MASS 6 4 2 0 MASS -2 -4 -3 -2 -1 0 1 2 CALF Partial Regression Plot Dependent Variable: MASS 4 3 2 1 0 -1 MASS -2 -3 -4 -3 THIGH -2 -1 0 1 2 3 4 Partial Regression Plot Dependent Variable: MASS 6 4 2 0 MASS -2 -4 -2 -1 0 1 2 3 HEAD Fit the ‘full model’, containing all of the possible explanatory variables. Which variables have significant coefficients ? Coefficients Unstandar Standardiz dized ed Coefficient Coefficient s s Model B Std. Error Beta 1 (Constant) -69.517 29.037 FORE 1.782 .855 .312 BICEP .155 .485 .041 CHEST .189 .226 .116 NECK -.482 .721 -.081 SHOULDE -2.931E-02 .239 -.017 R WASTE .661 .116 .470 HEIGHT .318 .130 .180 CALF .446 .413 .098 THIGH .297 .305 .097 HEAD -.920 .520 -.105 a Dependent Variable: MASS t Sig. -2.394 2.085 .320 .838 -.669 -.122 .036 .061 .755 .420 .518 .905 5.679 2.438 1.081 .974 -1.768 .000 .033 .303 .351 .105 How would you decide which explanatory variables to include in the model ? Choose a model which you think provides a reasonable fit to this data ? You could run a stepwise regression, but with great care over how and which variables it includes at each step. You should consider performing some sensitivity analysis, i.e. where there is a close decision as to which variable to include at a particular stage, explore what would happen under both possibilities. Alternatively, and a better option, you could build the models yourself, using diagnostics at each stage (examining residuals, testing coefficient and sum of squares) bearing in mind the purpose of the model and its interpretation. This may be more time consuming, and it is advisable to consider a series of nested models as you build up from a simple model to more complex ones. Here are the coefficients in the models at four stages of performing a stepwise regression: Unstandardized Coefficients Model B Std. Error 1 (Constant) -36.191 10.895 WASTE 1.287 .127 2 (Constant) -68.718 9.199 WASTE .773 .125 FORE 2.755 .506 3 (Constant) -107.488 15.998 WASTE .732 .108 FORE 2.579 .439 HEIGHT .264 .095 4 (Constant) -113.312 14.639 WASTE .647 .104 FORE 2.036 .462 HEIGHT .272 .085 THIGH .540 .237 a Dependent Variable: MASS Standardized Coefficients Beta .915 .550 .482 .520 .452 .149 .460 .356 .154 .177 t -3.322 10.148 -7.470 6.199 5.439 -6.719 6.772 5.870 2.787 -7.740 6.201 4.402 3.179 2.275 Sig. .003 .000 .000 .000 .000 .000 .000 .000 .012 .000 .000 .000 .005 .036 The final model then contains the waist, forearm, height and thigh measurements. The ANOVA tables for the models at each stage are as follows, note that the residual mean square is reduced at each stage, indicating that the model is a better fit to the data. Model a b c d Sum of df Mean F Squares Square 1 Regression 2113.643 1 2113.643 102.978 Residual 410.505 20 20.525 Total 2524.148 21 2 Regression 2363.612 2 1181.806 139.871 Residual 160.536 19 8.449 Total 2524.148 21 3 Regression 2411.999 3 804.000 129.043 Residual 112.149 18 6.230 Total 2524.148 21 4 Regression 2438.173 4 609.543 120.527 Residual 85.974 17 5.057 Total 2524.148 21 Predictors: (Constant), WASTE Predictors: (Constant), WASTE, FORE Predictors: (Constant), WASTE, FORE, HEIGHT Predictors: (Constant), WASTE, FORE, HEIGHT, THIGH Sig. .000 .000 .000 .000 The results from a similar procedure called regression by leaps and bounds are the same (done in S-plus). > leaps.mass <- leaps(physical[,2:11],Mass,nbest=3) > df.mass <- data.frame(p=leaps.mass$size,Cp=leaps.mass$Cp) > round(df.mass,2) Waist Fore Shoulder Fore,Waist Waist,Calf Shoulder,Waist Fore,Waist,Height Fore,Waist,Calf Fore,Waist,Thigh Fore,Waist,Height,Thigh Fore,Waist,Height,Calf Fore,Waist,Height,Head Fore,Waist,Height,Thigh,Head Fore,Waist,Height,Calf,Thigh Fore,Waist,Height,Calf,Head Fore,Waist,Height,Calf,Thigh,Head Fore,Chest,Waist,Height,Calf,Head Fore,Chest,Waist,Height,Thigh,Head Fore,Chest,Waist,Height,Calf,Thigh,Head Fore,Bicep,Waist,Height,Calf,Thigh,Head Fore,Shoulder,Waist,Height,Calf,Thigh,Head Fore,Chest,Neck,Waist,Height,Calf,Thigh,Head Fore,Chest,Shoulder,Waist,Height,Calf,Thigh,Head Fore,Bicep,Chest,Waist,Height,Calf,Thigh,Head Fore,Bicep,Chest,Neck,Waist,Height,Calf,Thigh,Head Fore,Chest,Neck,Shoulder,Waist,Height,Calf,Thigh,Head Fore,Bicep,Chest,Shoulder,Waist,Height,Calf,Thigh,Head Fore,Bicep,Chest,Neck,Shoulder,Waist,Height,Calf,Thigh,Head p Cp 2 60.50 2 74.80 2 110.36 3 14.70 3 25.25 3 29.54 4 7.45 4 11.18 4 12.21 5 4.44 5 6.10 5 6.83 6 4.14 6 4.82 6 5.35 7 4.38 7 4.81 7 5.50 8 5.47 8 6.07 8 6.12 9 7.13 9 7.45 9 7.47 10 9.01 10 9.10 10 9.45 11 11.00 > lm.mass <-lm(Mass~Fore+Waist+Height+Thigh) > summary(lm.mass,cor=F) Call: lm(formula = Mass ~ Fore + Waist + Height + Thigh) Residuals: Min 1Q Median 3Q Max -3.882 -0.6756 -0.1017 0.9641 4.992 Coefficients: Value Std. Error (Intercept) -113.3120 14.6391 Fore 2.0356 0.4624 Waist 0.6469 0.1043 Height 0.2717 0.0855 Thigh 0.5401 0.2374 t value -7.7404 4.4020 6.2015 3.1789 2.2750 Pr(>|t|) 0.0000 0.0004 0.0000 0.0055 0.0361 Residual standard error: 2.249 on 17 degrees of freedom Multiple R-Squared: 0.9659 F-statistic: 120.5 on 4 and 17 degrees of freedom, the p-value is 3.079e-012 Note the layout of the summary of the final model and ensure that you can get all the information, (RSS, 95% CI’s etc) from this that you need. Examine the residuals from your model. Are they consistent with the assumptions of linear regression ? Histogram of the residuals, should look normal Histogram Dependent Variable: MASS 7 6 5 4 Frequency 3 2 Std. Dev = .90 1 Mean = 0.00 N = 22.00 0 -1.50 -1.00 -.50 0.00 .50 1.00 Regression Standardized Residual 1.50 2.00 This is a normal p-p (or q-q) plot. The axes have been transformed so that the cumulative distribution of the residuals is plotted against the cumulative normal distribution, hence if the residuals do follow a normal distribution, they should lie on a straight line. Apart from a small deviation, this seems to be the case here. (the deviation is due to one or two of the variables being used in a linear form when their logged values should have been used). Normal P-P Plot of Regression Standardized Residu Dependent Variable: MASS 1.00 Expected Cum Prob .75 .50 .25 0.00 0.00 .25 .50 Observed Cum Prob .75 1.00 Scatterplot Dependent Variable: MASS 100 90 80 70 MASS 60 50 -2 -1 0 Regression Standardized Residual 1 2 3