Solutions I

advertisement
Maths 0146 Statistical Modelling in Biology
SPSS Practical
Multiple regression
Solutions
1.6 Exercise - Mass and Physical Measurements for Male Subjects

Examine graphically, the relationships between mass and the possible explanatory
variables. Do the assumptions needed to perform linear regression seen reasonable ?
Partial Regression Plot
Dependent Variable: MASS
6
4
2
0
MASS
-2
-4
-6
-1.5
FORE
-1.0
-.5
0.0
.5
1.0
1.5
Partial Regression Plot
Dependent Variable: MASS
5
4
3
2
1
0
MASS
-1
-2
-3
-3
-2
-1
0
1
2
3
2
4
6
BICEP
Partial Regression Plot
Dependent Variable: MASS
6
4
2
0
MASS
-2
-4
-6
CHEST
-4
-2
0
Partial Regression Plot
Dependent Variable: MASS
5
4
3
2
1
0
MASS
-1
-2
-3
-2.0
-1.5
-1.0
-.5
0.0
.5
1.0
1.5
NECK
Partial Regression Plot
Dependent Variable: MASS
5
4
3
2
1
0
MASS
-1
-2
-3
-6
-4
SHOULDER
-2
0
2
4
Partial Regression Plot
Dependent Variable: MASS
8
6
4
2
0
-2
MASS
-4
-6
-8
-10
0
10
WASTE
Partial Regression Plot
Dependent Variable: MASS
4
2
0
-2
MASS
-4
-6
-8
-6
HEIGHT
-4
-2
0
2
4
6
8
Partial Regression Plot
Dependent Variable: MASS
6
4
2
0
MASS
-2
-4
-3
-2
-1
0
1
2
CALF
Partial Regression Plot
Dependent Variable: MASS
4
3
2
1
0
-1
MASS
-2
-3
-4
-3
THIGH
-2
-1
0
1
2
3
4
Partial Regression Plot
Dependent Variable: MASS
6
4
2
0
MASS
-2
-4
-2
-1
0
1
2
3
HEAD

Fit the ‘full model’, containing all of the possible explanatory variables. Which
variables have significant coefficients ?
Coefficients
Unstandar
Standardiz
dized
ed
Coefficient
Coefficient
s
s
Model
B Std. Error
Beta
1 (Constant)
-69.517
29.037
FORE
1.782
.855
.312
BICEP
.155
.485
.041
CHEST
.189
.226
.116
NECK
-.482
.721
-.081
SHOULDE -2.931E-02
.239
-.017
R
WASTE
.661
.116
.470
HEIGHT
.318
.130
.180
CALF
.446
.413
.098
THIGH
.297
.305
.097
HEAD
-.920
.520
-.105
a Dependent Variable: MASS
t
Sig.
-2.394
2.085
.320
.838
-.669
-.122
.036
.061
.755
.420
.518
.905
5.679
2.438
1.081
.974
-1.768
.000
.033
.303
.351
.105

How would you decide which explanatory variables to include in the model ? Choose
a model which you think provides a reasonable fit to this data ?
You could run a stepwise regression, but with great care over how and which variables it
includes at each step. You should consider performing some sensitivity analysis, i.e. where
there is a close decision as to which variable to include at a particular stage, explore what
would happen under both possibilities. Alternatively, and a better option, you could build the
models yourself, using diagnostics at each stage (examining residuals, testing coefficient and
sum of squares) bearing in mind the purpose of the model and its interpretation. This may be
more time consuming, and it is advisable to consider a series of nested models as you build up
from a simple model to more complex ones.
Here are the coefficients in the models at four stages of performing a stepwise regression:
Unstandardized
Coefficients
Model
B Std. Error
1 (Constant)
-36.191 10.895
WASTE
1.287
.127
2 (Constant)
-68.718
9.199
WASTE
.773
.125
FORE
2.755
.506
3 (Constant)
-107.488 15.998
WASTE
.732
.108
FORE
2.579
.439
HEIGHT
.264
.095
4 (Constant)
-113.312 14.639
WASTE
.647
.104
FORE
2.036
.462
HEIGHT
.272
.085
THIGH
.540
.237
a Dependent Variable: MASS
Standardized
Coefficients
Beta
.915
.550
.482
.520
.452
.149
.460
.356
.154
.177
t
-3.322
10.148
-7.470
6.199
5.439
-6.719
6.772
5.870
2.787
-7.740
6.201
4.402
3.179
2.275
Sig.
.003
.000
.000
.000
.000
.000
.000
.000
.012
.000
.000
.000
.005
.036
The final model then contains the waist, forearm, height and thigh measurements.
The ANOVA tables for the models at each stage are as follows, note that the residual mean
square is reduced at each stage, indicating that the model is a better fit to the data.
Model
a
b
c
d
Sum of
df
Mean
F
Squares
Square
1 Regression 2113.643
1 2113.643 102.978
Residual
410.505
20
20.525
Total 2524.148
21
2 Regression 2363.612
2 1181.806 139.871
Residual
160.536
19
8.449
Total 2524.148
21
3 Regression 2411.999
3 804.000 129.043
Residual
112.149
18
6.230
Total 2524.148
21
4 Regression 2438.173
4 609.543 120.527
Residual
85.974
17
5.057
Total 2524.148
21
Predictors: (Constant), WASTE
Predictors: (Constant), WASTE, FORE
Predictors: (Constant), WASTE, FORE, HEIGHT
Predictors: (Constant), WASTE, FORE, HEIGHT, THIGH
Sig.
.000
.000
.000
.000
The results from a similar procedure called regression by leaps and bounds are the same (done
in S-plus).
> leaps.mass <- leaps(physical[,2:11],Mass,nbest=3)
> df.mass <- data.frame(p=leaps.mass$size,Cp=leaps.mass$Cp)
> round(df.mass,2)
Waist
Fore
Shoulder
Fore,Waist
Waist,Calf
Shoulder,Waist
Fore,Waist,Height
Fore,Waist,Calf
Fore,Waist,Thigh
Fore,Waist,Height,Thigh
Fore,Waist,Height,Calf
Fore,Waist,Height,Head
Fore,Waist,Height,Thigh,Head
Fore,Waist,Height,Calf,Thigh
Fore,Waist,Height,Calf,Head
Fore,Waist,Height,Calf,Thigh,Head
Fore,Chest,Waist,Height,Calf,Head
Fore,Chest,Waist,Height,Thigh,Head
Fore,Chest,Waist,Height,Calf,Thigh,Head
Fore,Bicep,Waist,Height,Calf,Thigh,Head
Fore,Shoulder,Waist,Height,Calf,Thigh,Head
Fore,Chest,Neck,Waist,Height,Calf,Thigh,Head
Fore,Chest,Shoulder,Waist,Height,Calf,Thigh,Head
Fore,Bicep,Chest,Waist,Height,Calf,Thigh,Head
Fore,Bicep,Chest,Neck,Waist,Height,Calf,Thigh,Head
Fore,Chest,Neck,Shoulder,Waist,Height,Calf,Thigh,Head
Fore,Bicep,Chest,Shoulder,Waist,Height,Calf,Thigh,Head
Fore,Bicep,Chest,Neck,Shoulder,Waist,Height,Calf,Thigh,Head
p
Cp
2 60.50
2 74.80
2 110.36
3 14.70
3 25.25
3 29.54
4
7.45
4 11.18
4 12.21
5
4.44
5
6.10
5
6.83
6
4.14
6
4.82
6
5.35
7
4.38
7
4.81
7
5.50
8
5.47
8
6.07
8
6.12
9
7.13
9
7.45
9
7.47
10
9.01
10
9.10
10
9.45
11 11.00
> lm.mass <-lm(Mass~Fore+Waist+Height+Thigh)
> summary(lm.mass,cor=F)
Call: lm(formula = Mass ~ Fore + Waist + Height + Thigh)
Residuals:
Min
1Q Median
3Q
Max
-3.882 -0.6756 -0.1017 0.9641 4.992
Coefficients:
Value Std. Error
(Intercept) -113.3120
14.6391
Fore
2.0356
0.4624
Waist
0.6469
0.1043
Height
0.2717
0.0855
Thigh
0.5401
0.2374
t value
-7.7404
4.4020
6.2015
3.1789
2.2750
Pr(>|t|)
0.0000
0.0004
0.0000
0.0055
0.0361
Residual standard error: 2.249 on 17 degrees of freedom
Multiple R-Squared: 0.9659
F-statistic: 120.5 on 4 and 17 degrees of freedom, the p-value is 3.079e-012

Note the layout of the summary of the final model and ensure that you can get all the
information, (RSS, 95% CI’s etc) from this that you need.

Examine the residuals from your model. Are they consistent with the assumptions of
linear regression ?
Histogram of the residuals, should look normal
Histogram
Dependent Variable: MASS
7
6
5
4
Frequency
3
2
Std. Dev = .90
1
Mean = 0.00
N = 22.00
0
-1.50
-1.00
-.50
0.00
.50
1.00
Regression Standardized Residual
1.50
2.00
This is a normal p-p (or q-q) plot. The axes have been transformed so that the cumulative
distribution of the residuals is plotted against the cumulative normal distribution, hence if the
residuals do follow a normal distribution, they should lie on a straight line. Apart from a small
deviation, this seems to be the case here. (the deviation is due to one or two of the variables
being used in a linear form when their logged values should have been used).
Normal P-P Plot of Regression Standardized Residu
Dependent Variable: MASS
1.00
Expected Cum Prob
.75
.50
.25
0.00
0.00
.25
.50
Observed Cum Prob
.75
1.00
Scatterplot
Dependent Variable: MASS
100
90
80
70
MASS
60
50
-2
-1
0
Regression Standardized Residual
1
2
3
Download