Multiple Regression: Part II (

advertisement
Multiple Regression: Part II
(§12.4 - 12.7)
•
•
•
•
•
Testing hierarchical models.
Model building and variable selection.
Checking model assumptions.
Dummy variables.
The Peruvian Indian Data – a complete example.
15-1
Testing Hierarchical Models
• Suppose we fit all available independent
variables in a general multiple regression model
(complete model - model 1).
Y = b0 + b1x1 + b2x2 + b12x1x2 + e
• Now fit the same model with one or more of the
terms removed(reduced model - model 2).
Y = b0 + b1 x 1 + e
• Does the reduced model fit as well as the
complete model?
• Test two parameters simultaneously.
H0: b2 = b12 = 0
15-2
Let SSE1 be the error sums of squares for the complete
model (Y = b0 + b1x1 + b2x2 + b12x1x2).
Let SSE2 be the error sums of squares for the reduced
model (Y = b0 + b1x1).
Since Model 1 includes more terms than Model 2, it should
fit better, hence we have that
SSE1  SSE2
The difference, SSE2 - SSE1 is a measure of the drop in
the sum of squares for error attributable to the
variables removed from the complete model.
15-3
F-Statistic for Hierarchical Models
Define the mean square drop as:
MSdrop = (SSE2 - SSE1 ) / (k-g)
where k is the number of terms in the complete model (Model 1) and
g (<k) is the number of terms in the reduced model (Model 2).
The mean square error for the complete model is:
MSE1 = SSE1 / (n-k-1)
To test the hypothesis that the terms left out of the complete model do
not contribute significantly to explaining the variability in y we use the
following F statistic.
F = MSdrop/MSE1
Reject Ho: Left out parameters = 0 if F > Fk-g,n-k-1,a
15-4
Sequential Models
With this concept of partial and full models, we can look at models
in two different ways.
Y  b0  e
Y  b 0  b1 x1  e
Y  b 0  b1 x1  b 2 x2  e
Y  b 0  b1 x1  b 2 x2  b12 x2 x2  e
Add constant
Add x1
Add x2
Add cross product
Mean Model
Model 1
Model 2
Model 3
Questions:
Is Model 1 better than the Mean Model?
Is Model 2 better than Model 1?
Is Model 3 better than Model 2?
All of these are tested with a drop type test.
15-5
Last In Significant (Partial) Tests
y  b0  b1x1  b 2 x 2  b3 x 3  e
y  b0  b1x1  b3 x 3  b 2 x 2  e
y  b0  b 2 x 2  b3 x 3  b1x1  e
Does x3 add to model with x1, x2?
Does x2 add to model with x1, x3?
Does x1 add to model with x2, x3?
Our interest centers on the question of how important is a
predictor in explaining variability in y over and above what is
explained by predictors already in the model.
All these are tested with a drop type test.
15-6
Reduction (Drop) Sums of Squares
Y  b0  e
Y  b 0  b1 x1  e
Y  b 0  b1 x1  b 2 x2  e
Y  b 0  b1 x1  b 2 x2  b12 x1 x2  e
Add constant
Add x1
Add x2
Add cross product
Mean Model
Model 1
Model 2
Model 3
Total SS = Model SS + Error SS
R(b2|b0,b1) = Model SS2 - Model SS1
= Error SS1 - Error SS2
R(b12|b0, b1 b2,) = Model SS3 - Model SS2
= Error SS2 - Error SS3
These are called
adjusted or partial
sums of squares.
Additional variability in Y explained by model 3, above and beyond
what is already explained by model 2.
15-7
SAS Type Sums of Squares
In SAS, Reduction (or Drop or Sequential) Sums of Squares are
presented in the Type I Sums of Squares table. This table has the
form.
Source
x1
x2
x3
Sums of Squares
R(b1|b0)
R(b2|b0, b1)
R(b3|b0, b1, b2)
Y = b0 + b1 X1
Y = b0 + b1 X1+ b2 X2
Y = b0 + b1 X1 + b2 X2 + b3 X3
In SAS, Last In (or Partial ) Sums of Squares are presented in the
Type III Sums of Squares table. This table has the form.
Source
x1
x2
x3
Sums of Squares
R(b1|b0, b2, b3)
R(b2|b0, b1, b3)
R(b3|b0, b1, b2)
Y = b0 + b1 X1 + b2 X2 + b3 X3
Y = b0 + b1 X1 + b2 X2 + b3 X3
Y = b0 + b1 X1 + b2 X2 + b3 X3
15-8
Variable/Predictor Selection
• Find the “best” (an appropriate) subset of regressors
(predictors) for the model from among all possible
candidate regressors (all simple terms, polynomial
terms, cross-product terms, etc.)
• Problem: How do we define “best”?
• The model should include as many regressors as
possible so that the information content in these factors
can account comprehensively for the predictions of y.
• The model should include as few regressors as possible
since the variance of predictions increases as the
number of regressors increases. Besides, one wants a
parsimonious (simple) model for ease of
interpretability.
15-9
Y  b 0  b1 X 1    b k X k  e
• Deleting variables from the complete model actually decreases the
variances of the parameter estimates for the remaining explanatory
variables. (R2 always increases/decreases when variables are
added/deleted to/from the model.)
sbˆ
j
 ˆ e
1
S x j x j (1  Rx2j  x1x2x j1x j1xk )
• Thus, deleting variables from the complete model improves the
precision of the estimates for the remaining parameters, as well as the
precision of the predictions.
• However, if we remove an explanatory variable which is strongly
associated with Y, we produce biased estimates of the remaining
parameters, the residual variance, and the predictions.
15-10
Model Selection Criteria
•
•
•
•
The coefficient of multiple determination.
R2 = SSRk/SYY
Choose models with high R2. However, R2 increases every time more
predictors are added, regardless of their importance in predicting Y.
Adjusted R2.
Adj R2 = 1- (n-1)(1- R2)/(n-k-1)
Choose models with high adj. R2. Better suited for model selection
than R2. It increases/decreases only when important/unimportant
predictors are added to the model.
Residual mean square.
MSEk = SSEk/(n-k-1)
Choose models with low MSE.
Mallows’ Cp statistic.
Cp = (SSEk/MSE1)-n+2k
Choose models with Cp  k+1. Measures adequacy of predictions from
reduced model relative to those from the full model. (MSE1 is the MSE
for the full model; the model containing all available predictors.)
15-11
• Predicted Residual Sum of Squares (PRESS) statistic.
^ (i) ]2
PRESS = S[ yi – y
Choose models with low PRESS. y^(i) is the estimate of the ith obs based
on a model fitted to the remaining n-1 obs.
• Highly recommended: Akaike’s Information Criterion (AIC). Uses
sophisticated concepts from information theory to measure how far the
candidate model is from the “true” model. Choose models with low AIC.
(Burnham, K. P., and D. R. Anderson. 2002. Model selection and
multimodel inference: a practical information-theoretic approach. SpringerVerlag, New York, NY. ISBN 0-387-95364-7.) For a multiple regression
model with n obs and k predictors,
 n  k 1
 2n(k  2)
AICc  n log 
MSE  
n

 nk 3
Need to also keep in mind that should have (about) at least 10 times
more observations than predictors in the candidate models, i.e. 10k<n.
15-12
Automatic Model Selection Approaches
•
•
•
Backward Elimination - First fit the model with all possible
predictors, then sequentially eliminate those predictors that are
least significant in a last-in (partial) test.
Forward Selection - First find the one predictor, call it x1, that
does the very best job of explaining variation in y. Then add to
the model the predictor that is most significant (via a last-in test)
when added into the model after x1. Continue until no additional
predictors contribute significantly using a last-in test.
Stepwise Selection - Begin as with the forward selection
method, but each time a new predictor is added into the model,
check all other predictors with a last-in test to determine if they
should continue to be in the model. Drop any predictor that
cannot pass the last-in test. (This method is generally preferred
over the other two.)
15-13
Checking Model Assumptions in
Multiple Regression
Basic Assumptions:
1. Zero expectation: E(ei) = 0 for all i.
2. Constant variance: V(ei) = 2e for all i.
3. Normality: ei are normally distributed.
4. Independence: Corr(ei,ej)=0 for all i  j.
These are checked using the residuals as estimates of the ei.
eˆi  y i  yˆ i
15-14
Tools for Assumption Checking
1. Zero expectation: E(ei) = 0 for all i.
Automatically satisfied if we implement least squares estimation correctly.
2. Constant variance: V(ei) = 2e for all i.
Use a plot of residuals versus predicted values.
Points should display the same spread regardless of the value of the
predicted response.
Alternatives:
Weighted least squares
Generalized linear models
15-15
3. Normality: ei are normally distributed.
Normal probability plots (Quantile-quantile plots)
Formal tests of normality
Histograms and Boxplots
4. Independence: Corr(ei,ej)=0 for all i  j.
For time (or space) series data:
Time series plot
Serial autocorrelation estimates and plot
Durbin-Watson Statistic
Non times series data:
Know how the data were collected; was randomization used?
No formal statistical tests or indices.
15-16
Comparing the Slopes of Two or More Regression Lines
Suppose we have a quantitative explanatory variable, X1, and we have two
possible regression lines: one for situation 1 (location A), the other for situation
2 (location B).
Use of dummy (or classification) variables in regression.
y
Equal intercepts
y  b 0  b1 x1 (loc A)
H0 : b 0  b 2
Equal slopes
H 0 : b1  b 3
y  b 2  b3 x1 (loc B)
x1
15-17
Reformulate the model.
Dummy variable
Define a new variable, x2 such that
x2 = 0 for situation 1 (Location A)
x2 = 1 for situation 2 (Location B)
Then use multiple regression.
yˆ   0   1 x1   2 x2   3 x1 x2
When x2=0 we have:
yˆ   0   1 x1  b 0  b1 x1
When x2=1 we have:
yˆ  ( 0   2 )  ( 1   3 ) x1  b 2  b 3 x1
Test of 2=0 equivalent to no intercept difference.
Test of 3=0 equivalent to no slope difference.
Tests are based on reduction (drop) sums of squares,
as previously defined.
15-18
y
20 and 30
y
y=0+ 1 x
y=0+ 1 x
y=(0+2)+ 1 x
y=(0+2)+ (1+3)
x
y
1
20 and 30
20 and 30
x1
y=0+ 1 x
y
20 and 30
x1
y=0+ 1 x
y=0+ 1 x
y=0+ (1+3) x
x1
x1
15-19
Generalized Linear Models
2
Normal Regression: y ~ N ( , e )     b 0  b1x1  b 2 x 2  
Distributional specification
Link function
Linear predictor
Poisson Regression:
Y~Poisson(l)
E(Y)= l = exp()
=b0+b1x1+ ...
Logistic Regression:
Y~Binomial(n,p)
E(Y)= np =exp()/[1+exp()]
=b0+b1x1+ ...
15-20
Multiple Regression Example: The Peruvian Indian Data
Anthropologists studying long-term effects of an environmental change on systolic blood
pressure, measured this and several other characteristics of 39 Indians who migrated
from a primitive environment high in the Andes, into mainstream Peruvian society at a
lower elevation. The variables were:
X1 = Age (years)
X2 = Years since migration
X3 = Weight (kg)
X4 = Height (mm)
X5 = Chin skin fold (mm)
X6 = Forearm skin fold (mm)
X7 = Calf skin fold (mm)
X8 = Pulse rate (beats/min)
X9 = Systolic blood pressure
X10 = Diastolic blood pressure
X11 = Years since migration divided by age
Data available in Minitab as PERU.MTW. Response Y is X9.
Age Years Weight Height Chin Forearm Calf Pulse Systol Diastol Fraction
21
1
71.0 1629
8.0
7.0
12.7
88
170
76
0.047619
.
28
25
53.0 1568
3.7
4.3
0.0
80
108
62
0.89286
.
.
54
40
87.0 1542
11.3 11.7
11.3
92
152
88
0.740741
#
1
15-21
8
39
Pairwise
scatterplots
R:
If data is read into a
data frame called
“peru” with
read.table, issue
the command
> pairs(peru)
MTB:
Graph > Matrix Plot
15-22
Matrix of all pairwise correlations (R)
> cor(peru)
V1
V2
V3
V4
V5
V6
V1 1.000000000 0.588212502 0.4316630 0.055777982 0.157908294 0.05520278
V2 0.588212502 1.000000000 0.4811534 0.072594154 0.221697674 0.14302404
V3 0.431662982 0.481153366 1.0000000 0.450330307 0.561748764 0.54373244
V4 0.055777982 0.072594154 0.4503303 1.000000000 -0.007898078 -0.06893212
V5 0.157908294 0.221697674 0.5617488 -0.007898078 1.000000000 0.63788150
V6 0.055202779 0.143024038 0.5437324 -0.068932124 0.637881501 1.00000000
V7 -0.005374411 0.001099438 0.3918655 -0.002845856 0.515999762 0.73552594
V8 0.090654502 0.236904643 0.3117934 0.007829993 0.223100921 0.42190760
V9 0.005844807 -0.087480460 0.5213643 0.219114553 0.170192453 0.27228023
V10 0.038725834 0.075792139 0.3944963 0.253040787 0.088787528 0.21237426
V11 0.364523488 0.938145452 0.2930832 0.051187749 0.120091662 0.02801547
V7
V8
V9
V10
V11
V1 -0.005374411 0.090654502 0.005844807 0.03872583 0.36452349
V2 0.001099438 0.236904643 -0.087480460 0.07579214 0.93814545
V3 0.391865474 0.311793359 0.521364290 0.39449626 0.29308318
In MTB:
V4 -0.002845856 0.007829993 0.219114553 0.25304079 0.05118775
> Stat
V5 0.515999762 0.223100921 0.170192453 0.08878753 0.12009166
V6 0.735525936 0.421907596 0.272280231 0.21237426 0.02801547
> Basic Stats
V7 1.000000000 0.208715412 0.250789289 0.30649050 -0.11301589
> Correlation
V8 0.208715412 1.000000000 0.135477107 0.05969512 0.21419489
V9 0.250789289 0.135477107 1.000000000 0.47519113 -0.27614544
V10 0.306490503 0.059695117 0.475191134 1.00000000 -0.05101310
V11 -0.113015894 0.214194894 -0.276145438 -0.05101310 1.00000000
15-23
Fit full model for subsequent comparison (MTB)
Stat > Regression > Options > Variance Inflation Factors
The regression equation is
Systol = 142 - 1.05 Age + 2.28 Years + 1.35 Weight - 0.0361 Height
- 0.841 Chin - 1.09 Forearm - 0.232 Calf + 0.116 Pulse
+ 0.100 Diastol - 108 Fraction
Predictor
Constant
Age
Years
Weight
Height
Chin
Forearm
Calf
Pulse
Diastol
Fraction
Coef
141.93
-1.0487
2.2775
1.3546
-0.03606
-0.8412
-1.088
-0.2316
0.1161
0.1003
-108.36
S = 8.739
SE Coef
49.98
0.3480
0.8644
0.4441
0.03728
0.7637
1.211
0.5532
0.1721
0.1500
32.17
R-Sq = 67.3%
T
2.84
-3.01
2.63
3.05
-0.97
-1.10
-0.90
-0.42
0.67
0.67
-3.37
P
0.008
0.005
0.014
0.005
0.342
0.280
0.377
0.679
0.505
0.509
0.002
VIF
3.6
37.9
4.9
1.9
2.2
3.8
2.5
1.3
1.5
27.2
These
are too
high!
R-Sq(adj) = 55.6%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
10
28
38
SS
4393.03
2138.41
6531.44
MS
439.30
76.37
F
5.75
P
0.000
15-24
Source
Age
Years
Weight
Height
Chin
Forearm
Calf
Pulse
Diastol
Fraction
DF
1
1
1
1
1
1
1
1
1
1
Seq SS
0.22
82.55
2693.40
61.37
366.86
42.69
14.67
2.95
261.93
866.38
Unusual Observations
Obs
Age
Systol
1
21.0
170.00
8
28.0
108.00
VIF’s in R:
See p.205 of Faraway’s Practical
Regression and ANOVA using R.
(Downloadable from R Resources on
course website). Package “car” also
available from CRAN.
Fit
155.02
95.72
SE Fit
6.21
6.55
Residual
14.98
12.28
St Resid
2.44R
2.12R
R denotes an observation with a large standardized residual
Stepwise selection (R)
In MTB:
> g <- lm(V9~., data=peru)
> step(g,trace=F)
Stat > Regression
> Stepwise
Call:
lm(formula = V9 ~ V1 + V2 + V3 + V5 + V11, data = peru)
Coefficients:
(Intercept)
109.359
V1
-1.012
V2
2.407
V3
1.098
V5
-1.192
V11
-110.811
15-25
All subsets adj. R2 selection (R)
> library(leaps)
> g <-leaps(peru[,-9],V9,nbest=1,names=c("V1","V2","V3","V4","V5","V6","V7","V8","V10","V11"),method="adjr2")
> plot(g$size,g$adjr2)
> for (i in 1:dim(g$which)[1]){ print(name[g$which[i,]])}
[1] "V3"
[2] "V3" "V11"
In MTB:
[3] "V3" "V10" "V11"
Stat > Regression > Best Subsets
[4] "V1" "V2" "V3" "V11"
[5] "V1" "V2" "V3" "V5" "V11"
Will do all subsets based on R2, adj
[6] "V1" "V2" "V3" "V5" "V6" "V11"
R2, Cp, and s (MSE).
[7] "V1" "V2" "V3" "V4" "V5" "V6" "V11"
[8] "V1" "V2" "V3" "V4" "V5" "V6" "V8" "V11"
[9] "V1" "V2" "V3" "V4" "V5" "V6" "V8" "V10" "V11"
[10] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V10" "V11“
Model with
V3, V11
Model with
V1, V2, V3, V5, V11
is best by adj R2 (also
selected by stepwise)
Size = K+1
15-26
All subsets Cp selection (R)
> g <- leaps(peru[,-9],V9,nbest=1,names=c("V1","V2","V3","V4","V5","V6","V7","V8","V10","V11"),method="Cp")
> plot(g$size,g$Cp)
> abline(0,1)
> g$which
for (i in 1:dim(g$which)[1]){ print(name[g$which[i,]])}
[1] "V3"
[2] "V3" "V11"
[3] "V3" "V10" "V11"
[4] "V1" "V2" "V3" "V11"
[5] "V1" "V2" "V3" "V5" "V11"
[6] "V1" "V2" "V3" "V5" "V6" "V11"
[7] "V1" "V2" "V3" "V4" "V5" "V6" "V11"
[8] "V1" "V2" "V3" "V4" "V5" "V6" "V8" "V11"
[9] "V1" "V2" "V3" "V4" "V5" "V6" "V8" "V10" "V11“
[10] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V10" "V11"
Model with
V3, V11
Full Model is
best by Cp
Model with
Model with
V1, V2, V3, V11
V1, V2, V3, V5, V11
15-27
All subsets AICc selection (R)
> g <- leaps(peru[,-9],V9,nbest=1,names=c("V1","V2","V3","V4","V5","V6","V7","V8","V10","V11"),method=“r2")
> k<-g$size-1; r2<-g$r2; n<-length(V9); sst<-(n-1)*var(V9)
> plot(k+1,n*log((1-r2)*sst/n)+2*n*(k+2)/(n-k-3),xlab=“k+1”,ylab=“AICc”)
> g$which
for (i in 1:dim(g$which)[1]){ print(name[g$which[i,]])}
[1] "V3"
[2] "V3" "V11"
[3] "V3" "V10" "V11"
[4] "V1" "V2" "V3" "V11"
[5] "V1" "V2" "V3" "V5" "V11"
[6] "V1" "V2" "V3" "V5" "V6" "V11"
[7] "V1" "V2" "V3" "V4" "V5" "V6" "V11"
[8] "V1" "V2" "V3" "V4" "V5" "V6" "V8" "V11"
[9] "V1" "V2" "V3" "V4" "V5" "V6" "V8" "V10" "V11“
[10] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V10" "V11"
The ranking of high R2 models within a
given subset size is the same as the
ranking of low AICc models. We can then
plot the AICc for each of these best 10
models. The result is that the model with
V1, V2, V3, V5, V11
is best according to AICc.
15-28
Fit the 5 variable model
Systol = 109 - 1.01 Age + 2.41 Years + 1.10 Weight - 1.19 Chin - 111 Fraction
Predictor
Constant
Age
Years
Chin
Weight
Fraction
Coef SE Coef
109.36 21.48
-1.0120 0.3059
2.4067 0.7426
-1.1918 0.6140
1.0976 0.2980
-110.81
27.28
T
5.09
-3.31
3.24
-1.94
3.68
-4.06
P
0.000
0.002
0.003
0.061
0.001
0.000
VIF
2.939
29.853
1.484
2.378
20.886
S = 8.45707 R-Sq = 63.9% R-Sq(adj) = 58.4%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
F
P
5 4171.21 834.24 11.66 0.000
33 2360.23 71.52
38 6531.44
15-29
Fit the 2 variable model
Systol = 60.9 + 1.22 Weight - 26.8 Fraction
Predictor
Coef SE Coef
T
P
Constant 60.90 14.28
4.26 0.000
Weight
1.2169 0.2337 5.21 0.000
Fraction -26.767 7.218 -3.71 0.001
VIF
1.094
1.094
S = 9.77719 R-Sq = 47.3% R-Sq(adj) = 44.4%
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
2 3090.1 1545.0 16.16 0.000
Residual Error 36 3441.4 95.6
Total
38 6531.4
15-30
Carry out a “Drop SS Test” to compare these two models
H0: Model 2 (V3, V11), with SSE2=3441.4, df2=36.
Ha: Model 1 (V1, V2, V3, V5, V11), with SSE1=2360.2, df1=33, MSE1=71.5.
( SSE2  SSE1 ) /( df 2  df1 )
F
MSE1
(3441.4  2360.2) /(36  33)

71.5
 5.04  Fdf2  df1 ,df1 ,a  F3,33,.05  2.89
So the “explanatory” capability of Model 1 (on Y) is significantly greater
than Model 2, but that doesn’t mean it is a “better” model…
Because of the multicollinearity and complexity, I’ll go with Model 2.
15-31
Check for Interaction in Final Model (R)
> g0 <- lm(V9~V3+V11)
> g1 <- lm(V9~V3*V11)
> summary(g1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.2224 34.1710 1.528 0.1354
V3
1.3560
0.5501 2.465 0.0187 *
V11
-9.8619 60.7797 -0.162 0.8720
V3:V11
-0.2672 0.9536 -0.280 0.7810
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Insignificant
interaction
Residual standard error: 9.905 on 35 degrees of freedom
Multiple R-Squared: 0.4743, Adjusted R-squared: 0.4292
F-statistic: 10.53 on 3 and 35 DF, p-value: 4.433e-05
> summary(g0)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.8959 14.2809
4.264 0.000138 ***
V3
1.2169
0.2337
5.207 7.97e-06 ***
V11
-26.7672
7.2178 -3.708 0.000699 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 9.777 on 36 degrees of freedom
Multiple R-Squared: 0.4731, Adjusted R-squared: 0.4438
F-statistic: 16.16 on 2 and 36 DF, p-value: 9.795e-06
Both variables highly
significant!
Systol = 60.9 + 1.2*weight
- 26.8*fraction
15-32
Influence (R)
> influence.measures(g0)
dfb.1.
dfb.V3 dfb.V11
dffit cov.r cook.d hat inf
1 -0.611150 0.82840 -0.96911 1.23380 0.664 4.19e-01 0.1508 *
8 0.270300 -0.32305 0.40802 0.48303 1.431 7.86e-02 0.2776 *
39 -0.392296 0.37889 0.07270 0.44230 1.574 6.63e-02 0.3314 *
> g0$resid[c(1,8,39)]
1
8
39
23.981894 6.509972 5.065127
15-33
Fitted regression equation (plane) and data points:
If you have to do a detailed regression analysis get the following book:
Applied Linear Regression Models, by Kutner, Nachtsheim, & Neter, 4th edition,
2004, McGraw-Hill/Irwin.
15-34
Download