12_Multiple and Complex Regression 2013

advertisement
Multiple and complex
regression
Extensions of simple linear
regression
• Multiple regression models: predictor
variables are continuous
• Analysis of variance: predictor variables
are categorical (grouping variables),
• But… general linear models can include
both continuous and categorical predictors
Relative abundance of C3 and C4 plants
•
Paruelo & Lauenroth (1996)
• Geographic distribution
and the effects of climate
variables on the relative
abundance of a number
of plant functional types
(PFTs): shrubs, forbs,
succulents, C3 grasses
and C4 grasses.
data
73 sites across temperate central North America
Response variable
• Relative abundance
of PTFs (based on
cover, biomass, and
primary production)
for each site
Predictor variables
•
•
•
•
•
•
•
Longitude
Latitude
Mean annual temperature
Mean annual precipitation
Winter (%) precipitation
Summer (%) precipitation
Biomes (grassland , shrubland)
8
6
2
0
0
0.2
0.4
0.6
0.8
-2.0
-1.5
-1.0
-0.5
C3
log_10_C3
Histogram of log_C3
Histogram of SQRT_C3
0.0
8
6
4
0
0
2
2
4
6
Frequency
8 10 12
10 12
0.0
Frequency
4
Frequency
10 12
10 15 20 25 30
Histogram of log_10_C3
5
Frequency
Histogram of C3
-5
-4
-3
-2
log_C3
-1
0
0.0
0.2
0.4
0.6
0.8
1.0
SQRT_C3
Relative abundance transformed ln(dat+1) because positively skewed
Collinearity
• Causes computational problems because
it makes the determinant of the matrix of
X-variables close to zero and matrix
inversion basically involves dividing by the
determinant (very sensitive to small
differences in the numbers)
• Standard errors of the estimated
regression slopes are inflated
Detecting collinearlity
• Check tolerance values
• Plot the variables
• Examine a matrix of correlation
coefficients between predictor variables
Dealing with collinearity
• Omit predictor variables if they are highly
correlated with other predictor variables
that remain in the model
Correlations
105
115
5 10
20
0.1
0.3
0.5
50
95
105 115
30
40
LAT
600
1000
95
LONG
20
200
MAP
0.3
0.5
5 10
MAT
0.3
0.5
0.1
JJAMAP
0.1
DJFMAP
30
40
50
200
600
1000
0.1
0.3
0.5
Correlations
LAT
LAT
LONG
MAP
MAT
JJAMAP
DJFMAP
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
1
.
73
.097
.416
73
-.247*
.036
73
-.839**
.000
73
.074
.533
73
-.065
.584
73
LONG
.097
.416
73
1
.
73
-.734**
.000
73
-.213
.070
73
-.492**
.000
73
.771**
.000
73
*. Correlation is significant at the 0.05 level (2-tailed).
**. Correlation is significant at the 0.01 level (2-tailed).
MAP
-.247*
.036
73
-.734**
.000
73
1
.
73
.355**
.002
73
.112
.344
73
-.405**
.000
73
MAT
JJAMAP
DJFMAP
-.839**
.074
-.065
.000
.533
.584
73
73
73
-.213
-.492**
.771**
.070
.000
.000
73
73
73
.355**
.112
-.405**
.002
.344
.000
73
73
73
1
-.081
.001
.
.497
.990
73
73
73
-.081
1
-.792**
.497
.
.000
73
73
73
.001
-.792**
1
.990
.000
.
73
73
73
(lnC3)= βo+ β1(lat)+ β2(long)+ β3(latxlong)
Coefficientsa
Model
1
(Constant)
LAT
LONG
LOXLA
Unstandardized
Coefficients
B
Std. Error
7.391
3.625
-.191
.091
-.093
.035
.002
.001
Standardized
Coefficients
Beta
-3.095
-1.824
4.323
t
2.039
-2.101
-2.659
2.572
Sig .
.045
.039
.010
.012
Collinearity Statistics
Tolerance
VIF
.003
.015
.002
307.745
66.784
400.939
a. Dependent Variable: LC3
After centering both lat and long
Coefficientsa
Model
1
(Constant)
LONRE
LATRE
RELALO
Unstandardized
Coefficients
B
Std. Error
-.553
.027
-.003
.004
.048
.006
.002
.001
a. Dependent Variable: LC3
Standardized
Coefficients
Beta
-.051
.783
.238
t
-20.131
-.597
8.484
2.572
Sig .
.000
.552
.000
.012
Collinearity Statistics
Tolerance
VIF
.980
.827
.820
1.020
1.209
1.220
Analysis of variance
Source of
variation
SS
Regression Σ(yhat-Y)2
df
MS
p
Σ(yhat-Y)2
p
Residual
Σ(yobs-yhat)2 n-p-1
Total
Σ(yobs-Y)2
n-1
Σ(yobs-yhat)2
n-p-1
Matrix algebra approach to OLS estimation of
multiple regression models
• Y=βX+ε
• X’Xb=XY
• b=(X’X) -1 (XY)
Criteria for “best” fitting in multiple regression with p predictors.
Criterion
r2
Adjusted r2
Akaike Information Criteria AIC
Formula
r2 
SSRe gression
SStotal
 1
SSRe sidual
SStotal
 n 1 
(1  r 2 )
1  
 n  p) 
 n
  pn 

 2 ln(2 (SSRe sidual ) / n))  1  2
 2
  n  p 1 
Akaike Information Criteria AIC
n[ln(SSRe sidual
 pn 

/ n)]  2
 n  p 1 
Hierarchical partitioning and model selection
No
pred
Model
r2
Adjr2
P
AIC (R)
1
Lon
0.0006
-0.013
0.84
30.15
1
Lat
0.47
0.46
>0.001
-16.16
2
Lon + Lat
0.48
0.46
>0.001
-15.25
3
Long +Lat +
Lon x Lat
0.54
0.52
>0.001
-22.55
C3
R2=0.48
Longitude
Latitude
Model Lat + Long
0.0 0.2 0.4 0.6 0.8 1.0
Y_hats.longlat
-15 -10 -5
0
cLAT
5 10 15
-5
-10
-15
0
5
15
10
-0.2 0.0 0.2 0.4 0.6 0.8 1.0
5 10 15
Y_hats.longxlat
0
0
cLONG
Y_hats.longlat
0.0 0.2 0.4 0.6 0.8 1.0
-15 -10 -5
-5
-10
-15
5
-15 -10 -5
-0.2 0.0 0.2 0.4 0.6 0.8 1.0
Y_hats.longxlat
cLAT
15
10
cLONG
-15 -10 -5
0
0
cLAT
5 10 15
5 10 15
-5
-10
-15
-5
-10
-15
0
0
5
cLAT
cLONG
15
10
cLONG
5
15
10
0.6
0.4
relative abundance
0.8
1.0
C3 grasses in North America
0.2
45 Lat
0.0
35 Lat
95
Model Lat * Long
100
105
Longitude
110
115
120
The final forward model selection is:
Step: AIC=-228.67
SQRT_C3 ~ LAT + MAP + JJAMAP + DJFMAP
Df Sum of Sq
<none>
+ LONG
+ MAT
RSS
AIC
2.7759 -228.67
1 0.0209705 2.7549 -227.23
1 0.0001829 2.7757 -226.68
Call:
lm(formula = SQRT_C3 ~ LAT + MAP + JJAMAP + DJFMAP)
Coefficients:
(Intercept)
-0.7892663
LAT
0.0391180
MAP
0.0001538
JJAMAP
-0.8573419
DJFMAP
-0.7503936
The final backward selection model is
Step: AIC=-229.32
SQRT_C3 ~ LAT + JJAMAP + DJFMAP
Df Sum of Sq
<none>
- DJFMAP
- JJAMAP
- LAT
1
1
1
RSS
2.8279
0.26190 3.0898
0.31489 3.1428
2.82772 5.6556
AIC
-229.32
-224.85
-223.61
-180.72
Call:
lm(formula = SQRT_C3 ~ LAT + JJAMAP + DJFMAP)
Coefficients:
(Intercept)
-0.53148
LAT
0.03748
JJAMAP
-1.02823
DJFMAP
-1.05164
Download