Slajd 1

advertisement
# predator species
z
Assumptions of linear regression
100
y = 1.16x + 4.17
80
There is a hypothesis about dependent and
independent variables
2
R = 0.49
60
40
The relation is supposed to be linear
20
0
0
10
20
30
40
50
z
# prey species
We have a hypothesis about the distribution
of errors around the hypothesized regression
line
Brain weight [g]
100000
Mammals
10000
There is a hypothesis about dependent and
independent variables
1000
100
10
The relation is non-linear
0.73
y = 9.24x
1
R2 = 0.95
0.1
0.001
1
We have no data about the distribution of
errors around the hypothesized regression
line
1000
z
Body weight [kg]
Agricultural field
1000
Ground beetles at two adjacent sites
100
There is no clear hypothesis about
dependent and independent variables
The relation is non-linear
10
0.53
y = 4.4x
2
R = 0.19
1
1
10
Poplar plantation
100
We have no data about the distribution of
errors around the hypothesized regression
line
z
# predator species
100
Assumptions:
y = 1.16x + 4.17
80
2
R = 0.49
60
A linear model applies
40
The x-variable has no error term
20
0
0
10
20
30
40
50
The distribution of the y errors
around the regression line is normal
# prey species
Least squares method
n
D   (y )  [ yi  (axi  b)]2
30
25
20
y
n
2
i 1
y
15
y
y
10
y
5
0
0
5
10
15
x
20
25
30
i 1
n
D
 2 xi ( yi  axi  b) 0
a
i 1
n
D
 2 ( yi  axi  b)  0
b
i 1
 xy
a 2
x
z
Brain weight [g]
100000
10000
The second example is nonlinear
Mammals
1000
We hypothesize the allometric relation
W = aBz
100
10
0.73
y = 9.24x
1
R2 = 0.95
0.1
0.001
1
1000
Body weight [kg]
Linearised
regression model
Nonlinear
regression model
W  aB z
W  aB z


ln W  ln a  z ln B
W  aB z  

ln W  ln a  z ln B  

W  aB z Exp( )
Assumption:
Assumption:
The distribution of errors is lognormal
The distribution of errors is normal
Y=e0.1X+norm(0;Y)
100
100
y = 1.04e
80
0.098x
y = 1.16e
0.46
0.089x
y = 1.57x
10
60
y
y
Y=X0.5enorm(0;Y)
40
1
20
y = 0.60x0.56
0
0.1
0
20
x
40
1
10
100
1000
10000
x
In both cases we have some sort of autocorrelation
Using logarithms reduces the effect of autocorrelation and makes the distribution of
errors more homogeneous.
Non linear estimation instead puts more weight on the larger y-values.
If there is no autocorrelation the log-transformation puts more weight on smaller values.
Linear regression
European bat species and environmental correlates
N=62
ln(Area)
ln(Number
of
species)
10.26632
6.148468
11.33704
7.696213
8.519989
12.24361
10.3264
10.84344
12.40519
11.61702
8.891512
5.703782
9.068777
9.019059
10.94366
7.824046
9.132379
11.27551
10.67112
7.887209
10.71945
7.243513
12.73123
13.20664
12.78555
1.871802
11.7905
11.44094
11.54248
11.16014
12.6162
9.615805
11.07637
3.258097
0
3.218876
0.693147
2.70805
2.890372
2.995732
3.178054
2.890372
3.496508
2.197225
1.609438
3.044522
2.833213
3.526361
1.098612
2.890372
3.178054
2.639057
2.639057
2.397895
0
2.397895
3.465736
3.218876
1.609438
3.496508
3.332205
0
2.397895
3.433987
2.564949
2.772589
ln( S )  a0  a1 ln( A)
 y1 
 x1   1 x1 
1 
 
  

 
y
x
1
x
1
 2
 2 
 
2  a0 
 
Y     a0    a1    

... ...  a1 
...
...
...
 
  

 
1 
y 
x  1 x 
 
n
 n
 n 
Y  XA
Matrix approach to linear regression
X is not a square matrix, hence X-1 doesn’t exist.
X' Y  X' XA
X' X1 X' Y  X' X1 X' XA  IA  A
1
A  X' X X' Y
The species – area relationship of European bats
3.258097
0
3.218876
0.693147
2.70805
2.890372
2.995732
3.178054
2.890372
3.496508
2.197225
1.609438
3.044522
2.833213
3.526361
1.098612
2.890372
3.178054
2.639057
2.639057
2.397895
0
2.397895
3.465736
3.218876
1.609438
3.496508
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
ln(Area)
10.26632
6.148468
11.33704
7.696213
8.519989
12.24361
10.3264
10.84344
12.40519
11.61702
8.891512
5.703782
9.068777
9.019059
10.94366
7.824046
9.132379
11.27551
10.67112
7.887209
10.71945
7.243513
12.73123
13.20664
12.78555
1.871802
11.7905
X'
1
1
1
1
1
1
10.26632 6.148468 11.33704 7.696213 8.519989 12.24361
X'X
X'Y
154.2937
1647.908
1
1
1
1
1
1
1
10.3264 10.84344 12.40519 11.61702 8.891512 5.703782 9.068777
4
62 607.1316
607.1316 6518.161
(X'X)-1
0.183521 -0.01709
-0.01709 0.001746
3.5
y = 0.2391x + 0.1468
R² = 0.4614
3
ln(# species)
ln(Number
of
Constant
species)
2.5
2
1.5
1
-1
(X'X) (X'Y)
a0 0.146808
a1 0.239144
What about the part of variance
explained by our model?
1
1
1
R
Σ X ( X  Μ )' ( X  Μ ) Σ X
n 1
0.5
0
-0.5
-5
0
5
10
15
20
ln (Area)
ln S  0.24 ln A  0.15
S  e 0.15 A0.24  1.16 A0.24
1.16: Average number of species per unit area (species
density)
0.24: spatial species turnover
R
0.769488
-2.48861
0.730267
-1.79546
0.219442
0.401763
0.507124
0.689445
0.401763
1.007899
-0.29138
-0.87917
0.555914
0.344605
1.037752
-1.39
0.401763
0.689445
0.150449
0.150449
-0.09071
-2.48861
-0.09071
0.977127
0.730267
-0.87917
1.007899
0.843596
-2.48861
-0.09071
0.945379
0.076341
(X-M)'
0.473878
-3.64398
1.54459
-2.09623
-1.27246
2.451164
0.533954
1.050991
2.612741
1.824579
-0.90093
-4.08866
-0.72367
-0.77339
1.151213
-1.9684
-0.66007
1.48306
0.878671
-1.90524
0.927004
-2.54893
2.938785
3.414195
2.993105
-7.92064
1.998051
1.64849
1.750039
1.367698
2.823752
-0.17664
0.769488
0.473878
-2.48861 0.730267
-3.64398 1.54459
(X-M)'(X-M)
71.0087 136.9954
136.9954 572.8582
-1.79546 0.219442 0.401763
-2.09623 -1.27246 2.451164
(X-M)'(X-M) / (n-1)
1.164077 2.245826
2.245826 9.391119
Sx
1.078924
0
0 3.064493
4
3.5
Sx -1
0.926849
0
0 0.326318
Sx -1 (X-M)'(X-M) / (n-1)
1.078924 2.081542
0.732854 3.064493
Sx-1 (X-M)'(X-M) / (n-1) Sx-1
1 0.679245
0.679245
1
Sx-1 (X-M)'(X-M) / (n-1) Sx-1)2
1 0.461374
0.461374
1
y = 0.2391x + 0.1468
R² = 0.4614
3
ln(# species)
X-M
1
1
1
Σ X ( X  Μ )' ( X  Μ ) Σ X
n 1
2.5
2
1.5
1
0.5
0
-0.5
-5
0
5
10
ln (Area)
15
20
How to interpret the coefficient of determination
4
3.5
3
ln(# species)

y = 0.2391x + 0.1468
R² = 0.4614
2
Y ;M
1 n

(Yi  Y ) 2

n  1 i 1
Total variance
2.5
2

1.5
1
2
Y ;Y ( X )
1 n
2

(
Y

Y
(
X
))
 i
i
n  1 i 1
Rest (unexplained) variance
0.5

0
-0.5
-5
0
5
10
15
20
 Y2;M   Y2;Y ( X )   Y2( X );M
1 n
(Yi  Y ( X i )) 2

Residual variance
n  1 i 1
R2  1
 1

1 n
Total variance
(Yi  Y ) 2

n  1 i 1
Statistical testing
is done by an F or
a t-test.
1 n

(Y ( X i )  Y ) 2

n  1 i 1
Residual (explained) variance
ln (Area)
R2
F
df
2
1 R
2
Y ( X ); M
n
 (Y ( X )  Y )
i
i 1
n
 (Y  Y )
i 1
i
t F
t
R
1 R 2
df
2
2
ln( S )  a0  a1 ln( A)  a2 T  a3 NT 0  a4 L
The general linear model
n
Y  a0  a1 X 1  a2 X 2  a3 X 3  ...  an X n  a0   ai X i
i 1
A model that assumes that a dependent variable Y can be expressed by a linear combination of
predictor variables X is called a linear model.
 y1  1 x1,1
  
 y2  1 x2,1
Y 
1 ...
...
  
 y  1 x
m ,1
 m 
...
x1,n  a0 
 
... x2,n  y1 
 XA
... ...  ... 
 
... xm ,n  yn 
 y1  1 x1,1
  
 y2  1 x2,1
Y 
1 ...
...
  
 y  1 x
m ,1
 m 
...
X' Y  X' XA
X' X1 X' Y  X' X1 X' XA  IA  A
1
A  X' X X' Y
x1,n  a0    0 
   
... x2,n  y1    1 

 XA  Ε
... ...  ...   ... 
   
... xm ,n  yn    n 
The vector E contains the error terms of each
regression. Aim is to minimize E.
The general linear model
n
Y  a0  a1 X 1  a2 X 2  a3 X 3  ...  an X n  a0   ai X i
i 1
If the errors of the preictor variables are Gaussian the error term e should also be
Gaussian and means and variances are additive
n
Y  a0  a1 X 1  a2 X 2  a3 X 3  ...  an X n    a0   ai X i  
i 1
n
 (Y )  a0   ai  ( X i )   ( )
i 1
n


 (Y )    a0   ai X i    2 ( )
i 1


2
2
Total
variance
Explained
variance
Unexplained
(rest)
variance
n


  a0   ai X i 
 2 (Y )   2 ( )
i 1
2


R 

2
 (Y )
 2 (Y )
2
ln( S )  a0  a1 ln( A)  a3 NT 0  a4 L
Y
Country/Island
Albania
Andorra
Austria
Azores
Baleary Islands
Belarus
Belgium
Bosnia and Herzegovina
British islands
Bulgaria
Canary Islands
Channel Is.
Corsica
Crete
Croatia
Cyclades Is.
Cyprus
Czech Republic
Denmark
Dodecanese Is.
Estonia
Faroe Is.
Finland
France
Germany
Gibraltar
Greece
Hungary
Iceland
X
ln(Number
of
Constant
species)
3.258097
0
3.218876
0.693147
2.70805
2.890372
2.995732
3.178054
2.890372
3.496508
2.197225
1.609438
3.044522
2.833213
3.526361
1.098612
2.890372
3.178054
2.639057
2.639057
2.397895
0
2.397895
3.465736
3.218876
1.609438
3.496508
3.332205
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
ln(Area)
Days
below
zero
Latitude
of capitals
(decimal
degrees)
10.26632
6.148468
11.33704
7.696213
8.519989
12.24361
10.3264
10.84344
12.40519
11.61702
8.891512
5.703782
9.068777
9.019059
10.94366
7.824046
9.132379
11.27551
10.67112
7.887209
10.71945
7.243513
12.73123
13.20664
12.78555
1.871802
11.7905
11.44094
11.54248
34
60
92
1
18
144
50
114
64
102
1
12
11
1
114
1
2
119
85
2
143
35
169
50
97
0
2
100
133
41.33
42.5
48.12
37.73
39.55
53.87
50.9
43.82
51.15
42.65
27.93
49.22
41.92
35.33
45.82
37.1
35.15
50.1
55.63
36.4
59.35
62
60.32
48.73
52.38
36.1
37.9
47.43
64.13
Multiple regression
1. Model formulation
2. Estimation of model parameters
3. Estimation of statistical significance
Y  XA
A  X' X  X' Y
1
X'
1
1
1
1
1
1
10.26632 6.148468 11.33704 7.696213 8.519989 12.24361
144
18
1
92
60
34
53.87
39.55
37.73
48.12
42.5
41.33
1
1
10.3264 10.84344
114
50
43.82
50.9
X'X
62
607.1316
4328
2906.4
2906.4
4328
607.1316
6518.161 48545.59 29086.57
534136 228951.7
48545.59
29086.57 228951.7 141148.1
(X'X)-1
1.019166 -0.02275
-0.02275 0.002458
0.00261 -7.5E-05
-0.02053 8.3E-05
0.00261 -0.02053
-7.5E-05 8.3E-05
1.3E-05 -5.9E-05
-5.9E-05 0.000509
(X'X)-1X'
0.025783 0.163309 0.013407
0.003376 -0.00859 0.002243
-0.00017 0.000405 9.87E-05
-0.00066 -0.00195 -0.00056
a0
a1
a2
a3
(X'X)-1X'Y
2.679757
0.290121
0.002155
-0.06789
0.07203 0.060295 0.010457 -0.13031 0.170347
-0.00078 0.00013 0.001069 0.003124 -0.00097
-0.00019 -0.00014 0.000364 -0.00054 0.000676
-0.00074 -0.00076 -0.00064 0.003269 -0.00409
X'Y
154.2937
1647.908
11289.32
7137.716
(X'X)-1(X'Y)
2.679757
0.290121
0.002155
-0.06789
Multiple R and R2
trace( R 1 )(1  R 2 )
SE 
n  k 1
R: correlation matrix
n: number of cases
k: number of independent
variables in the model
t
parameter
SE ( parameter)
D<0 is statistically not
significant and should
be eliminated from
the model.
Adjusted R2
Radj  1  (1  R 2 )
2
n 1
n  k 1
 12 df 1
R 2 n  k  1 0.66646 62  3  1
F 2


 38.6307
2
 2 df 2 1  R
k
0.33354
3
A mixed model
ln S  a0  a1 ln A  a2 DT 0  a3 L  a4 L2
Y
Country/Island
Albania
Andorra
Austria
Azores
Baleary Islands
Belarus
Belgium
Bosnia and Herzegovina
British islands
Bulgaria
Canary Islands
Channel Is.
Corsica
Crete
Croatia
Cyclades Is.
Cyprus
Czech Republic
Denmark
Dodecanese Is.
Estonia
Faroe Is.
Finland
France
Germany
Gibraltar
Greece
Hungary
Iceland
Ireland
Italy
Kaliningrad Region
Latvia
X
ln(Number
of
Constant
species)
3.258097
0
3.218876
0.693147
2.70805
2.890372
2.995732
3.178054
2.890372
3.496508
2.197225
1.609438
3.044522
2.833213
3.526361
1.098612
2.890372
3.178054
2.639057
2.639057
2.397895
0
2.397895
3.465736
3.218876
1.609438
3.496508
3.332205
0
2.397895
3.433987
2.564949
2.772589
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
ln(Area)
Days
below
zero
10.26632
6.148468
11.33704
7.696213
8.519989
12.24361
10.3264
10.84344
12.40519
11.61702
8.891512
5.703782
9.068777
9.019059
10.94366
7.824046
9.132379
11.27551
10.67112
7.887209
10.71945
7.243513
12.73123
13.20664
12.78555
1.871802
11.7905
11.44094
11.54248
11.16014
12.6162
9.615805
11.07637
34
60
92
1
18
144
50
114
64
102
1
12
11
1
114
1
2
119
85
2
143
35
169
50
97
0
2
100
133
23
18
110
124
Latitude
of capitals
Latitude2
(decimal
degrees)
41.33
42.5
48.12
37.73
39.55
53.87
50.9
43.82
51.15
42.65
27.93
49.22
41.92
35.33
45.82
37.1
35.15
50.1
55.63
36.4
59.35
62
60.32
48.73
52.38
36.1
37.9
47.43
64.13
53.43
41.8
52.7
56.96
1708.169
1806.25
2315.534
1423.553
1564.203
2901.977
2590.81
1920.192
2616.323
1819.023
780.0849
2422.608
1757.286
1248.209
2099.472
1376.41
1235.523
2510.01
3094.697
1324.96
3522.423
3844
3638.502
2374.613
2743.664
1303.21
1436.41
2249.605
4112.657
2854.765
1747.24
2777.29
3244.442
X'
1
1
1
1
1
1
10.26632 6.148468 11.33704 7.696213 8.519989 12.24361
34
60
92
1
18
144
41.33
42.5
48.12
37.73
39.55
53.87
1708.169 1806.25 2315.534 1423.553 1564.203 2901.977
1
1
10.3264 10.84344
50
114
50.9
43.82
2590.81 1920.192
X'X
62
607.1316
4328
2906.4
141148.1
607.1316
4328
2906.4 141148.1
6518.161 48545.59 29086.57 1441737
48545.59
534136 228951.7 12488619
29086.57 228951.7 141148.1 7106497
1441737 12488619 7106497 3.71E+08
(X'X)-1
6.45421 0.000497 0.001087 -0.25606 0.002409
0.000497 0.002557 -8.1E-05 -0.00092 1.03E-05
0.001087 -8.1E-05 1.34E-05 6.63E-06 -6.8E-07
-0.25606 -0.00092 6.63E-06 0.010716
-0.0001
0.002409 1.03E-05 -6.8E-07
-0.0001 1.07E-06
(X'X)-1X'
0.028519 -0.00857 -0.18332 0.227512 0.119213 -0.18587 -0.27812 -0.01106
0.003388 -0.00932 0.001402 -0.00011 0.000382 0.000229 0.002492 -0.00174
-0.00017 0.000453 0.000154 -0.00024 -0.00016 0.000419 -0.00049 0.000727
-0.00078
0.0055 0.007968 -0.00748 -0.00331 0.007864 0.009674 0.003767
1.21E-06 -7.6E-05 -8.7E-05 6.89E-05 2.61E-05 -8.7E-05 -6.6E-05
-8E-05
a0
a1
a2
a3
a4
(X'X)-1X'Y
-3.40816
0.264082
0.003862
0.195932
-0.0027
The final model
ln S  3.41  0.26 ln A  0.004 DT 0  0.196 L  0.0027 L2
Very low species
density (log-scale!)
Realistic increase of
species richness with
area
Increase of species
richness with winter
length
Increase of species
richness at higher
latitudes
Is this model realistic?
A peak of species
richness at
intermediate
latitudes
ln(# species predicted)
The model makes realistic predictions.
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
-0.5
-1 0
Problem might arise from the intercorrelation
between the predictor variables
(multicollinearity).
y = 0.6966x + 0.7481
R² = 0.6973
We solve the problem by a step-wise approach
eliminating the variables that are either not
significant or give unreasonable parameter
values
1
2
ln (# species observed)
3
4
The variance explanation of this final model
is higher than that of the previous one.
Multiple regression solves systems of intrinsically linear algebraic equations
Y  a10  a11 X1  a12 X1 a13X1 ...  a21 X 2  b22 X 2  a23 X 2 ...an1 X  an 2 X 2  an3 X 3...
2
3
2
3
A  X' X  X' Y
1
Polynomial regression
•
•
•
•
•
The matrix X’X must not be singular. It est, the variables have to be independent. Otherwise we
speak of multicollinearity. Collinearity of r<0.7 are in most cases tolerable.
Multiple regression to be safely applied needs at least 10 times the number of cases than
variables in the model.
Statistical inference assumes that errors have a normal distribution around the mean.
The model assumes linear (or algebraic) dependencies. Check first for non-linearities.
Check the distribution of residuals Yexp-Yobs. This distribution should be random.
Check the parameters whether they have realistic values.
Multiple regression is a hypothesis
testing and not a hypothesis generating
technique!!
ln(# species predicted)
•
General additive model
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
-0.5
-1 0
y = 0.6966x + 0.7481
R² = 0.6973
1
2
ln (# species observed)
3
4
Standardized coefficients of correlation
Z-tranformed distributions have a mean
of 0 an a standard deviation of 1.
Z
x

B  Z X ' Z X  Z X ' ZY
1
n
1 
i 1
r
n 1
  Zx1i Zxi1

...

Z' Z  
...

  Zx Zx
ni
i1

(Xi  X)(Yi  Y)
s Xs Y
... ...
... ...
... ...
... ...

1 n (Xi  X) (Yi  Y)
1 n

ZX ZY
 s

n  1 i 1
s
n

1
i 1
X
Y
 Zx
Zx1n 
 r11 .... ... r1n 



...

1
 ... ... ... ... 

R

Z
'
Z


 ... ... ... ... 
...
n

1






 Zxii Zxii 
 rn1 ... ... rnn 
ni
R
B  R xx  R XY
1
1
Z' Z
n 1
R XY  R XX B
In the case of bivariate regression Y = aX+b, Rxx = 1.
Hence B=RXY.
Hence the use of Z-transformed values results in standardized correlations coefficients, termed b-values
Download