# predator species z Assumptions of linear regression 100 y = 1.16x + 4.17 80 There is a hypothesis about dependent and independent variables 2 R = 0.49 60 40 The relation is supposed to be linear 20 0 0 10 20 30 40 50 z # prey species We have a hypothesis about the distribution of errors around the hypothesized regression line Brain weight [g] 100000 Mammals 10000 There is a hypothesis about dependent and independent variables 1000 100 10 The relation is non-linear 0.73 y = 9.24x 1 R2 = 0.95 0.1 0.001 1 We have no data about the distribution of errors around the hypothesized regression line 1000 z Body weight [kg] Agricultural field 1000 Ground beetles at two adjacent sites 100 There is no clear hypothesis about dependent and independent variables The relation is non-linear 10 0.53 y = 4.4x 2 R = 0.19 1 1 10 Poplar plantation 100 We have no data about the distribution of errors around the hypothesized regression line z # predator species 100 Assumptions: y = 1.16x + 4.17 80 2 R = 0.49 60 A linear model applies 40 The x-variable has no error term 20 0 0 10 20 30 40 50 The distribution of the y errors around the regression line is normal # prey species Least squares method n D (y ) [ yi (axi b)]2 30 25 20 y n 2 i 1 y 15 y y 10 y 5 0 0 5 10 15 x 20 25 30 i 1 n D 2 xi ( yi axi b) 0 a i 1 n D 2 ( yi axi b) 0 b i 1 xy a 2 x z Brain weight [g] 100000 10000 The second example is nonlinear Mammals 1000 We hypothesize the allometric relation W = aBz 100 10 0.73 y = 9.24x 1 R2 = 0.95 0.1 0.001 1 1000 Body weight [kg] Linearised regression model Nonlinear regression model W aB z W aB z ln W ln a z ln B W aB z ln W ln a z ln B W aB z Exp( ) Assumption: Assumption: The distribution of errors is lognormal The distribution of errors is normal Y=e0.1X+norm(0;Y) 100 100 y = 1.04e 80 0.098x y = 1.16e 0.46 0.089x y = 1.57x 10 60 y y Y=X0.5enorm(0;Y) 40 1 20 y = 0.60x0.56 0 0.1 0 20 x 40 1 10 100 1000 10000 x In both cases we have some sort of autocorrelation Using logarithms reduces the effect of autocorrelation and makes the distribution of errors more homogeneous. Non linear estimation instead puts more weight on the larger y-values. If there is no autocorrelation the log-transformation puts more weight on smaller values. Linear regression European bat species and environmental correlates N=62 ln(Area) ln(Number of species) 10.26632 6.148468 11.33704 7.696213 8.519989 12.24361 10.3264 10.84344 12.40519 11.61702 8.891512 5.703782 9.068777 9.019059 10.94366 7.824046 9.132379 11.27551 10.67112 7.887209 10.71945 7.243513 12.73123 13.20664 12.78555 1.871802 11.7905 11.44094 11.54248 11.16014 12.6162 9.615805 11.07637 3.258097 0 3.218876 0.693147 2.70805 2.890372 2.995732 3.178054 2.890372 3.496508 2.197225 1.609438 3.044522 2.833213 3.526361 1.098612 2.890372 3.178054 2.639057 2.639057 2.397895 0 2.397895 3.465736 3.218876 1.609438 3.496508 3.332205 0 2.397895 3.433987 2.564949 2.772589 ln( S ) a0 a1 ln( A) y1 x1 1 x1 1 y x 1 x 1 2 2 2 a0 Y a0 a1 ... ... a1 ... ... ... 1 y x 1 x n n n Y XA Matrix approach to linear regression X is not a square matrix, hence X-1 doesn’t exist. X' Y X' XA X' X1 X' Y X' X1 X' XA IA A 1 A X' X X' Y The species – area relationship of European bats 3.258097 0 3.218876 0.693147 2.70805 2.890372 2.995732 3.178054 2.890372 3.496508 2.197225 1.609438 3.044522 2.833213 3.526361 1.098612 2.890372 3.178054 2.639057 2.639057 2.397895 0 2.397895 3.465736 3.218876 1.609438 3.496508 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ln(Area) 10.26632 6.148468 11.33704 7.696213 8.519989 12.24361 10.3264 10.84344 12.40519 11.61702 8.891512 5.703782 9.068777 9.019059 10.94366 7.824046 9.132379 11.27551 10.67112 7.887209 10.71945 7.243513 12.73123 13.20664 12.78555 1.871802 11.7905 X' 1 1 1 1 1 1 10.26632 6.148468 11.33704 7.696213 8.519989 12.24361 X'X X'Y 154.2937 1647.908 1 1 1 1 1 1 1 10.3264 10.84344 12.40519 11.61702 8.891512 5.703782 9.068777 4 62 607.1316 607.1316 6518.161 (X'X)-1 0.183521 -0.01709 -0.01709 0.001746 3.5 y = 0.2391x + 0.1468 R² = 0.4614 3 ln(# species) ln(Number of Constant species) 2.5 2 1.5 1 -1 (X'X) (X'Y) a0 0.146808 a1 0.239144 What about the part of variance explained by our model? 1 1 1 R Σ X ( X Μ )' ( X Μ ) Σ X n 1 0.5 0 -0.5 -5 0 5 10 15 20 ln (Area) ln S 0.24 ln A 0.15 S e 0.15 A0.24 1.16 A0.24 1.16: Average number of species per unit area (species density) 0.24: spatial species turnover R 0.769488 -2.48861 0.730267 -1.79546 0.219442 0.401763 0.507124 0.689445 0.401763 1.007899 -0.29138 -0.87917 0.555914 0.344605 1.037752 -1.39 0.401763 0.689445 0.150449 0.150449 -0.09071 -2.48861 -0.09071 0.977127 0.730267 -0.87917 1.007899 0.843596 -2.48861 -0.09071 0.945379 0.076341 (X-M)' 0.473878 -3.64398 1.54459 -2.09623 -1.27246 2.451164 0.533954 1.050991 2.612741 1.824579 -0.90093 -4.08866 -0.72367 -0.77339 1.151213 -1.9684 -0.66007 1.48306 0.878671 -1.90524 0.927004 -2.54893 2.938785 3.414195 2.993105 -7.92064 1.998051 1.64849 1.750039 1.367698 2.823752 -0.17664 0.769488 0.473878 -2.48861 0.730267 -3.64398 1.54459 (X-M)'(X-M) 71.0087 136.9954 136.9954 572.8582 -1.79546 0.219442 0.401763 -2.09623 -1.27246 2.451164 (X-M)'(X-M) / (n-1) 1.164077 2.245826 2.245826 9.391119 Sx 1.078924 0 0 3.064493 4 3.5 Sx -1 0.926849 0 0 0.326318 Sx -1 (X-M)'(X-M) / (n-1) 1.078924 2.081542 0.732854 3.064493 Sx-1 (X-M)'(X-M) / (n-1) Sx-1 1 0.679245 0.679245 1 Sx-1 (X-M)'(X-M) / (n-1) Sx-1)2 1 0.461374 0.461374 1 y = 0.2391x + 0.1468 R² = 0.4614 3 ln(# species) X-M 1 1 1 Σ X ( X Μ )' ( X Μ ) Σ X n 1 2.5 2 1.5 1 0.5 0 -0.5 -5 0 5 10 ln (Area) 15 20 How to interpret the coefficient of determination 4 3.5 3 ln(# species) y = 0.2391x + 0.1468 R² = 0.4614 2 Y ;M 1 n (Yi Y ) 2 n 1 i 1 Total variance 2.5 2 1.5 1 2 Y ;Y ( X ) 1 n 2 ( Y Y ( X )) i i n 1 i 1 Rest (unexplained) variance 0.5 0 -0.5 -5 0 5 10 15 20 Y2;M Y2;Y ( X ) Y2( X );M 1 n (Yi Y ( X i )) 2 Residual variance n 1 i 1 R2 1 1 1 n Total variance (Yi Y ) 2 n 1 i 1 Statistical testing is done by an F or a t-test. 1 n (Y ( X i ) Y ) 2 n 1 i 1 Residual (explained) variance ln (Area) R2 F df 2 1 R 2 Y ( X ); M n (Y ( X ) Y ) i i 1 n (Y Y ) i 1 i t F t R 1 R 2 df 2 2 ln( S ) a0 a1 ln( A) a2 T a3 NT 0 a4 L The general linear model n Y a0 a1 X 1 a2 X 2 a3 X 3 ... an X n a0 ai X i i 1 A model that assumes that a dependent variable Y can be expressed by a linear combination of predictor variables X is called a linear model. y1 1 x1,1 y2 1 x2,1 Y 1 ... ... y 1 x m ,1 m ... x1,n a0 ... x2,n y1 XA ... ... ... ... xm ,n yn y1 1 x1,1 y2 1 x2,1 Y 1 ... ... y 1 x m ,1 m ... X' Y X' XA X' X1 X' Y X' X1 X' XA IA A 1 A X' X X' Y x1,n a0 0 ... x2,n y1 1 XA Ε ... ... ... ... ... xm ,n yn n The vector E contains the error terms of each regression. Aim is to minimize E. The general linear model n Y a0 a1 X 1 a2 X 2 a3 X 3 ... an X n a0 ai X i i 1 If the errors of the preictor variables are Gaussian the error term e should also be Gaussian and means and variances are additive n Y a0 a1 X 1 a2 X 2 a3 X 3 ... an X n a0 ai X i i 1 n (Y ) a0 ai ( X i ) ( ) i 1 n (Y ) a0 ai X i 2 ( ) i 1 2 2 Total variance Explained variance Unexplained (rest) variance n a0 ai X i 2 (Y ) 2 ( ) i 1 2 R 2 (Y ) 2 (Y ) 2 ln( S ) a0 a1 ln( A) a3 NT 0 a4 L Y Country/Island Albania Andorra Austria Azores Baleary Islands Belarus Belgium Bosnia and Herzegovina British islands Bulgaria Canary Islands Channel Is. Corsica Crete Croatia Cyclades Is. Cyprus Czech Republic Denmark Dodecanese Is. Estonia Faroe Is. Finland France Germany Gibraltar Greece Hungary Iceland X ln(Number of Constant species) 3.258097 0 3.218876 0.693147 2.70805 2.890372 2.995732 3.178054 2.890372 3.496508 2.197225 1.609438 3.044522 2.833213 3.526361 1.098612 2.890372 3.178054 2.639057 2.639057 2.397895 0 2.397895 3.465736 3.218876 1.609438 3.496508 3.332205 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ln(Area) Days below zero Latitude of capitals (decimal degrees) 10.26632 6.148468 11.33704 7.696213 8.519989 12.24361 10.3264 10.84344 12.40519 11.61702 8.891512 5.703782 9.068777 9.019059 10.94366 7.824046 9.132379 11.27551 10.67112 7.887209 10.71945 7.243513 12.73123 13.20664 12.78555 1.871802 11.7905 11.44094 11.54248 34 60 92 1 18 144 50 114 64 102 1 12 11 1 114 1 2 119 85 2 143 35 169 50 97 0 2 100 133 41.33 42.5 48.12 37.73 39.55 53.87 50.9 43.82 51.15 42.65 27.93 49.22 41.92 35.33 45.82 37.1 35.15 50.1 55.63 36.4 59.35 62 60.32 48.73 52.38 36.1 37.9 47.43 64.13 Multiple regression 1. Model formulation 2. Estimation of model parameters 3. Estimation of statistical significance Y XA A X' X X' Y 1 X' 1 1 1 1 1 1 10.26632 6.148468 11.33704 7.696213 8.519989 12.24361 144 18 1 92 60 34 53.87 39.55 37.73 48.12 42.5 41.33 1 1 10.3264 10.84344 114 50 43.82 50.9 X'X 62 607.1316 4328 2906.4 2906.4 4328 607.1316 6518.161 48545.59 29086.57 534136 228951.7 48545.59 29086.57 228951.7 141148.1 (X'X)-1 1.019166 -0.02275 -0.02275 0.002458 0.00261 -7.5E-05 -0.02053 8.3E-05 0.00261 -0.02053 -7.5E-05 8.3E-05 1.3E-05 -5.9E-05 -5.9E-05 0.000509 (X'X)-1X' 0.025783 0.163309 0.013407 0.003376 -0.00859 0.002243 -0.00017 0.000405 9.87E-05 -0.00066 -0.00195 -0.00056 a0 a1 a2 a3 (X'X)-1X'Y 2.679757 0.290121 0.002155 -0.06789 0.07203 0.060295 0.010457 -0.13031 0.170347 -0.00078 0.00013 0.001069 0.003124 -0.00097 -0.00019 -0.00014 0.000364 -0.00054 0.000676 -0.00074 -0.00076 -0.00064 0.003269 -0.00409 X'Y 154.2937 1647.908 11289.32 7137.716 (X'X)-1(X'Y) 2.679757 0.290121 0.002155 -0.06789 Multiple R and R2 trace( R 1 )(1 R 2 ) SE n k 1 R: correlation matrix n: number of cases k: number of independent variables in the model t parameter SE ( parameter) D<0 is statistically not significant and should be eliminated from the model. Adjusted R2 Radj 1 (1 R 2 ) 2 n 1 n k 1 12 df 1 R 2 n k 1 0.66646 62 3 1 F 2 38.6307 2 2 df 2 1 R k 0.33354 3 A mixed model ln S a0 a1 ln A a2 DT 0 a3 L a4 L2 Y Country/Island Albania Andorra Austria Azores Baleary Islands Belarus Belgium Bosnia and Herzegovina British islands Bulgaria Canary Islands Channel Is. Corsica Crete Croatia Cyclades Is. Cyprus Czech Republic Denmark Dodecanese Is. Estonia Faroe Is. Finland France Germany Gibraltar Greece Hungary Iceland Ireland Italy Kaliningrad Region Latvia X ln(Number of Constant species) 3.258097 0 3.218876 0.693147 2.70805 2.890372 2.995732 3.178054 2.890372 3.496508 2.197225 1.609438 3.044522 2.833213 3.526361 1.098612 2.890372 3.178054 2.639057 2.639057 2.397895 0 2.397895 3.465736 3.218876 1.609438 3.496508 3.332205 0 2.397895 3.433987 2.564949 2.772589 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ln(Area) Days below zero 10.26632 6.148468 11.33704 7.696213 8.519989 12.24361 10.3264 10.84344 12.40519 11.61702 8.891512 5.703782 9.068777 9.019059 10.94366 7.824046 9.132379 11.27551 10.67112 7.887209 10.71945 7.243513 12.73123 13.20664 12.78555 1.871802 11.7905 11.44094 11.54248 11.16014 12.6162 9.615805 11.07637 34 60 92 1 18 144 50 114 64 102 1 12 11 1 114 1 2 119 85 2 143 35 169 50 97 0 2 100 133 23 18 110 124 Latitude of capitals Latitude2 (decimal degrees) 41.33 42.5 48.12 37.73 39.55 53.87 50.9 43.82 51.15 42.65 27.93 49.22 41.92 35.33 45.82 37.1 35.15 50.1 55.63 36.4 59.35 62 60.32 48.73 52.38 36.1 37.9 47.43 64.13 53.43 41.8 52.7 56.96 1708.169 1806.25 2315.534 1423.553 1564.203 2901.977 2590.81 1920.192 2616.323 1819.023 780.0849 2422.608 1757.286 1248.209 2099.472 1376.41 1235.523 2510.01 3094.697 1324.96 3522.423 3844 3638.502 2374.613 2743.664 1303.21 1436.41 2249.605 4112.657 2854.765 1747.24 2777.29 3244.442 X' 1 1 1 1 1 1 10.26632 6.148468 11.33704 7.696213 8.519989 12.24361 34 60 92 1 18 144 41.33 42.5 48.12 37.73 39.55 53.87 1708.169 1806.25 2315.534 1423.553 1564.203 2901.977 1 1 10.3264 10.84344 50 114 50.9 43.82 2590.81 1920.192 X'X 62 607.1316 4328 2906.4 141148.1 607.1316 4328 2906.4 141148.1 6518.161 48545.59 29086.57 1441737 48545.59 534136 228951.7 12488619 29086.57 228951.7 141148.1 7106497 1441737 12488619 7106497 3.71E+08 (X'X)-1 6.45421 0.000497 0.001087 -0.25606 0.002409 0.000497 0.002557 -8.1E-05 -0.00092 1.03E-05 0.001087 -8.1E-05 1.34E-05 6.63E-06 -6.8E-07 -0.25606 -0.00092 6.63E-06 0.010716 -0.0001 0.002409 1.03E-05 -6.8E-07 -0.0001 1.07E-06 (X'X)-1X' 0.028519 -0.00857 -0.18332 0.227512 0.119213 -0.18587 -0.27812 -0.01106 0.003388 -0.00932 0.001402 -0.00011 0.000382 0.000229 0.002492 -0.00174 -0.00017 0.000453 0.000154 -0.00024 -0.00016 0.000419 -0.00049 0.000727 -0.00078 0.0055 0.007968 -0.00748 -0.00331 0.007864 0.009674 0.003767 1.21E-06 -7.6E-05 -8.7E-05 6.89E-05 2.61E-05 -8.7E-05 -6.6E-05 -8E-05 a0 a1 a2 a3 a4 (X'X)-1X'Y -3.40816 0.264082 0.003862 0.195932 -0.0027 The final model ln S 3.41 0.26 ln A 0.004 DT 0 0.196 L 0.0027 L2 Very low species density (log-scale!) Realistic increase of species richness with area Increase of species richness with winter length Increase of species richness at higher latitudes Is this model realistic? A peak of species richness at intermediate latitudes ln(# species predicted) The model makes realistic predictions. 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 -1 0 Problem might arise from the intercorrelation between the predictor variables (multicollinearity). y = 0.6966x + 0.7481 R² = 0.6973 We solve the problem by a step-wise approach eliminating the variables that are either not significant or give unreasonable parameter values 1 2 ln (# species observed) 3 4 The variance explanation of this final model is higher than that of the previous one. Multiple regression solves systems of intrinsically linear algebraic equations Y a10 a11 X1 a12 X1 a13X1 ... a21 X 2 b22 X 2 a23 X 2 ...an1 X an 2 X 2 an3 X 3... 2 3 2 3 A X' X X' Y 1 Polynomial regression • • • • • The matrix X’X must not be singular. It est, the variables have to be independent. Otherwise we speak of multicollinearity. Collinearity of r<0.7 are in most cases tolerable. Multiple regression to be safely applied needs at least 10 times the number of cases than variables in the model. Statistical inference assumes that errors have a normal distribution around the mean. The model assumes linear (or algebraic) dependencies. Check first for non-linearities. Check the distribution of residuals Yexp-Yobs. This distribution should be random. Check the parameters whether they have realistic values. Multiple regression is a hypothesis testing and not a hypothesis generating technique!! ln(# species predicted) • General additive model 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 -1 0 y = 0.6966x + 0.7481 R² = 0.6973 1 2 ln (# species observed) 3 4 Standardized coefficients of correlation Z-tranformed distributions have a mean of 0 an a standard deviation of 1. Z x B Z X ' Z X Z X ' ZY 1 n 1 i 1 r n 1 Zx1i Zxi1 ... Z' Z ... Zx Zx ni i1 (Xi X)(Yi Y) s Xs Y ... ... ... ... ... ... ... ... 1 n (Xi X) (Yi Y) 1 n ZX ZY s n 1 i 1 s n 1 i 1 X Y Zx Zx1n r11 .... ... r1n ... 1 ... ... ... ... R Z ' Z ... ... ... ... ... n 1 Zxii Zxii rn1 ... ... rnn ni R B R xx R XY 1 1 Z' Z n 1 R XY R XX B In the case of bivariate regression Y = aX+b, Rxx = 1. Hence B=RXY. Hence the use of Z-transformed values results in standardized correlations coefficients, termed b-values