Lecture 12 Correlation and linear regression 20 The least squares method of Carl Friedrich Gauß. Y 15 OLRy y = ax + b y2 10 5 y 0 0 5 10 15 D i 1 D a D b n ( y ) [ y i ( ax i b )] 2 xy i X n n 20 2 a nxy i 1 n x i 1 i 2 i nx 2 i 1 n 2 x i ( y i ax i b ) 0 b y ax i 1 n 2 ( y i ax i b ) 0 i 1 y ax y a x ( y y) a(x x) Covariance n x yi n x y i i 1 n a x 2 i nx n 1 x n 1 i i 1 yi x y i 1 n x n 2 1 2 i x n (x n i x )( y i y ) i 1 1 n (x n 2 i 1 i x) 2 s xy 2 sx i 1 Variance Correlation coefficient xy r x y as x s xy rs x s y 2 s xy r a sx sy sx sy 1 r 1 Slope a and coefficient of correlation r are zero if the covariance is zero. xy 2 r 2 2 x 0 r 1 2 2 y Coefficient of determination R 2 E xplained variance T otal variance Dimorphic species Brachypterous species Relationships between macropterous, dimorphic and brachypterous ground beetles on 17 Mazurian lake islands 7 Positive correlation; r =r2= 0.41 6 The regression is weak. 5 Macropterous species richness explains only 17% of 4 the variance in brachypterous species richness. 3 We have some islands without brachypterous 2 y = 0.192x + 0.4671 species. 1 R² = 0.1723 We really don’t know what is the independent 0 0 10 20 30 variable. There is no clear cut logical connection. Macropterous species 14 12 10 8 6 4 2 0 y = 0.3875x + 3.7188 R² = 0.4455 0 10 20 Macropterous species 30 Positive correlation; r =r2= 0.67 The regression is moderate. Macropterous species richness explains only 45% of the variance in dimorphic species richness. The relationship appears to be non-linear. Logtransformation is indicated (no zero counts). We really don’t know what is the independent variable. There is no clear cut logical connection. Brachypterous species y = -36.203x + 5.5585 R² = 0.2311 7 6 5 4 3 2 1 0 0 0.05 0.1 0.15 Brachypterous species Isolation 45 40 35 30 25 20 15 10 5 0 y = 0.4894x + 22.094 R² = 0.0037 -3 -2 -1 Negative correlation; r =r2= -0.48 The regression is weak. Island isolation explains only 23% of the variance in brachypterous species richness. We have two apparent outliers. Without them the whole relationship would vanish, it est R2 0. Outliers have to be eliminated fom regression analysis. We have a clear hypothesis about the logical relationships. Isolation should be the predictor of species richness. No correlation; r =r2= 0.06 The regression slope is nearly zero. Area explains less than 1% of the variance in brachypterous species richness. We have a clear hypothesis about the logical relationships. Area should be the predictor of species richness. 0 ln Area 1 2 The matrix perspective Macro Constant 4 6 3 4 1 4 2 5 1 0 0 0 1 4 2 7 12 13 18 10 14 7 22 9 7 15 13 8 10 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 2 14 6 1 1 Transpose Macro Constant 7 12 13 ( a 0 ... 6 12 1 13 1 Dispersion matrix XX 2499 193 193 17 T Inverse 0.003248 -0.03687 -0.03687 0.477455 XTY 570 45 7 6 5 4 3 2 1 0 y = 0.192x + 0.4671 R² = 0.1723 0 Coefficients a1 0.192014 10 20 X is not quadratic. It doesn’t possess an inverse Y Xa Y Xa X Y X Xa T 1 T 1 ( X X ) X Y ( X X ) X Xa Ia a T T T T Y Xa a1 ) 30 Macropterous species a0 0.467138 4 a 0 7 a1 4 1 7 a 12 a 6 6 1 0 12 1 3 a 13 a 3 a 1 a 13 1 0 1 0 ... ... ... ... ... ... 2 a 0 6 a1 2 1 6 4 1 6 1 3 1 ... ... 2 1 7 1 Brachypterous species Brachy 1 a (X X ) X Y T T Macro Constant 4 6 3 4 1 4 2 5 1 0 0 0 1 4 2 7 12 13 18 10 14 7 22 9 7 15 13 8 10 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 2 14 6 1 1 Brachypterous species Brachy Dispersion matrix XX 2499 193 193 17 7 6 5 4 3 2 1 0 T n 2 xi T X X n i 1 x i const i 1 y = 0.192x + 0.4671 R² = 0.1723 0 10 20 x const i i 1 n 2 const i 1 n 30 Macropterous species Variance Brachy Macro Constant 4 6 3 4 1 4 2 5 1 0 0 0 1 4 2 7 12 13 18 10 14 7 22 9 7 15 13 8 10 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 2 14 6 1 1 2 1 n n i 1 1 n 2 x x xi x n i 1 2 i 2 Covariance xy 1 n n i 1 1 n x i y i x y x i x y i y n i 1 Raw data Brachy Macro Constant 4 7 1 6 12 1 3 13 1 4 18 1 1 10 1 4 14 1 2 7 1 5 22 1 1 9 1 0 7 1 0 15 1 0 13 1 1 8 1 4 10 1 2 8 1 6 14 1 2 6 1 Σ Arithmetic mean Brachy Macro Constant Brachy 4 6 3 4 1 4 2 5 1 2.64706 11.3529 1 Macro 7 12 13 18 10 14 7 22 9 2.64706 11.3529 1 Constant 1 1 1 1 1 1 1 1 1 2.64706 11.3529 1 2.64706 11.3529 1 Brachy 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.647059 2.64706 11.3529 1 Macro 11.353 11.353 11.353 11.353 11.353 11.353 11.353 11.353 11.35294 2.64706 11.3529 1 Constant 1 1 1 1 1 1 1 1 1 2.64706 11.3529 1 Dispersion matrix Squared means 2.64706 11.3529 1 T T 2.64706 11.3529 1 185 570 45 45 XX M M 119.12 510.88 2.64706 11.3529 1 570 2499 193 510.88 2191.1 193 2.64706 11.3529 1 45 193 17 45 193 17 2.64706 11.3529 1 Variance Covariance 2.64706 11.3529 1 Brachy Constant 2.64706 11.3529 1 0 3.8754 3.4775 0 XTX - MTM 65.882 59.118 XTX - MTM 2.64706 11.3529 1 59.118 307.88 0 /17 Macro 3.4775 18.111 0 2.64706 11.3529 1 0 0 0 Constant 0 0 0 2.64706 11.3529 1 1 (X X M M ) T T n r xy x 1 / x 0 S y 1 / x Σ 1 / y 0 [( X M ) ( X M ) T n 0 1 1 / y 0 1/x 1/y Covariance 3.8754 3.4775 3.4775 18.111 V 0.508 0 0 0.235 r r2 12 12 Σ ... 1n VS 1.9686 1.7665 0.8171 4.2557 VSV 1 0.4151 0.4151 1 0.1723 12 ... ... 2 2 ... ... 2n ... Covariances 1n 2n ... 2 n Variances The covariance matrix is square and symmetric Non-linear relationships Ground beetles on Mazurian lake islands Linear function Logarithmic function 100 40 40 Species 50 30 10 20 y = 6.0987ln(x) - 8.3513 R² = 0.6003 y = 0.0056x + 24.305 R² = 0.2963 10 0 0 2000 1 Individuals 20 y = 6.7337x0.2306 R² = 0.67 10 100 Individuals 10000 0 2000 4000 Individuals The species – individuals relationship are obviously non-linear. 100 Species 30 0 1 4000 Power function 60 50 Species Species 60 10 y = 6.7337x0.2306 R² = 0.67 The power function has the highest R2 and explains therefore most of the variance in species richness. The coefficient of determination is a measure of goodness of fit. 0 . 2308 1 1 100 Individuals 10000 S 6 . 733 I ln S ln( 6 . 733 ) 0 . 2308 ln I 1 . 907 0 . 2308 ln I Intercept Slope Having more than one predictor Island 1pog 2pog 3pog cor dab ful gil guc hel lip mil sos swi ter wil wron wros Species Individuals 13 55 24 149 31 206 29 3450 31 505 37 996 54 1895 27 476 25 325 30 459 34 1410 33 829 34 1704 16 91 21 28 21 102 342 258 Area 0.01 0.9 2.1 6.84 10 9.9 10 0.92 2.3 4.19 0.2 20.09 2.08 0.03 Isolation 0.088719 0.088592 0.081131 0.089384 0.080644 0.094508 0.093676 0.097195 0.088938 0.088367 0.089204 0.087405 0.096915 0.085875 Describe species richness in dependence of numbers of individuals, area, and isolation of islands. We need a clear hypothesis about dependent and independent predictors. Use a block diagram. 1 0.096584 0.15 0.01 0.15 0.01 Individuals Area Isolation Species Individuals Collinearity Area Isolation Predictors are not independent. Numbers of individuals depends on area and degree of isolation. We need linear relationships Species We use ln transformed variables of species, area, and individuals. Check for multicollinearity using a correlation matrix. We check for non-linearities using plots. The correlation between area and individuals is highly significant. The probability of H0 = 0.004. Of the predictors area and individuals are highly correlated. In linear regression analysis correlations of predictors below 0.7 are acceptable. The final data for our analysis Island 1pog 2pog 3pog cor dab ful gil guc hel lip mil sos swi ter ln_S Constant ln_Ind. ln_Area Isolation 2.564949 1 4.007333 -4.60517 0.088719 3.164068 1 5.003946 -0.10536 0.088592 3.427515 1 5.327876 0.741937 0.081131 3.366817 1 8.14613 1.922788 0.089384 3.443352 1 6.224558 2.302585 0.080644 3.609114 1 6.903747 2.292535 0.094508 3.985008 1 7.546974 2.302585 0.093676 3.294602 1 6.165418 -0.08338 0.097195 3.236061 1 5.783825 0.832909 0.088938 3.401197 1 6.12905 1.432701 0.088367 3.521447 1 7.251345 -1.60944 0.089204 3.483143 1 6.72022 3.000222 0.087405 3.531251 1 7.440734 0.732368 0.096915 2.772589 1 4.51086 -3.50656 0.085875 wil wron wros 3.060271 3.332205 3.020425 1 1 1 4.624973 0 0.096584 5.834811 -1.89712 0.01 5.55296 -1.89712 0.01 The vector Y The matrix X contains contains the the effect (predictor) response variables variable The model ln S a 0 a1 ln Ind a 2 ln Area a 3 Isolation The predictor variables have to contain different information. If X is singular no inverse exists 60 Multiple linear regression Y Xa 1 a (X X ) X Y T T F t (n 2) r 2 2 1 r The probability that R2 is zero is only 0.01%. With 99.9% R2 > 0 and hence statistically significant. The model explains 78.6 % of variance in species richness. 21.4% of avriance remains unexplained. t Coefficien t Standard ln S 2 . 48 0 . 15 ln Ind 0 . 07 ln Area 0 . 91 Isolation error The probabilities that the coefficients deviate from zero. Isolation is not a significant predictor. 2 What distance to minimize? 20 Y 15 OLRy y2 10 5 x2 0 0 OLRx 5 10 15 20 X s xy aOLRy s 2 aOLRx 2 x 2 aOLRys 2 x ys xy sy sy s xy 2 aOLRx * aOLRy aOLRx Model I regression sy 2 sx 20 RMA Y 15 OLRy 10 x y y2 5 x2 0 OLRx 0 5 10 15 20 X 2 a RMA a RM A sy sx axay s xy s y s 2 x s xy a O LRy r sy sx a RMA a OLRy Reduced major axis regression is the geometric average of aOLRy and aOLRx Model II regression Past standard output of linear regression Reduced major axis Parameters and standard errors Parametric probability for r = 0 t (df n 2) r F t (n 2) 2 n2 1 r r 2 2 1 r 2 Permutation test for statistical significance We don’t have a clear hypothesis about the causal relationships. In this case RMA is indicated. Both tests indicate that Brach and Macro are not significantly correlated. The RMA regression slope is insignificant. Permutation test for statistical significance Brach 4 6 3 4 1 4 2 5 1 0 0 0 1 4 2 Los() 0.335757 0.787809 0.310238 0.626757 0.220597 0.012454 0.909548 0.299534 0.177327 0.953261 0.242402 0.595826 0.596459 0.880829 0.548183 Macro 14 10 12 22 13 6 9 10 8 7 7 13 8 14 15 Los() 0.531818 0.580728 0.101989 0.115425 0.413435 0.684826 0.474608 0.830635 0.581156 0.916832 0.974389 0.625952 0.260397 0.61705 0.588517 14 6 6 2 0.790054 0.999702 7 18 0.015239 0.253364 N Mean r Lower CL Upper CL 0.099125 1000 0.061 -0.538 0.768 Observed r 0.41508801 N Macro 7 12 13 18 10 14 7 22 9 7 15 13 8 10 8 Macro Los() Macro 0.258728 14 90 10 0.860314 9 80 18 0.709402 15 70 6 0.793515 12 60 8 0.965281 7 50 14 0.305505 13 40 10 22 0.701483 10 30 0.061196 22 20 7 0.204792 8 S N2.5 = 25 10 13 0.72657 8 07 7 0.013131 18 15 0.066869 -0.2 -0.4 10 -0.6 -0.8 13 0.414809 6 Lower CL 14 0.093979 7 9 0.462482 7 8 12 -0.05535 Randomize 1000 times x or y. Calculate each time r. Plot the statistical distribution and calculate the lower and upper confidence limits. 0.234162 0.011327 13 14 0.302746 Los() Macro Los() Macro 0.296023 10 0.809377 14 0.524753 8 0.801854 10 g15> 0 0.942821 0.826895 22 0.064408 13 0.722662 12 0.25255 7 0.218747 18 Observed r 0.976486 8 0.404831 13 0.170293 22 0.745551 8 0.517693 14 0.968818 6 0.355126 10 0.822951 S N2.5 =7 25 0.38976 6 0.78764 14 0.639621 7 0.878803 15 0.511781 0.6 0.8 7 1 0 0.2 70.4 0.032343 0.489293 14 0.92727 10 m > 0 Upper CL 0.504421 12 0.267633 8 0.630868 13 0.106493 7 r 0.778739 0.815214 18 9 0.89634 0.4389 0.358917 Calculating confidence limits Rank all 1000 coefficients of correlation and take the values at rank positions 25 and 975. 13 9 -0.0413 Upper CL Lower CL The 95% confidence limit of the regression slope mark the 95% probability that the regression slope is within these limits. The lower CL is negative, hence the zero slope is with the 95% CL. The RMA regression has a much steeper slope. This slope is often intuitively better. The coefficient of correlation is independent of the regression method In OLRy regression insignificance of slope means also insignificance of r and R2. 20 Y 15 Outliers have an overproportional influence on correlation and regression. OLRy y2 10 5 y 0 20 15 10 5 0 X Y Outliers should be eliminated from regression analysis. Instead of the Pearson coefficient of correlations use Spearman’s rank order correlation. 7 6 5 4 3 2 1 0 rPearson = 0.79 Normal correlation on ranked data rSpearman = 0.77 0 1 2 3 4 X 5 6 7 Home work and literature Refresh: Literature: • • • • • • • Łomnicki: Statystyka dla biologów http://statsoft.com/textbook/ Coefficient of correlation Pearson correlation Spearman correlation Linear regression Non-linear regression Model I and model II regression RMA regression Prepare to the next lecture: • F-test • F-distribution • Variance