Multiple and complex regression Extensions of simple linear regression • Multiple regression models: predictor variables are continuous • Analysis of variance: predictor variables are categorical (grouping variables), • But… general linear models can include both continuous and categorical predictors Relative abundance of C3 and C4 plants • Paruelo & Lauenroth (1996) • Geographic distribution and the effects of climate variables on the relative abundance of a number of plant functional types (PFTs): shrubs, forbs, succulents, C3 grasses and C4 grasses. data 73 sites across temperate central North America Response variable • Relative abundance of PTFs (based on cover, biomass, and primary production) for each site Predictor variables • • • • • • • Longitude Latitude Mean annual temperature Mean annual precipitation Winter (%) precipitation Summer (%) precipitation Biomes (grassland , shrubland) 8 6 2 0 0 0.2 0.4 0.6 0.8 -2.0 -1.5 -1.0 -0.5 C3 log_10_C3 Histogram of log_C3 Histogram of SQRT_C3 0.0 8 6 4 0 0 2 2 4 6 Frequency 8 10 12 10 12 0.0 Frequency 4 Frequency 10 12 10 15 20 25 30 Histogram of log_10_C3 5 Frequency Histogram of C3 -5 -4 -3 -2 log_C3 -1 0 0.0 0.2 0.4 0.6 0.8 1.0 SQRT_C3 Relative abundance transformed ln(dat+1) because positively skewed Collinearity • Causes computational problems because it makes the determinant of the matrix of X-variables close to zero and matrix inversion basically involves dividing by the determinant (very sensitive to small differences in the numbers) • Standard errors of the estimated regression slopes are inflated Detecting collinearlity • Check tolerance values • Plot the variables • Examine a matrix of correlation coefficients between predictor variables Dealing with collinearity • Omit predictor variables if they are highly correlated with other predictor variables that remain in the model Correlations 105 115 5 10 20 0.1 0.3 0.5 50 95 105 115 30 40 LAT 600 1000 95 LONG 20 200 MAP 0.3 0.5 5 10 MAT 0.3 0.5 0.1 JJAMAP 0.1 DJFMAP 30 40 50 200 600 1000 0.1 0.3 0.5 Correlations LAT LAT LONG MAP MAT JJAMAP DJFMAP Pearson Correlation Sig . (2-tailed) N Pearson Correlation Sig . (2-tailed) N Pearson Correlation Sig . (2-tailed) N Pearson Correlation Sig . (2-tailed) N Pearson Correlation Sig . (2-tailed) N Pearson Correlation Sig . (2-tailed) N 1 . 73 .097 .416 73 -.247* .036 73 -.839** .000 73 .074 .533 73 -.065 .584 73 LONG .097 .416 73 1 . 73 -.734** .000 73 -.213 .070 73 -.492** .000 73 .771** .000 73 *. Correlation is significant at the 0.05 level (2-tailed). **. Correlation is significant at the 0.01 level (2-tailed). MAP -.247* .036 73 -.734** .000 73 1 . 73 .355** .002 73 .112 .344 73 -.405** .000 73 MAT JJAMAP DJFMAP -.839** .074 -.065 .000 .533 .584 73 73 73 -.213 -.492** .771** .070 .000 .000 73 73 73 .355** .112 -.405** .002 .344 .000 73 73 73 1 -.081 .001 . .497 .990 73 73 73 -.081 1 -.792** .497 . .000 73 73 73 .001 -.792** 1 .990 .000 . 73 73 73 (lnC3)= βo+ β1(lat)+ β2(long)+ β3(latxlong) Coefficientsa Model 1 (Constant) LAT LONG LOXLA Unstandardized Coefficients B Std. Error 7.391 3.625 -.191 .091 -.093 .035 .002 .001 Standardized Coefficients Beta -3.095 -1.824 4.323 t 2.039 -2.101 -2.659 2.572 Sig . .045 .039 .010 .012 Collinearity Statistics Tolerance VIF .003 .015 .002 307.745 66.784 400.939 a. Dependent Variable: LC3 After centering both lat and long Coefficientsa Model 1 (Constant) LONRE LATRE RELALO Unstandardized Coefficients B Std. Error -.553 .027 -.003 .004 .048 .006 .002 .001 a. Dependent Variable: LC3 Standardized Coefficients Beta -.051 .783 .238 t -20.131 -.597 8.484 2.572 Sig . .000 .552 .000 .012 Collinearity Statistics Tolerance VIF .980 .827 .820 1.020 1.209 1.220 Analysis of variance Source of variation SS Regression Σ(yhat-Y)2 df MS p Σ(yhat-Y)2 p Residual Σ(yobs-yhat)2 n-p-1 Total Σ(yobs-Y)2 n-1 Σ(yobs-yhat)2 n-p-1 Matrix algebra approach to OLS estimation of multiple regression models • Y=βX+ε • X’Xb=XY • b=(X’X) -1 (XY) Criteria for “best” fitting in multiple regression with p predictors. Criterion r2 Adjusted r2 Akaike Information Criteria AIC Formula r2 SSRe gression SStotal 1 SSRe sidual SStotal n 1 (1 r 2 ) 1 n p) n pn 2 ln(2 (SSRe sidual ) / n)) 1 2 2 n p 1 Akaike Information Criteria AIC n[ln(SSRe sidual pn / n)] 2 n p 1 Hierarchical partitioning and model selection No pred Model r2 Adjr2 P AIC (R) 1 Lon 0.0006 -0.013 0.84 30.15 1 Lat 0.47 0.46 >0.001 -16.16 2 Lon + Lat 0.48 0.46 >0.001 -15.25 3 Long +Lat + Lon x Lat 0.54 0.52 >0.001 -22.55 C3 R2=0.48 Longitude Latitude Model Lat + Long 0.0 0.2 0.4 0.6 0.8 1.0 Y_hats.longlat -15 -10 -5 0 cLAT 5 10 15 -5 -10 -15 0 5 15 10 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 Y_hats.longxlat 0 0 cLONG Y_hats.longlat 0.0 0.2 0.4 0.6 0.8 1.0 -15 -10 -5 -5 -10 -15 5 -15 -10 -5 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Y_hats.longxlat cLAT 15 10 cLONG -15 -10 -5 0 0 cLAT 5 10 15 5 10 15 -5 -10 -15 -5 -10 -15 0 0 5 cLAT cLONG 15 10 cLONG 5 15 10 0.6 0.4 relative abundance 0.8 1.0 C3 grasses in North America 0.2 45 Lat 0.0 35 Lat 95 Model Lat * Long 100 105 Longitude 110 115 120 The final forward model selection is: Step: AIC=-228.67 SQRT_C3 ~ LAT + MAP + JJAMAP + DJFMAP Df Sum of Sq <none> + LONG + MAT RSS AIC 2.7759 -228.67 1 0.0209705 2.7549 -227.23 1 0.0001829 2.7757 -226.68 Call: lm(formula = SQRT_C3 ~ LAT + MAP + JJAMAP + DJFMAP) Coefficients: (Intercept) -0.7892663 LAT 0.0391180 MAP 0.0001538 JJAMAP -0.8573419 DJFMAP -0.7503936 The final backward selection model is Step: AIC=-229.32 SQRT_C3 ~ LAT + JJAMAP + DJFMAP Df Sum of Sq <none> - DJFMAP - JJAMAP - LAT 1 1 1 RSS 2.8279 0.26190 3.0898 0.31489 3.1428 2.82772 5.6556 AIC -229.32 -224.85 -223.61 -180.72 Call: lm(formula = SQRT_C3 ~ LAT + JJAMAP + DJFMAP) Coefficients: (Intercept) -0.53148 LAT 0.03748 JJAMAP -1.02823 DJFMAP -1.05164