1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 EXERCISES FOR GLMs and GAMs The following is adapted from a really awesome suite of tutorials by Dave Roberts for Vegetation dynamics using the labdsv package http://ecology.msu.montana.edu/labdsv/R/. I. GLMS Introduction to Modeling Species-Environment Relations Introduction We often want to characterize the distribution of vegetation concisely and quantitatively, as well as to assess the statistical significance of observed relationships. The typical approach to such inferential analysis is (multiple) linear regression by least squares. Unfortunately, vegetation data generally violate the basic assumptions of linear regression. Specifically, the dependent variable is generally species presence/absence, which does not meet the assumption of an unbounded dependent variable. Presence/absence is either 0 or 1, with no intermediate value possible and no values less than 0 or greater than 1In addition, linear regression by least squares assumes that the errors are normally distributed with zero mean and constant variance; this is rarely if ever true for vegetation data. Fortunately, alternative inferential statistics have been developed which eliminate (or at least finesse) these problems. The first technique we will explore is "Generalized Linear Models", specifically techniques known as "logistic regression." Generalized Linear Models Generalized linear models eliminate the problem of bounded dependent variables by transformation to the logit [log of the odds ratio]) for logistic regression. While these transformations can be employed in linear regression by least squares, generalized linear models also simultaneously eliminate the problem that the variance is no longer approximately constant by employing the appropriate variance for binomial distributions in a weighted, iterated least squares calculation. While direct analytical solution of the least squares problem is no longer possible, efficient computer algorithms exist to solve the iterated problem. Example Logistic Regression Generalized linear logistic regression are based on minimizing deviance. Deviance is a concept similar to variance, in that it can be used to measure variability around an estimate or model, but has a different calculation for logistic values. Deviance is often a difficult concept to grasp, and like many things, may be best appreciated through an example. The following is an extremely simplified example of logistic regression to help demonstrate definitions and calculations. Suppose we had two vectors: demo = c(0,0,0,0,1,0,1,1,1,1) x = 1:10 where demo is the presence/absence of a species of interest, and x is the value of a variable of interest, perhaps along a gradient. If we look at the distribution, you can easily visualize that demo is generally absent at low values of x, and generally present at high values of x Suppose we try fitting a linear regression through these data. demo.lm = lm(demo~x) summary(demo.lm) 1 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 # Call: # lm(formula = demo ~ x) # Residuals: # Min 1Q Median # -5.697e-01 -1.455e-01 -3.469e-18 # # # # # # 3Q 1.455e-01 Max 5.697e-01 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.26667 0.22874 -1.166 0.27728 x 0.13939 0.03687 3.781 0.00538 ** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 # Residual standard error: 0.3348 on 8 degrees of freedom # Multiple R-Squared: 0.6412, Adjusted R-squared: 0.5964 # F-statistic: 14.3 on 1 and 8 DF, p-value: 0.005379 The results show that demo is highly related to x, or, in other words, that x is "highly significant", note the ** for x in the summary. Looks good! Let's see what the model looks like. plot(x,fitted(demo.lm)) points(x,demo,col='red') Notice how the fitted points (black) start below 0 and slant upward past 1.0. If we interpret these values as probabilities that species demo is present, we get some nonsensical values. When x is less than or equal to 1, we get negative probabilities. When x is greater than 9, we get probabilities greater than 1.0. Neither of those is possible of course. Actually, there are two primary problems: the fitted values lie outside the range of possible values the residuals are not nearly balanced, and the variance is not constant across the range of 2 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 values of x Alternatively, we can fit a generalized linear model that models the log of the odds (called the logit) rather than the probability, and finesse both problems simultaneously. An odds ratio for a binary response variable is the ratio of the probability of getting a 1 to the probability of getting a 0. If we call p the probability of getting a 1, then this ratio is p / (1-p) . If we take the log of this ratio, its known as the logit function, and is written as logit(p) = log( p / (1-p) ). If we fit a GLM as follows demo.glm = glm(demo~x,family=binomial) plot(x,demo,ylim=c(-0.2,1.2)) points(x,fitted(demo.glm),col=2,pch="+") we get The red crosses represent the fitted values from the GLM. Notice how they never go below 0.0 or above 1.0. In addition, the family=binomial specification in the glm() function specifies a logistic regression and makes sure that the fitting function knows that the variance is proportional to the fitted value times 1 minus the fitted value rather than constant (recall that the variance of the binomial distribution is p*(1-p) - http://en.wikipedia.org/wiki/Binomial_distribution). OK, so where does deviance enter in? Deviance is calculated as -2 * the log-likelihood. In a logistic regression you can visualize the deviance as lines drawn from the fitted point to 0 and 1, as shown below. 3 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 We take the length of each arrow and multiply that times the log of that length and add to the same calculation for the other arrow for that point. Finally, take -2 * the sum of all those values. This works because the likelihood of a data point for the binomial distribution is proportional to py(1-p)n-y where y is the number of 1s and n is the number of Bernoulli trials. (e.g. if you throw a fair die thrown 10 times and get 6 heads, then p=0.5, n=10, y=6). In this example n=1, y=0 or 1 and p equals the value of the red cross. Taking logs turns the multiplication in to a sum. If you look at the first point, for example, the fitted value is almost exactly on top of the actual value, so the length of one of the arrows is almost zero (and in fact not shown on the drawing as they are too small to draw). For this point, the deviance is very small. In fact fitted(demo.glm) # 1 2 3 4 5 6 # 0.002850596 0.010397540 0.037179991 0.124285645 0.342804967 0.657195033 # 7 8 9 10 # 0.875714355 0.962820009 0.989602460 0.997149404 shows that the fitted value of the first point is 0.002850596. If we calculate the deviance for just that point -2 * (0.002850596*log(0.002850596)+(1-0.002850596)*log(10.002850596)) # [1] 0.03910334 it's only 0.03910334, a very small value. If we look at point 5, its fitted values is 0.342804967. The deviance associated with point 5 is -2 * (0.342804967*log(0.342804967)+(1-0.342804967)*log(10.342804967)) # [1] 1.285757 This is a much higher value. In a logistic regression, deviances of greater than one for a single point are indicative of poor fits. What's the most deviance a single point can contribute? Well, let's look at the distribution for 4 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 deviance for all probabilities between 0 and 1. q = seq(0.0,1.0,0.05) plot(q,-2*(q*log(q)+(1-q)*log(1-q))) Notice how the curve is uni-modal, symmetric, and achieves its maximum value at 0.5. Notice too how probabilities greater than 0.2 and less than 0.8 have deviance greater than 1. The maximum deviance for a single point in a logistic regression is for probability = 0.5, where it reaches 1.386294. A CASE STUDY The Data Set We will attempt to model the distribution of Berberis (Mahonia) repens, a common shrub in mesic portions of Bryce Canyon. You can download this data and load it in to R all at once as follows: veg=read.table('http://ecology.msu.montana.edu/labdsv/R/labs/lab1/bryceveg.s') attach(veg) # to allow us to reference the columns by name head(veg)[,1:8] # junost ameuta arcpat arttri atrcan berfre ceamar cerled # bcnp__1 0 0.0 1.0 0 0 0 0.5 0 # bcnp__2 0 0.5 0.5 0 0 0 0.0 0 # bcnp__3 0 0.0 1.0 0 0 0 0.5 0 # bcnp__4 0 0.5 1.0 0 0 0 0.5 0 # bcnp__5 0 0.0 4.0 0 0 0 0.5 0 # bcnp__6 0 0.5 1.0 0 0 0 1.0 0 This file, bryceveg.s, contains all the vegetation data from a vegetation study of Bryce Canyon National Park, with the vegetation abundance recorded by cover class (see below). The rows are sampling locations and the columns are species. Bryce Canyon National Park is a 14000 ha park in southern Utah with vegetation representative of a transition from the cordilleran flora of the central Rocky Mountains to the Colorado Plateau. The Park vegetation ranges from small areas of badlands and sagebrush steppe at low elevations 5 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 through an extensive pinyon-juniper woodland, to ponderosa pine savanna at mid-elevations, reaching closed mixed conifer forest at high elevations. The dataset contains 160 0.1 acre circular sample plots, where the cover of all vascular plant species (except trees) was estimated by eye according to the following scale: code range (%) mid-point presence/absence nominal + (present) 0 0 1 0 .2 T 0-1 0.5 1 0.5 1 1-5 3.0 1 1.0 2 5-25 15.0 1 2.0 3 25-50 37.5 1 3.0 4 50-75 62.5 1 4.0 5 75-95 85.0 1 5.0 6 95-100 97.5 1 6.0 The abundance of trees was estimated by basal area (cross-sectional area of stems at breast height), but is not included in the data set because it is more representative of successional status than environmental relations. The cover scale is commonly referred to as the "Pfister" scale after R.D. Pfister, who first used it for vegetation analysis in the western U.S. (Pfister et al. 1977). It is similar to the Braun Blanquet or Domin scale commonly used in Europe. There are 169 vascular plant species in the data set. Since we’ll be focusing on Berberis (Mahonia) repens, note that you can reference just this column from the veg data. This gives the abundance at each of the 169 sites. berrep # [1] 1.0 0.0 0.5 1.0 0.5 1.0 0.5 0.0 1.0 0.5 1.0 0.0 0.5 0.5 0.5 # [16] 1.0 0.5 0.5 0.5 0.0 0.5 0.5 0.0 0.5 0.5 0.5 0.0 0.5 0.5 0.5 # [31] 0.5 1.0 0.0 1.0 0.5 0.0 0.5 0.0 0.5 0.5 0.0 0.5 0.5 0.0 0.5 # [46] 0.5 0.5 1.0 0.5 0.5 0.5 0.5 0.5 0.0 0.5 0.5 0.5 0.0 1.0 1.0 # [61] 1.0 0.5 1.0 1.0 1.0 0.5 0.5 0.5 1.0 0.5 0.0 0.5 0.0 0.0 0.0 # [76] 0.5 0.5 1.0 1.0 0.0 0.0 0.5 0.5 0.5 3.0 1.0 1.0 0.0 0.0 0.0 # [91] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 # [106] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 # [121] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 # [136] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 # [151] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Next we need to obtain the data on the sampling sites. Again, we can read this in from the website and look at the data site=read.table('http://ecology.msu.montana.edu/labdsv/R/labs/lab2/b rycesit.s',header=T) attach(site) head(site) # # # # # # # # labels annrad asp av 50001 241 30 1.00 50002 222 50 0.96 50003 231 50 0.96 50004 254 360 0.93 50005 232 300 0.48 50006 216 330 0.76 pos quad slope 1 2 3 4 5 6 depth east elev grorad north deep 390 8720 162 4100 shallow 390 8360 156 4100 shallow 390 8560 159 4100 shallow 390 8660 166 4100 shallow 390 8480 159 4100 shallow 390 8560 155 4100 6 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 # # # # # # 1 ridge 2 mid_slope 3 mid_slope 4 ridge 5 up_slope 6 mid_slope pc pc pc pc pc pc 9 2 2 0 2 2 Preliminary graphical analysis (shown below, Berberis locations in red) suggests that the distribution of Berberis repens is related to elevation and possibly aspect value. We’ll use this information to start constructing a model. The term elev[berrep>0] below means that we only want the values of elevation where the abundance of barberry is greater than 0. plot(elev,av) points(elev[berrep>0],av[berrep>0],col='red',pch='+') The Model To model the presence/absence of Berberis as a function of elevation, we use: bere1.glm = glm (berrep>0 ~ elev, family = binomial) Remember that in R equations are given in a general form, and that we can use logical subscripts. bere1.glm=glm evaluates to "store the result of the generalized linear model in an object called 'bere1.glm'." Any object name could be used, but "variable.glm" is concise and self-explanatory. The number 1 is in anticipation of fitting more models later. berrep>0 evaluates to a logical, 0 (absent) or 1 (true), and ~ elev evaluates to "as a function of elevation." The family = binomial tells R to perform logistic regression, as opposed to other forms of GLM (like Poisson, which is for count data). R calls the estimated probabilities "fitted values", and we can use the fitted function to extract the probabilities from our GLM object (bere1.glm). To see a graphical representation of the fitted 7 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 model, do plot(elev,fitted(bere1.glm),xlab='Elevation',ylab='Probability of Occurrence') Notice how the fitted curve is a smooth sigmoidal curve. The logistic regression is linear in the logit, but when back transformed from the logit to probabilities, it's sigmoidal. This is a good start, but we originally guessed that the presence/absence of Berberis was a function of both elevation and aspect. In addition, following conventional ecological theory we might assume that the probability exhibits a unimodal response to environment. To get a smooth unimodal response is simple. Just as a linear logistic regression is sigmoidal when back transformed to probability, a quadratic logistic regression is unimodal symmetric when back transformed to probabilities. Accordingly, let's fit a model that is quadratic for elevation and aspect value. We use: bere2.glm=glm(berrep>0~elev+I(elev^2)+av+I(av^2),family=binomial) ~elev+I(elev^2)+av+I(av^2) evaluates to "as a function of elevation, elevation squared, aspect value, and aspect value squared." The I(elev^2) tells R that "I really do mean elevation squared" so that it doesn't attempt to interpret the ^2 as an R formula operator. All together, bere2.glm=glm(berrep>0~elev+I(elev^2)+av+I(av^2),family=binomial) models the probability of the presence of Berberis as a unimodal function of elevation and aspect value. The details of the object are available with the summary function. summary(bere2.glm) Call: glm(formula = berrep > 0 ~ elev + I(elev^2) + av + I(av^2), family = binomial) Deviance Residuals: Min 1Q Median -2.3801 -0.7023 -0.4327 3Q 0.5832 Max 2.3540 Coefficients: Estimate Std. Error z value Pr(>|z|) 8 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 (Intercept) 7.831e+01 elev -2.324e-02 I(elev^2) 1.665e-06 av 4.385e+00 I(av^2) -4.447e+00 --Signif. codes: 0 '***' 4.016e+01 1.039e-02 6.695e-07 2.610e+00 2.471e+00 1.950 -2.236 2.487 1.680 -1.799 0.0512 0.0254 0.0129 0.0929 0.0719 . * * . . 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 218.19 Residual deviance: 143.19 AIC: 153.19 on 159 on 155 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 The listing includes deviance residuals quartiles, variable coefficients, null deviance vs residual deviance, and an AIC (Akaike information criterion) statistic. This last item doesn't concern us yet, but will be handy later on. The output includes a Z statistic (standardized normal deviate) and p value for each variable in the model, and tests the hypothesis that the true value of each coefficient is 0. As you can determine from the output, it appears that elev and I(elev^2) are significantly different from 0, and the intercept is marginal. While glm models do not have quite the same properties as ordinary least squares (e.g. deviance instead of variance), you can still get good indications about the performance of the model. Analogous to R^2, 1-(residual deviance/null deviance) is a good indicator of overall model fit. E.g. 1-(143.1878/218.1935) # [1] 0.3437577 In addition, each term term in the model can be tested in several ways. In order, they can be tested with the anova function, particularly with a Chi-square test statistic. anova(bere2.glm,test="Chi") # # # # Analysis of Deviance Table Model: binomial, link: logit Response: berrep > 0 Terms added sequentially (first to last) # # # # # # NULL elev I(elev^2) av I(av^2) Df Deviance Resid. Df Resid. Dev P(>|Chi|) 159 218.193 1 66.247 158 151.947 3.979e-16 1 5.320 157 146.627 0.021 1 0.090 156 146.537 0.765 1 3.349 155 143.188 0.067 The test being performed assumes that the difference in deviance (See Deviance column) attributed to a particular variable is distributed as Chi-squared with the number of degrees of freedom attributed to that variable (See Df column). E.g., the elev variable achieves a reduction in deviance of 66.247 at a cost of 1 degree of freedom; the associated probability of that much reduction by chance (Type I error) is essentially 0. It is important to realize that this table uses the variables in the order given in the equation, and that each variable is tested against the residual deviance after earlier variables (including the intercept) have been taken into account. The indication is that elevation is very important, and that the quadratic term is also statistically significant. Neither av nor I(av^2) is significant, at least after accounting for elevation. 9 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 For nested models (where a simpler model is a formal subset of the more complex model) it is also possible to compare glms with the anova function as follows: bere3.glm=glm(berrep>0~elev+I(elev^2),family=binomial) anova(bere3.glm,bere2.glm,test="Chi") Analysis of Deviance Table Model 1: berrep > 0 ~ elev + I(elev^2) Model 2: berrep > 0 ~ elev + I(elev^2) + av + I(av^2) Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 157 146.627 2 155 143.188 2 3.439 0.179 The table is telling us that the reduction in deviance attributable to adding av and I(av) is 3.439, and that the probability of getting that with an additional 2 degrees of freedom is 0.179. This suggests strongly that aspect value is not significant. Alternatively, we can compare the AIC values for the two models with the AIC() function. Here, we choose the model with the smaller AIC as superior. The output is much more terse, and we have to remember which terms are in each model. AIC(bere2.glm,bere3.glm) # df AIC # bere2.glm 5 153.1878 # bere3.glm 3 152.6268 The analysis suggests that addition of av and av^2 is not a significant improvement, as we determined before. AIC balances model accuracy and complexity by giving good scores for models with good fit (high likelihood) but decreasing scores when more parameters are used. Smaller values of AIC are better and of differences greater three units usually mean (i.e. it’s a bit subjective) that there isn’t much support for the model with higher AIC. The small difference in AIC values here indicates that the models aren’t appreciably different. A rule of thumb is that difference in AIC of greater than 3-5 indicate support for the lower value, and differences greater than 10 indicate strong support. So in this case, Occam and his razor suggest the simpler model. So, how good is the model? For the time being (and for the sake of demonstration), we're going to stick with our full model even though we don't believe av matters. We'll delete it later and see the results. Our pseudo-R^2 of 0.34 suggests that the model is not great, but plant species are often difficult to model well, and we've just started. We can calculate the fraction (or equivalently, percent) of plots correctly predicted with the table function. Since R refers to the estimated probabilities as fitted values, we can use the fitted function to extract the fitted values and do the following: table(berrep>0,fitted(bere1.glm)>0.5) FALSE TRUE FALSE TRUE 83 9 15 53 This is slightly complicated, so let's look at the call. berrep>0 you remember evaluates to a logical (TRUE or FALSE); fitted(bere1.glm)>0.5 does the same for the estimated probabilities. Here, however, we want to classify all plots with a probability > 0.5 as "present" (or in this case TRUE), and those less than 0.5 as 0 or FALSE. The first variable in the call is listed as the rows, and the 10 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 second variable as the columns. As you can see, we correctly predicted 136 out of 160 plots (0.85). In addition, we made slightly fewer "errors of commission" (predicted Berberis to be present when it's absent = 9) than "errors of omission" (predicted absent when present = 15 ). These errors can be seen on the first residual plot as well. Overall, our model is fair, but is biased, and under-predicts Berberis slightly. That's all well and good, but what does our model really look like? Well, the current model is multi-dimensional for elev and av, so it's problematic to look at directly. In this case we can look at the model one dimension at a time. The first line plots all the probability of presence for all points and the second line makes the points that correspond to observed presences red. plot(elev,fitted(bere2.glm)) points(elev[berrep>0],fitted(bere2.glm)[berrep>0],col=2) This is often called a response curve. Notice how the points make a fairly smooth curve as elevation increases; this is the effect due to elevation. Notice also that there is a small vertical distribution of points at any given elevation; this is the effect due to aspect value. As we noted before, the av and I(av^2) variables are not contributing significantly to the fit. If we look at the model without them, we get: plot(elev,fitted(bere3.glm)) 11 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 Notice how the scatter due to av is now gone, and we get a clear picture of the relationship of the probability of Berberis to elevation. Notice too that the curve actually goes up slightly at lower elevations; this is unexpected and probably ecologically improbable given what we know about species response to environment. While it first appeared that the GLM was successful in fitting a modal response model to the elevation data, what we actually observe is a truncated portion of an inverted modal curve. If you go back and look at the coefficients fit by the model, you see: # elev # I(elev^2) -2.323e-02 1.664e-06 1.030e-02 6.631e-07 -2.256 2.510 0.0241 * 0.0121 * What's important here is that the linear term elev is negative, while the quadratic term I(elev^2) is positive. This is indicative of an inverted fit; what we want is a positive linear term and negative quadratic term to get a classical unimodal response curve. Some ecologists (e.g. ter Braak and Looman [1995]) suggest that inverted curves should not be accepted, as they violate our ecological understanding and are dangerous if extrapolated. On the other hand, if accurate prediction is the primary objective, we might prefer an "improper" fit if it is more accurate within the range of our data. We can test for this, of course, as follows: anova(bere3.glm,glm(berrep>0~elev,family=binomial),test="Chi") Analysis of Deviance Table Response: berrep > 0 Resid. Df Resid. Dev elev + I(elev^2) 157 146.627 elev 158 151.947 Df Deviance P(>|Chi|) -1 -5.320 0.021 In this case I didn't even store the linear logistic equation, I simply embedded the call to glm within the anova function. (Actually, we already calculated this model as bere1.glm, but I wanted to demonstrate nested calls for future use). As you can see, the quadratic term achieves a small but significant reduction in deviance. The AIC analysis agrees. AIC(bere3.glm,glm(berrep>0~elev,family=binomial)) # df AIC # bere3.glm 3 152.6268 12 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 # glm(berrep > 0 ~ elev, family = binomial) 2 155.9469 If we decide not to accept the inverted modal GLM, going back to the simpler model would only be slightly less accurate. In fact, converted to a simple test, the simple GLM has equal prediction accuracy (0.85). table(berrep>0,fitted(bere1.glm)>0.5) # FALSE TRUE # FALSE 83 9 # TRUE 15 53 This is an example where minimizing deviance and maximizing prediction accuracy are not the same thing. Some statistical approaches let you choose the criterion to optimize, but GLM does not, at least directly. As another example of nested calls, we can plot the linear model without storing it as follows: plot(elev,fitted(glm(berrep>0~elev,family=binomial))) Models can be modified by adding or dropping specific terms, with an abbreviated formula as follows: bere4.glm=update(bere2.glm,.~.+depth,na.action=na.omit) anova(bere4.glm,test="Chi") Analysis of Deviance Table Model: binomial, link: logit Response: berrep > 0 Terms added sequentially (first to last) NULL elev I(elev^2) av I(av^2) depth Df Deviance Resid. Df Resid. Dev P(>|Chi|) 144 197.961 1 69.702 143 128.258 6.896e-17 1 2.677 142 125.581 0.102 1 0.080 141 125.502 0.778 1 7.592 140 117.910 0.006 1 12.422 139 105.488 4.243e-04 The analysis added soil depth to the existing model. The notation update(bere.glm,.~.+depth means update model bere.glm, use the same formula, but add depth. The na.action=na.omit means omit cases where soil depth is unknown. Unfortunately, several plots in the data set don't have soil data. As you might guess, soil depth and elevation are not independent, with deeper soils generally occurring at lower elevations. This last model suggests that soil depth is significant, even after accounting for elevation. We actually observe slightly higher deviance reduction due to elevation than before, but this is due to dropping those plots where soil depth is unknown, not to a change in the effect of elevation. Variables can be dropped as well, as for example: bere5.glm=update(bere2.glm,.~.-av-I(av^2)) summary(bere5.glm) 13 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 Call: glm(formula = berrep > 0 ~ elev + I(elev^2), family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -2.5304 -0.6914 -0.4661 0.5889 2.1420 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 7.310e+01 3.979e+01 1.837 0.0662 . elev -2.167e-02 1.027e-02 -2.110 0.0349 * I(elev^2) 1.560e-06 6.613e-07 2.359 0.0183 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 218.19 on 159 degrees of freedom Residual deviance: 146.63 on 157 degrees of freedom AIC: 152.63 Number of Fisher Scoring iterations: 4 Note, however, that I had to use the I(av^2) instead of the simpler av^2 to get the correct result. 14 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 II. GAMs Modeling Species/Environment Relations with Generalized Additive Models (GAMs) Introduction In the last set of exercises, we developed sets of models of the distribution Berberis repens on environmental gradients in Bryce Canyon National Park. The models were developed as "Generalized Linear Models" (or GLMs), and included logistic regression. As you will recall, GLMs finesse the problems of bounded dependent variables and heterogeneous variances by transforming the dependent variable and employing a specific variance function for that transformation. While GLMs were shown to work reasonably well, and are in fact the method of choice for many ecologists, they are limited to linear functions. When you examine the predicted values from GLMs, they are sigmoidal or modal curves, leading to the impression that they are not really linear. This is an artifact of the transformations employed, however, and the models are linear in the logit (log of the odds) or log scale. This is simultaneously a benefit (the models are parametric, with a wealth of theory applicable to their analysis and interpretation), and a hindrance (they require a priori estimation of the curve shape, and have difficulty in fitting data that doesn't follow a simple parametric curve shape). Generalized Additive Models (GAMs) are designed to capitalize on the strengths of GLMs (ability to fit logistic and poisson regressions) without requiring the problematic steps of a priori estimation of response curve shape or a specific parametric response function. They employ a class of equations called "smoothers" or "scatterplot smoothers" that attempt to generalize data into smooth curves by local fitting to subsections of the data. The design and development of smoothers is a very active area of research in statistics, and a broad array of such functions has been developed. The simplest example that is likely to be familiar to most ecologists is the running average, where you calculate the average value of data in a "window" along some gradient. While the running average is an example of a smoother, it's rather primitive and much more efficient smoothers have been developed. The idea behind GAMs is to "plot" (conceptually, not literally) the value of the dependent variable along a single independent variable, and then to calculate a smooth curve that goes through the data as well as possible, while being parsimonious. The trick is in the parsimony. It would be possible using a polynomial of high enough order to get a curve that went through every point. It is likely, however, that the curve would "wiggle" excessively, and not represent a parsimonious fit. The approach generally employed with GAMs is to divide the data into some number of sections, using "knots" at the ends of the sections. Then a low order polynomial or spline function is fit to the data in the section, with the added constraint that the second derivative of the function at the knots must be the same for both sections sharing that knot. This eliminates kinks in the curve, and ensures that it is smooth and continuous at all points. The problem with GAMs is that they are simultaneously very simple and extraordinarily complex. The idea is simple; let the data speak, and draw a simple smooth curve through the data. Most ecologists are quite capable of doing this by eye. The problem is determining goodness-of-fit and error terms for a curve fit by eye. GAMs make this unnecessary, and fit the curve algorithmically in a way that allows error terms to be estimated precisely. On the other hand, the algorithm that fits the curve is usually iterative and non-parametric, masking a great deal of complex numerical processing. It's much too complex to address here, but there are at least two significant approaches to solving the GAM parsimony problem, and it is an area of 15 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 active research. As a practical matter, we can view GAMs as non-parametric curve fitters that attempt to achieve an optimal compromise between goodness-of-fit and parsimony of the final curve. Similar to GLMs, on species data they operate on deviance, rather than variance, and attempt to achieve the minimal residual deviance on the fewest degrees of freedom. One of the interesting aspects of GAMs is that they can only approximate the appropriate number of degrees of freedom, and that the number of degrees of freedom is often not an integer, but rather a real number with some fractional component. This seems very odd at first, but is actually a fairly straight forward extension of the concepts you are already familiar with. A second order polynomial (or quadratic equation) in a GLM uses two degrees of freedom (plus one for the intercept). A curve that is slightly less regular than a quadratic might require two and a half degrees of freedom (plus one for the intercept), but might fit the data better. The other aspect of GAMs that is different is that they don't handle interaction well (i.e. a product of two predictor variables). Rather than fit multiple variables simultaneously, the algorithm fits a smooth curve to each variable and then combines the results additively, thus giving rise to the name "Generalized Additive Models." The point is minimized here, somewhat, in that we never fit interaction terms in our GLMs in the previous exercise. For example, it is possible that slope only matters at some range of elevations, giving rise to an interaction of slope and elevation. In practice, interaction terms can be significant, but often require fairly large amounts of data. The Bryce data set is relatively small, and tests of interaction were generally not significant. GAMs in R There at least two implementations of GAMS in R. The one we will employ is in the package gam, but its worth noting that an alternate version is in mgcv. GAMS are invoked with result = gam(y~s(x)) where y and x represent the dependent variable and an independent variable respectively. The notation s(x) means to fit the dependent variable as a "smooth" of x. We’ll be using the two data sets named veg and site from the GLM exercises so make sure you’ve got those loaded. Let’s start by loading a package for GAMs install.packages('gam',dependencies=T) library(gam) To simplify matters we will parallel our attempts from last lab to model the distribution of Berberis repens. To model the presence/absence of Berberis as a smooth function of elevation bere1.gam = gam(berrep>0~s(elev),family=binomial) Just as for the GLM, to get a logistic fit as appropriate for presence/absence values we specify family=binomial. To see the result, use the summary() function. summary(bere1.gam) # Call: gam(formula = bere > 0 ~ s(elev), family = binomial) 16 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 # # # # # # # # # # # # # # # # Deviance Residuals: Min 1Q Median -2.2252 -0.6604 -0.4280 3Q 0.5318 Max 1.9649 (Dispersion Parameter for binomial family taken to be 1) Null Deviance: 218.1935 on 159 degrees of freedom Residual Deviance: 133.4338 on 155 degrees of freedom AIC: 143.4339 Number of Local Scoring Iterations: 8 DF for Terms and Chi-squares for Nonparametric Effects Df Npar Df (Intercept) s(elev) 1 1 Npar 3 Chisq P(Chi) 18.2882 0.0004 As in the last lab, the important elements of the summary are the reduction in deviance for the degrees of freedom used. Here, we see that null deviance of 218.1935 was reduced to 133.4338 for 4 degrees of freedom (one for the intercept, and 3 for the smooth). As shown in the lower section, the probability of achieving such a reduction is about 0.0004. To really see what the results look like, use the plot() function. plot(bere1.gam,se=TRUE) The default plot shows several things. The solid line is the predicted value of the dependent variable as a function of the x axis. The se=TRUE means to plot two times the standard errors of the estimates (in dashed lines). The small lines along the x axis are the "rug", showing the location of the sample plots. The y axis is in the linear units, which in this case is logits, so that the values are centered on 0 (50/50 odds), and extend to both positive and negative values. To see the predicted values on the probability scale we need to use the back transformed values, which are available as fitted(bere1.gam). So, plot(elev,fitted(bere1.gam)) 17 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 Notice how the curve has multiple inflection points and modes, even though we did not specify a high order function. This is the beauty and bane of GAMs. In order to fit the data, the function fit a bimodal curve. It seems unlikely ecologically that a species would have a multi-modal response to a single variable. Rather, it would appear to suggest competitive displacement by some other species from 7300 to 8000 feet, or the effects of another variable that interacts strongly over that elevation range. To compare to the GLM fit, we can superimpose the GLM predictions on the previous plot. plot(elev,fitted(bere1.gam)) bere1.glm = glm (berrep>0 ~ elev, family = binomial) points(elev,fitted(bere1.glm),col='red') bere3.glm=glm(berrep>0~elev+I(elev^2),family=binomial) points(elev,fitted(bere3.glm),col='green') That's the predicted values from the first order logistic GLM in red, and the quadratic GLM in green. Notice how even though the GAM curve is quite flexible, it avoids the problematic upturn at low values shown by the quadratic GLM. 18 670 671 672 The GAM function achieves a residual deviance of 133.4 on 155 degrees of freedom (this was shown in the summary above). The following table gives the comparison to the GLMS. model GAM linear GLM quadratic GLM 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 residual deviance 133.43 151.95 146.63 residual degrees of freedom 155 158 157 How do we compare the fits? Well, since they're not nested, we can't properly use the anova method from last lab. However, since they're fit to the same data, we can at least heuristically compare the goodness-of-fits and residual degrees of freedom. As I mentioned previously, GAMs often employ fractional degrees of freedom. anova(bere3.glm,bere1.gam,test='Chi') #NOTE: This is not a legitimate test because the models are not tested, we’re just getting the residual deviance from this table # # # # # # # Analysis of Deviance Table Model 1: berrep > 0 ~ elev + I(elev^2) Model 2: berrep > 0 ~ s(elev) Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 157.0000 146.627 2 155.0000 133.434 2.0000 13.193 0.001 AIC(bere3.glm,bere1.gam) # df AIC # bere3.glm 3 152.6268 # bere1.gam 2 143.4339 The GAM clearly achieves lower residual deviance as shown in the anova table, and at a cost of only two degrees of freedom compared to the quadratic GLM. The probability of achieving a reduction in deviance of 13.2 for 2 degrees of freedom in a NESTED model is about 0.001. While this probability is not strictly correct for a non-nested test, I think it's safe to say that the GAM is a better fit. In addition, the AIC results are in agreement. One approach to a simple evaluation is to compare the residuals in a boxplot. boxplot(bere1.glm$residuals,bere3.glm$residuals,bere1.gam$residuals, names=c("GLM 1","GLM 2","GAM")) 19 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 The plot shows that the primary difference among the models is in the number of extreme residuals; the mid-quartiles of all models are very similar. Multi-Variable GAMs I mentioned above that GAMs combine multiple variables by fitting separate smooths to each variable and then combining them linearly. Let's look at their ability with Berberis repens. bere2.gam = gam(berrep>0~s(elev)+s(slope),family=binomial) Plotting this model results in one plote for each independent variable. plot(bere2.gam,se=TRUE) The interpretation of the slope plot is that Berberis is most likely on midslopes (10-50%), but distinctly less likely on flats or steep slopes. Could a variable with that much error contribute significantly to the fit? Using the anova() function for GAMs, so we can look directly at the influence of slope on the fit. anova(bere2.gam,bere1.gam,test="Chi") # Analysis of Deviance Table 20 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 # # # # # # Model 1: berrep > 0 ~ s(elev) + s(slope) Model 2: berrep > 0 ~ s(elev) Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 151 113.479 2 155 133.434 -4 -19.954 0.001 We get a reduction in deviance of 19.95 for 4.00 degrees of freedom, with a probability of < 0.001. We can again see the result by plotting the fitted values against the independent variables. plot(elev,fitted(bere2.gam)) Where before we had a smoothly varying line we now have a scatterplot with some significant departures from the line, especially at low elevations, due to slope effects. As for slope itself plot(slope,fitted(bere2.gam)) 21 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 The plot suggests a broad range of values for any specific slope, but there is a modal trend to the data. Can a third variable help still more. Let's try aspect value (av). bere3.gam = gam(berrep>0~s(elev)+s(slope)+s(av),family=binomial) anova(bere3.gam,bere2.gam,test="Chi") # # # # # # # # # # # # Analysis of Deviance Table Model 1: berrep > 0 ~ s(elev) + s(slope) + s(av) Model 2: berrep > 0 ~ s(elev) + s(slope) Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 147.0001 100.756 2 151.0000 113.479 -3.9999 -12.723 0.013 AIC(bere3.gam,bere2.gam) df AIC bere3.gam 4 126.7560 bere2.gam 3 131.4795 Somewhat surprisingly (since it hasn't shown much power before), aspect values appears to be significant. Lets look (I'll omit the plots you've already seen). 22 768 769 770 771 772 773 774 775 776 777 778 779 780 Surprisingly, since aspect value is designed to linearize aspect, the response curve is modal. Berberis appears to prefer warm aspects, but not south slopes. While we might expect some interaction here with elevation (preferring different aspects at different elevations to compensate for temperature), GAMs are not directly useful for interaction terms. Although it's statistically significant, its effect size is small, as demonstrated in the following plot. Notice how almost all values are possible at most aspect values. Nonetheless, it is statistically significant. GAMs versus GLMs GAMs are extremely flexible models for fitting smooth curves to data. On ecological data they 23 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 often achieve results superior to GLMs, at least in terms of goodness-of-fit. On the other hand, because they lack a parametric equation for their results, they are somewhat hard to evaluate except graphically; you can't provide an equation of the result in print. Because of this, some ecologists find GLMs more parsimonious, and prefer the results of GLMs to GAMs even at the cost of a lower goodness-of-fit, due to increased understanding of the results. Many ecologists fit GAMs as a means of determining the correct curve shape for GLMs, deciding whether to fit low or high order polynomials as suggested by the GAM plot. Given the ease of creating and presenting graphical output, the increased goodness-of-fit of GAMs and the insights available from analysis of their curvature over sub-regions of the data make GAMs a potentially superior tool in the analysis of ecological data. Certainly, you can argue either way. Summary GAMs are extremely flexible models for fitting smooth curves to data. They reflect a nonparametric perspective that says "let the data determine the shape" of the response curve. They are sufficiently computationally intense that only a few years ago they were perhaps infeasible on some data sets, but are now easy and quick to apply to most data sets. Like GLMs, they are suitable for fitting logistic and poisson regressions, which is of course very useful in ecological analysis. 24