Exercises - GLMs and..

advertisement
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
EXERCISES FOR GLMs and GAMs
The following is adapted from a really awesome suite of tutorials by Dave Roberts for Vegetation
dynamics using the labdsv package http://ecology.msu.montana.edu/labdsv/R/.
I. GLMS
Introduction to Modeling Species-Environment Relations
Introduction
We often want to characterize the distribution of vegetation concisely and quantitatively, as well
as to assess the statistical significance of observed relationships. The typical approach to such
inferential analysis is (multiple) linear regression by least squares. Unfortunately, vegetation
data generally violate the basic assumptions of linear regression. Specifically, the dependent
variable is generally species presence/absence, which does not meet the assumption of an
unbounded dependent variable. Presence/absence is either 0 or 1, with no intermediate value
possible and no values less than 0 or greater than 1In addition, linear regression by least squares
assumes that the errors are normally distributed with zero mean and constant variance; this is
rarely if ever true for vegetation data.
Fortunately, alternative inferential statistics have been developed which eliminate (or at least
finesse) these problems. The first technique we will explore is "Generalized Linear Models",
specifically techniques known as "logistic regression."
Generalized Linear Models
Generalized linear models eliminate the problem of bounded dependent variables by
transformation to the logit [log of the odds ratio]) for logistic regression. While these
transformations can be employed in linear regression by least squares, generalized linear models
also simultaneously eliminate the problem that the variance is no longer approximately constant
by employing the appropriate variance for binomial distributions in a weighted, iterated least
squares calculation. While direct analytical solution of the least squares problem is no longer
possible, efficient computer algorithms exist to solve the iterated problem.
Example Logistic Regression
Generalized linear logistic regression are based on minimizing deviance. Deviance is a concept
similar to variance, in that it can be used to measure variability around an estimate or model, but
has a different calculation for logistic values. Deviance is often a difficult concept to grasp, and
like many things, may be best appreciated through an example. The following is an extremely
simplified example of logistic regression to help demonstrate definitions and calculations.
Suppose we had two vectors:
demo = c(0,0,0,0,1,0,1,1,1,1)
x = 1:10
where demo is the presence/absence of a species of interest, and x is the value of a variable of
interest, perhaps along a gradient. If we look at the distribution, you can easily visualize that
demo is generally absent at low values of x, and generally present at high values of x
Suppose we try fitting a linear regression through these data.
demo.lm = lm(demo~x)
summary(demo.lm)
1
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# Call:
# lm(formula = demo ~ x)
# Residuals:
#
Min
1Q
Median
# -5.697e-01 -1.455e-01 -3.469e-18
#
#
#
#
#
#
3Q
1.455e-01
Max
5.697e-01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.26667
0.22874 -1.166 0.27728
x
0.13939
0.03687
3.781 0.00538 **
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
# Residual standard error: 0.3348 on 8 degrees of freedom
# Multiple R-Squared: 0.6412,
Adjusted R-squared: 0.5964
# F-statistic: 14.3 on 1 and 8 DF, p-value: 0.005379
The results show that demo is highly related to x, or, in other words, that x is "highly significant",
note the ** for x in the summary. Looks good! Let's see what the model looks like.
plot(x,fitted(demo.lm))
points(x,demo,col='red')
Notice how the fitted points (black) start below 0 and slant upward past 1.0. If we interpret these
values as probabilities that species demo is present, we get some nonsensical values. When x is
less than or equal to 1, we get negative probabilities. When x is greater than 9, we get
probabilities greater than 1.0. Neither of those is possible of course.
Actually, there are two primary problems:
 the fitted values lie outside the range of possible values
 the residuals are not nearly balanced, and the variance is not constant across the range of
2
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
values of x
Alternatively, we can fit a generalized linear model that models the log of the odds (called the
logit) rather than the probability, and finesse both problems simultaneously. An odds ratio for a
binary response variable is the ratio of the probability of getting a 1 to the probability of getting a
0. If we call p the probability of getting a 1, then this ratio is p / (1-p) . If we take the log of this
ratio, its known as the logit function, and is written as logit(p) = log( p / (1-p) ). If we fit a GLM as
follows
demo.glm = glm(demo~x,family=binomial)
plot(x,demo,ylim=c(-0.2,1.2))
points(x,fitted(demo.glm),col=2,pch="+")
we get
The red crosses represent the fitted values from the GLM. Notice how they never go below 0.0 or
above 1.0. In addition, the family=binomial specification in the glm() function specifies a logistic
regression and makes sure that the fitting function knows that the variance is proportional to the
fitted value times 1 minus the fitted value rather than constant (recall that the variance of the
binomial distribution is p*(1-p) - http://en.wikipedia.org/wiki/Binomial_distribution).
OK, so where does deviance enter in? Deviance is calculated as -2 * the log-likelihood. In a logistic
regression you can visualize the deviance as lines drawn from the fitted point to 0 and 1, as
shown below.
3
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
We take the length of each arrow and multiply that times the log of that length and add to the
same calculation for the other arrow for that point. Finally, take -2 * the sum of all those values.
This works because the likelihood of a data point for the binomial distribution is proportional to
py(1-p)n-y where y is the number of 1s and n is the number of Bernoulli trials. (e.g. if you throw a
fair die thrown 10 times and get 6 heads, then p=0.5, n=10, y=6). In this example n=1, y=0 or 1
and p equals the value of the red cross. Taking logs turns the multiplication in to a sum. If you
look at the first point, for example, the fitted value is almost exactly on top of the actual value, so
the length of one of the arrows is almost zero (and in fact not shown on the drawing as they are
too small to draw). For this point, the deviance is very small. In fact
fitted(demo.glm)
#
1
2
3
4
5
6
# 0.002850596 0.010397540 0.037179991 0.124285645 0.342804967 0.657195033
#
7
8
9
10
# 0.875714355 0.962820009 0.989602460 0.997149404
shows that the fitted value of the first point is 0.002850596. If we calculate the deviance for just
that point
-2 * (0.002850596*log(0.002850596)+(1-0.002850596)*log(10.002850596))
# [1] 0.03910334
it's only 0.03910334, a very small value. If we look at point 5, its fitted values is 0.342804967.
The deviance associated with point 5 is
-2 * (0.342804967*log(0.342804967)+(1-0.342804967)*log(10.342804967))
# [1] 1.285757
This is a much higher value. In a logistic regression, deviances of greater than one for a single
point are indicative of poor fits.
What's the most deviance a single point can contribute? Well, let's look at the distribution for
4
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
deviance for all probabilities between 0 and 1.
q = seq(0.0,1.0,0.05)
plot(q,-2*(q*log(q)+(1-q)*log(1-q)))
Notice how the curve is uni-modal, symmetric, and achieves its maximum value at 0.5. Notice too
how probabilities greater than 0.2 and less than 0.8 have deviance greater than 1. The maximum
deviance for a single point in a logistic regression is for probability = 0.5, where it reaches
1.386294.
A CASE STUDY
The Data Set
We will attempt to model the distribution of Berberis (Mahonia) repens, a common shrub in
mesic portions of Bryce Canyon. You can download this data and load it in to R all at once as
follows:
veg=read.table('http://ecology.msu.montana.edu/labdsv/R/labs/lab1/bryceveg.s')
attach(veg) # to allow us to reference the columns by name
head(veg)[,1:8]
#
junost ameuta arcpat arttri atrcan berfre ceamar cerled
# bcnp__1
0
0.0
1.0
0
0
0
0.5
0
# bcnp__2
0
0.5
0.5
0
0
0
0.0
0
# bcnp__3
0
0.0
1.0
0
0
0
0.5
0
# bcnp__4
0
0.5
1.0
0
0
0
0.5
0
# bcnp__5
0
0.0
4.0
0
0
0
0.5
0
# bcnp__6
0
0.5
1.0
0
0
0
1.0
0
This file, bryceveg.s, contains all the vegetation data from a vegetation study of Bryce Canyon
National Park, with the vegetation abundance recorded by cover class (see below). The rows are
sampling locations and the columns are species.
Bryce Canyon National Park is a 14000 ha park in southern Utah with vegetation representative
of a transition from the cordilleran flora of the central Rocky Mountains to the Colorado Plateau.
The Park vegetation ranges from small areas of badlands and sagebrush steppe at low elevations
5
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
through an extensive pinyon-juniper woodland, to ponderosa pine savanna at mid-elevations,
reaching closed mixed conifer forest at high elevations.
The dataset contains 160 0.1 acre circular sample plots, where the cover of all vascular plant
species (except trees) was estimated by eye according to the following scale:
code
range (%)
mid-point
presence/absence
nominal
+ (present)
0
0
1
0 .2
T
0-1
0.5
1
0.5
1
1-5
3.0
1
1.0
2
5-25
15.0
1
2.0
3
25-50
37.5
1
3.0
4
50-75
62.5
1
4.0
5
75-95
85.0
1
5.0
6
95-100
97.5
1
6.0
The abundance of trees was estimated by basal area (cross-sectional area of stems at breast
height), but is not included in the data set because it is more representative of successional
status than environmental relations. The cover scale is commonly referred to as the "Pfister"
scale after R.D. Pfister, who first used it for vegetation analysis in the western U.S. (Pfister et al.
1977). It is similar to the Braun Blanquet or Domin scale commonly used in Europe. There are
169 vascular plant species in the data set.
Since we’ll be focusing on Berberis (Mahonia) repens, note that you can reference just this column
from the veg data. This gives the abundance at each of the 169 sites.
berrep
#
[1] 1.0 0.0 0.5 1.0 0.5 1.0 0.5 0.0 1.0 0.5 1.0 0.0 0.5 0.5 0.5
# [16] 1.0 0.5 0.5 0.5 0.0 0.5 0.5 0.0 0.5 0.5 0.5 0.0 0.5 0.5 0.5
# [31] 0.5 1.0 0.0 1.0 0.5 0.0 0.5 0.0 0.5 0.5 0.0 0.5 0.5 0.0 0.5
# [46] 0.5 0.5 1.0 0.5 0.5 0.5 0.5 0.5 0.0 0.5 0.5 0.5 0.0 1.0 1.0
# [61] 1.0 0.5 1.0 1.0 1.0 0.5 0.5 0.5 1.0 0.5 0.0 0.5 0.0 0.0 0.0
# [76] 0.5 0.5 1.0 1.0 0.0 0.0 0.5 0.5 0.5 3.0 1.0 1.0 0.0 0.0 0.0
# [91] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
# [106] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
# [121] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
# [136] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
# [151] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Next we need to obtain the data on the sampling sites. Again, we can read this in from the
website and look at the data
site=read.table('http://ecology.msu.montana.edu/labdsv/R/labs/lab2/b
rycesit.s',header=T)
attach(site)
head(site)
#
#
#
#
#
#
#
#
labels annrad asp
av
50001
241 30 1.00
50002
222 50 0.96
50003
231 50 0.96
50004
254 360 0.93
50005
232 300 0.48
50006
216 330 0.76
pos quad slope
1
2
3
4
5
6
depth east elev grorad north
deep 390 8720
162 4100
shallow 390 8360
156 4100
shallow 390 8560
159 4100
shallow 390 8660
166 4100
shallow 390 8480
159 4100
shallow 390 8560
155 4100
6
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
#
#
#
#
#
#
1
ridge
2 mid_slope
3 mid_slope
4
ridge
5 up_slope
6 mid_slope
pc
pc
pc
pc
pc
pc
9
2
2
0
2
2
Preliminary graphical analysis (shown below, Berberis locations in red) suggests that the
distribution of Berberis repens is related to elevation and possibly aspect value. We’ll use this
information to start constructing a model. The term elev[berrep>0] below means that we
only want the values of elevation where the abundance of barberry is greater than 0.
plot(elev,av)
points(elev[berrep>0],av[berrep>0],col='red',pch='+')
The Model
To model the presence/absence of Berberis as a function of elevation, we use:
bere1.glm = glm (berrep>0 ~ elev, family = binomial)
Remember that in R equations are given in a general form, and that we can use logical subscripts.
bere1.glm=glm evaluates to "store the result of the generalized linear model in an object called
'bere1.glm'." Any object name could be used, but "variable.glm" is concise and self-explanatory.
The number 1 is in anticipation of fitting more models later. berrep>0 evaluates to a logical, 0
(absent) or 1 (true), and ~ elev evaluates to "as a function of elevation." The family = binomial
tells R to perform logistic regression, as opposed to other forms of GLM (like Poisson, which is
for count data).
R calls the estimated probabilities "fitted values", and we can use the fitted function to extract
the probabilities from our GLM object (bere1.glm). To see a graphical representation of the fitted
7
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
model, do
plot(elev,fitted(bere1.glm),xlab='Elevation',ylab='Probability of
Occurrence')
Notice how the fitted curve is a smooth sigmoidal curve. The logistic regression is linear in the
logit, but when back transformed from the logit to probabilities, it's sigmoidal. This is a good
start, but we originally guessed that the presence/absence of Berberis was a function of both
elevation and aspect. In addition, following conventional ecological theory we might assume that
the probability exhibits a unimodal response to environment. To get a smooth unimodal
response is simple. Just as a linear logistic regression is sigmoidal when back transformed to
probability, a quadratic logistic regression is unimodal symmetric when back transformed to
probabilities. Accordingly, let's fit a model that is quadratic for elevation and aspect value. We
use:
bere2.glm=glm(berrep>0~elev+I(elev^2)+av+I(av^2),family=binomial)
~elev+I(elev^2)+av+I(av^2) evaluates to "as a function of elevation, elevation squared, aspect
value, and aspect value squared." The I(elev^2) tells R that "I really do mean elevation squared"
so that it doesn't attempt to interpret the ^2 as an R formula operator. All together,
bere2.glm=glm(berrep>0~elev+I(elev^2)+av+I(av^2),family=binomial) models the probability
of the presence of Berberis as a unimodal function of elevation and aspect value.
The details of the object are available with the summary function.
summary(bere2.glm)
Call:
glm(formula = berrep > 0 ~ elev + I(elev^2) + av + I(av^2), family = binomial)
Deviance Residuals:
Min
1Q
Median
-2.3801 -0.7023 -0.4327
3Q
0.5832
Max
2.3540
Coefficients:
Estimate Std. Error z value Pr(>|z|)
8
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
(Intercept) 7.831e+01
elev
-2.324e-02
I(elev^2)
1.665e-06
av
4.385e+00
I(av^2)
-4.447e+00
--Signif. codes: 0 '***'
4.016e+01
1.039e-02
6.695e-07
2.610e+00
2.471e+00
1.950
-2.236
2.487
1.680
-1.799
0.0512
0.0254
0.0129
0.0929
0.0719
.
*
*
.
.
0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 218.19
Residual deviance: 143.19
AIC: 153.19
on 159
on 155
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 5
The listing includes deviance residuals quartiles, variable coefficients, null deviance vs residual
deviance, and an AIC (Akaike information criterion) statistic. This last item doesn't concern us
yet, but will be handy later on.
The output includes a Z statistic (standardized normal deviate) and p value for each variable in
the model, and tests the hypothesis that the true value of each coefficient is 0. As you can
determine from the output, it appears that elev and I(elev^2) are significantly different from 0,
and the intercept is marginal.
While glm models do not have quite the same properties as ordinary least squares (e.g. deviance
instead of variance), you can still get good indications about the performance of the model.
Analogous to R^2, 1-(residual deviance/null deviance) is a good indicator of overall model fit.
E.g.
1-(143.1878/218.1935)
# [1] 0.3437577
In addition, each term term in the model can be tested in several ways. In order, they can be
tested with the anova function, particularly with a Chi-square test statistic.
anova(bere2.glm,test="Chi")
#
#
#
#
Analysis of Deviance Table
Model: binomial, link: logit
Response: berrep > 0
Terms added sequentially (first to last)
#
#
#
#
#
#
NULL
elev
I(elev^2)
av
I(av^2)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
159
218.193
1
66.247
158
151.947 3.979e-16
1
5.320
157
146.627
0.021
1
0.090
156
146.537
0.765
1
3.349
155
143.188
0.067
The test being performed assumes that the difference in deviance (See Deviance column)
attributed to a particular variable is distributed as Chi-squared with the number of degrees of
freedom attributed to that variable (See Df column). E.g., the elev variable achieves a reduction in
deviance of 66.247 at a cost of 1 degree of freedom; the associated probability of that much
reduction by chance (Type I error) is essentially 0. It is important to realize that this table uses
the variables in the order given in the equation, and that each variable is tested against the
residual deviance after earlier variables (including the intercept) have been taken into account.
The indication is that elevation is very important, and that the quadratic term is also statistically
significant. Neither av nor I(av^2) is significant, at least after accounting for elevation.
9
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
For nested models (where a simpler model is a formal subset of the more complex model) it is
also possible to compare glms with the anova function as follows:
bere3.glm=glm(berrep>0~elev+I(elev^2),family=binomial)
anova(bere3.glm,bere2.glm,test="Chi")
Analysis of Deviance Table
Model 1: berrep > 0 ~ elev + I(elev^2)
Model 2: berrep > 0 ~ elev + I(elev^2) + av + I(av^2)
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1
157
146.627
2
155
143.188
2
3.439
0.179
The table is telling us that the reduction in deviance attributable to adding av and I(av) is 3.439,
and that the probability of getting that with an additional 2 degrees of freedom is 0.179. This
suggests strongly that aspect value is not significant.
Alternatively, we can compare the AIC values for the two models with the AIC() function. Here,
we choose the model with the smaller AIC as superior. The output is much more terse, and we
have to remember which terms are in each model.
AIC(bere2.glm,bere3.glm)
#
df
AIC
# bere2.glm 5 153.1878
# bere3.glm 3 152.6268
The analysis suggests that addition of av and av^2 is not a significant improvement, as we
determined before. AIC balances model accuracy and complexity by giving good scores for
models with good fit (high likelihood) but decreasing scores when more parameters are used.
Smaller values of AIC are better and of differences greater three units usually mean (i.e. it’s a bit
subjective) that there isn’t much support for the model with higher AIC. The small difference in
AIC values here indicates that the models aren’t appreciably different. A rule of thumb is that
difference in AIC of greater than 3-5 indicate support for the lower value, and differences greater
than 10 indicate strong support. So in this case, Occam and his razor suggest the simpler model.
So, how good is the model? For the time being (and for the sake of demonstration), we're going
to stick with our full model even though we don't believe av matters. We'll delete it later and see
the results. Our pseudo-R^2 of 0.34 suggests that the model is not great, but plant species are
often difficult to model well, and we've just started.
We can calculate the fraction (or equivalently, percent) of plots correctly predicted with the table
function. Since R refers to the estimated probabilities as fitted values, we can use the fitted
function to extract the fitted values and do the following:
table(berrep>0,fitted(bere1.glm)>0.5)
FALSE
TRUE
FALSE TRUE
83
9
15
53
This is slightly complicated, so let's look at the call. berrep>0 you remember evaluates to a logical
(TRUE or FALSE); fitted(bere1.glm)>0.5 does the same for the estimated probabilities. Here,
however, we want to classify all plots with a probability > 0.5 as "present" (or in this case TRUE),
and those less than 0.5 as 0 or FALSE. The first variable in the call is listed as the rows, and the
10
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
second variable as the columns. As you can see, we correctly predicted 136 out of 160 plots
(0.85). In addition, we made slightly fewer "errors of commission" (predicted Berberis to be
present when it's absent = 9) than "errors of omission" (predicted absent when present = 15 ).
These errors can be seen on the first residual plot as well. Overall, our model is fair, but is biased,
and under-predicts Berberis slightly.
That's all well and good, but what does our model really look like? Well, the current model is
multi-dimensional for elev and av, so it's problematic to look at directly. In this case we can look
at the model one dimension at a time. The first line plots all the probability of presence for all
points and the second line makes the points that correspond to observed presences red.
plot(elev,fitted(bere2.glm))
points(elev[berrep>0],fitted(bere2.glm)[berrep>0],col=2)
This is often called a response curve. Notice how the points make a fairly smooth curve as
elevation increases; this is the effect due to elevation. Notice also that there is a small vertical
distribution of points at any given elevation; this is the effect due to aspect value.
As we noted before, the av and I(av^2) variables are not contributing significantly to the fit. If we
look at the model without them, we get:
plot(elev,fitted(bere3.glm))
11
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
Notice how the scatter due to av is now gone, and we get a clear picture of the relationship of the
probability of Berberis to elevation. Notice too that the curve actually goes up slightly at lower
elevations; this is unexpected and probably ecologically improbable given what we know about
species response to environment. While it first appeared that the GLM was successful in fitting a
modal response model to the elevation data, what we actually observe is a truncated portion of
an inverted modal curve. If you go back and look at the coefficients fit by the model, you see:
# elev
# I(elev^2)
-2.323e-02
1.664e-06
1.030e-02
6.631e-07
-2.256
2.510
0.0241 *
0.0121 *
What's important here is that the linear term elev is negative, while the quadratic term I(elev^2)
is positive. This is indicative of an inverted fit; what we want is a positive linear term and
negative quadratic term to get a classical unimodal response curve. Some ecologists (e.g. ter
Braak and Looman [1995]) suggest that inverted curves should not be accepted, as they violate
our ecological understanding and are dangerous if extrapolated. On the other hand, if accurate
prediction is the primary objective, we might prefer an "improper" fit if it is more accurate
within the range of our data. We can test for this, of course, as follows:
anova(bere3.glm,glm(berrep>0~elev,family=binomial),test="Chi")
Analysis of Deviance Table
Response: berrep > 0
Resid. Df Resid. Dev
elev + I(elev^2)
157
146.627
elev
158
151.947
Df Deviance P(>|Chi|)
-1
-5.320
0.021
In this case I didn't even store the linear logistic equation, I simply embedded the call to glm
within the anova function. (Actually, we already calculated this model as bere1.glm, but I wanted
to demonstrate nested calls for future use). As you can see, the quadratic term achieves a small
but significant reduction in deviance. The AIC analysis agrees.
AIC(bere3.glm,glm(berrep>0~elev,family=binomial))
#
df
AIC
# bere3.glm
3 152.6268
12
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
# glm(berrep > 0 ~ elev, family = binomial)
2 155.9469
If we decide not to accept the inverted modal GLM, going back to the simpler model would only
be slightly less accurate.
In fact, converted to a simple test, the simple GLM has equal prediction accuracy (0.85).
table(berrep>0,fitted(bere1.glm)>0.5)
#
FALSE TRUE
# FALSE
83
9
# TRUE
15
53
This is an example where minimizing deviance and maximizing prediction accuracy are not the
same thing. Some statistical approaches let you choose the criterion to optimize, but GLM does
not, at least directly.
As another example of nested calls, we can plot the linear model without storing it as follows:
plot(elev,fitted(glm(berrep>0~elev,family=binomial)))
Models can be modified by adding or dropping specific terms, with an abbreviated formula as
follows:
bere4.glm=update(bere2.glm,.~.+depth,na.action=na.omit)
anova(bere4.glm,test="Chi")
Analysis of Deviance Table
Model: binomial, link: logit
Response: berrep > 0
Terms added sequentially (first to last)
NULL
elev
I(elev^2)
av
I(av^2)
depth
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
144
197.961
1
69.702
143
128.258 6.896e-17
1
2.677
142
125.581
0.102
1
0.080
141
125.502
0.778
1
7.592
140
117.910
0.006
1
12.422
139
105.488 4.243e-04
The analysis added soil depth to the existing model. The notation update(bere.glm,.~.+depth
means update model bere.glm, use the same formula, but add depth. The na.action=na.omit
means omit cases where soil depth is unknown. Unfortunately, several plots in the data set don't
have soil data.
As you might guess, soil depth and elevation are not independent, with deeper soils generally
occurring at lower elevations. This last model suggests that soil depth is significant, even after
accounting for elevation. We actually observe slightly higher deviance reduction due to elevation
than before, but this is due to dropping those plots where soil depth is unknown, not to a change
in the effect of elevation.
Variables can be dropped as well, as for example:
bere5.glm=update(bere2.glm,.~.-av-I(av^2))
summary(bere5.glm)
13
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
Call:
glm(formula = berrep > 0 ~ elev + I(elev^2), family = binomial)
Deviance Residuals:
Min
1Q
Median
3Q
Max
-2.5304 -0.6914 -0.4661
0.5889
2.1420
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.310e+01 3.979e+01
1.837
0.0662 .
elev
-2.167e-02 1.027e-02 -2.110
0.0349 *
I(elev^2)
1.560e-06 6.613e-07
2.359
0.0183 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 218.19 on 159 degrees of freedom
Residual deviance: 146.63 on 157 degrees of freedom
AIC: 152.63
Number of Fisher Scoring iterations: 4
Note, however, that I had to use the I(av^2) instead of the simpler av^2 to get the correct result.
14
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
II. GAMs
Modeling Species/Environment Relations with Generalized Additive Models (GAMs)
Introduction
In the last set of exercises, we developed sets of models of the distribution Berberis repens on
environmental gradients in Bryce Canyon National Park. The models were developed as
"Generalized Linear Models" (or GLMs), and included logistic regression. As you will recall, GLMs
finesse the problems of bounded dependent variables and heterogeneous variances by
transforming the dependent variable and employing a specific variance function for that
transformation. While GLMs were shown to work reasonably well, and are in fact the method of
choice for many ecologists, they are limited to linear functions. When you examine the predicted
values from GLMs, they are sigmoidal or modal curves, leading to the impression that they are
not really linear. This is an artifact of the transformations employed, however, and the models
are linear in the logit (log of the odds) or log scale. This is simultaneously a benefit (the models
are parametric, with a wealth of theory applicable to their analysis and interpretation), and a
hindrance (they require a priori estimation of the curve shape, and have difficulty in fitting data
that doesn't follow a simple parametric curve shape).
Generalized Additive Models (GAMs) are designed to capitalize on the strengths of GLMs (ability
to fit logistic and poisson regressions) without requiring the problematic steps of a priori
estimation of response curve shape or a specific parametric response function. They employ a
class of equations called "smoothers" or "scatterplot smoothers" that attempt to generalize data
into smooth curves by local fitting to subsections of the data. The design and development of
smoothers is a very active area of research in statistics, and a broad array of such functions has
been developed. The simplest example that is likely to be familiar to most ecologists is the
running average, where you calculate the average value of data in a "window" along some
gradient. While the running average is an example of a smoother, it's rather primitive and much
more efficient smoothers have been developed.
The idea behind GAMs is to "plot" (conceptually, not literally) the value of the dependent variable
along a single independent variable, and then to calculate a smooth curve that goes through the
data as well as possible, while being parsimonious. The trick is in the parsimony. It would be
possible using a polynomial of high enough order to get a curve that went through every point. It
is likely, however, that the curve would "wiggle" excessively, and not represent a parsimonious
fit. The approach generally employed with GAMs is to divide the data into some number of
sections, using "knots" at the ends of the sections. Then a low order polynomial or spline
function is fit to the data in the section, with the added constraint that the second derivative of
the function at the knots must be the same for both sections sharing that knot. This eliminates
kinks in the curve, and ensures that it is smooth and continuous at all points.
The problem with GAMs is that they are simultaneously very simple and extraordinarily
complex. The idea is simple; let the data speak, and draw a simple smooth curve through the
data. Most ecologists are quite capable of doing this by eye. The problem is determining
goodness-of-fit and error terms for a curve fit by eye. GAMs make this unnecessary, and fit the
curve algorithmically in a way that allows error terms to be estimated precisely. On the other
hand, the algorithm that fits the curve is usually iterative and non-parametric, masking a great
deal of complex numerical processing. It's much too complex to address here, but there are at
least two significant approaches to solving the GAM parsimony problem, and it is an area of
15
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
active research.
As a practical matter, we can view GAMs as non-parametric curve fitters that attempt to achieve
an optimal compromise between goodness-of-fit and parsimony of the final curve. Similar to
GLMs, on species data they operate on deviance, rather than variance, and attempt to achieve the
minimal residual deviance on the fewest degrees of freedom. One of the interesting aspects of
GAMs is that they can only approximate the appropriate number of degrees of freedom, and that
the number of degrees of freedom is often not an integer, but rather a real number with some
fractional component. This seems very odd at first, but is actually a fairly straight forward
extension of the concepts you are already familiar with. A second order polynomial (or quadratic
equation) in a GLM uses two degrees of freedom (plus one for the intercept). A curve that is
slightly less regular than a quadratic might require two and a half degrees of freedom (plus one
for the intercept), but might fit the data better.
The other aspect of GAMs that is different is that they don't handle interaction well (i.e. a product
of two predictor variables). Rather than fit multiple variables simultaneously, the algorithm fits a
smooth curve to each variable and then combines the results additively, thus giving rise to the
name "Generalized Additive Models." The point is minimized here, somewhat, in that we never fit
interaction terms in our GLMs in the previous exercise. For example, it is possible that slope only
matters at some range of elevations, giving rise to an interaction of slope and elevation. In
practice, interaction terms can be significant, but often require fairly large amounts of data. The
Bryce data set is relatively small, and tests of interaction were generally not significant.
GAMs in R
There at least two implementations of GAMS in R. The one we will employ is in the package gam,
but its worth noting that an alternate version is in mgcv. GAMS are invoked with
result = gam(y~s(x))
where y and x represent the dependent variable and an independent variable respectively. The
notation s(x) means to fit the dependent variable as a "smooth" of x.
We’ll be using the two data sets named veg and site from the GLM exercises so make sure you’ve
got those loaded. Let’s start by loading a package for GAMs
install.packages('gam',dependencies=T)
library(gam)
To simplify matters we will parallel our attempts from last lab to model the distribution of
Berberis repens. To model the presence/absence of Berberis as a smooth function of elevation
bere1.gam = gam(berrep>0~s(elev),family=binomial)
Just as for the GLM, to get a logistic fit as appropriate for presence/absence values we specify
family=binomial. To see the result, use the summary() function.
summary(bere1.gam)
# Call: gam(formula = bere > 0 ~ s(elev), family = binomial)
16
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
Deviance Residuals:
Min
1Q Median
-2.2252 -0.6604 -0.4280
3Q
0.5318
Max
1.9649
(Dispersion Parameter for binomial family taken to be 1)
Null Deviance: 218.1935 on 159 degrees of freedom
Residual Deviance: 133.4338 on 155 degrees of freedom
AIC: 143.4339
Number of Local Scoring Iterations: 8
DF for Terms and Chi-squares for Nonparametric Effects
Df Npar Df
(Intercept)
s(elev)
1
1
Npar
3
Chisq
P(Chi)
18.2882
0.0004
As in the last lab, the important elements of the summary are the reduction in deviance for the
degrees of freedom used. Here, we see that null deviance of 218.1935 was reduced to 133.4338
for 4 degrees of freedom (one for the intercept, and 3 for the smooth). As shown in the lower
section, the probability of achieving such a reduction is about 0.0004.
To really see what the results look like, use the plot() function.
plot(bere1.gam,se=TRUE)
The default plot shows several things. The solid line is the predicted value of the dependent
variable as a function of the x axis. The se=TRUE means to plot two times the standard errors of
the estimates (in dashed lines). The small lines along the x axis are the "rug", showing the
location of the sample plots. The y axis is in the linear units, which in this case is logits, so that
the values are centered on 0 (50/50 odds), and extend to both positive and negative values. To
see the predicted values on the probability scale we need to use the back transformed values,
which are available as fitted(bere1.gam). So,
plot(elev,fitted(bere1.gam))
17
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
Notice how the curve has multiple inflection points and modes, even though we did not specify a
high order function. This is the beauty and bane of GAMs. In order to fit the data, the function fit a
bimodal curve. It seems unlikely ecologically that a species would have a multi-modal response
to a single variable. Rather, it would appear to suggest competitive displacement by some other
species from 7300 to 8000 feet, or the effects of another variable that interacts strongly over that
elevation range.
To compare to the GLM fit, we can superimpose the GLM predictions on the previous plot.
plot(elev,fitted(bere1.gam))
bere1.glm = glm (berrep>0 ~ elev, family = binomial)
points(elev,fitted(bere1.glm),col='red')
bere3.glm=glm(berrep>0~elev+I(elev^2),family=binomial)
points(elev,fitted(bere3.glm),col='green')
That's the predicted values from the first order logistic GLM in red, and the quadratic GLM in
green. Notice how even though the GAM curve is quite flexible, it avoids the problematic upturn
at low values shown by the quadratic GLM.
18
670
671
672
The GAM function achieves a residual deviance of 133.4 on 155 degrees of freedom (this was
shown in the summary above). The following table gives the comparison to the GLMS.
model
GAM
linear GLM
quadratic GLM
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
residual deviance
133.43
151.95
146.63
residual degrees of freedom
155
158
157
How do we compare the fits? Well, since they're not nested, we can't properly use the anova
method from last lab. However, since they're fit to the same data, we can at least heuristically
compare the goodness-of-fits and residual degrees of freedom. As I mentioned previously, GAMs
often employ fractional degrees of freedom.
anova(bere3.glm,bere1.gam,test='Chi')
#NOTE: This is not a legitimate test because the models are not
tested, we’re just getting the residual deviance from this table
#
#
#
#
#
#
#
Analysis of Deviance Table
Model 1: berrep > 0 ~ elev + I(elev^2)
Model 2: berrep > 0 ~ s(elev)
Resid. Df Resid. Dev
Df Deviance P(>|Chi|)
1 157.0000
146.627
2 155.0000
133.434
2.0000
13.193
0.001
AIC(bere3.glm,bere1.gam)
#
df
AIC
# bere3.glm 3 152.6268
# bere1.gam 2 143.4339
The GAM clearly achieves lower residual deviance as shown in the anova table, and at a cost of
only two degrees of freedom compared to the quadratic GLM. The probability of achieving a
reduction in deviance of 13.2 for 2 degrees of freedom in a NESTED model is about 0.001.
While this probability is not strictly correct for a non-nested test, I think it's safe to say that the
GAM is a better fit. In addition, the AIC results are in agreement. One approach to a simple
evaluation is to compare the residuals in a boxplot.
boxplot(bere1.glm$residuals,bere3.glm$residuals,bere1.gam$residuals,
names=c("GLM 1","GLM 2","GAM"))
19
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
The plot shows that the primary difference among the models is in the number of extreme
residuals; the mid-quartiles of all models are very similar.
Multi-Variable GAMs
I mentioned above that GAMs combine multiple variables by fitting separate smooths to each
variable and then combining them linearly. Let's look at their ability with Berberis repens.
bere2.gam = gam(berrep>0~s(elev)+s(slope),family=binomial)
Plotting this model results in one plote for each independent variable.
plot(bere2.gam,se=TRUE)
The interpretation of the slope plot is that Berberis is most likely on midslopes (10-50%), but
distinctly less likely on flats or steep slopes. Could a variable with that much error contribute
significantly to the fit? Using the anova() function for GAMs, so we can look directly at the
influence of slope on the fit.
anova(bere2.gam,bere1.gam,test="Chi")
# Analysis of Deviance Table
20
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
#
#
#
#
#
#
Model 1: berrep > 0 ~ s(elev) + s(slope)
Model 2: berrep > 0 ~ s(elev)
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1
151
113.479
2
155
133.434 -4 -19.954
0.001
We get a reduction in deviance of 19.95 for 4.00 degrees of freedom, with a probability of <
0.001. We can again see the result by plotting the fitted values against the independent variables.
plot(elev,fitted(bere2.gam))
Where before we had a smoothly varying line we now have a scatterplot with some significant
departures from the line, especially at low elevations, due to slope effects. As for slope itself
plot(slope,fitted(bere2.gam))
21
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
The plot suggests a broad range of values for any specific slope, but there is a modal trend to the
data.
Can a third variable help still more. Let's try aspect value (av).
bere3.gam = gam(berrep>0~s(elev)+s(slope)+s(av),family=binomial)
anova(bere3.gam,bere2.gam,test="Chi")
#
#
#
#
#
#
#
#
#
#
#
#
Analysis of Deviance Table
Model 1: berrep > 0 ~ s(elev) + s(slope) + s(av)
Model 2: berrep > 0 ~ s(elev) + s(slope)
Resid. Df Resid. Dev
Df Deviance P(>|Chi|)
1 147.0001
100.756
2 151.0000
113.479 -3.9999 -12.723
0.013
AIC(bere3.gam,bere2.gam)
df
AIC
bere3.gam 4 126.7560
bere2.gam 3 131.4795
Somewhat surprisingly (since it hasn't shown much power before), aspect values appears to be
significant. Lets look (I'll omit the plots you've already seen).
22
768
769
770
771
772
773
774
775
776
777
778
779
780
Surprisingly, since aspect value is designed to linearize aspect, the response curve is modal.
Berberis appears to prefer warm aspects, but not south slopes. While we might expect some
interaction here with elevation (preferring different aspects at different elevations to
compensate for temperature), GAMs are not directly useful for interaction terms. Although it's
statistically significant, its effect size is small, as demonstrated in the following plot.
Notice how almost all values are possible at most aspect values. Nonetheless, it is statistically
significant.
GAMs versus GLMs
GAMs are extremely flexible models for fitting smooth curves to data. On ecological data they
23
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
often achieve results superior to GLMs, at least in terms of goodness-of-fit. On the other hand,
because they lack a parametric equation for their results, they are somewhat hard to evaluate
except graphically; you can't provide an equation of the result in print. Because of this, some
ecologists find GLMs more parsimonious, and prefer the results of GLMs to GAMs even at the cost
of a lower goodness-of-fit, due to increased understanding of the results. Many ecologists fit
GAMs as a means of determining the correct curve shape for GLMs, deciding whether to fit low or
high order polynomials as suggested by the GAM plot.
Given the ease of creating and presenting graphical output, the increased goodness-of-fit of
GAMs and the insights available from analysis of their curvature over sub-regions of the data
make GAMs a potentially superior tool in the analysis of ecological data. Certainly, you can argue
either way.
Summary
GAMs are extremely flexible models for fitting smooth curves to data. They reflect a nonparametric perspective that says "let the data determine the shape" of the response curve. They
are sufficiently computationally intense that only a few years ago they were perhaps infeasible
on some data sets, but are now easy and quick to apply to most data sets.
Like GLMs, they are suitable for fitting logistic and poisson regressions, which is of course very
useful in ecological analysis.
24
Download