bridges - Environmental Statistics Group

advertisement
Melissa Bridges
BIOL 504 Fall 2010
Species distribution modeling within a hierarchical Bayesian framework: evaluating the role of
land use
Introduction
Explaining and modeling species distributions are of particular interest in ecology. These types
of models are frequently used to assess habitat availability and fragmentation (Guisan and
Zimmerman, 2000), predict resource use (Keating and Cherry, 2004), and to help predict
environments vulnerable to invasion by non-native species (Rew et al., 2005; Shafii et al., 2003).
Frequently used methods for generating species distribution models include relating binary
presence/absence data of a species of interest to a set of environmental variables through
regression techniques such as logistic regression or general additive models (Guisan et al., 2002).
Conventional regression methods used in species distribution modeling typically do not
adequately characterize the uncertainty in the predictions, account for spatial dependency in
species distributions, or address the role of human related disturbances (Latimer et al., 2006).
Human related disturbances such as land use change have been implicated in facilitating the
invasion of non-native plant species (Hobbs and Humphries, 1995; Corbin and D’Antonio,
2004); however, few studies exist that specifically evaluate the role of land use in statistical
models of non-native plant distributions (but see Kuhman et al., 2010 for a recent example).
Latimer et al. (2006) provided some techniques for how several problems with typical
approaches to species distribution modeling could be addressed including how human
transformation of the landscape could be used within a Bayesian model framework. Although
Latimer et al. (2006) provided instructions for how to spatially model plant distributions across a
landscape using logistic regression within a hierarchical Bayesian framework, the authors were
unclear in how they explicitly addressed human related land use change.
Low density residential developments in areas of close proximity to national parks and other
natural amenities has been termed exurban development and has been hypothesized to influence
ecosystem processes and services such as biodiversity, nutrient cycling, and vegetation patterns
(Hansen et al., 2005). In particular, it is hypothesized that exurban development can cause
increases in non-native or weedy plant species occupancy and abundance. The objective of this
project was to evaluate the role of land use change, specifically the transformation of grasslands
to low density residential developments in a portion of the Greater Yellowstone Ecosystem, and
the probability of occurrence of one non-native plant species, Centaurea maculosa. More
specifically, this project was to serve as one of many exploratory data analyses meant to
contribute to my understanding of how exurban development could be related to non-native plant
distributions.
Methods
Study Area and Data Collection
The locations of the presence and absence of Centaurea maculosa (CEMA), a suspected invader
of rangeland systems and a Montana state listed noxious weed, were recorded within areas of
Paradise Valley, MT where land use and cover were classified (Figure 1). Presence/absence data
were collected at the 10m scale (i.e., 10m x 10m quadrat size) along transects that randomly
started on roads and varied in length between 300m and 1000m perpendicular from their start
points along roads.
The presence/absence data were scaled up from the 10m resolution to a 30m resolution to match
the environmental variables used in the analysis. The scaling up process allowed for the number
of occurrences within any 30m grid cell to be binomially distributed. Each 30m grid cell
containing observations became an individual case in the following analyses. This methodology
of scaling the observations up to the resolution of the environmental predictor variables was used
in Latimer et al. (2006).
Model Building
Typically, a logistic regression model would be used to relate several environmental variables to
the probability of occurrence; however, for the sake of simplicity for this particular project, I
choose to model the probability of occurrence as a function of only one environmental variable,
elevation. For my purposes, I assumed that elevation was an adequate surrogate for a variety of
environmental conditions. I generated two beta/binomial exchangeable hierarchical Bayesian
models, one for each set of cases within undeveloped grassland and developed residential
(formally undeveloped grassland).
I let Y = number of CEMA occurrences. I assumed that all case-specific yi are independent. I
assumed that yi ~ binomial (ni,pi), where ni = case-specific number of trials and pi = case-specific
probability of occurrence. Further, I assumed that each pi was a function of elevation modeled as
a simple logistic regression. Therefore, the likelihood function used was the equation for the
binomial distribution where p was equal to a logistic regression function (Eq. 1).
Eq. 1
log (P(yi|ni,pi)) = ∑ (i=1 to k) [(ni!/yi!(ni-yi)!) * piyi * (1-pi)yi]
To construct a beta/binomial exchangeable hierarchical model, I assumed that p1,...,pk, where k =
number of cases within a particular land use type (i.e., grassland or residential), were a random
sample from a beta distribution with shape parameters, a and b. The beta (a,b) distribution was
the prior on pi. The parameters, a and b, were hyperparameters that were assigned uniform
hyperpriors. Vague uniform priors were placed on the global beta (coefficient) parameters for
the logistic regression.
All analyses were coded and administered in R (www.r-project.org) using methodologies
illustrated in Goodman (2009), Albert (2009), Geyer (2009), and Crawley (2007). The joint
posterior distribution for each Bayesian logistic regression model was sampled using a
Metropolis random walk algorithm within a Markov Chain Monte Carlo (MCMC) simulation
(mcmc package in R, Geyer, 2009).
The ultimate goal of these models were to compare the predicted logistic regression curves
resulting from cases within undeveloped grasslands and cases within developed residential areas.
Therefore, functions in R were coded to output only the MCMC diagnostics and marginal
posterior distributions of the parameters of interest for inference (i.e., the coefficients for the two
simple logistic regression models).
Results and Discussion
MCMC Output
It is important to plot the output of the MCMC chain as a time series plot. This allows for the
visual evaluation of convergence for each of the parameters in the joint posterior distribution
being sampled. Figures 2 and 4 refer to the MCMC diagnostic output for each of the models.
Ideally, one can feel reasonably certain about convergence being met when the time series plots
appear similar to the plots for the β1 parameters of both models (Figures 2 and 4). Despite much
experimentation with tuning the MCMC sampler, the best time series plots for the β 0 parameters
for both models were less than satisfactory. For the sake of completing the analyses that were
part of my objectives for this project, I assumed that adequate samples were drawn from the
posterior distribution for each parameter of interest.
Figures 3 and 5 refer to the marginal posterior distributions for each coefficient parameter for
each model. For the purposes of graphically plotting the relationship between the probability of
occurrence and elevation for each land use, I assumed that the marginal posterior distributions
were normally distributed. This assumption might adequate for the marginal distributions for the
β1 parameters of both models, but this assumption may be inadequate for the distributions of the
β0 parameters of both models (Figures 3 and 5). The marginal distributions of both parameters
were influenced by the MCMC sampling algorithm. The failure of adequate convergence leads
me to be suspicious of the resulting marginal posterior distributions. However, for the purposes
of this project, I assumed that the marginal distributions could be interpreted and used to
compare the relationships between the probability of CEMA occurrence and elevation for the
two land uses of interest.
Comparison of Probabilities of CEMA Occurrence between Land Uses
It was clear from the resulting marginal distributions on the coefficient parameters that were
differences in both the intercept and slope parameters between the two land use models. For
instance, the grassland model yielded much higher mean intercept and slope values as compared
to the residential model. In the context of a logistic regression, these differences could lead to
dramatic differences in the placement of the resulting curves. This is very clearly illustrated in
Figure 6 where both models are plotted. Figure 6 shows the mean response curve as well as the
upper and lower bounds of the 95% credible intervals around those mean responses.
As expected, the uncertainty around the mean response of the grassland model is less than that of
the residential model. This was expected because the number cases used to draw inference for
the grassland model (k = 1292) parameters was far greater than the number of cases used in the
residential model (k = 70). Furthermore, uncertainty for both models decreased as the elevation
increased. I believe this was due to the majority of cases in both datasets having zero number of
occurrences at higher elevation values.
The relationships between the probability of CEMA occurrence and elevation were different
between the two land uses. The probability of CEMA occurrence within undeveloped grassland
was much higher at lower elevations compared to the probability of occurrence within developed
residential areas. As elevation increased, the probability of CEMA occurrence within the
grassland areas sharply decreased while the probability of occurrence within the residential areas
gradually decreased. The grassland model suggests that there was a threshold elevation where
the probability of CEMA occurrence goes to zero. This along with the shape of the residential
curve suggest that human related disturbances associated with residential development has
allowed for significantly increased probabilities of CEMA occurrence at higher elevations.
Interestingly, undeveloped grassland areas at low elevation locations had significantly higher
probabilities of occurrence than locations of the same elevation in residential areas. This trend
could possibly be explained by active management CEMA populations by homeowners.
Conclusions and Future Directions
Despite less than satisfactory convergence within the MCMC simulations, the resulting
relationships between the probability of CEMA occurrence and elevation were quite different.
The results of this exploratory data analysis using a hierarchical Bayesian approach suggested
that residential development could facilitate the occurrence of CEMA into higher elevations than
previously documented.
The inspiration for this project was from Latimer et al. (2006). The authors illustrated methods
for modeling species distributions within a hierarchical Bayesian framework, and suggested that
models incorporating spatial dependency of species distributions as well as a hierarchical
structure can result in better probability of occurrence predictions. It would be necessary to
integrate a spatial random effect into the model as suggested by Latimer et al. (2006) to make
inference on those cases (i.e., 30m grid cells) where observations do not exist. Furthermore, it
would be necessary to integrate this level of complexity from the practical application of
producing a map of predicted probabilities of occurrence. Another important aspect to be
considered is the addition of other environmental variables that could further explain variability
in predicted probabilities. Moreover, experimenting with different samplers within an MCMC
may be necessary to result in satisfactory convergence diagnostics. For instance, a Gibbs
sampler might be necessary for the integration of the spatial random effect.
References
Albert, J. 2009. Bayesian Computation with R Second Edition. Springer Science+Business
Media, LLC. New York, New York.
Corbin, J.D. and C.M. D’Antonio. 2004. Competition between native perennial and exotic
annual grasses: implications for an historical invasion. Ecology. 85:1273-1283.
Crawley, M.J. 2007. The R Book. John Wiley and Sons, Ltd. West Sussex, England.
Geyer, C.J. 2009. MCMC Package Example (Version 0.7-3). Online resource. www.r-project.org
Goodman, D. 2009. Empirical Bayes, Bayes empirical Bayes, and hierarchical analysis. Online
resource (www.esg.montana.edu).
Guisan, A., T.C.J. Edwards, and T. Hastie. 2002. Generalized linear and generalized additive
models in studies of species distributions: setting the scene. Ecological Modelling.
157:89-100.
Guisan, A. and N.E. Zimmerman. 2000. Predictive habitat distribution models in ecology.
Ecological Modelling. 135:147-186.
Hansen, A.J., R.L. Knight, J.M. Marzluff, S. Powell, K. Brown, P.H. Gude, and A. Jones. 2005.
Effects of exurban development on biodiversity: patterns, mechanisms, and research
needs. Ecological Applications. 15:1893-1905.
Hobbs, R.J. and S.E. Humphries. 1995. An integrated approach to the ecology and management
of plant invasions. Conservation Biology. 9:761-770.
Latimer, A.M., S. Wu, A.E. Gelfand, ad J.A Silander Jr. 2006. Building statistical models to
analyze species distributions. Ecological Applications. 16:33-50.
Keating, K.A., and S. Cherry. 2004. Use and interpretation of logistic regression in habitat
selection studies. Journal of Wildlife Management. 68:774-789.
Rew, L.J., B.D. Maxwell, and R. Aspinall. 2005. Predicting the occurrence of non-indigenous
species using environmental and remotely sensed data. Weed Science. 53:236-241.
Shafii, B., W.J. Price, T.S. Prather, L.W. Lass, and D.C. Thill. 2003. Predicting the likelihood of
yellow starthistle (Centaurea solstitialis) occurrence using landscape characteristics.
Weed Science. 51:748-751.
Figures
Figure 1. Location of each case for both sets of cases within undeveloped grassland (green) and
developed residential (brown)
Figure 2. Time series plots of the sampled marginal posterior distributions for both the
coefficients of the grassland model
Figure 3. Marginal posterior distributions of β0 and β1 corresponding to the binomial
probabilities of Centaurea maculosa occurrence within undeveloped grassland areas of Paradise
Valley, MT
Figure 4. Time series plots of the sampled marginal posterior distributions for both the
coefficients of the residential model
Figure 5. Marginal posterior distributions of β0 and β1 corresponding to the binomial
probabilities of Centaurea maculosa occurrence within residentially developed areas of Paradise
Valley, MT
Figure 6. The mean response and 95% credible intervals for each land use model
Download