Melissa Bridges BIOL 504 Fall 2010 Species distribution modeling within a hierarchical Bayesian framework: evaluating the role of land use Introduction Explaining and modeling species distributions are of particular interest in ecology. These types of models are frequently used to assess habitat availability and fragmentation (Guisan and Zimmerman, 2000), predict resource use (Keating and Cherry, 2004), and to help predict environments vulnerable to invasion by non-native species (Rew et al., 2005; Shafii et al., 2003). Frequently used methods for generating species distribution models include relating binary presence/absence data of a species of interest to a set of environmental variables through regression techniques such as logistic regression or general additive models (Guisan et al., 2002). Conventional regression methods used in species distribution modeling typically do not adequately characterize the uncertainty in the predictions, account for spatial dependency in species distributions, or address the role of human related disturbances (Latimer et al., 2006). Human related disturbances such as land use change have been implicated in facilitating the invasion of non-native plant species (Hobbs and Humphries, 1995; Corbin and D’Antonio, 2004); however, few studies exist that specifically evaluate the role of land use in statistical models of non-native plant distributions (but see Kuhman et al., 2010 for a recent example). Latimer et al. (2006) provided some techniques for how several problems with typical approaches to species distribution modeling could be addressed including how human transformation of the landscape could be used within a Bayesian model framework. Although Latimer et al. (2006) provided instructions for how to spatially model plant distributions across a landscape using logistic regression within a hierarchical Bayesian framework, the authors were unclear in how they explicitly addressed human related land use change. Low density residential developments in areas of close proximity to national parks and other natural amenities has been termed exurban development and has been hypothesized to influence ecosystem processes and services such as biodiversity, nutrient cycling, and vegetation patterns (Hansen et al., 2005). In particular, it is hypothesized that exurban development can cause increases in non-native or weedy plant species occupancy and abundance. The objective of this project was to evaluate the role of land use change, specifically the transformation of grasslands to low density residential developments in a portion of the Greater Yellowstone Ecosystem, and the probability of occurrence of one non-native plant species, Centaurea maculosa. More specifically, this project was to serve as one of many exploratory data analyses meant to contribute to my understanding of how exurban development could be related to non-native plant distributions. Methods Study Area and Data Collection The locations of the presence and absence of Centaurea maculosa (CEMA), a suspected invader of rangeland systems and a Montana state listed noxious weed, were recorded within areas of Paradise Valley, MT where land use and cover were classified (Figure 1). Presence/absence data were collected at the 10m scale (i.e., 10m x 10m quadrat size) along transects that randomly started on roads and varied in length between 300m and 1000m perpendicular from their start points along roads. The presence/absence data were scaled up from the 10m resolution to a 30m resolution to match the environmental variables used in the analysis. The scaling up process allowed for the number of occurrences within any 30m grid cell to be binomially distributed. Each 30m grid cell containing observations became an individual case in the following analyses. This methodology of scaling the observations up to the resolution of the environmental predictor variables was used in Latimer et al. (2006). Model Building Typically, a logistic regression model would be used to relate several environmental variables to the probability of occurrence; however, for the sake of simplicity for this particular project, I choose to model the probability of occurrence as a function of only one environmental variable, elevation. For my purposes, I assumed that elevation was an adequate surrogate for a variety of environmental conditions. I generated two beta/binomial exchangeable hierarchical Bayesian models, one for each set of cases within undeveloped grassland and developed residential (formally undeveloped grassland). I let Y = number of CEMA occurrences. I assumed that all case-specific yi are independent. I assumed that yi ~ binomial (ni,pi), where ni = case-specific number of trials and pi = case-specific probability of occurrence. Further, I assumed that each pi was a function of elevation modeled as a simple logistic regression. Therefore, the likelihood function used was the equation for the binomial distribution where p was equal to a logistic regression function (Eq. 1). Eq. 1 log (P(yi|ni,pi)) = ∑ (i=1 to k) [(ni!/yi!(ni-yi)!) * piyi * (1-pi)yi] To construct a beta/binomial exchangeable hierarchical model, I assumed that p1,...,pk, where k = number of cases within a particular land use type (i.e., grassland or residential), were a random sample from a beta distribution with shape parameters, a and b. The beta (a,b) distribution was the prior on pi. The parameters, a and b, were hyperparameters that were assigned uniform hyperpriors. Vague uniform priors were placed on the global beta (coefficient) parameters for the logistic regression. All analyses were coded and administered in R (www.r-project.org) using methodologies illustrated in Goodman (2009), Albert (2009), Geyer (2009), and Crawley (2007). The joint posterior distribution for each Bayesian logistic regression model was sampled using a Metropolis random walk algorithm within a Markov Chain Monte Carlo (MCMC) simulation (mcmc package in R, Geyer, 2009). The ultimate goal of these models were to compare the predicted logistic regression curves resulting from cases within undeveloped grasslands and cases within developed residential areas. Therefore, functions in R were coded to output only the MCMC diagnostics and marginal posterior distributions of the parameters of interest for inference (i.e., the coefficients for the two simple logistic regression models). Results and Discussion MCMC Output It is important to plot the output of the MCMC chain as a time series plot. This allows for the visual evaluation of convergence for each of the parameters in the joint posterior distribution being sampled. Figures 2 and 4 refer to the MCMC diagnostic output for each of the models. Ideally, one can feel reasonably certain about convergence being met when the time series plots appear similar to the plots for the β1 parameters of both models (Figures 2 and 4). Despite much experimentation with tuning the MCMC sampler, the best time series plots for the β 0 parameters for both models were less than satisfactory. For the sake of completing the analyses that were part of my objectives for this project, I assumed that adequate samples were drawn from the posterior distribution for each parameter of interest. Figures 3 and 5 refer to the marginal posterior distributions for each coefficient parameter for each model. For the purposes of graphically plotting the relationship between the probability of occurrence and elevation for each land use, I assumed that the marginal posterior distributions were normally distributed. This assumption might adequate for the marginal distributions for the β1 parameters of both models, but this assumption may be inadequate for the distributions of the β0 parameters of both models (Figures 3 and 5). The marginal distributions of both parameters were influenced by the MCMC sampling algorithm. The failure of adequate convergence leads me to be suspicious of the resulting marginal posterior distributions. However, for the purposes of this project, I assumed that the marginal distributions could be interpreted and used to compare the relationships between the probability of CEMA occurrence and elevation for the two land uses of interest. Comparison of Probabilities of CEMA Occurrence between Land Uses It was clear from the resulting marginal distributions on the coefficient parameters that were differences in both the intercept and slope parameters between the two land use models. For instance, the grassland model yielded much higher mean intercept and slope values as compared to the residential model. In the context of a logistic regression, these differences could lead to dramatic differences in the placement of the resulting curves. This is very clearly illustrated in Figure 6 where both models are plotted. Figure 6 shows the mean response curve as well as the upper and lower bounds of the 95% credible intervals around those mean responses. As expected, the uncertainty around the mean response of the grassland model is less than that of the residential model. This was expected because the number cases used to draw inference for the grassland model (k = 1292) parameters was far greater than the number of cases used in the residential model (k = 70). Furthermore, uncertainty for both models decreased as the elevation increased. I believe this was due to the majority of cases in both datasets having zero number of occurrences at higher elevation values. The relationships between the probability of CEMA occurrence and elevation were different between the two land uses. The probability of CEMA occurrence within undeveloped grassland was much higher at lower elevations compared to the probability of occurrence within developed residential areas. As elevation increased, the probability of CEMA occurrence within the grassland areas sharply decreased while the probability of occurrence within the residential areas gradually decreased. The grassland model suggests that there was a threshold elevation where the probability of CEMA occurrence goes to zero. This along with the shape of the residential curve suggest that human related disturbances associated with residential development has allowed for significantly increased probabilities of CEMA occurrence at higher elevations. Interestingly, undeveloped grassland areas at low elevation locations had significantly higher probabilities of occurrence than locations of the same elevation in residential areas. This trend could possibly be explained by active management CEMA populations by homeowners. Conclusions and Future Directions Despite less than satisfactory convergence within the MCMC simulations, the resulting relationships between the probability of CEMA occurrence and elevation were quite different. The results of this exploratory data analysis using a hierarchical Bayesian approach suggested that residential development could facilitate the occurrence of CEMA into higher elevations than previously documented. The inspiration for this project was from Latimer et al. (2006). The authors illustrated methods for modeling species distributions within a hierarchical Bayesian framework, and suggested that models incorporating spatial dependency of species distributions as well as a hierarchical structure can result in better probability of occurrence predictions. It would be necessary to integrate a spatial random effect into the model as suggested by Latimer et al. (2006) to make inference on those cases (i.e., 30m grid cells) where observations do not exist. Furthermore, it would be necessary to integrate this level of complexity from the practical application of producing a map of predicted probabilities of occurrence. Another important aspect to be considered is the addition of other environmental variables that could further explain variability in predicted probabilities. Moreover, experimenting with different samplers within an MCMC may be necessary to result in satisfactory convergence diagnostics. For instance, a Gibbs sampler might be necessary for the integration of the spatial random effect. References Albert, J. 2009. Bayesian Computation with R Second Edition. Springer Science+Business Media, LLC. New York, New York. Corbin, J.D. and C.M. D’Antonio. 2004. Competition between native perennial and exotic annual grasses: implications for an historical invasion. Ecology. 85:1273-1283. Crawley, M.J. 2007. The R Book. John Wiley and Sons, Ltd. West Sussex, England. Geyer, C.J. 2009. MCMC Package Example (Version 0.7-3). Online resource. www.r-project.org Goodman, D. 2009. Empirical Bayes, Bayes empirical Bayes, and hierarchical analysis. Online resource (www.esg.montana.edu). Guisan, A., T.C.J. Edwards, and T. Hastie. 2002. Generalized linear and generalized additive models in studies of species distributions: setting the scene. Ecological Modelling. 157:89-100. Guisan, A. and N.E. Zimmerman. 2000. Predictive habitat distribution models in ecology. Ecological Modelling. 135:147-186. Hansen, A.J., R.L. Knight, J.M. Marzluff, S. Powell, K. Brown, P.H. Gude, and A. Jones. 2005. Effects of exurban development on biodiversity: patterns, mechanisms, and research needs. Ecological Applications. 15:1893-1905. Hobbs, R.J. and S.E. Humphries. 1995. An integrated approach to the ecology and management of plant invasions. Conservation Biology. 9:761-770. Latimer, A.M., S. Wu, A.E. Gelfand, ad J.A Silander Jr. 2006. Building statistical models to analyze species distributions. Ecological Applications. 16:33-50. Keating, K.A., and S. Cherry. 2004. Use and interpretation of logistic regression in habitat selection studies. Journal of Wildlife Management. 68:774-789. Rew, L.J., B.D. Maxwell, and R. Aspinall. 2005. Predicting the occurrence of non-indigenous species using environmental and remotely sensed data. Weed Science. 53:236-241. Shafii, B., W.J. Price, T.S. Prather, L.W. Lass, and D.C. Thill. 2003. Predicting the likelihood of yellow starthistle (Centaurea solstitialis) occurrence using landscape characteristics. Weed Science. 51:748-751. Figures Figure 1. Location of each case for both sets of cases within undeveloped grassland (green) and developed residential (brown) Figure 2. Time series plots of the sampled marginal posterior distributions for both the coefficients of the grassland model Figure 3. Marginal posterior distributions of β0 and β1 corresponding to the binomial probabilities of Centaurea maculosa occurrence within undeveloped grassland areas of Paradise Valley, MT Figure 4. Time series plots of the sampled marginal posterior distributions for both the coefficients of the residential model Figure 5. Marginal posterior distributions of β0 and β1 corresponding to the binomial probabilities of Centaurea maculosa occurrence within residentially developed areas of Paradise Valley, MT Figure 6. The mean response and 95% credible intervals for each land use model