Spatial Modelling of Count Data: A Case Study in Modelling Breeding Bird Survey Data on Large Spatial Domains Christopher K. Wikle University of Missouri-Columbia 2 0.1 Introduction The North American Breeding Bird Survey (BBS) is conducted each breeding season by volunteer observers (e.g., Robbins et al. 1986). The observers count the number of various species of birds along specified routes. The collected data are used for several purposes, including the study of the range of bird species, and the variation of the range and abundance over time (e.g., Link and Sauer, 1998). Such studies usually require spatial maps of relative abundance. Traditional methods for producing such maps are somewhat ad hoc (e.g., inverse distance methods) and do not always account for the special discrete, positive nature of the count data (e.g., Sauer et al. 1995). In addition, corresponding prediction uncertainties for maps produced in this fashion are not typically available. Providing such uncertainties is critical as the prediction maps are often used as ”data” in other studies and for the design of auxiliary sampling plans. We consider the BBS modeling problem from a hierarchical perspective, modeling the count data as Poisson, conditional on a spatially-varying intensity process. The intensities are then assumed to follow a log-normal distribution with fixed effects and with spatial and non-spatial random effects. Model-based geostatistical methods for generalized linear mixed models (GLMMs) of this type have been available since the seminal work of Diggle et al. (1998). However, implementation is problematic when there are large data sets and prediction is desired over large domains. We show that by utilizing spectral representations of the spatial random effects process, Bayesian spatial prediction can easily be carried out on very large data sets over extensive prediction domains. The BBS sampling unit is a roadside route 39.2 km in length. Over each route, an observer makes 50 stops, at which birds are counted by sight and sound for a period of 3 minutes. Over 4000 routes have been included in the North American survey, but not all routes are available each year. As might be expected due to the subjectivity involved in counting birds by sight and sound, and the relative experience and expertise of the volunteer observers, there is substantial observer error in the BBS survey (e.g., Sauer et al. 1994). In this study, we are concerned with the relative abundance of the House Finch (Carpodacus mexicanus). Figure 0.1 shows the location of the sampling route midpoints and observed counts over the continental United States (U.S.) for the 1999 House Finch BBS. The size of the circle radius is proportional to the number of birds observed at each site. This figure suggests that the House Finch is more prevalent in the Eastern and Western U.S. than in the interior. Indeed, this species is native to the Western U.S. and Mexico. The Eastern population is a result of a 1940 release of caged birds in New York. The birds were being sold illegally in New York City as “Hollywood Finches” and were supposedly released by dealers in THE POISSON RANDOM EFFECTS MODEL 3 Figure 0.1 Observation locations for 1999 BBS of House Finch (Carpodacus mexicanus). Radius and color are proportional to the observed counts. an attempt to avoid prosecution. Within three years there were reports of the birds breeding in the New York area. Because the birds are prolific breeders and their juveniles disperse over long distances, the House Finch quickly expanded to the west (Elliott and Arbib, 1953). Simultaneously, as the human population on the west coast expanded eastward (and correspondingly, changed the environment) the House Finch expanded eastward as well. By the late 1990’s, the two populations met in the Central Plains of North America. From Figure 0.1 it is clear that there are many regions of the U.S. that were not sampled in the 1999 House Finch BBS. Our interest here is to predict abundance over a relatively dense network of spatial locations, every quarter degree of latitude and longitude. The network of prediction grid locations includes 228 points in the longitudinal and 84 in the latitudinal direction, for a total of 19,152 prediction grid locations. 0.2 The Poisson Random Effects Model Consider the model for the count process y(x) given a spatially varying mean process λ(x): y(x)|λ(x) ∼ P oisson(λ(x)). (0.1) The log of the spatial mean process is given by: log(λ(x)) = µ + z(x) + η(x), (0.2) where µ is a deterministic mean component, z(x) is a spatially correlated random component, and η(x) is an uncorrelated spatial random component. In general, the fixed component µ might be related to spatiallyvarying covariates (such as habitat) and could include “regression” terms. We will consider the simple constant mean formulation in this application. 4 The correlated process, z(x), is necessary in this application because we have substantial prior belief that the counts at “nearby” routes are correlated. From a scientific point of view, this is likely due (at least in part) to the fact that the birds are attracted to specific habitats, and we know that habitat is correlated in space. Typically, one can view the z-process as accounting for the effects of “unknown” covariates, since it induces spatial structure in the λ-process, and thus the observed counts. In that sense, maps of the z-process may be interesting and lead to greater understanding as to the preferred habitat of the modeled bird species (e.g., Royle et al, 2001). The random component η(x) accounts for observer effects. A major concern in the analysis of BBS data is the known observer bias, as discussed previously. Typically, we can assume that since the observers produce counts on different routes, they are independent with regards to space. The above discussion suggests that we might model z(x) as a Gaussian random field with zero mean and covariance given by cθ (x, x0 ), where θ represents parameters (possibly vector-valued) of the covariance function c. In addition, we assume η(x) ∼ N (0, ση2 ), where cov(η(x), η(x0 )) = 0 if x 6= x0 . As presented, the Poisson spatial model follows the framework for generalized geostatistical prediction formulated in Diggle et al. (1998). An example of this approach applied to the BBS problem can be found in Royle et al. (2001). However, implementation in that case was concerned with relatively small data sets and over limited geographical regions. The Gaussian random field-based Bayesian hierarchical approach becomes increasingly difficult to implement as the dimensionality of the data and number of prediction locations increases. Consequently, such an approach is not feasible at the continental scale and high resolution that we require in the present application. However, as outlined in Royle and Wikle (2001), one can still use the Bayesian GLMM methodology in these high-dimensional settings if one makes use of spectral representations. This approach is summarized in the next section. 0.2.1 Spectral Formulation Let {xi }m i=1 be the set of data locations, at which counts y(xi ) were observed. Further, let {xj }nj=1 be the set of prediction locations, which may, but need not, include some or all of the m data locations. We now rewrite the mean-process model (0.2): log(λ(xi )) = µ + k0i z n + η(xi ), (0.3) where z n is an n × 1 vector representation of z-process at the prediction locations, and the vector ki relates the log-mean process at observation location xi to one or more elements of the z-process at prediction locations THE POISSON RANDOM EFFECTS MODEL 5 (e.g., Wikle et al. 1998; Wikle et al. 2001). We then assume: z n = Ψα + ², (0.4) where Ψ is an n×p matrix, fixed and known, α is a p×1 vector of coefficients with α ∼ N (0, Σα ), and ² ∼ N (0, σ²2 I). We let Ψ consist of spectral basis 0 functions [ψj,k ]n,p j=1,k=1 that are orthogonal. That is, if ψ k ≡ [ψ1,k , . . . , ψn,k ] then ψ 0k ψ j = 0 if k 6= j and 1, otherwise. In this case, we say that α are spectral coefficients. From a hierarchical perspective, we can write: z n |α, σ²2 ∼ N (Ψα, σ²2 I) (0.5) α|Σα ∼ N (0, Σα ). (0.6) and In general, the covariance function for the α-process depends on some parameters θ; we denote this covariance by Σα (θ). The modeling motivation for the hierarchy is apparent if we note that the random z-process can be written z n ∼ N (0, Σz (θ) + σ² I), where σ²2 accounts for the “nugget effect” due to small scale variability. Given (0.4), the covariance function for the z-process can be written, Σz (θ) = ΨΣα (θ)Ψ0 . In principle, any set of orthogonal spectral basis functions could be used for Ψ. For example, one could use the leading variance modes of the covariance matrix Σz . Such modes are just the eigenvectors that diagonalize the spatial covariance matrix and thus are just principal components. These spatial principal components are known as Empirical Orthogonal Functions (EOFs) in the geostatistical literature (e.g., Obled and Creutin, 1986; Wikle and Cressie, 1999). Such a formulation is advantageous because it allows for non-stationary spatial correlation and dimension reduction (p << n). Another possibility would be to use Fourier basis functions in Ψ. This could apply if the prediction locations were defined in continuous space or on a grid. However, as we will demonstrate, if we choose a grid implementation, one need not actually form the matrix Ψ, which would be problematic for grid sizes of order 105 as we consider here. That is, the operation Ψα is actually an inverse Fourier transform operation on the vector α. On a discrete lattice, one can use Fast Fourier Transform (FFT) procedures to efficiently implement this transform without having to make or store the matrix of basis functions. In this case, p = n. If the z-process is stationary, the use of Fourier basis functions suggests that the matrix Σα (θ) is diagonal (asymptotically). For situations where it is more appropriate to assume that the process is nonstationary and the prediction locations can be thought of as a discrete grid, one could consider a wavelet basis function for Ψ. In this case, the operation Ψα is just an inverse discrete wavelet transform of α; again, Ψ need not be constructed directly. Depending on the class of wavelets chosen, the matrix Σα (θ) may be diagonal (asymptotically) or nearly so. 6 In the hierarchical implementation, the parameterization of Σα (θ) is especially critical. For example, with wavelet basis functions, we might assume a fractional scaling behavior in the variance of the different wavelet scales. This is particularly useful when the process is known to exhibit such behavior, such as turbulence examples in atmospheric science (e.g., Wikle et al. 2001). Alternatively, we might assume a common stationary class for the z-process, such as the Matérn class of covariance functions, c(dij ) = φ(θ1 dij )θ2 Kθ2 (θ1 dij ), φ > 0, θ1 > 0, θ2 > 0, (0.7) where dij is the distance between two spatial locations, Kθ2 is the modified Bessel function, θ2 is related to the degree of smoothness of the spatial process, θ1 is related to the correlation range, and φ is proportional to the variance of the process (e.g., Stein 1999, p.48). The corresponding spatial spectral density function at frequency ω is, f (ω; θ1 , θ2 , φ, g) = 2θ2 −1 φΓ(θ2 + g/2)θ1 2θ2 , π g/2 (θ1 2 + ω 2 )θ2 +g/2 (0.8) where g is the dimension of the spatial process (e.g., Stein 1999, p. 49). Thus, if one chooses Fourier basis functions for Ψ and assumes the Matérn class, then Σα (θ) should be diagonal (asymptotically) with diagonal elements corresponding to f given by (0.8). If not known, one must specify prior distributions for θ and φ at the next level of the model hierarchy. 0.2.2 Model Implementation and Prediction The hierarchical Poisson model with a spectral spatial component is summarized as follows. The joint likelihood for all observations y (an m × 1 vector) is m Y [y|λ] = P oisson(λ(xi )), (0.9) i=1 where λ is an m × 1 vector, corresponding to the locations of the vector y. The joint prior distribution for log(λ(xi )) is: [log(λ)|µ, γ, z n , σn2 ] = N (µ1 + γKz n , ση2 I), (0.10) where 1 is an m × 1 vector of ones, log(λ) is the m × 1 vector with elements log(λ(xi )), K is an m×n matrix with rows k0i , and γ is a scaling coefficient (introduced for computational reasons as discussed below). Then, let [z n |α, σe2 ] = N (Ψα, σe2 I), (0.11) and allow the spectral coefficients to have distribution, [α|Rα (θ1 )] = N (0, Rα (θ1 )), (0.12) where Rα (θ1 ) is a diagonal matrix. For the BBS illustration presented here, we let θ2 = 1/2 in (0.7) (i.e., we assume the covariance model is exponen- THE POISSON RANDOM EFFECTS MODEL 7 tial) but assume the dependence parameter θ1 is random. Note that as a consequence of including the γ parameter in (0.10) we are able to specify the conditional covariance of α as the diagonalization of a correlation matrix rather than a covariance matrix (see discussion below). Finally, to complete the model hierarchy, we assume the remaining parameters are independent and specify the following prior distributions: µ ∼ N (µ0 , σµ2 ), ση2 ∼ IG(qη , rη ), σe2 ∼ IG(qe , re ), γ ∼ U [0, b], θ1 ∼ U [u1 , u2 ], (0.13) (0.14) where IG( ) refers to an inverse gamma distribution, and U [ ] a uniform distribution. For the BBS House Finch data we select qη = 0.5, qe = 1, rη = 2, re = 10, µ0 = 0, σµ2 = 10, b = 100, u1 = 1, and u2 = 100 (note, our parameterization of the exponential is r(d) ∝ exp(−θ1 d), where d is the distance). These hyperparameters correspond to rather vague proper priors. The alternative to specifying γ in (0.10) is to let the conditional covariance of α be σα2 Rα (θ1 ). However, as is often the case for Bayesian spatial models that are deep in the model hierarchy (and thus, relatively far from the data), the MCMC implementation has difficulty converging because of the tradeoff between the spatial process variance, σα2 , and the dependence parameter, θ1 . By allowing the z-process to have unit variance, as in the above formulation, we need not estimate σα2 (which is 1 in this case). The variance in the spatial process is then achieved through γ. In situations where the implied assumption of homogeneous variance is unrealistic, a more complicated reparameterization would be required. Note that the γ parameterization also affects the interpretation of the variance of the z-process (i.e., σe2 = σ²2 /γ 2 ). Our goal is the estimation of the joint posterior distribution, [log(λ), z n , θ1 , γ, ση2 , σe2 , µ|y] ∝ [y| log(λ)][log(λ)|µ, z n , ση2 ][z n |α, σe2 ] × [α|θ1 ][θ1 ][γ][µ][ση2 ][σe2 ] Although this distribution cannot be analyzed directly, we are able to use MCMC approaches as suggested by Diggle et al. (1998) to draw samples from this posterior and appropriate marginals. In particular, we utilized a Gibbs sampler with Metropolis-Hastings sampling of log(λ) and θ1 (e.g., see Royle et al. 2001). Perhaps more importantly, we would like estimates from the posterior distribution of λn , the λ-process at prediction grid locations. The key difficulty in the traditional (non-spectral) geostatistical formulation is the dimensionality of the full-conditional update for the zprocess given all other parameters. As we show below, this is no longer a serious problem if we make use of the spectral representation. 8 Selected Full-Conditional Distributions As mentioned above, for the most part the full-conditional distributions follow those outlined generally in Diggle et al. (1998) and specifically, those in Royle et al. (2001). However, the spectral representation allows simpler forms for the z n and α full-conditionals. The full-conditional distribution for z n can be shown to be: −1 z n |· ∼ N (S −1 z az , S z ), I/σe2 0 2 /ση2 Ψα/σe2 (0.15) 0 where S z = + K Kγ and az = + K (log(λ) − µ1)γ/ση2 . In our case, K is an incidence matrix (a matrix of ones and zeros) such that each observation is only associated with one prediction grid location (a reasonable assumption at the resolution presented here). Thus, K 0 K can be shown to be a diagonal matrix with 1’s and 0’s along the diagonal. Although the matrix S z is very high-dimensional (order 105 × 105 ), it is diagonal and trivial to invert. In addition, Ψα can be calculated by the inverse FFT function (a fast operation) and z n is updated as simple univariate normal distributions. In practice, we update these simultaneously in a matrix language implementation. Similarly, −1 α|· ∼ N (S −1 (0.16) α aα , S α ), where S α = (Ψ0 Ψ/σe2 + Rα (θ1 )−1 ) and aα = Ψ0 z n /σe2 . At first glance, this appears problematic due to the Ψ0 Ψ and Rα (θ1 )−1 terms in the fullconditional variance. However, since the spectral operators are orthogonal, Ψ0 Ψ = I and the matrix Rα (θ1 )−1 is diagonal as discussed previously. Furthermore, Ψ0 z n is just the FFT operation on z n and is very fast. Thus, α|· ∼ N ((I/σe2 + Rα (θ1 )−1 )−1 Ψ0 z n /σe2 , (I/σe2 + Rα (θ1 )−1 )−1 ) (0.17) and can be sampled as individual univariate normals, or easily in a block update. Prediction To obtain predictions of λn , the λ-process at the prediction grid locations, we sample from (t) (t) (t) 2 (t) 2 (t) [log(λ(t) ] = N (µ(t) 1 + γ (t) z (t) I), n )|z n , γ , µ , ση n , ση (t) (0.18) 2 (t) are the t-th samples from the where 1 is n × 1 and µ(t) , γ (t) , z n , ση MCMC simulation. We obtain λ(t) by simply exponentiating these samples. n Implementation The MCMC simulation must be run long enough to achieve precise estimation of model parameters and predictions. For the BBS House Finch data, the MCMC simulation was run for 200,000 iterations after a 50,000 burn-in RESULTS 9 period. For sake of comparison, the algorithm took approximately 0.5 seconds per iteration with a MATLAB implementation on a 500 MHz Pentium III processor running Linux. Considering there are nearly 20,000 prediction locations and relatively strong spatial structure, this is quite fast. We examined many shorter runs to establish burn-in time and to evaluate model sensitivity to the fixed parameters and starting values. The model does not seem overly sensitive to these parameters. 0.3 Results The posterior mean and posterior standard deviation for the scalar parameters are shown in Table 0.1. Table 0.1 Posterior mean and standard deviation of univariate model parameters. Parameter µ γ ση2 σe2 θ1 Posterior Mean Posterior Standard Deviation 0.74 1.41 0.84 0.23 14.78 0.105 0.138 0.100 0.064 4.605 Figure 0.2 shows the posterior mean for the gridded z-process. We note the agreement with the data shown in Figure 0.1. One might examine this map to indentify possible habitat covariates that are represented by the spatial random field. One possibility in this case might be elevation and population, both of which are thought to be associated with the prevalence of the House Finch. We note that the prediction grid extends beyond the continental United States. Clearly, estimates over ocean regions are meaningless with regard to House Finch data. These estimates are a result of the large-scale Fourier coefficients in the model. Fortunately, the map of posterior standard deviations for this process, shown in Figure 0.3, indicates that these regions with no-data are highly suspect. This is also true of the northern plains region, which has few observations. Of course, having the prediction grid extend over the ocean is not ideal in this case, but the FFT-based algorithm requires rectangular grids. We could control for the land-sea effect by having an indicator covariate or possibly, a regime-specific model. Such modifications would be simple to implement in the hierarchical Bayesian framework presented here. However, simulation studies have shown that 10 Figure 0.2 Posterior mean of zn for the 1999 BBS House Finch data. Figure 0.3 Posterior standard deviation of data. zn for the 1999 BBS House Finch these are not necessary and if desired, one could simply mask the water portions of the map for presentation. Finally, Figure 0.4 and Figure 0.5 show the posterior mean and standard deviation of the λ-process on the prediction grid. These plots show clearly that the posterior standard deviation is proportional to the predicted mean, as expected with Poisson count data. In addition, the standard errors are also high in data sparse regions, as we expect. 0.4 Conclusion In summary, we have demonstrated how the Bayesian implementation of geostatistical-based GLMM Poisson spatial models can be implemented in problems with very large numbers of prediction locations. By utilizing relatively simple spectral transforms and associated orthogonality and decorrelation, we are able to implement the modeling approach very efficiently in general MCMC algorithms. CONCLUSION Figure 0.4 Posterior mean of gridded 11 n for the 1999 BBS House Finch data. Figure 0.5 Posterior standard deviation of data. n for the 1999 BBS House Finch Acknowledgement This research has been supported by a grant from the U.S. Environmental Protection Agency’s Science to Achieve Results (STAR) program, Assistance Agreement No. R827257-01-0. The author would like to thank Andy Royle for providing the BBS data and for helpful discussions. References Diggle, P.J., J.A. Tawn, and R.A. Moyeed. 1998. Model-based geostatistics (with discussion). Applied Statistics 47:299-350. Elliott, J.J., and R.S. Arbib. 1953. Origin and status of the house finch in the eastern United States. Auk 70:31-37. Link, W.A., and J.R. Sauer. 1998. Estimating population change from 12 count data: application to the North American Breeding Bird Survey. Ecological Applications 8:258-268. Obled, C., and J.D. Creutin. 1986. Some developments in the use of empirical orthogonal functions for mapping meteorological fields. J. Climate and Applied Meteorology 25:1189-1204. Robbins, C.S., D.A. Bystrak, and P.H. Geissler. 1986. The Breeding Bird Survey: its first fifteen years, 1965-1979. USDOI, Fish and Wildlife Service Resource Publication 157. Washington, D.C. Royle, J.A., W.A. Link, and J.R. Sauer. 2001. Statistical mapping of count survey data. In Predicting Species Occurrences: Issues of Scale and Accuracy, (Scott, J. M., P. J. Heglund, M. Morrison, M. Raphael, J. Haufler, B. Wall, editors). Island Press. Covello, CA. (to appear) Royle, J.A., and C.K. Wikle. 2001. Large-scale spatial modeling of breeding bird survey data. Under review. Sauer, J.R., B.G. Peterjohn, and W.A. Link. 1994. Observer differences in the North American Breeding Bird Survey. Auk 111:50-62. Sauer, J.R., G.W. Pendleton, and S. Orsillo. 1995. Mapping of bird distributions from point count surveys. Pages 151-160 in C.J. Ralph, J.R. Sauer, and S. Droege, eds. Monitoring Bird Populations by Point Counts, USDA Forest Service, Pacific Southwest Research Station, General Technical Report PSW-GTR-149. Stein, M. 1999. Interpolation of Spatial Data: Some Theory for Kriging. Springer-Verlag: New York. Wikle, C.K., Berliner, L.M., and N. Cressie. 1998. Hierarchical Bayesian space-time models. Journal of Environmental and Ecological Statistics 5:117–154. Wikle, C.K. and N. Cressie. 1999. A dimension reduction approach to spacetime Kalman filtering. Biometrika 86:815-829. Wikle, C.K., R.F. Milliff, D. Nychka, and L.M. Berliner. 2001. Spatiotemporal hierarchical Bayesian modeling: Tropical ocean surface winds. Journal of the American Statistical Association 96:382-397.