Non-Gaussian Spatiotemporal Modelling through Scale Mixing Thaís C.O. da Fonseca and Mark F.J. Steel∗ Department of Statistics, University of Warwick, U.K. Abstract The aim of this work is to construct non-Gaussian and nonseparable covariance functions for processes that vary continuously in space and time. Stochastic modelling of phenomena over space and time is important in many areas of application. But choice of an appropriate model can be difficult as one must take care to use valid covariance structures. We start from a general and flexible way of constructing valid nonseparable covariance functions derived through mixing over separable Gaussian covariance functions. We then generalize the resulting models by allowing for individual outliers as well as regions with larger variances. We induce this through scale mixing with separate positive-valued processes. Smooth mixing processes are applied to the underlying correlated Gaussian processes in space and in time, thus leading to regions in space and time of increased spread. We also apply a separate uncorrelated mixing process to the nugget effect to generate individual outliers. We consider posterior and predictive Bayesian inference with these models. We implement this through a Markov chain Monte Carlo sampler and apply our modelling approach to temperature data in the Basque country. Key words: Bayesian Inference; Flexible tails; Mixtures; Nonseparability; Outliers; Temperature data. 1 Introduction The motivation of this work is to develop and study non-Gaussian models for processes that vary continuously in space and time. This is a problem of interest in many fields of science such as geology, hydrology and meteorology. Consider the problem of modelling a phenomenon of interest over space and time as a random process {Z(s, t); (s, t) ∈ D × T }, ∗ (1) Thaís Fonseca acknowledges financial support from the Center for Research in Statistical Methodology (CRiSM) and we thank Blanca Palacios for providing us with the temperature data. 1 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism where (s, t) ∈ D × T , D ⊆ <d , T ⊆ < are space-time coordinates that vary continuously in D × T . We usually observe a realization of this process at locations si , i = 1, . . . , I and time points tj , j = 1, . . . , J. The usual assumption for the finite dimensional distributions implied by this process is that, for these spatiotemporal coordinates, the random vector Z = (Z(s1 , t1 ), . . . , Z(sI , t1 ), . . . , Z(s1 , tJ ), . . . , Z(sI , tJ ))0 has a multivariate normal distribution with covariance matrix Σ with elements Σkk0 = Cov(Zk , Zk0 ), k, k 0 = 1, 2, . . . , N = IJ. This class is mathematically very convenient, but it is a very restrictive assumption and the data may well present non-Gaussian characteristics. For instance, if there are aberrant observations in the data set it would be useful to consider heavy tailed distributions in order to accommodate these observations. In recent years some models describing departures from Gaussianity for spatial processes were presented in the literature. Cressie and Hawkins (1980) discuss robust estimation of the variogram when the distribution has heavier tails than the normal in spatial models. De Oliveira et al. (1997) used nonlinear transformations of random fields in order to accommodate moderate departures from Gaussianity. For instance, their proposal includes the Gaussian and the lognormal models as sampling distributions. Palacios and Steel (2006) proposed a geostatistical model that accommodates non-Gaussian tail behaviour in space. Their proposal has the Gaussian model as a limiting case. The proposed class of processes is based on scale mixing a Gaussian process which allows for modelling regions with larger observational variance. Here we consider similar mixing in the nugget effect component allowing for individual outliers. In addition, the ideas in Palacios and Steel (2006) are here extended to processes in space and time, while avoiding the restrictive assumption of separability between space and time. In the context of larger observational variance, Damian et al. (2001) considered temporally p ν(s)Y1 (s) + Y2 (s, t), where independent samples from a spatiotemporal process Z(s, t) = Y1 (s) is a spatial process and Y2 (s, t) accounts for the nugget effect. Their model addresses the problem of anisotropy through deformation of the spatial coordinates and uses Bayesian semi-parametric modelling of the deformation function. The general model potentially also accounts for different variances in space but they adopted the simplifying assumption of constant variances ν(s) = ν, ∀s. Damian et al. (2003) consider the complete model that incorporates spatial heterogeneity by modelling ν(s) as latent variables with a log-Gaussian distribution. To deal with heterogeneity in time, Stein (2009) proposed a model that can account for occasional bursts of increased variability in time. This was done by considering the transformed spatiotemporal process divided by a function of time which was estimated by computing the sample standard deviation at each time point and then smoothed by cubic splines. Notice that this approach does not allow for predictions in time. The model we propose here is able to capture heterogeneous variability both in time and 2 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism space, as well as outliers in space through a mixed nugget effect. In addition, the covariance structure is nonseparable between space and time. We present an application to maximum temperature data in the Spanish Basque Country, in which the model massively outperforms Gaussian modelling, both in terms of within-sample data support and out-of-sample predictive fit. Moreover, the model easily allows for prediction in space and in time, since we can also predict the mixing processes. We use a Bayesian inferential framework with mildly informative priors. The flexibility of the proposed model does not substantially complicate posterior and predictive inference since conditional on the mixing processes the finite dimensional distributions are all Gaussian distributions. 2 Spatiotemporal modelling Building adequate models for processes observed over space and time is not an easy task. Many features have to be considered, like stationarity, separability, isotropy and Gaussianity. Adequate specification of the sampling distribution plays an important role in this context since misspecification can lead to poor forecasts or interpolations in space and time. In particular, Gaussian models will not perform well if the data are contaminated by outliers or if there are regions in space or time with larger observational variance. For this reason, we propose a general model able to capture individual outliers as well as regions with different variance. We use the idea of scale mixing in order to construct processes that imply finite dimensional distribution with heavier tail than the normal distribution. We consider nonseparable models in space and time generated as proposed in Fonseca and Steel (2008). This construction takes advantage of the models proposed for spatial and temporal processes separately and combines them by using a continuous mixture of separable covariance functions. Let (U, V ) be a bivariate nonnegative random vector with distribution µ(u, v) and independent of {Z1 (s); s ∈ D} and {Z2 (t); t ∈ T } which are purely spatial and temporal random processes, respectively, taken to be independent. Define the process Z(s, t) = Z1 (s; U )Z2 (t; V ), (2) where Z1 (s; u) is a purely spatial random process for every u ∈ <+ with covariance function C1 (s; u) = σ1 exp{−γ1 (s)u} which is a stationary covariance for s ∈ D and every u ∈ <+ and a measurable function of u ∈ <+ for every s ∈ D. Z2 (t; v) is a purely temporal random process for every v ∈ <+ with covariance function C2 (t; v) = σ2 exp{−γ2 (t)v} which is a stationary covariance for t ∈ T and every v ∈ <+ and a measurable function of v ∈ <+ for every t ∈ T . γ1 (s) is a purely spatial variogram on D and γ2 (t) is a purely temporal variogram on T . Then the corresponding covariance function of Z(s, t) is a convex combination of separable covariance 3 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism functions. It is valid (see Ma, 2002, 2003) and generally nonseparable, and is given by Z C(s, t) = C1 (s; u)C2 (t; v)dµ(u, v) (3) In particular, if we define U = X0 + X1 and V = X0 + X2 where X0 , X1 and X2 are independent nonnegative random variables with finite moment generating functions M0 , M1 and M2 , respectively, then the resulting covariance function is given by C(s, t) = σ 2 M0 (−γ1 (s) − γ2 (t)) M1 (−γ1 (s)) M2 (−γ2 (t)), (4) where σ 2 = σ1 σ2 . For some interesting classes generated by this approach see Fonseca and Steel (2008). We now consider a more general process {Z̃(s, t); (s, t) ∈ D × T } defined by Z̃(s, t) = Z̃1 (s; U )Z̃2 (t; V ), (5) where Z̃1 (s; U ) = p Z1 (s; U ) (s) 1 − τ2 p +τp λ1 (s) h(s) (6) where {λ1 (s); s ∈ D} is a positively valued mixing process which is independent of (s) and Z1 (s; u). {(s); s ∈ D} denotes an uncorrelated Gaussian process with zero mean and unitary variance which introduces a nugget effect parameterised by τ . {h(s); s ∈ D} is an uncorrelated process in <+ with distribution Ph . The mixing process λ1 (s) is spatially correlated and allows for regions in space with larger variance while the process h(s) can create traditional outliers, i.e. observations with unusually large nugget effects. We also want to allow heterogeneous observational variances in time, so we consider the following process in time Z2 (t; V ) Z̃2 (t; V ) = p λ2 (t) (7) where {λ2 (t); t ∈ T } is a positive mixing process which is independent of Z2 (t; v). The covariance function for the process {Z̃(s, t); (s, t) ∈ D × T } is given by C̃(s, t) = Cov(Z̃(s0 , t0 ), Z̃(s0 + s, t0 + t)) = EU,V {C̃1 (s; U )C̃2 (t; V )}, (8) where s, s + s0 ∈ D and t, t + t0 ∈ T , C̃1 (s; u) = Cov(Z̃1 (s0 ; u), Z̃1 (s0 + s; u)) and C̃2 (t; v) = Cov(Z̃2 (t0 ; v), Z̃2 (t0 + t; v)). Throughout, we assume independence between λ1 (s), h(s) and λ2 (t). 3 Scale mixing in space In this section we consider scale mixing in the space dimension. This will account for individual outliers (through the process h(s)) and regions in space with larger observational variance (through the process λ1 (s)). The latter is quite common e.g. in meteorological applications 4 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism where outliers are often associated with severe weather events such as tornados and hurricanes. Lu et al. (2007) pointed out that these events do not usually happen in a single location but cover an extended region. Initially, we consider the case where λ2 (t) = 1, ∀t ∈ T . The results in Palacios and Steel (2006) are directly applicable to the purely spatial process Z̃1 (s; u). The mixing process λ1 (s) needs to be correlated to induce mean square continuity in the process Z̃1 (s; u) for τ = 0, that is, we need to correlate the mixing variables so that locations that are close together will have very similar values of λ1 (s). The simplest way to do this is to consider a common mixing variable λ1 (s) = λ ∼ Pλ . Then we have that C̃1 (s; u) = (1 − τ 2 )E λ−1 C1 (s; u) + τ 2 E h−1 I(s=0) (9) E[h−1 ] = E[h(s)−1 ], ∀s, where h ∼ Ph . Solving the integral in (8) and assuming throughout that σ 2 = σ2 and σ1 = 1 we obtain C̃(s, t) = σ 2 M0 (−γ1 (s) − γ2 (t)) M̃1 (−γ1 (s)) M2 (−γ2 (t)) (10) where M̃1 (−γ1 ) = (1 − τ 2 )E[λ−1 ]M1 (−γ1 (s)) + τ 2 E[h−1 ]I(s=0) . Therefore the correlation structure for s 6= 0 is given by ρ̃(s, t) = E[λ−1 ] C̃(s, t) = M0 (−γ1 (s) − γ2 (t)) M1 (−γ1 (s)) M2 (−γ2 (t)) −1 E[λ ] + w2 E[h(s)−1 ] C̃(0, 0) (11) where w2 = τ 2 /(1 − τ 2 ). When τ 2 = 0 (no nugget effect) the mixing does not affect the correlation structure, that is, ρ̃(s, t) = ρ(s, t), where ρ(s, t) = C(s, t)/C(0, 0) is the correlation function of {Z(s, t), (s, t) ∈ D × T }. For instance, if we take Pλ to be Ga(ν/2, ν/2) then the unconditional distribution of Z̃ = (Z̃(s1 , t1 ), . . . , Z̃(sI , tJ ))0 is IJ-variate student-t with ν degrees of freedom. Roislien and Omre (2006) presents some characteristics of student-t random fields. But setting λ1 (s) = λ ∼ Pλ is an extreme situation and we would like to have individual −1/2 mixing variables that account for spatial heterogeneity. Thus, we consider a process λ1 (s) that is mean square continuous which implies that continuous. This Z̃1 (s; u) is also mean square 2 −1/2 −1/2 means that we need to satisfy the condition E λ1 (si ) − λ1 (si0 ) → 0 as si → si0 . The latter is satisfied for the log-Gaussian process proposed in Palacios and Steel (2006) where {ln(λ1 (s)); s ∈ D} is a Gaussian process with mean −ν/2 and covariance structure νC1∗ (·), with ν > 0 and C1∗ (·) a valid correlation function. This implies a lognormal distribution for each λ1 (si ) with mean one and Var[λ1 (si )] = exp(ν) − 1, so that the marginal distribution becomes more spread out as ν increases. For large ν the distribution also becomes more right-skewed with the mode shifting towards zero, allowing for substantial variance inflation for some spatial regions. The spatial covariance function is then given by nν o C̃1 (s; u) = (1 − τ 2 ) exp [C1∗ (s) − 1] + ν C1 (s; u) + τ 2 E h−1 I(s=0) , 4 (12) 5 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism and the resulting covariance function for the spatiotemporal process obtained by solving (8) is given by (10) with M̃1 (−γ1 ) = (1 − τ 2 ) exp ν4 [C1∗ (s) − 1] + ν M1 (−γ1 ) + τ 2 E[h−1 ]I(s=0) . Therefore the correlation structure for s 6= 0 is given by exp ν4 [C1∗ (s) − 1] + ν ρ̃(s, t) = M0 (−γ1 (s) − γ2 (t)) M1 (−γ1 (s)) M2 (−γ2 (t)). exp{ν} + w2 E[h−1 ] (13) Throughout, we will use C1∗ (s) = M1 (−γ1 (s)). If, in addition, we take h(s) = 1 we have the same model presented in Palacios and Steel (2006) for the space dimension. As commented in Palacios and Steel (2006), we could use a different correlation function C1∗ (.) for the mixing process but then we would need to estimate the parameters in C1∗ (.) and this might not be easy on the basis of typically available data. The smoothness properties presented in Palacios and Steel (2006) and Fonseca and Steel (2008) extend to this framework. In the following, f (q) (x) will denote the q th derivative of a function f (x) with respect to x. Proposition 3.1 In the case without nugget effect and with C1∗ (s) = M1 (−γ1 (s)), the purely spatial process {Z̃(s, t0 ), s ∈ D} at a fixed time point t0 ∈ T is m times mean square differentiable (2m) if and only if M1 3.1 (2m) (r) and γ1 (s) exist and are finite at 0. Process h(s) We define the process {h(s); s ∈ D} as an uncorrelated mixing process that allows for larger nugget effects. This accommodates traditional outlying observations. The process is uncorrelated with (s), Z1 (s; U ) and Z2 (t; V ). Aberrant observations are common in time series analysis and might also be encountered in processes observed in space. Therefore it is essential to consider this possibility when modelling phenomena over space-time. It is important to understand the effect of outliers in the estimation of the parameters in the correlation structure as this will affect directly the predictions. We consider the detection of outliers jointly in the estimation procedure. The variables hi = h(si ), i = 1, . . . , I are considered latent variables and their posterior distribution provides an indication of outlying observations. If the marginal posterior distribution of hi has a lot of mass close to 0, this indicates inflation of the scale τ 2 and therefore an outlying observation. We consider the following i.i.d. assumptions for hi , i = 1, . . . , I where νh > 0: 1. ln(hi ) ∼ N(−νh /2, νh ), that is, E[hi ] = 1 and Var(hi ) = exp(νh ) − 1. If νh is close to 0 then the distribution of hi is very tight around 1. 2. hi ∼ Ga(1/νh , 1/νh ), where Ga(a, b) denotes the Gamma distribution with density function f (x) = ba /Γ(a)xa−1 exp(−bx), so that E[hi ] = 1 and Var(hi ) = νh . If νh is close to 0 then again the distribution of hi is very tight around 1. 6 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism In order to evaluate the tail behaviour of the finite dimensional distribution of the proposed 4 2 process we consider the kurtosis, which is given by E[Z̃ij ]/E2 [Z̃ij ], where Z̃ij = Z̃(si , tj ), i = 1, . . . , I and j = 1, . . . , J. The kurtosis of the marginal finite dimensional distributions implied by the process defined in (5)-(6) in combination with a log-Gaussian λ1 (s) is given by kurt[Z̃ij ] = −2 4 3{exp(3ν) + 2w2 exp(ν)E[h−1 i ] + w E[hi ]} . 4 2 −1 exp(2ν) + 2w2 exp(ν)E[h−1 i ] + w E [hi ] (14) Notice that when τ 2 = 0, that is, if there is no nugget effect then the kurtosis is given by 3 exp(ν) as in Palacios and Steel (2006). In the case of ln(hi ) ∼ N(−νh /2, νh ), E[h−1 i ] = exp(νh ) −1 and E[h−2 i ] = exp(3νh ), for any νh > 0. If instead hi ∼ Ga(1/νh , 1/νh ) then E[hi ] = 1/1 − νh and E[h−2 i ] = 1/(1 − νh )(1 − 2νh ). The latter case requires that νh < 0.5. Figure 1 shows the implied kurtosis for several values of w2 for both models when ν = 0.5. The Gamma distribution for Ph looks less flexible since it gives kurtosis very close to 3 exp(ν) for almost all values of νh . We need to go very close to νh = 0.5 to get larger values for the kurtosis coefficient. Without mixing through λ1 (s), the kurtosis is an increasing function of νh . For instance, for the case of ln(hi ) ∼ N(−νh /2, νh ) we have that kurt[Z̃ij ] = 3{1 + 2w2 exp(νh ) + w4 exp(3νh )}/{1 + 2w2 exp(νh ) + w4 exp(2νh )} which is an increasing function of νh . On the other hand, for ν > 0 the kurtosis is not monotonous in νh , as illustrated in Figure 1. Notice that without mixing in the nugget effect the kurtosis is a increasing function of ν and without either mixing we obtain kurt[Z̃ij ] = 3 (Gaussian case). w2 = 0.05 w2 = 0.01 w2 = 0.05 3.0 0.0 1.0 2.0 3.0 4.90 kurtosis 0.2 0.4 0.0 0.2 νh νh w2 = 0.11 w2 = 0.18 w2 = 0.11 w2 = 0.18 2.0 3.0 4.5 4.6 0.0 1.0 νh 2.0 3.0 0.0 νh 5.5 kurtosis 5.4 5.0 kurtosis 25 15 kurtosis 1.0 0.2 0.4 0.0 νh (a) hi Lognormal 0.4 6.5 νh 5 0.0 0.0 4.80 4.905 4.8 2.0 νh 5 kurtosis 1.0 10 15 20 0.0 4.915 kurtosis 10 8 kurtosis 6 5.0 kurtosis 5.2 w2 = 0.01 0.2 0.4 νh (b) hi Gamma Figure 1: Kurtosis for different values of w2 and ν = 0.5. 7 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism 3.2 Parameterisation In what follows we make particular choices for the variogram functions γ1 (s), γ2 (t) and variables X0 , X1 and X2 introduced in Section 2. Consider γ1 (s) = ||s/a||α , γ2 (t) = |t/b|β , a, b > 0, α ∈ (0, 2] and β ∈ (0, 2]. Let X0 ∼ Ga(η0 , 1), X2 ∼ Ga(η2 , 1), which results in Cauchy covariance functions, that is, M0 (x) = (1 − x)−η0 and M2 (x) = (1 − x)−η2 , η0 , η2 > 0. For X1 we consider the following distributions 1. X1 ∼ Ga(η1 , 1) resulting in the Cauchy function M1 (x) = (1 − x)−η1 , η1 > 0; 2. X1 ∼ InvGa(η1 , 1) resulting in the Matèrn function M1 (x) = √ 2(−x)η1 /2 K (2 −x), 2η1 −1 Γ(η1 ) η1 η1 > 0. See Stein (1999) for details of this class of covariance functions; 3. X1 ∼ GIG(η1 , δ, δ) resulting in the Generalized Matérn function M1 (x) = (1 − p x −η1 /2 Kη1 (2δ 1 − xδ )/Kη1 (2δ), η1 ∈ <, δ > 0. See Shkarofsky (1968) for details of δ) this class of covariance functions. Here InvGa(η1 , 1) denotes the Inverse Gamma distribution with density function f (x) = 1/Γ(η1 )x−η1 −1 exp(−1/x) and GIG(η1 , δ, δ) denotes the Generalized Inverse Gaussian distribution with density function f (x) = 1/(2Kη1 (2δ))xη1 −1 exp{−(δx + δ/x)}. We use the correlation between the variables U and V as an indication of interaction between space and time components. This correlation is given by η0 c= p , (η0 + V1 )(η0 + η2 ) (15) where V1 = Var(X1 ). Thus, 0 ≤ c ≤ 1 could be used as a measure of space-time interaction, with c = 0 indicating separability and c = 1 meaning high dependence between space and time. Notice that in the case X1 ∼ InvGa(η1 , 1) the variance of X1 does not exist (unless η1 > 2 is imposed through the prior) and the dependence between space and time is then measured by η0 c̃ = q , (η0 + Ṽ1 (η1 ))(η0 + η2 ) (16) where Ṽ1 (η1 ) = (Q(0.75; η1 ) − Q(0.25; η1 ))2 and Q(x; η1 ) is the quantile of X1 corresponding to 100x%. 3.3 Inference Suppose we observe realizations z̃ij of Z̃ij ≡ Z̃(si , tj ) at locations si , i = 1, . . . , I and time points tj , j = 1, . . . , J. Defining λ1 = (λ1 (s1 ), . . . , λ1 (sI )), h = (h(s1 ), . . . , h(sI )) and µ = (µ(s1 , t1 , . . . µ(sI , tJ ))0 a location function, the likelihood function is given by L1 (θ, λ1 , h; z̃) = fN (z̃ | µ, Σ), (17) 8 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism where θ = (η0 , η1 , η2 , α, β, a, b, σ 2 , τ 2 ) or θ = (δ, η0 , η1 , η2 , α, β, a, b, σ 2 , τ 2 ) depending on which distribution we choose for X1 , fN (.|µ, Σ) denotes the multivariate Gaussian density function with mean µ and covariance matrix Σ and Σkk0 = Cov[(Z̃)k , (Z̃)k0 ] where I(si = si0 ) M1 (−γ1 ) + τ2 √ M2 (−γ2 ), Cov(Z̃ij , Z̃i0 j 0 ) = σ 2 M0 (−γ1 − γ2 ) (1 − τ 2 ) √ λ1i λ1i0 hi hi0 i, i0 = 1, . . . , I, j, j 0 = 1, . . . , J, γ1 = γ1 (si − si0 ), γ2 = γ2 (tj − tj 0 ), λ1i = λ1 (si ) and (Z̃)k is the k th element of the IJ-dimensional vector Z̃. Note that Gaussian behaviour is only assumed given the mixing variables λ1 and h. Integrating out with respect to these mixing variables leads to non-Gaussian distributions. We augment with the latent variables λ1 and h in order to identify possible regions with larger observational variance and/or traditional outliers. The vector ln(λ1 ) is multivariate normally distributed with mean −ν/2 and covariance matrix with elements νM1 (−γ1 (si − si0 )). And ln(hi ), i = 1, . . . , I is either normal with mean −νh /2 and variance νh or hi ∼ Ga(1/νh , 1/νh ). We use stochastic simulation via MCMC to obtain an approximation of the posterior distribution of (θ, λ1 , h). We obtain samples from the target distribution p(θ, λ1 , h|z) by successive generations from the full conditional distributions. More specifically, we adopt a hybrid Gibbs sampler scheme with Metropolis-Hastings steps. We use random walk proposals to generate values of λ1 and h. We also consider groups in space in order to block the sampler. For a more elaborate algorithm see Palacios and Steel (2006). Model comparison is conducted on the basis of Bayes factors. These are computed from the MCMC output using methods to approximate the marginal predictive density of z. In previous simulation studies (Fonseca and Steel, 2008) we noticed that the estimator p4 of Newton and Raftery (1994) (with their d as small as 0.01), the optimal bridge sampling approach of Meng and Wong (1996), and the shifted Gamma estimator proposed by Raftery et al. (2007) (with values of their λ1 close to one) give essentially the same results, especially the last two. 3.4 Prediction and interpolation Gaussian model Suppose we are interested in predicting Zp at a location sp at time point tp , where sp is not necessarily included within the sampling design. Under a Bayesian approach, the prediction of Zp is based on the posterior predictive distribution P (Zp |Zo ) where Zo are the available observations of the process Z(s, t). We have that Z P (Zp |Zo ) = P (Zp |Zo , θ)P (θ|Zo )dθ (18) Under the Gaussian model, (Zo , Zp |θ) has a multivariate Gaussian distribution. Using the properties of conditional distributions in the Gaussian family we obtain the mean and the 9 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism variance for Zp |Zo , θ given by µ̄ = µp + Σpo Σ−1 oo (zo − µo ) (19) Σ̄ = Σpp − Σpo Σ−1 oo Σop (20) and Suppose we have θ(1) , . . . , θ(M ) generated from the posterior distribution θ|Zo (by the MCMC sampler). Then we approximate the predictive distribution of Zp by averaging over P (Zp |Zo , θ(k) ), for k = 1, . . . , M . Non-Gaussian model Suppose we have samples from the distribution of θ, λ1o |Zo . We need to sample λ1p |θ, λ1o , Zo and then use the Gaussian sampler just described to sample Zp |Zo , θ, λ1p , λ1o . Notice that p(λ1p |θ, λ1o , Zo ) ≡ p(λ1p |θ, λ1o ), where (ln λ1o , ln λ1p )0 |θ ∼ N (−ν/2, νM1 ), (21) with M1ij = M1 (−γ1 (si − sj )). Thus, given λ1o and θ we can easily generate values of λ1p . Analogously, we deal with λ2 (t) discussed in the next section. Taking into account h(s) is trivial due to the independence of the hi ’s. Predictive model comparison In order to check the predictive accuracy of each model we use a predictive scoring rule. Scoring rules provide summaries for the evaluation of probabilistic forecasts by comparing the predictive distribution with the actual value observed for the process. For more details about scoring rules see Gneiting and Raftery (2007). In particular, we use the log predictive score (LPS) based on the predictive distribution p (which can be multivariate) and on the observed value z, LPS(p; z) = − ln(p(z)). (22) The smaller LPS is, the better the model does in forecasting Zp . 4 Scale mixing in time We also want to consider the case where λ2 (t) 6= 1. Let {λ2 (t); t ∈ T } be a mixing process in time. Notice that the process {Y (s, t); (s, t) ∈ D × T } where Y (s, t) = λ2 (t)1/2 Z̃(s, t) (23) is exactly the process proposed in Section 3. We derive properties and conduct inference by using this fact. We will not consider the case with temporal nugget effect but the model presented here can be easily extended to allow for a nugget effect in time as done in the spatial dimension. 10 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism Defining λ2 = (λ2 (t1 ), . . . , λ2 (tJ )) and using (17) and (23) the resulting likelihood function is given by L(θ, λ1 , h, λ2 ; z̃) = L1 (θ, λ1 , h; Λ2 z̃) J Y I/2 λ2j , (24) j=1 1/2 1/2 where Λ2 = diag((λ21 , . . . , λ2J )0 ⊗ 1I ), λ2j = λ2 (tj ) and 1I is a vector of ones of size I. We use in the time dimension the same kind of mixing process used in space, i.e. {ln(λ2 (t)) : t ∈ T } is a Gaussian process with mean −ν2 /2 and covariance structure ν2 C2∗ (·), with ν2 > 0 and C2∗ (·) a valid correlation function. The temporal covariance function is given by C̃2 (t; v) = exp nν 2 4 o [C2∗ (t) − 1] + ν2 C2 (t; v) (25) And the resulting covariance function for the spatiotemporal process obtained by solving (8) is given by C̃(s, t) = σ 2 M0 (−γ1 (s) − γ2 (t)) M̃1 (−γ1 (s)) M̃2 (−γ2 (t)) (26) where M̃1 (−γ1 (s)) = (1 − τ 2 ) exp ν4 [C1∗ (s) − 1] + ν M1 (−γ1 (s)) + τ 2 E[h(s)−1 ]I(s=0) and M̃2 (−γ2 (t)) = exp ν42 [C2∗ (t) − 1] + ν2 M2 (−γ2 (t)). Therefore, the correlation structure for s 6= 0 is given by exp ν4 [C1∗ (s) − 1] + ν + ν42 [C2∗ (t) − 1] + ν2 M0 (−γ1 (s)−γ2 (t)) M1 (−γ1 (s)) M2 (−γ2 (t)), ρ̃(s, t) = [exp{ν} + w2 E[h(s)−1 ]] exp{ν2 } (27) where we set C1∗ (s) = M1 (−γ1 (s)) and C2∗ (t) = M2 (−γ2 (t)). This scale mixing in time will capture periods in time with larger observational variance, which can be seen as a way to address the issue of volatility clustering, which is quite a common occurrence in e.g. financial time series data. This aspect of our model is reminiscent of a stochastic volatility model used in this literature. Smoothness properties, such as the one in proposition 3.1, are easily derived for the temporal process. Thus, if C2∗ (t) = M2 (−γ2 (t)), the purely temporal process {Z̃(s0 , t), t ∈ T } at a fixed location s0 ∈ D is m times mean square (2m) differentiable if and only if M2 (2m) (r) and γ2 (t) exist and are finite at 0. To summarize, the full model, with scale mixing through λ1 , λ2 and h will be able to accommodate smooth spatial heterogeneity in the variance (through λ1 ), gradual temporal changes in the variance (through λ2 ) and spatial outliers (through the fat-tailed distribution of the nugget effect, induced by mixing with h). 5 Empirical Results Throughout, the prior distribution for the hyperparameters ν, νh and ν2 are Ga(1, 5), assigning a large probability mass to values close to zero which indicate simpler cases (without scale mixing, 11 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism i.e. no spatial or temporal heterogeneity or fat tails). The prior distribution for the remaining parameters is mildly informative as proposed and discussed in Fonseca and Steel (2008). 5.1 Simulation results We have analysed a substantial number of generated datasets, with and without perturbations with respect to Gaussianity. This has illustrated that the priors are reasonable and the inference methods are reliable and efficient. In addition, we can successfully identify outliers and regions of increased variance in time and space. In fact, we are able to separate these effects even when they are all present in the data simultaneously. The use of Bayes factors leads to sensible model choices, in line with the way the data were generated. 5.2 Application to temperature data We now present an application to the maximum temperatures recorded daily in July of 2006 (J = 31) in 70 locations within the Spanish Basque country. We consider I = 67 of these locations for estimation of the parameters and we leave out 3 locations for predictive comparison. Figure 2 (a)-(b) present the boxplots for the maximum temperature (in degrees centigrade) over space and time, respectively. Notice in Figure 2 (c) and (d) that the empirical variance over time and space is far from constant, which suggests that a simple Gaussian model with constant variance might be unsuitable. In order to model nonstationarities in the mean of the process we considered a mean function that depends on the spatial coordinates and on time. In addition, the region considered is quite mountainous with altitudes ranging from 0 to 1188 meters, therefore the altitude (x) was included in the mean function as a covariate. The resulting mean function is given by µ(s, t) = δ0 + δ1 s1 + δ2 s2 + δ3 x + δ4 t + δ5 t2 (28) The covariance model considered here is nonseparable as presented in (4) allowing for interactions between space and time. The model is parameterized as in Subsection 3.2 and the chosen covariance in space is the Cauchy type. The parameters estimated are the trend coefficients (δ0 , . . . , δ5 ), the covariance parameters (η0 , α, β, a, b, σ 2 , τ 2 ) and the mixing parameters (ν1 , ν2 , νh ). We also generate the auxiliary variables (λ1 , λ2 , h) in our MCMC algorithm as described in Subsection 3.3 in order to identify regions in space and time with larger observational variance. The parameters η1 and η2 are set to 1. See Fonseca and Steel (2008) for a discussion of parameterisation in this class of covariance models. For the mixing on the nugget effect we only consider lognormally distributed hi , as this seems the most flexible option (see the discussion at the end of Subsection 3.1). In order to calculate the likelihood function we need to invert a matrix with dimension 12 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism 2077 × 2077. Since this is very time demanding we approximate the likelihood by using conditional distributions as described in Stein et al. (2004). In summary, we consider a partition of Z into subvectors Z1 , ..., Z31 where Zj = (Z(s1 , tj ), . . . , Z(s67 , tj ))0 and we define Z(j) = (Zj−L+1 , ..., Zj ). Then, taking φ = (θ, λ1 , λ2 , h), we use p(z|φ) ≈ p(z1 |φ) 31 Y p(zj |z(j−1) , φ). (29) j=2 Since we have a natural time ordering of Z, this means the distribution of Zj will only depend on the observations in space for the previous L time points. In this application we used L = 5 to make the MCMC feasible. We checked that this approximation is quite accurate for these data. We estimated the parameters for the following models: the Gaussian model, the nonGaussian model with λ1 only, the non-Gaussian model with h, the non-Gaussian model with h and λ1 , the non-Gaussian model with λ2 only, the non-Gaussian model with λ2 and h, the non-Gaussian model with λ1 and λ2 and finally the non-Gaussian model with h, λ1 and λ2 . Notice that all the models considered here have a nugget effect in space parameterized by τ 2 > 0 that accounts for measurement errors and small-scale variation. The estimated Bayes Factors presented in Table 1 indicate the non-Gaussian models are much more adequate for this dataset than the Gaussian model. The complete model that includes λ1 , λ2 and h is by far the best one according to both estimators. As expected on the basis of the large variations in empirical variance over time and space (see Figure 2), the models that include both λ1 and λ2 perform well. Table 2 presents some posterior summaries for the parameters of interest. The models with λ1 tend to give rather different results for the smoothness parameter in space α, the range in space a and the nugget effect τ 2 (not reported). In particular, models with λ1 tend to suggest rougher processes with smaller values of α than the models without λ1 . This may be related to the use of the same covariance structure for the processes λ1 (s) and Z1 (s; u). Thus, it may be the case that the process λ1 (s) is rougher than the process Z1 (s; u). But estimation of a different covariance structure for λ1 would probably be too much to ask from the data. Furthermore, h seems to capture some of this roughness as the models with λ1 and also h have larger estimates of α. Inference on the separability measure c in (15) is relatively unaffected by the model choice and indicates that the data are fairly close to separable. Notice that the process is very rough in time with very small values of β. For β < 1 the process is not even mean square continuous in time. The posterior distributions of the parameters of the mixing distributions, which drive the tail behaviour, are depicted in Figure 3. It is clear that the posteriors are very different from 13 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism the priors (indicated by dashed lines), and point very strongly towards non-Gaussian behaviour. Remember that the Gaussian model corresponds to the limiting case where all ν parameters tend towards zero. The posteriors clearly suggest that all three forms of scale mixing in the model are supported by the data, especially for the smooth spatial process. Of course, this is in line with the Bayes factors discussed earlier. Inference on the coefficients in the mean function (28) show that altitude is an important covariate with a similar effect for all the models (in particular, it indicates a drop in mean level of about 0.8 of a degree centigrade per 100 meters altitude). In the most complete model, the variance of Z̃(si , tj ) is σ2 1 − τ 2 τ2 2 Var[Z̃(si , tj )] ≡ σij = + , λ2j λi1 hi from which we can deduce the variance structures over stations and time points. In particular, if we marginalise over space by assigning the spatial mixing variables an “average” constant value which we can take to be the prior mean, i.e. λi1 = hi = 1 then we can trace the temporal evolution of the variance as σ 2 /λ2j . Similarly, if we assume λ2j = 1, we can decompose the spatial variance into a part for the correlated process σ 2 (1 − τ 2 )/λ1i and a nugget part σ 2 τ 2 /hi . Figure 5 gives the posterior boxplots for the variance evolution over time and space for the full model, whereas Figure 4 does the same for the model with only h and λ2 (here the constant contribution to the variance of the spatially smooth process is not depicted). Notice the clear separation of the effects showing that stations 17, 18 and 66 (both panels (b)) are outliers whereas some regions in time have larger observational variance (panels (a)). The 3 stations identified as outliers are all located in the north but quite far from each other, indicating that it is not the case of a region in space with larger variance. The temporal variance pattern in both models is roughly in line with the empirical variances over time shown in Figure 2 (d). The component containing λ1 (Figure 5 (c)) does indicate appreciable differences in variances over spatial regions, in line with the Bayes factor in favour of this model and the posterior inference on ν. The Gaussian model is, of course, not as flexible, assuming just one global variance with posterior 95% credible interval given by (8.64, 17.03). Figure 6 presents the total 2 posterior variance σij for all time points at stations 1 (small nugget effect) and 17 (large nugget effect) using the complete model, while horizontal lines indicate the posterior credible interval for the variance for the Gaussian model. This clearly illustrates the inadequacy of the Gaussian model in adapting to the temporal changes in the variance. For instance, for station 1 and time points 5 to 10 it is clear that the Gaussian model overestimates the variance and for time point 31 it is clear that the variance should be larger than the one estimated by the Gaussian model. This underestimation is even more pronounced for location 17 where the nugget effect is more important, and even time period 22 seems problematic then. 14 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism In view of the lack of smoothness of the process in time, we do not present any predictions in time, but we will conduct interpolation to unobserved sites. In our case, we left some stations out of the estimation sample in order to compare the predictions with the actually observed maximum temperature. Table 3 shows the estimated value of the LPS as in (22) using the Gaussian and non-Gaussian models for the 3 stations left out of the estimation step. The nonGaussian models that include λ2 predict better than the Gaussian model. Especially in the tails, the extreme events are predicted in a much more adequate way. This can be seen in Figure 7 which plots the predictions (medians and 95% credible intervals) obtained for various models against actual observations at out-of-sample stations (labelled 1∗ to 3∗ ) for all 31 time periods. Station 1∗ is very close to other stations and even the Gaussian model gives reasonable predictions. The non-Gaussian models provide interval widths that are similar to the Gaussian model on average but the intervals are smaller for many points in time and larger in the tails (e.g. j = 31) where the posterior prediction interval for the Gaussian model misses the observed value for the process at stations 2∗ and 3∗ . This suggests that scale mixing in time is essential in this application in order to produce good interpolations in space, especially for extreme temperatures. In order to verify how the Gaussian and non-Gaussian models would predict in other regions of the spatial domain we repeated the estimation and prediction steps for 5 different partitions of the data, selecting at random three testing locations and leaving the remaining 67 for estimation. The results obtained for the Gaussian model and the 2 best non-Gaussian models are presented in Table 4. The model with λ2 and h has the smallest average log predictive score, followed by the model with only λ2 . In summary, whereas Bayes factors favour the model with all three mixing mechanisms, the model with smoothly varying heterogeneous variance in time and a fat-tailed nugget effect is the one that does best in out-of-sample predictions. h λ1 λ1 & h λ2 λ2 & h λ1 & λ2 λ1 , h & λ2 Shifted Gamma estimator 172 148 345 138 279 417 547 Raftery estimator p4 115 116 227 116 203 302 327 Table 1: Log Bayes factor in favour of the model in the column versus the Gaussian model using ShiftedGamma (λ = 0.98) and Newton-Raftery (d = 0.01) estimators for the marginal likelihood. 15 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism 35 25 15 20 Maximum Temperature 30 35 30 25 20 Maximum Temperature 15 1 5 9 14 19 24 29 34 39 44 49 54 59 64 1 3 5 7 9 11 14 Station 17 20 23 26 29 Day 12 (b) Temperature over time 12 (a) Temperature over space 10 10 ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 4 6 ● ● ● 8 ● ● ● ● ● ● ● 6 8 ● ● ● empirical variance ●● 4 empirical variance ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 2 ●● ● 0 10 20 30 40 50 60 0 5 station index 10 15 20 25 30 time index (c) Empirical variance for each location (d) Empirical variance for each time point Figure 2: Data summaries. a b c α β Gaussian (0.99, 2.04) (3.50, 8.13) (0.00, 0.15) (1.31, 1.83) (0.33, 0.50) h (0.99, 1.98) (3.29, 8.22) (0.01, 0.13) (1.39, 1.74) (0.31, 0.48) λ1 (4.17, 24.08) (3.49, 7.94) (0.01, 0.13) (0.58, 0.94) (0.38, 0.55) λ1 & h (1.44, 3.15) (3.30, 7.70) (0.01, 0.12) (1.20, 1.55) (0.36, 0.53) λ2 (0.95, 2.01) (4.28, 9.74) (0.01, 0.10) (1.35, 1.84) (0.32, 0.46) λ2 & h (1.10, 2.33) (3.33, 7.88) (0.01, 0.13) (1.29, 1.68) (0.30, 0.45) λ1 & λ2 (7.28, 21.20) (0.78, 4.41) (0.01, 0.15) (0.75, 1.00) (0.29, 0.52) λ1 , h & λ2 (6.36, 18.14) (0.39, 2.58) (0.01, 0.20) (0.89, 1.11) (0.31, 0.68) Table 2: 95% posterior credible intervals for some parameters of interest. 16 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism 1 2 3 4 5 6 7 2.0 1.0 0.0 0.0 1.0 Density Density 2.0 3.0 0.6 0.4 Density 0.2 0.0 0 0.5 1.0 1.5 0.0 0.4 νh ν (a) ν 0.8 1.2 ν2 (b) νh (c) ν2 Figure 3: Prior and posterior densities of the parameters of the mixing distributions for the model with h, 3 σ2τ2 h 30 0 0 1 10 2 20 σ 2 λ2 4 40 5 50 6 λ1 and λ2 . Prior densities are given by dashed lines and posterior densities are indicated by drawn lines. 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 4 7 (a) σ 2 /λ2j , j = 1, . . . , 31 11 16 21 26 31 36 41 46 51 56 61 66 (b) σ 2 τ 2 /hi , i = 1, . . . , 67 Figure 4: Posterior boxplots of the variance structure over time and space corresponding to the model with 2 h and λ2 , where σij = σ 2 [(1 − τ 2 ) + τ 2 /hi ]/λ2j . The left panel describes the temporal evolution and the right panel the spatial nugget effect. 17 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism 50 3.0 40 2.5 400 30 σ2(1 − τ2) λ1 20 1.0 1.5 σ2τ2 h 2.0 300 200 σ 2 λ2 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 (a) σ 2 /λ2j , j = 1, . . . , 31 0 0.0 0.5 10 100 0 1 1 4 7 11 16 21 26 31 36 41 46 51 56 61 66 1 4 7 (b) σ 2 τ 2 /hi , i = 1, . . . , 67 11 16 21 26 31 36 41 46 51 56 61 66 (c) σ 2 (1 − τ 2 )/λ1i , i = 1, . . . , 67 Figure 5: Posterior boxplots of the variance structure over time and space for the model with h, λ1 and λ2 , 2 = σ 2 [(1 − τ 2 )/λ1i + τ 2 /hi ]/λ2j . The left panel describes the temporal evolution, the middle panel where σij 60 50 40 σ172j 30 0 10 20 30 0 10 20 σ12j 40 50 60 the spatial nugget effect and the right panel the smooth spatial process. 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 (a) station 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 (b) station 17 Figure 6: Posterior boxplots of the total variance for two stations at each time point corresponding to the 2 model with h, λ1 and λ2 , where σij = σ 2 [(1 − τ 2 )/λ1i + τ 2 /hi ]/λ2j . The left panel is for station 1 (small nugget effect) and the right panel for station 17 (large nugget effect). Horizontal lines indicate the 95% credible interval for the variance of the Gaussian model. 18 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism Station 2* Station 3* 35 35 Station 1* * * * * *** * * * predicted ** * * 25 * 20 predicted * *** ** * ** * * * * ** * * * ** * * ***** * * ** * * ** * ** * 15 25 * ** * ** ** ** *** * * ** * * * * * * * ** ** ** * ** 10 * 20 20 predicted 30 30 * * 25 * 18 20 22 24 26 28 30 32 16 18 20 22 observed 24 26 28 22 24 26 28 observed (a) Gaussian Model (b) Gaussian Model 32 34 (c) Gaussian Model Station 1* Station 1* 25 35 35 30 Station 1* 30 observed * * * * * * *** * * predicted ** * * 25 20 predicted * ***** * * ** * * * **** * * ** * * ***** ** * * ** * * *** * 15 25 * ** * ** ** ** *** ** ** * ** * * * * ** ** ** * ** * 10 * 20 20 predicted 30 30 * 18 20 22 24 26 28 30 32 16 18 20 22 observed 24 26 28 22 24 26 28 observed (d) Model with λ2 (e) Model with λ2 32 34 (f) Model with λ2 Station 2* Station 3* 35 35 30 Station 1* 30 observed * * * ** * *** * * ** * predicted * * ***** * * ** * * * *** * * * 25 * * 20 predicted ***** * ** ** * *** * * * * * 15 25 * * * *** ** ** * ** * * * * ** *** ** ** * ** * * * 10 * 20 20 predicted 30 30 25 * 18 20 22 24 26 28 30 32 16 18 20 observed (g) Model with λ2 & h 22 24 26 observed (h) Model with λ2 & h 28 22 24 26 28 30 32 34 observed (i) Model with λ2 & h Figure 7: Posterior predictive median of the maximum temperature versus the observed maximum temperatures (points) at out-of-sample stations for times j = 1, . . . , 31 . The dashed lines are the 95% credible predictive intervals, and the solid line indicates y = x. 19 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism model LPS Gaussian 97.25 h 112.56 λ1 107.43 λ1 & h 117.20 λ2 76.73 λ2 & h 77.60 λ1 & λ2 96.35 λ1 , h & λ2 90.30 Table 3: Log predictive score (LPS) for the predicted maximum temperature at the out-of-sample stations. model LPS Gaussian 88.70 λ2 86.64 λ2 & h 80.59 Table 4: Average log predictive score (LPS) for the predicted maximum temperature at the out-of-sample stations for 5 partitions of the data set. 20 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism 6 Conclusions and future work We present a non-Gaussian spatiotemporal model that is able to capture departures from Gaussianity in terms of outlier contamination and regions in space or time with larger observational variance. The proposed finite dimensional distributions have heavier tails than the normal distribution and have the normal as a limiting case. The general model includes correlated mixing in the spatiotemporal process (both in time and space) and in the nugget effect. This model is quite flexible, combining nonseparability and non-Gaussian behaviour, and performed well on simulated data. This was also illustrated in an application to Spanish temperature data, where simultaneous mixing in the nugget effect and in time seems an essential feature. Prediction is straightforward, using the fact that we are scale mixing Gaussian processes, and efficient MCMC algorithms for posterior and predictive inference also immediately allow for the identification of outliers and regions in time and space with inflated variances. As a topic of future research, it might be interesting to explore the effect of using C1∗ (s) = M1 (−γ1 (s)) and C2∗ (t) = M2 (−γ2 (t)) in the model for posterior and predictive inference, and to investigate ways of separately modelling correlation structures for the mixing variables and the observables. In the simulated examples we examined here, it did not seem to be an important restriction but this might be the case for some applications. Appendix A Proof of Proposition 3.1 The Gaussian-Log-Gaussian process {Z̃(s, t); (s, t) ∈ D × T } as defined in (5) and (10) with no nugget effect can be rewritten as 1 Z̃(s, t) = p λ1 (s) 1 Z1 (s; U )Z2 (t; V ) = p λ1 (s) Z(s, t). Then for a fixed time point t0 we have C̃(s, 0) = σ 2 C(s, 0)f (s), where f (s) = exp{ ν4 [M1 (−γ1 (s)) − 1] + ν} and C(s, 0) = σ 2 M0 (−γ1 (s))M1 (−γ1 (s)). Therefore, (i) P2m (2m−i) C̃ (2m) (s, 0) = i=0 2m (s). i C (s, 0)f By Faá di Brunos’s formula, termwise differentiation of C(s, 0) results in C (2m) (s, 0) = X A Y m! y (k) (−γ1 (s)) k1 !k2 !...k2m ! kj 6=0 (2m) = {y (1) (−γ1 (s))[γ1 ( (j) −γ1 (s) j! )kj (30) (1) (s)] + . . . + y (2m) (−γ1 (s))[γ1 (s)]2m } where A = {k1 , k2 , ..., k2m : k1 + 2k2 + ... + 2mk2m = 2m}, k = k1 + k2 + ... + k2m , ki ≥ 0, i = (k−i) Pk (i) 1, 2, . . . , 2m and y (k) (x) = i=0 ki M0 (x)M1 (x). 21 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism In the expression (30), the highest order derivative of γ1 (s) is 2m obtained when k2m = 1 and k1 = . . . = k2m−1 = 0. Thus, in this case, the behaviour of C (2m) (s, 0) as s → 0 depends only (2m) on the local behaviour of γ1 (1) (s) and M1 (r) as s, r → 0. And the highest order derivative of M1 (−γ1 (s)) is 2m obtained when i = 2m. Thus, the behaviour of C (2m) (s, 0) as s → 0 depends (2m) only on the local behaviour of M1 (1) (k) (r) and γ1 (s) as s, r → 0. Notice that M0 (0) = E[X0k ] always exists since X0 ∼ Gamma(λ0 , 1) in our model. (2m) (2m) This proves that C (2m) (s, 0) exists and is finite at 0 if and only if M1 (r) and γ1 (s) n o (2m) are finite and exist at 0. Also, f (2m) (s) = γ4 f (s) M1 (−γ1 (s)) + g(ν, s) exists and is finite under the same conditions (see Palacios and Steel, 2006). The result follows. References Cressie, N. and Hawkins, D. M. (1980). “Robust Estimation of the Variogram: I.” Mathematical Geology, 12, 115–125. Damian, D., Sampson, P. D., and Guttorp, P. (2001). “Bayesian estimation of semi-parametric non-stationary spatial covariance structures.” Environmetrics, 12, 161–178. — (2003). “Variance modeling for nonstationary spatial processes with temporal replications.” Journal of Geophysical Research, 108, 1–12. De Oliveira, V., Kedem, B., and Short, D. A. (1997). “Bayesian Prediction of Transformed Gaussian Random Fields.” Journal of the American Statistical Association, 92, 1422–1433. Fonseca, T. C. O. and Steel, M. (2008). “A New Class of Nonseparable Spatiotemporal Models.” CRiSM Working Paper 08-13, University of Warwick. Gneiting, T. and Raftery, A. E. (2007). “Strictly proper scoring rules, prediction and estimation.” Journal of the American Statistical Association, 102, 477, 360–378. Lu, C., Kou, Y., Zhao, J., and Chen, L. (2007). “Detecting and tracking regional outliers in meteorological data.” Information Sciences, , 177, 1609–1632. Ma, C. (2002). “Spatio-Temporal Covariance Functions Generated by Mixtures.” Mathematical Geology, 34, 8, 965–975. — (2003). “Spatio-Temporal Stationary Covariance Models.” Journal of Multivariate analysis, 86, 97–107. Meng, X. L. and Wong, W. H. (1996). “Simulating Ratios of Normalizing Constants via a Simple Identity: A Theoretical Exploration.” Statistica Sinica, 6, 831–860. 22 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism Newton, M. A. and Raftery, A. E. (1994). “Approximate Bayesian Inference With the Weighted Likelihood Bootstrap.” Journal of the Royal Statistical Society Series B , 56, 3–48. Palacios, M. B. and Steel, M. F. J. (2006). “Non-Gaussian Bayesian Geostatistical Modeling.” Journal of the American Statistical Association, 101, 474, 604–618. Raftery, A. E., Newton, M. A., Satagopan, J. M., and Krivitsky, P. N. (2007). “Estimating the Integrated Likelihood via Posterior Simulation Using the Harmonic Mean Identity.” In Bayesian Statistics 8 , 371–416. Oxford University Press. Roislien, J. and Omre, H. (2006). “T-distributed Random Fields: A Parametric Model for Heavy-tailed Well-log Data.” Mathematical Geology, 38, 7, 821–849. Shkarofsky, I. P. (1968). “Generalized Turbulence Space-Correlation and Wave-Number Spectrum-Function Pairs.” Canadian Journal of Physics, 46, 2133–2153. Stein, M. L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer New York. — (2009). “Spatial interpolation of high-frequency monitoring data.” Annals of Applied Statistics, 272–291. Stein, M. L., Chi, Z., and Welty, L. J. (2004). “Approximating Likelihoods for Large Spatial Data Sets.” Journal of the Royal Statistical Society Series B , 66, 275–296. 23 CRiSM Paper No. 09-33, www.warwick.ac.uk/go/crism