Gaussian Spatial Processes: Introduction The Model The models we’ve used to this point have all be what would ordinarily be called “parametric”; that is, the functional form of the distributional reliance of the response on the controlled variabiables has been has been specified, and the “unknowns” of the problem have been a fixed and finite set of parameters, e.g. y = m(u, θ) + y= k X θi fi (u) + i=1 where the experimental effort is aimed at producing inferences about the k θ’s. In reality, a model form often is not known in advance, and more general, and possibly more “flexible” functions are needed; here’s we’ll use the notation: y = z(u) + One approach to more flexible modeling is to stipulate that z is a “random function” – a realization of a spatial stochastic process (where “spatial” refers to U in our context). Of course, a “random function” approach also can’t be completely assumption-free; even “random” functions have structure, and that structure must be specified. However, specification of spatial stochastic processes leads to what might be called a “softer” structure than is present in a parametric regression model. Here, we begin this specification at the level of a priori means and variances for z at any pair of locations in the design region. In particular, for any pair of “sites” u1 and u2 in U we will need to specify the functional forms in the following statement: z(u1 ) µ(u1 ) E = z(u2 ) µ(u2 ) z(u1 ) V ar = z(u2 ) δ 2 (u1 ) ρ(u1 , u2 )δ(u1 )δ(u2 ) δ 2 (u2 ) That is, we will need to specify forms for a mean function µ, a variance function δ 2 , and a correlation function ρ. As in parametric modeling, these functions may be (and usually are) fully defined by parameters that we don’t known, and our inference will need to account for this. Hence, if we take observations at u1 and u2 y1 (u1 ) = z(u‘ ) + 1 y2 (u2 ) = z(u2 ) + 2 u1 and u2 are known to us, all other quantities are random, and only y1 and y2 can be observed. If both observations are taken at the same u y1 (u) = z(u) + 1 y2 (u) = z(u) + 2 1 all uncontrolled quantities are again random variables, but the two z’s are the same, while the ’s aren’t. This model has a structure that is different from that of models we have focused on to this point. A typical regression model, for example, specifies that the mean function of the data is known up to a finite set of unknown parameters, and our focus is on estimating these. In contrast, here the analogous functional form is one realization of the spatial process. That is, regardless of the number of experimental observations we are allowed, we say that they are all comprised of a common component – one functional realization of z – plus independent noise associated with each observation. As a result, inference focuses on prediction of the random function z over it’s domain U, rather than estimation of the parameters in an assumed functional form. As usual, that inference is based on observing y at u ∈ U , the design. For most applications, we generally want a process model that is continuous, and perhaps smooth. If it is assumed that a simple trend is the primary pattern in the function, this can be expressed with a regression-like function in the mean, e.g. µ(u) = θ0 + r X θi ui + ... i=1 and where other basis functions may be added if desired, e.g. higher-order monomials in the elements of u. (Note, however, that specifying a linear model for µ is not the same thing as saying that z – the function of real interest – is of this form.) Otherwise, we may want to specify a second-order stationary process if there is no a priori difference in knowledge about z’s at different points in U: E(z(u)) = µ V ar(z(u)) = δ 2 Corr(z(u), z(u + ∆)) = ρ(z(u), z(u + ∆)) = R(∆) for any u ∈ U, and displacement vector ∆. In these notes, we will focus on modeling with second-order stationary processes, for which the definition of R largely defines the character of the model. A few observations concerning this function are in order: 1. Suppose u2 = u1 +∆. Since Corr(z(u1 ), z(u2 )) must be the same as Corr(z(u2 ), z(u1 )), Corr(z(u1 ), z(u1 + ∆)) = R(∆) = Corr(z(u2 ), z(u2 − ∆)) = R(−∆) that is, R must be a symmetric function of its vector-valued argument. 2. For any collection of N u vectors, say T , the N × N covariance matrix V ar(z(T )), for which the i, j element is δ 2 R(ui − uj ), must be positive semi-definite. Functions R which produce positive (semi-)definite covariance matrices for z over any set of N vectors u, for any finite value of N , are called positive (semi-)definite functions. 2 3. A continuous processe requires that R be specified in such a way that: R(∆) → 1 as ∆ → 0 Intuitively, this means that knowing z(u) is equivalent to “knowing” the limiting value of z(u + ∆) with probability 1. While not a mathematical requirement (as the previous points are), for our purposes it is generally desirable to use functions for which R(∆) is positive, and is a decreasing function of each element of ∆. The intuitive backdrop for this is the idea that, other things being equal, values of z at u-vectors that are relatively close together are expected (but not required) to be similar, while values of z at u-vectors that are relatively farther apart are less likely to have similar values. Unless there is a reason to expect periodic or oscillating patterns in z as a function of u, there seems to be little intuitive basis for using correlation functions that can take negative values. Here we will focus on Gaussian stochastic processes (GaSP). That is, beyond everything that has been said above, for any finite collection of points T ⊂ U, z(T ) has a multivariate Gaussian distribution. Because Gaussian distributions and processes are fully specified by their collective first- and second-order moments, we replace the phrase “second-order stationary” with “stationary” in this context. Example: Consider r = 2, with U = [0, 1]2 . The contour plots below are of two realizations of a stationary Gaussian process with µ = 10, δ 2 = 1, and R(∆) = exp {−5∆0 ∆}. 9.6 0.0 0.2 0.4 0.6 0.8 1.0 9.8 9 10 9.4 0.0 10.4 9.6 9.8 10.2 9.2 10 8.8 9.4 8.6 10 9.6 9.6 .4 10.4 0.4 9 8.8 9.8 9.2 10 1 9.8 10 9.8 0.2 9.4 .2 9.6 9.6 0.4 0.8 9.4 10 9.4 0.0 10 9.8 0.8 10 0.0 0.2 0.4 0.6 0.8 1.0 3 Predictive Inference As noted above, the emphasis in using this model is not on estimating unknown parameters, even though the parameters of the GaSP model generally are estimated. Instead, the focus is on nonparametric, predictive inference about the “response surface”, i.e. the random function z, or equivalently, the long-term average y that would be seen with repeated observations. That is, we want to make inferences about z(u), perhaps at all u ∈ U, after observing (or conditioning on) values of y in an experiment: z(u) | y(u1 ), y(u2 ), ... y(uN ) where u is an arbitrary element of U, ui ∈ U (a design), and y(u) = z(u) + , ∼ N (0, σ 2 ). Suppose, for the moment, that we know µ, δ 2 , σ 2 , and the function R. Then z(u), y(u1 ), ... , y(uN ) are jointly multivariate normal, and inference about z is based on it’s conditional distribution given y’s. In particular, the • minimum variance linear unbiased predictor (classical BLUP), or • minimum expected squared-error loss predictor (Bayes) of z(u) is: ẑ(u) = µ+ Cov(z(u), y(U )) V ar−1 (y(U )) (y(U ) − µ1) 1×N N ×N N ×1 Note that for Gaussian processes (and so for multivariate normal distributions) this is just the mean of z conditional on observed y’s. Because we are working with stationary GaSP models, conditional variances are not functions of y, and “Cov” and “V ar” can be written in terms of the corresponding vector and matrix of correlations, where “Cov 00 = δ 2 r0 = δ 2 (R(u − u1 ), R(u − u2 ), ... R(u − uN )) σ2 δ2 +1 R(u2 − u1 ) σ2 “V ar00 = δ 2 ( 2 I + C) = δ 2 δ ... R(u1 − u2 ) ... R(u1 − uN ) σ2 +1 ... R(u2 − uN ) δ2 ... ... R(uN − u1 ) R(uN − u2 ) ... ... σ2 δ2 +1 Corresponding to this, ẑ(u) can be simplified due to cancellation of δ 2 in “Cov” and δ −2 in the inverse of “V ar”: σ2 I + C)−1 (y(U ) − µ1) δ2 Given values of the parameters, and standard results from multivariate normal distribuẑ(u) = µ + r0 ( tions, the conditional variance of z(u) (or posterior variance in the Bayesian setting) is: V ar(z(u)|y(U )) = δ 2 (1 − r0 ( 4 σ2 I + C)−1 r) 2 δ For a completely characterized probability model (i.e. µ, σ, δ, and R), this provides a complete framework for statistical prediction. That is, conditional on y(U ), we know that any z(u) is normally distributed, and we know how to compute it’s mean and standard deviation. Example: Continuing with the setup described in the last example, now let σ 2 = 0.1, one-tenth of δ 2 , and suppose the following data have been collected: U (.2, .2) (.2, .8) (.8, .2) (.8, .8) (.5, .5) y 9.3 8.9 9.0 12.2 10.6 The following figures display ẑ(u) and SD(z(u)) for u ∈ U . Note that ẑ is a function of the observed y’s, and displays asymmetry suggested by those values, but SD(z) is not a function of the observed data, and so reflects the symmetry of the experimental design. sd(z) 1.0 9 0.7 0.8 0.6 0.3 5 5 5 0.6 0.8 0.35 0.6 11.5 0.6 0.65 0.45 0.6 1.0 zhat 0.35 0.7 0.4 10.5 0.7 0.4 11 10 0.2 0.2 0.4 0.3 0.35 9.5 0.0 0.0 0.65 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.55 0.7 0.2 0.4 5 0.5 0.6 0.8 1.0 More About R The functional form of the correlation function R largely determines the character of the fitted GaSP model, and selection of this form is an important step in any application. In most applications, the entire function is not specified a priori, but a functional form with parameters that can be “fitted” or “tuned” is specified. For example, in the numerical examples above, R(∆) = exp{−5(∆21 + ∆22 )} might have been written as exp{−θ(∆21 + ∆22 )}, with the parameter θ restricted to positive values. Changing the value of θ doesn’t change 5 the analytical properties of the function R, but does change the “rate” at which correlation “dies off” between two values of z as a function of distance between their corresponding locations u. More generally, we’ll use the vector θ to represent all parameters required to fully define R, and may write Rθ (∆). As noted earlier, R must be a positive definite function (regardless of its parameter values), so that the resulting correlation matrix for z over any finite set of u is positive definite. Selection of such functions is fairly well-understood for r = 1, but is trickier for large r. In fact, several “reasonable-looking” distance-based functions that are legal as correlation functions in 2 or 3 dimensions are not positive definite when directly generalized to higher dimension. One widely used approach to specifying positive definite correlation matrices for any value of r is through a separable or product correlation form: • let Ri (∆i ), i = 1, 2, 3, ..., r be positive definite one-dimensional correlation functions, • then R(∆) = Qr i=1 Ri (∆i ) is a positive definite r-dimensional correlation function Note that the correlation function used in the above examples is actually of this form. For simplicity, we will focus on product-form correlations in this discussion. Perhaps the most important characteristic of the correlation function is its smoothness, in particular, its analytical behavior at 0. For simplicity, consider r = 1-dimensional u and note that the form of the function ẑ(u) is linear in each of R(u − u1 ), R(u − u2 ), ...R(u − uN ). Hence ẑ(u) is of the same degree of smoothness – i.e. has the same number of derivatives – as R(−). A different smoothness characteristic of GaSP models is the degree of smoothness of the functional realizations of the process. This issue is a bit more delicate mathematically because these realizations are random. However, consider the limiting case of two adjacent “divided differences”: lim∆→0 Corr((y(u) − y(u − ∆))/∆, (y(u + ∆) − y(u))/∆) = lim∆→0 ∆12 {Corr(y(u), y(u + ∆)) − Corr(y(u), y(u)) −Corr(y(u − ∆), y(u + ∆)) + Corr(y(u − ∆), y(u))} = lim∆→0 ∆12 {−R(0) + 2R(∆) − R(2∆)} = R00 (0) Loosely, this implies that the first derivative is continuous with probability 1 – i.e. the first derivative of realizations exists for all u – if the second derivative of R is exists at 0. A similar argument can be made for higher-order deviatives of the realization; realizations have d derivatives if R has 2d derivatives at 0. A related point is that the realizations of a stationary GaSP are continuous if R is continuous. Examples of 1-Dimensional Covariance Functions 6 In each of the following, let • U = { 0.2, 0.5, 0.8} • µ = 10, δ 2 = 1, σ 2 = 0.1 • y(U ) = ( 9, 11, 12)0 In each case shown, the parameterization is defined so that θ is positive, and relatively larger θ corresponds to relatively weaker correlation at any ∆ (although different θ values are used in each case to make the behavior of the processes more comparable). The nonnegative linear correlation function: Rθ (∆) = 1 − 1θ |∆| |∆| < θ |∆| > θ 0 R is continuous, but has no derivative at −θ, 0, and θ. The following are plots of R, and of the conditional mean and ± 2 s.d. bounds of z for the conditions specified above, computed with θ = 0.5. 13 12 11 ● 10 ● ● 8 0.0 9 0.2 0.4 R 0.6 z−hat, +−2sd, y 1.0 y, z−hat, +−2*sd 0.8 R −1.0 −0.5 0.0 0.5 1.0 0.0 Delta 0.2 0.4 0.6 u 7 0.8 1.0 The Ornstein-Uhlenbeck correlation function: Rθ (∆) = exp(−θ|∆|) R is continuous, but has no derivative at 0. The following are plots of R, and of the conditional mean and ± 2 s.d. bounds of z for the conditions specified above, computed with θ = 5. 13 12 11 ● 10 ● ● 8 0.0 9 0.2 0.4 R 0.6 z−hat, +−2sd, y 1.0 y, z−hat, +−2*sd 0.8 R −1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 Delta 0.6 0.8 1.0 u The Gaussian correlation function: Rθ (∆) = exp(−θ∆2 ) R is continuous and infinitely differentiable everywhere. The following are plots of R, and of the conditional mean and ± 2 s.d. bounds of z for the conditions specified above, computed with θ = 10. 13 12 11 ● 10 ● ● 8 0.0 9 0.2 0.4 R 0.6 z−hat, +−2sd, y 1.0 y, z−hat, +−2*sd 0.8 R −1.0 −0.5 0.0 0.5 1.0 0.0 Delta 0.2 0.4 0.6 u 8 0.8 1.0 The Matérn correlation function: Rθ,ν (∆) = √ d √ d ν 1 ) K ( ( 2ν 2ν ) ν Γ(ν)2ν−1 θ θ where Γ is the gamma function, and Kν is the νth Bessel function of the second kind. R is continuous, and has differentiability characteristics determined by the positive parameter ν. For ν = 0.5, this correlation is equivalent to the exponential correlation (very rough), and for ν → ∞, it approaches the Gaussian (very smooth). Estimation Inference Everything discussed above is written in the context of predicting the random function z based on knowing the parameters of the GaSP. In practice, both parameter estimation and process prediction are made based on collected data. Briefly, the two most common approaches to parameter estimation are maximum likelihood (sometimes modified to “penalized maximum likelihood”, usually to control numerical difficulties), and Bayesian approaches. Our emphasis here is on experimental design, but it is worth spending just a few words on how practical estimation inference can proceed. With the notation we’ve adopted, likelihood estimation focuses on the parameters µ, δ 2 , σ 2 , and the correlation parameters θ. Apart from constants, the log likelihood function for these, based on y observed at U , is L(µ, δ 2 , σ 2 , θ|y) = −(N logδ 2 + N log|φI + C| + (y − µ1)0 (φI + C)−1 (y − µ1)/δ 2 ) where φ = σ 2 /δ 2 (the “noise-to-signal ratio”). Given φ and θ, µ and δ 2 can be estimated in closed form: ∂ L|φ, θ = 0 → µ̂ = 10 (φI + C)−1 y/10 (φI + C)−1 1 ∂ µ̂ ∂ L|φ, θ = 0 → δ̂ 2 = (y − µ̂1)0 (φI + C)−1 (y − µ̂1)/N ∂ δ̂ 2 The complete likelihood must generally be maximized numerically over φ and θ. Given parameter values, z at any u is conditionally normal with easily computed mean and standard deviation. In many applications, the likelihood function is relatively flat, and positive penalty terms that are increasing functions of the elements of θ are sometimes subtracted from the log likelihood to improve the numerical conditioning of the problem (e.g. Li and Sudjianto, 2005). The general result of this is: • Elements of the penalized estimate of θ tend to be smaller (due to the penalty terms) than their corresponding MLEs. 9 • The penalized estimate of δ 2 is often larger than the MLE. Neither of these are necessarily of concern of themselves, since the GaSP parameters are generally not the quantities of real interest anyway. The more important question is: How does the penalty affect ẑ and the conditional standard deviation? There are no guaranteed answers to this, and even asymptotic results are difficult to obtain here because of the increasing dependence between y values as the number of experimental runs increases in a limited U – the so-called “infill problem”. However, the general observation is that: • µ usually has little influence on predictions unless the data are very sparse or correlations are very weak. • Smaller correlation parameters (e.g., leading to smaller conditional standard deviations) are often off-set by larger δ (e.g., leading to larger conditional standard deviations). The intent of penalizing the likelihood is that the numerical problem of parameter estimation is made easier, but that the impact on prediction of z is minimal. Because the likelihood function is relatively simple, Bayesian approaches are also popular. With priors on µ, δ 2 , σ 2 , and θ, posterior probability distributions can be computed via MCMC simultaneously for the parameters and for z and any specified values of u, and depending on the form of priors selected, at least some of this may be possible via Gibbs steps rather than generic Metropolis-Hastings steps. In most of the notes on experimental design to follow, GaSP parameters are treated as known. Design can be formulated around joint parameter estimation and function prediction, but this is much more difficult, and has all of the complications encountered in design for nonlinear models (and sometimes more). In addition, in most applications there is considerably more uncertainty associated with function prediction (given parameter values) than with parameter estimation; it is often the case that varying the values of the GaSP parameters substantially has very little impact on predictions of z. Hence, control of the uncertainty associated with prediction is a good starting point for experimental design in this context. 10