Gaussian Spatial Processes: Introduction

advertisement
Gaussian Spatial Processes: Introduction
The Model
The models we’ve used to this point have all be what would ordinarily be called “parametric”; that is, the functional form of the distributional reliance of the response on the
controlled variabiables has been has been specified, and the “unknowns” of the problem
have been a fixed and finite set of parameters, e.g.
y = m(u, θ) + y=
k
X
θi fi (u) + i=1
where the experimental effort is aimed at producing inferences about the k θ’s. In reality, a
model form often is not known in advance, and more general, and possibly more “flexible”
functions are needed; here’s we’ll use the notation:
y = z(u) + One approach to more flexible modeling is to stipulate that z is a “random function” – a
realization of a spatial stochastic process (where “spatial” refers to U in our context).
Of course, a “random function” approach also can’t be completely assumption-free; even
“random” functions have structure, and that structure must be specified. However, specification of spatial stochastic processes leads to what might be called a “softer” structure than
is present in a parametric regression model. Here, we begin this specification at the level of a
priori means and variances for z at any pair of locations in the design region. In particular,
for any pair of “sites” u1 and u2 in U we will need to specify the functional forms in the
following statement:




z(u1 )   µ(u1 ) 
E
=
z(u2 )
µ(u2 )



z(u1 )  
V ar 
=
z(u2 )
δ 2 (u1 )

ρ(u1 , u2 )δ(u1 )δ(u2 ) 
δ 2 (u2 )
That is, we will need to specify forms for a mean function µ, a variance function δ 2 , and a
correlation function ρ. As in parametric modeling, these functions may be (and usually are)
fully defined by parameters that we don’t known, and our inference will need to account for
this. Hence, if we take observations at u1 and u2
y1 (u1 ) = z(u‘ ) + 1 y2 (u2 ) = z(u2 ) + 2
u1 and u2 are known to us, all other quantities are random, and only y1 and y2 can be
observed. If both observations are taken at the same u
y1 (u) = z(u) + 1 y2 (u) = z(u) + 2
1
all uncontrolled quantities are again random variables, but the two z’s are the same, while
the ’s aren’t.
This model has a structure that is different from that of models we have focused on to this
point. A typical regression model, for example, specifies that the mean function of the data
is known up to a finite set of unknown parameters, and our focus is on estimating these. In
contrast, here the analogous functional form is one realization of the spatial process. That is,
regardless of the number of experimental observations we are allowed, we say that they are all
comprised of a common component – one functional realization of z – plus independent noise
associated with each observation. As a result, inference focuses on prediction of the random
function z over it’s domain U, rather than estimation of the parameters in an assumed
functional form. As usual, that inference is based on observing y at u ∈ U , the design.
For most applications, we generally want a process model that is continuous, and perhaps
smooth. If it is assumed that a simple trend is the primary pattern in the function, this can
be expressed with a regression-like function in the mean, e.g.
µ(u) = θ0 +
r
X
θi ui + ...
i=1
and where other basis functions may be added if desired, e.g. higher-order monomials in the
elements of u. (Note, however, that specifying a linear model for µ is not the same thing as
saying that z – the function of real interest – is of this form.) Otherwise, we may want to
specify a second-order stationary process if there is no a priori difference in knowledge about
z’s at different points in U:
E(z(u)) = µ
V ar(z(u)) = δ 2
Corr(z(u), z(u + ∆)) = ρ(z(u), z(u + ∆)) = R(∆)
for any u ∈ U, and displacement vector ∆.
In these notes, we will focus on modeling with second-order stationary processes, for
which the definition of R largely defines the character of the model. A few observations
concerning this function are in order:
1. Suppose u2 = u1 +∆. Since Corr(z(u1 ), z(u2 )) must be the same as Corr(z(u2 ), z(u1 )),
Corr(z(u1 ), z(u1 + ∆)) = R(∆) = Corr(z(u2 ), z(u2 − ∆)) = R(−∆)
that is, R must be a symmetric function of its vector-valued argument.
2. For any collection of N u vectors, say T , the N × N covariance matrix V ar(z(T )),
for which the i, j element is δ 2 R(ui − uj ), must be positive semi-definite. Functions
R which produce positive (semi-)definite covariance matrices for z over any set of N
vectors u, for any finite value of N , are called positive (semi-)definite functions.
2
3. A continuous processe requires that R be specified in such a way that:
R(∆) → 1 as ∆ → 0
Intuitively, this means that knowing z(u) is equivalent to “knowing” the limiting value
of z(u + ∆) with probability 1.
While not a mathematical requirement (as the previous points are), for our purposes it is
generally desirable to use functions for which R(∆) is positive, and is a decreasing function of
each element of ∆. The intuitive backdrop for this is the idea that, other things being equal,
values of z at u-vectors that are relatively close together are expected (but not required) to
be similar, while values of z at u-vectors that are relatively farther apart are less likely to
have similar values. Unless there is a reason to expect periodic or oscillating patterns in z
as a function of u, there seems to be little intuitive basis for using correlation functions that
can take negative values.
Here we will focus on Gaussian stochastic processes (GaSP). That is, beyond everything
that has been said above, for any finite collection of points T ⊂ U, z(T ) has a multivariate Gaussian distribution. Because Gaussian distributions and processes are fully specified
by their collective first- and second-order moments, we replace the phrase “second-order
stationary” with “stationary” in this context.
Example:
Consider r = 2, with U = [0, 1]2 . The contour plots below are of two realizations of a
stationary Gaussian process with µ = 10, δ 2 = 1, and R(∆) = exp {−5∆0 ∆}.
9.6
0.0 0.2 0.4 0.6 0.8 1.0
9.8
9
10
9.4
0.0
10.4
9.6
9.8
10.2
9.2
10
8.8
9.4
8.6
10
9.6
9.6
.4
10.4
0.4
9
8.8
9.8
9.2
10
1
9.8
10
9.8
0.2
9.4
.2
9.6
9.6
0.4
0.8
9.4
10
9.4
0.0
10
9.8
0.8
10
0.0 0.2 0.4 0.6 0.8 1.0
3
Predictive Inference
As noted above, the emphasis in using this model is not on estimating unknown parameters, even though the parameters of the GaSP model generally are estimated. Instead, the
focus is on nonparametric, predictive inference about the “response surface”, i.e. the random function z, or equivalently, the long-term average y that would be seen with repeated
observations. That is, we want to make inferences about z(u), perhaps at all u ∈ U, after
observing (or conditioning on) values of y in an experiment:
z(u) | y(u1 ), y(u2 ), ... y(uN )
where u is an arbitrary element of U, ui ∈ U (a design), and y(u) = z(u) + , ∼
N (0, σ 2 ). Suppose, for the moment, that we know µ, δ 2 , σ 2 , and the function R. Then
z(u), y(u1 ), ... , y(uN ) are jointly multivariate normal, and inference about z is based on it’s
conditional distribution given y’s. In particular, the
• minimum variance linear unbiased predictor (classical BLUP), or
• minimum expected squared-error loss predictor (Bayes)
of z(u) is:
ẑ(u) = µ+ Cov(z(u), y(U )) V ar−1 (y(U )) (y(U ) − µ1)
1×N
N ×N
N ×1
Note that for Gaussian processes (and so for multivariate normal distributions) this is just
the mean of z conditional on observed y’s. Because we are working with stationary GaSP
models, conditional variances are not functions of y, and “Cov” and “V ar” can be written
in terms of the corresponding vector and matrix of correlations, where
“Cov 00 = δ 2 r0 = δ 2 (R(u − u1 ), R(u − u2 ), ... R(u − uN ))



σ2
δ2
+1
 R(u2 − u1 )
σ2
“V ar00 = δ 2 ( 2 I + C) = δ 2 

δ

...


R(u1 − u2 ) ... R(u1 − uN ) 

σ2
+1
... R(u2 − uN ) 
δ2

...
...
R(uN − u1 ) R(uN − u2 ) ...
...
σ2
δ2
+1



Corresponding to this, ẑ(u) can be simplified due to cancellation of δ 2 in “Cov” and δ −2 in
the inverse of “V ar”:
σ2
I + C)−1 (y(U ) − µ1)
δ2
Given values of the parameters, and standard results from multivariate normal distribuẑ(u) = µ + r0 (
tions, the conditional variance of z(u) (or posterior variance in the Bayesian setting) is:
V ar(z(u)|y(U )) = δ 2 (1 − r0 (
4
σ2
I + C)−1 r)
2
δ
For a completely characterized probability model (i.e. µ, σ, δ, and R), this provides a complete
framework for statistical prediction. That is, conditional on y(U ), we know that any z(u) is
normally distributed, and we know how to compute it’s mean and standard deviation.
Example:
Continuing with the setup described in the last example, now let σ 2 = 0.1, one-tenth of
δ 2 , and suppose the following data have been collected:
U (.2, .2) (.2, .8) (.8, .2) (.8, .8) (.5, .5)
y
9.3
8.9
9.0
12.2
10.6
The following figures display ẑ(u) and SD(z(u)) for u ∈ U . Note that ẑ is a function of the
observed y’s, and displays asymmetry suggested by those values, but SD(z) is not a function
of the observed data, and so reflects the symmetry of the experimental design.
sd(z)
1.0
9
0.7
0.8
0.6
0.3
5
5
5
0.6
0.8
0.35
0.6
11.5
0.6
0.65
0.45
0.6
1.0
zhat
0.35
0.7
0.4
10.5
0.7
0.4
11
10
0.2
0.2
0.4
0.3
0.35
9.5
0.0
0.0
0.65
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.55
0.7
0.2
0.4
5
0.5
0.6
0.8
1.0
More About R
The functional form of the correlation function R largely determines the character of the
fitted GaSP model, and selection of this form is an important step in any application.
In most applications, the entire function is not specified a priori, but a functional form
with parameters that can be “fitted” or “tuned” is specified. For example, in the numerical
examples above, R(∆) = exp{−5(∆21 + ∆22 )} might have been written as exp{−θ(∆21 + ∆22 )},
with the parameter θ restricted to positive values. Changing the value of θ doesn’t change
5
the analytical properties of the function R, but does change the “rate” at which correlation
“dies off” between two values of z as a function of distance between their corresponding
locations u. More generally, we’ll use the vector θ to represent all parameters required to
fully define R, and may write Rθ (∆).
As noted earlier, R must be a positive definite function (regardless of its parameter
values), so that the resulting correlation matrix for z over any finite set of u is positive
definite. Selection of such functions is fairly well-understood for r = 1, but is trickier for
large r. In fact, several “reasonable-looking” distance-based functions that are legal as
correlation functions in 2 or 3 dimensions are not positive definite when directly generalized
to higher dimension. One widely used approach to specifying positive definite correlation
matrices for any value of r is through a separable or product correlation form:
• let Ri (∆i ), i = 1, 2, 3, ..., r be positive definite one-dimensional correlation functions,
• then R(∆) =
Qr
i=1
Ri (∆i ) is a positive definite r-dimensional correlation function
Note that the correlation function used in the above examples is actually of this form. For
simplicity, we will focus on product-form correlations in this discussion.
Perhaps the most important characteristic of the correlation function is its smoothness,
in particular, its analytical behavior at 0. For simplicity, consider r = 1-dimensional u and
note that the form of the function ẑ(u) is linear in each of R(u − u1 ), R(u − u2 ), ...R(u − uN ).
Hence ẑ(u) is of the same degree of smoothness – i.e. has the same number of derivatives –
as R(−).
A different smoothness characteristic of GaSP models is the degree of smoothness of
the functional realizations of the process. This issue is a bit more delicate mathematically
because these realizations are random. However, consider the limiting case of two adjacent
“divided differences”:
lim∆→0 Corr((y(u) − y(u − ∆))/∆, (y(u + ∆) − y(u))/∆)
= lim∆→0 ∆12 {Corr(y(u), y(u + ∆)) − Corr(y(u), y(u))
−Corr(y(u − ∆), y(u + ∆)) + Corr(y(u − ∆), y(u))}
= lim∆→0 ∆12 {−R(0) + 2R(∆) − R(2∆)}
= R00 (0)
Loosely, this implies that the first derivative is continuous with probability 1 – i.e. the first
derivative of realizations exists for all u – if the second derivative of R is exists at 0. A
similar argument can be made for higher-order deviatives of the realization; realizations
have d derivatives if R has 2d derivatives at 0. A related point is that the realizations of a
stationary GaSP are continuous if R is continuous.
Examples of 1-Dimensional Covariance Functions
6
In each of the following, let
• U = { 0.2, 0.5, 0.8}
• µ = 10, δ 2 = 1, σ 2 = 0.1
• y(U ) = ( 9, 11, 12)0
In each case shown, the parameterization is defined so that θ is positive, and relatively larger
θ corresponds to relatively weaker correlation at any ∆ (although different θ values are used
in each case to make the behavior of the processes more comparable).
The nonnegative linear correlation function:
Rθ (∆) =




1 − 1θ |∆| |∆| < θ 
|∆| > θ 
0
R is continuous, but has no derivative at −θ, 0, and θ. The following are plots of R, and of
the conditional mean and ± 2 s.d. bounds of z for the conditions specified above, computed
with θ = 0.5.
13
12
11
●
10
●
●
8
0.0
9
0.2
0.4
R
0.6
z−hat, +−2sd, y
1.0
y, z−hat, +−2*sd
0.8
R
−1.0
−0.5
0.0
0.5
1.0
0.0
Delta
0.2
0.4
0.6
u
7
0.8
1.0
The Ornstein-Uhlenbeck correlation function:
Rθ (∆) = exp(−θ|∆|)
R is continuous, but has no derivative at 0. The following are plots of R, and of the
conditional mean and ± 2 s.d. bounds of z for the conditions specified above, computed
with θ = 5.
13
12
11
●
10
●
●
8
0.0
9
0.2
0.4
R
0.6
z−hat, +−2sd, y
1.0
y, z−hat, +−2*sd
0.8
R
−1.0
−0.5
0.0
0.5
1.0
0.0
0.2
0.4
Delta
0.6
0.8
1.0
u
The Gaussian correlation function:
Rθ (∆) = exp(−θ∆2 )
R is continuous and infinitely differentiable everywhere. The following are plots of R, and of
the conditional mean and ± 2 s.d. bounds of z for the conditions specified above, computed
with θ = 10.
13
12
11
●
10
●
●
8
0.0
9
0.2
0.4
R
0.6
z−hat, +−2sd, y
1.0
y, z−hat, +−2*sd
0.8
R
−1.0
−0.5
0.0
0.5
1.0
0.0
Delta
0.2
0.4
0.6
u
8
0.8
1.0
The Matérn correlation function:
Rθ,ν (∆) =
√ d
√ d ν
1
)
K
(
(
2ν
2ν )
ν
Γ(ν)2ν−1
θ
θ
where Γ is the gamma function, and Kν is the νth Bessel function of the second kind. R is
continuous, and has differentiability characteristics determined by the positive parameter ν.
For ν = 0.5, this correlation is equivalent to the exponential correlation (very rough), and
for ν → ∞, it approaches the Gaussian (very smooth).
Estimation Inference
Everything discussed above is written in the context of predicting the random function
z based on knowing the parameters of the GaSP. In practice, both parameter estimation
and process prediction are made based on collected data. Briefly, the two most common approaches to parameter estimation are maximum likelihood (sometimes modified to “penalized
maximum likelihood”, usually to control numerical difficulties), and Bayesian approaches.
Our emphasis here is on experimental design, but it is worth spending just a few words on
how practical estimation inference can proceed.
With the notation we’ve adopted, likelihood estimation focuses on the parameters µ, δ 2 ,
σ 2 , and the correlation parameters θ. Apart from constants, the log likelihood function for
these, based on y observed at U , is
L(µ, δ 2 , σ 2 , θ|y) = −(N logδ 2 + N log|φI + C| + (y − µ1)0 (φI + C)−1 (y − µ1)/δ 2 )
where φ = σ 2 /δ 2 (the “noise-to-signal ratio”). Given φ and θ, µ and δ 2 can be estimated in
closed form:
∂
L|φ, θ = 0 → µ̂ = 10 (φI + C)−1 y/10 (φI + C)−1 1
∂ µ̂
∂
L|φ, θ = 0 → δ̂ 2 = (y − µ̂1)0 (φI + C)−1 (y − µ̂1)/N
∂ δ̂ 2
The complete likelihood must generally be maximized numerically over φ and θ. Given
parameter values, z at any u is conditionally normal with easily computed mean and standard
deviation.
In many applications, the likelihood function is relatively flat, and positive penalty terms
that are increasing functions of the elements of θ are sometimes subtracted from the log
likelihood to improve the numerical conditioning of the problem (e.g. Li and Sudjianto,
2005). The general result of this is:
• Elements of the penalized estimate of θ tend to be smaller (due to the penalty terms)
than their corresponding MLEs.
9
• The penalized estimate of δ 2 is often larger than the MLE.
Neither of these are necessarily of concern of themselves, since the GaSP parameters are
generally not the quantities of real interest anyway. The more important question is: How
does the penalty affect ẑ and the conditional standard deviation? There are no guaranteed
answers to this, and even asymptotic results are difficult to obtain here because of the
increasing dependence between y values as the number of experimental runs increases in a
limited U – the so-called “infill problem”. However, the general observation is that:
• µ usually has little influence on predictions unless the data are very sparse or correlations are very weak.
• Smaller correlation parameters (e.g., leading to smaller conditional standard deviations) are often off-set by larger δ (e.g., leading to larger conditional standard deviations).
The intent of penalizing the likelihood is that the numerical problem of parameter estimation
is made easier, but that the impact on prediction of z is minimal.
Because the likelihood function is relatively simple, Bayesian approaches are also popular.
With priors on µ, δ 2 , σ 2 , and θ, posterior probability distributions can be computed via
MCMC simultaneously for the parameters and for z and any specified values of u, and
depending on the form of priors selected, at least some of this may be possible via Gibbs
steps rather than generic Metropolis-Hastings steps.
In most of the notes on experimental design to follow, GaSP parameters are treated as
known. Design can be formulated around joint parameter estimation and function prediction,
but this is much more difficult, and has all of the complications encountered in design
for nonlinear models (and sometimes more). In addition, in most applications there is
considerably more uncertainty associated with function prediction (given parameter values)
than with parameter estimation; it is often the case that varying the values of the GaSP
parameters substantially has very little impact on predictions of z. Hence, control of the
uncertainty associated with prediction is a good starting point for experimental design in
this context.
10
Download