Gaussian Spatial Processes: Design In designing experiments for spatial models, the general goal is finding a set of u ∈ U that lead to the greatest amount of information, or the least amount of uncertainty – in some sense – about the function z(u). When modeling is based on a Gaussian stochastic process model, a key quantity for these purposes is the “prediction variance” at u: δ 2 − Cov[z(u), y(U )]V ar−1 [y(U )]Cov[y(U ), z(u)] the (frequentist) conditional or (Bayes) posterior variance of z(u) given y(U ). In practice, µ, δ 2 , σ 2 , and θ (the vector of correlation parameters) must also be estimated, but here we’ll only consider design for which at least σ 2 /δ 2 and θ are treated as known. G-Optimality Recall from our study of parameteric models that the G-optimality criterion focuses directly on expected response estimation, rather than model parameter estimation. That logic is perhaps even more appealling here since the only quantities being treated as parameters are part of the predictive stochastic process, rather than a physically meaningful model. Hence, for the GaSP models we are studying, define a G-optimal design to be U for a fixed size N which minimizes the largest conditional/posterior variance of z(u): maxu∈U V ar[z(u)|y(U )] Note that for any δ 2 , this is equivalent to finding the design that maximizes the slightly simpler expression φ = minu∈U Cov[z(u), y(U )]V ar−1 [y(U )]Cov[y(U ), z(u)] and so we take this as the criterion function, and for convenience, write it (apart from a factor of δ 2 ) as σ2 I + C)−1 r = minu∈U Q(u, U ) δ2 where r is the N -element vector of correlations with ith element R(u − ui ), and C is the φ = minu∈U r0 ( N -by-N matrix of correlations with (i, j) element R(ui − uj ). Note, for purposes of finding an optimal design, that for a fully defined R, C is a function only of the experimental design, but r is a function both of the design and u, the argument of the minimization. Hence an idealized schetch of an algorithm for finding an optimal design 1 would look like: search over possible designs compute ( σδ2 I + C)−1 2 search over points in U compute r0 ( σδ2 I + C)−1 r 2 ... ... The “outer loop” generally needs a compromise, since trying all possible designs is generally too computationally intensive. A point-exchange approach, similar to what we talked about for parametric models, can be taken in which a randomly chosen initial design is improved through interations that involve adding and then deleting a point: • add a point u+ that results in the largest minu Q(u, U + u+ ) • delete a point u− that results in the largest minu Q(u, U − u− ) Even here, if N is not small we would like to avoid computing and inverting N × N and (N + 1) × (N + 1) matrices any more than necessary. With this in mind, let M = σ2 I δ2 + C, and use the “update” formula: M v , M symmetric and p.s.d. • let M+ = 0 v s • then M−1 + 1 1 −1 −1 M−1 + s−v0 M v)(M−1 v)0 − s−v0 M v) −1 v (M −1 v (M = 1 1 −1 v)0 − s−v0 M −1 v (M s−v0 M−1 v −1 The formula can be used directly when adding a point, to get M−1 + from M . When deleting a point, use: M−1 + A b 1 = 0 , and M−1 = A − bb0 c b c With that, here’s a somewhat more detailed sketch of how an algorithm could be constructed • set N • pick a random (or other) starting design, U • compute M−1 • to add a point: – loop over all u+ ∈ U ∗ U+ ← U + u+ ∗ M−1 + ← update 2 ∗ find min Q(u, U+ ), Qmin , umin ∗ keep u+ for which Qmin is largest – U ← U + u+ – N ←N +1 – M−1 ← update • to delete a point: – loop over all u− ∈ U ∗ U− ← U − u− ∗ M−1 − ← update ∗ find min Q(u, U− ), Qmin , umin ∗ keep u− for which Qmin is largest – U ← U − u− – N ←N −1 – M−1 ← update • alternatively add and delete until no further change results Examples The figures below display 3 approximately G-optimal N = 10 point designs generated using an algorithm as outlined above. For these calculations, U = {0, 0.05, 0.10, 0.15, ..., 1}2 , and the design displayed in each case is the best of 10 tries beginning with a random design and iterating until a sequential add-delete cycle does not change the design. Each design was constructed for a stationary process with Gaussian correlation function with θ = 10 for both u1 and u2 . From left to right, the panels display the resulting desgns for σ 2 /δ 2 = ● 1.0 1.0 1.0 0.1, 0.2, and 0.5. ● ● ● ● ● ● ● 0.8 0.8 ● 0.8 ● ● 0.6 ● u2 ● 0.4 u2 0.4 0.4 u2 0.6 0.6 ● ● 0.2 ● ● 0.2 0.2 ● ● ● ● ● ● ● 0.0 0.2 ● 0.4 0.6 u1 0.8 0.0 0.0 0.0 ● ● ● 0.0 0.2 0.4 0.6 u1 3 0.8 1.0 ● ● 0.2 0.4 0.6 u1 0.8 1.0 A-Optimality As we discussed with parametric models, G-optimality is difficult computationally since it is essentially a nested optimization problems – minimizing through choice of design, the maximum predictive variance over U. An alternative but related approach is an analogue to what we called A-optimality earlier, where we focus on integrated (or average) predictive variance over U, i.e. Z V ar(z(u)|y(U ))du u∈U For any δ 2 , this is equivalent to finding the design that maximizes: Z Z σ2 σ2 σ2 −1 −1 0 −1 φ= r ( 2 I + C) rdu = trace( 2 I + C) rr du = trace( 2 I + C) rr0 du δ δ δ u∈U u∈U u∈U Z 0 the last expression being true because only r is a function of the prediction “site” u. This means that for a given design, evaluation consists of computing the matrix A = R u∈U rr0 du, 2 followed by the trace of ( σδ2 I+C)−1 A. This can still require substantial computational effort, but is generally less intensive than the complete search over U required by G-optimality. In cases where a product correlation form is used for R, and U is a hyper-rectangle in r-space, substantial simplification can be realized in the computing of A by expressing the elements of the matrix as r factors; if the one-dimensional correlations are of convenient functional form, the integrals can sometimes be performed analytically (and therefore quickly). Other reasonable forms of design optimality can be formulated for this model, but each generally presents a difficult computational problem (even with the design assumption of known GaSP parameter values). For example, one might want to explore an analogue of D-optimality from parametric modeling, where here the emphasis would be on minimizing the determinant of the variance matrix of predicted z(u) at all sites in U simultaneously. An immediate practical problem, of course, is that this would be an enormous matrix. One general difficulty here is the absence of an “equivalence strategy” as provided by Frechet derivatives in the parametric case. Instead of pursuing the general design problem for this model, we turn to an interesting special case that yields a bit more structure for design. D-Optimality for No Noise Suppose now that σ 2 = 0, i.e. z(u) = y(u). This is the appropriate model for applications in which a deterministic computer model (with u the input vector and z an output of interest) generates “data”, and the aim is to construct an approximation to the computer model that can be quickly evaluated at any u. It can also be regarded as a reasonable approximate model for design purposes when δ 2 >> σ 2 . But the change from “small” σ 2 to σ 2 = 0 is a fundamental structural change in the model, and provides substantial simplicity in the way optimality, especially D-optimality, can be formulated. 4 First, note that with this modification comes simplicity in functional expressions for the predictive mean and standard deviation of z at any u: ẑ(u) = µ + r0 C−1 (z(U ) − µ1) V ar(z(u)|z(U )) = δ 2 (1 − r0 C−1 r) where U is the experimental design of N sites, and z(U ) is the collection of responses observed there. In fact, in this case, predictions made at points in the design exactly replicate the data values observed there, and conditional variances associated with these predictions are zero (since there is no uncertainty about z(u) if it has been observed). To demonstrate this, the following graph displays ẑ(u) with plus-and-minus 2 conditional standard deviation bounds generated using the 3-point example data set and the nonnegative linear correlation function as described in the previous chapter, but here with σ 2 = 0 rather than σ 2 = 0.1 as before: 11 ● 10 ● ● 8 9 z−hat, +−2sd, y 12 13 y, z−hat, +−2*sd 0.0 0.2 0.4 0.6 0.8 1.0 u Focus now on a large-but-finite grid V and partition it into U (the design) and Ū (everything else in V). A version of “D-optimality” would be to pick U so as to minimize the determinant of the conditional covariance matrix of z(Ū ): V ar[z(Ū )|z(U )] = δ 2 [Corr[z(Ū ), z(Ū )] − Corr[z(Ū ), z(U )]Corr[z(U )]−1 Corr(z(U ), z(Ū ))] On the face of it, this would lead to a very difficult calculation if V, and therefore Ū is very large, since this is the dimension of these matrices. However, consider the implications of the following fact: |Corr[z(V)]| = |Corr[z(U )]| × |Corr[z(Ū )] − Corr[z(Ū ), z(U )]Corr[z(U )]−1 Corr[z(U ), z(Ū )]| For fixed GaSP parameters and a fixed grid V, the expression on the left is fixed. The second factor on the right is the determinant of the (very large) conditional variance matrix for z 5 at all sites other than those in the design that we would want to minimize for D-optimality. The first factor on the right is the determinant of the unconditional variance matrix for z at the design sites (a much smaller matrix). Taken together, this says that we can minimize the determinant of the conditional variance matrix for z(Ū ) – the criterion function for Doptimality – by maximizing the determinant of the unconditional variance matrix for z(U ) with respect to selection of U . This latter matrix is typically much, much smaller (order N , the size of the design), and so computationally much more feasible. This value of this fact actually extends beyond the context of a single selected finite grid V. Suppose we maximize |Corr(z(U ))| with respect to selection of design points from a possibly continuous/infinite U. This, in fact, guarantees that |V ar[z(Ū )|z(U )]| is minimized for any grid V you might have selected from U that includes U . Whether the search is confined to points on a grid or not, this means that a D-optimal design (or near-D-optimal design) can be constructed using calculations involving N × N matrices. An add-delete algorithm that takes advantage of this might follow the following outline: • set N • pick a random (or other) starting design, U • compute C(U ) (the N × N correlation matrix) • to add a point: – loop over all u+ ∈ U ∗ U+ ← U + u+ ∗ compute |C(U+ )| as |C(U )| × (1 − r0 C(U )−1 r) = q (where r is an N -element vector of correlations) ∗ keep u+ for which q is largest – U+ ← U + u+ – N ←N +1 • to delete a point: – loop over all u− ∈ U+ ∗ U− ← U+ − u− ∗ compute |C(U− )| as |C(U+ )|/(1 − r0 C(U− )−1 r) = q ∗ keep u− for which q is largest – U ← U+ − u− 6 – N ←N −1 • alternatively add and delete until no further change results Asymptotic Optimality for No Noise All discussion of optimal design construction to this point has been predicated on stating a value of θ, even for the case of σ 2 = 0. Johnson, Moore, and Ylvisaker (1990) developed arguments that characterize designs for GaSP models (with σ 2 = 0) for the limiting case of “weak local correlation”, i.e. of θ → ∞ in the parameterization we’ve adopted here. Suppose the correlation function R(∆) can be written in terms of what we will call a “distance function” d(∆) that is non-negative, and zero only for ∆ = 0. Write the relationship between R and d as R(∆) = r(d(∆)), where we require that r be a decreasing function of d (so that correlation is relatively weak for distances that are relatively large). For example, the product-form Ornstein-Uhlenbeck correlation Rθ (∆) = exp(− ri=1 θi |∆i |), so P d = ri=1 θi |∆i |, i.e. “distance” is the sum of distances in each dimension, each scaled by a P corresponding parameter (sometimes called “rectangular distance”). Similarly, the productform Gaussian correlation function corresponds to d = Pr 2 i=1 θi ∆i , or squared Euclidean distance where, again, each dimension is scaled by a corresponding parameter. For any design space U, define the following: • Call an N -point design U a minimiax distance design (with respect to d) if the largest distance between a point in U to the nearest point in U is less than or equal to that of any other N -point design. That is, a minimax distance design is any solution to: argminU maxu∈U minv∈U d(u − v) For a minimax distance design, let this largest distance between a point in U and the nearest point in U be dmM . Call the points in U that are distance dmM from the nearest point in U remote points. Let the smallest number of points in U that are distance dmM from some remote point be ImM . Then the design is a minimax distance design of maximum index if there is no minimax distance design with a larger value of ImM . • Call an N -point design U a maximin distance design (with respect to d) if the smallest distance between two points in U is greater than or equal to that of any other N -point design. That is, a maximin distance design is any solution to: argmaxU minu6=v∈U d(u − v) For a maximin distance design, let this smallest distance between two points in U be dM m , and let the number of pairs of points in the design separated by this distance be IM m . Then the design is a maximin distance design of minimum index if there is no maximin distance design with a smaller value of IM m . 7 Now consider a progression of GaSP models indexed by positive k for which the correlation function is Rk (∆). The two central results of Johnson, Moore, and Ylvisaker are the following: • Minimax distance designs of maximum index are asymptotically G-optimal as k → ∞. • Maximin distance designs of minimum index are asymptotically D-optimal as k → ∞. (Proof for the second result, for example, comes from the fact that in the limit, dM m determines the largest-order term in log|C(U )|, and IM m determes the coefficient of that term.) The two figures below, taken from the JMY paper, display a minimax distance design of maximum index (left) and a maximin distance design of minimum index (right), each in N = 7 points for U = [0, 1]2 , for squared Euclidean distance. The circles on the graphs emphasize reliance on distance from design points to “most distant” points not in the design (minimax) and the minimum distance between design points (maximin). Computing to find a minimax distance design is relatively difficult because for any design, the distance to all other points in U must, in principle, be considered. Evaluation of a design by the maximin criterion is much simpler since only distances between the N 2 pairs of points in the design need to be evaluated. A Cute Mm Distance Example (of limited practical value) Consider the squared-Euclidean distance function d(s, t) = Pr l=1 (sl − tl )2 , and the hy- percubic design region U = [−1, +1]r . (I’ve omitted a θ parameter here for simplicity; the argument to be presented also holds for scaled distance so long as the scaling is the same in each dimension.) For convenience, think about a conventional design matrix XN ×r with (i, j) element ui,j . For this problem, denote the distance between the ith and jth design points as d(i, j) = Pr 2 l=1 (ui,l − uj,l ) . For any N -point design, there are a total of distances: 8 N 2 interpoint P i<j P l (ui,l − uj,l )2 = P P l i<j (ui,l − uj,l )2 = const × S 2 (l) P l where S 2 (l) is the sample variance of the elements in the lth column of X. We know that these sample variances are all as large as possible for any design that specifies N/2 +1’s and N/2 -1’s for each independent variable – the so-called “balanced designs” including orthogonal arrays. Now, narrow consideration to N = r + 1 = 0 mod[4], i.e. the conditions needed for a “full-width” Plackett-Burman design. Let X∗ be the model matrix for a first-order linear model, including a column for the intercept: X0∗ X∗ = N I = X∗ X0∗ where the last equation is correct because X∗ is square. The elements of this last matrix are: • diagonal: 1 + P l • off-diagonal: 1 + So, P l u2i,l − 2 P l u2i,l P l ui,l uj,l + ui,l uj,l P l u2j,l = 2N , which implies that P l (ui,l − uj,l )2 = 2N , all (i, j). That is, the distance between every pair of points is 2N . Putting this together, for PlackettBurman designs, • P i,j d(i, j) is maximized • d(i, j) is the same for every (i, j) Hence, Plackett-Burman designs are maximin distance designs for the stated problem (Further, this argument also works for rectangular distance.) Latin Hypercube Designs There are also other forms of spatial designs that are not directly motivated by the precision of prediction that can be expected for GaSP models, but have been shown to be effective experimental plans for this purpose. Probably the mostly widely used of these is based on the Latin Hypercube Sample, LHS, (not “design” yet, but we’ll get to that soon), introduced by McKay, Beckman, and Conover (1979) in the context of computer experiments in which inputs are chosen randomly from some specified distribution, and analysis focuses on estimating properties, such as the mean or specified quantiles, of the resulting distribution of the outputs. In this kind of experiment, the values of inputs actually selected are generally not used in the estimation exercise, that is, N input vectors are randomly selected, the computer model is executed for each of them, and the analysis is based only on the resulting 9 random sample of output values. McKay, Beckman, and Conover (1979) focused in particular on averages of functions of the output: T = n 1 X g(zi ), N i=1 (1) where zi , i = 1, 2, 3, ..., N is the value of the output of interest resulting from execution of the model with the ith selected set of inputs (ui ). In this setting g is an arbitrary function that accomodates a useful variety of output statistics. For example, g(z) = z leads to the sample mean, g(z) = z m for positive integer m yields the mth noncentral sample moment, and g(z) = 1 for z < z ∗ and 0 otherwise results in the empirical distribution function evaluated at z ∗ . Latin Hypercube sampling is based on the idea that a joint probability distribution has been specified for the input vector, F (u), and that the elements of u are independent so that the joint distribution can be written as the product of the marginals, F (u) = Qr i=1 Fi (ui ). Values of the inputs are selected individually. For the ith input, the range of ui is partitioned into n non-overlapping intervals, each of probability 1/N under Fi , and one value of ui is drawn conditionally from each of these intervals. After N values have been thus chosen for each input, they are combined randomly (with equal probability for each possible arrangement) to form the N input vectors each of order r. When N is large, the conditional sampling from equal-probability intervals is often ignored, and values are simply taken from a regular grid. The following figure displays (in the left panel) how 5 sample values of one input are selected conditionally from equal-probability “slices” of a given univariate distribution, and (in the right panel) how values chosen in this way for 2 inputs can be randomly matched to construct a Latin Hypercube sample. 10 The basic result presented by McKay, Beckman, and Conover (1979) compares the efficiency of estimation for LHS to that for simple random sampling (SRS), and can be easily stated. For a fixed sample size N , let TSRS be the quantity of interest calculated from outputs resulting from a simple random sample of inputs, and let TLHS be the same quantity resulting from a Latin Hypercube sample. Then if the computer model is such that z is a monotonic function of each of the inputs, and g is a monotonic function of z, then V ar(TLHS ) ≤ V ar(TSRS ). Stein (1987) showed that so long as E(g(z)2 ) is finite, the asymptotic (large N ) variance of TLHS is no larger than that of TSRS without the monotonicity requirements, and that the asymtotic efficiency of TLHS relative to TSRS is governed by how well the computer model can be approximated by a linear function in u. As described above, the original justification for the Latin Hypercube was as a stratified sampling plan, requiring a specified probability distribution for u. However, the structure of the LHS is frequently used in GaSP modeling applications where u is not regarded as random, resulting in the Latin Hypercube design. Intuitive appeal for this approach to design for meta-modeling includes the following: 1. One-dimensional stratification: In a Latin Hypercube sample (or design), each input takes on N distinct values spread across its experimental range. This is not particularly appealing in physical experimentation since it implies that the design cannot have factorial structure or include replication. However, in experiments with deterministic models, there is no uncontrolled “noise” as such; replication is not needed to estimate uncontrolled variation, and the benefits of factorial structure that come from maximizing “signal to noise” in the analysis are not relevant. The N -value one-dimensional projections of a Latin Hypercube provide (at least in some cases) the information needed to map out more complex functional z-to-u behavior than can be supported with designs that rely on a small number of unique values for each input. 11 2. Space-filling potential: The modeling techniques that are most appropriate in this context are data interpolators rather than data smoothers; they generally perform best when the N points of an experimental design “fill the space”, as opposed to being arranged so that there are relatively large subregions of U containing no points (as is the case, for example, with factorial designs with relatively few levels attached to each input). While Latin Hypercube designs do not necessarily have good spacefilling properties, they can be made to fill ∆ effectively through judicious (non-random) arrangement of the combinations of input values used. As one example, Morris and Mitchell (1995) constructed maximin distance designs within the class of equally-spaced Latin Hypercube designs for use in computer experiments. Sequential Experiments: Expected Improvement As noted earlier, the GaSP model with σ 2 = 0 has become a popular statistical framework for modeling the behavoir of deterministic computer models. The motivation for this is often that complex computer models take a substantial amount of computer time for each run, but many applications require a large number of such evaluations. An important example of this kind of computer “experimentation” is the problem of numerical function optimization. Suppose our interest in z(u) is in finding the value of values of u for which z is maximized. Numerical optimization has traditionally not been treated as a statistical problem. However, suppose that z takes substantial computer time for each function evaluation; many or most traditional numerical optimization approaches may be infeasible due to the number of function evaluations they require, particularly when the problem dimension (r) is high. Recent statistical research has been focused on how a “meta-model” such as a GaSP might be used to make more complete use of the data, and lead to effective function optimization through fewer evaluations of z. A very simple way to do this is to use the conditional/posterior predictor ẑ(u) as what is sometimes called an “oracle”, to predict the values of u that are likely to lead to larger value of z. A simple algorithm: • Begin with a “standard” but small design U of points taken from U. • Fit a GaSP model, and find the value or values of u ∈ U that lead to the largest value of ẑ. (This can be done relatively quickly, since ẑ is typically much easier to calculate than z.) • Evaluate z at the value of u identified in the last step, add this value of z to the dataset, and update the GaSP predictor of z. • Iterate the second and third stapes until no appreciable improvement is attained. 12 The above approach is simple and heuristic, and more important, it often is actually quite effective. However, it does not take uncertainty of the predictions ẑ into account, and so can take ẑ values “too seriously”, especially in early iterations, leading to slow convergence or failure in some cases. An alternative approach that takes advantage of more of the information in the stochastic process is based on expected improvement. For our function maximization example, suppose that we have evaluated z at each of N values of u (the “current design”) and wish to select an N + 1st value of u for the next evaluation. Because our goal is function maximization, the next iteration yields “improvement” for our pusposes only if the resulting value of z is greater than any of the N values already computed. Toward this end, let zN,max denote the largest z associated with the N function evalutions to date, and define the improvement that would be realized from a new evaluation at u to be I(u) = z(u) − zN,max , z(u) − zN,max > 0 z(u) − zN,max ≤ 0 0 Since z(u) is unknown at any u other than those in the N -point design, I(u) cannot be evaluated. However, under the GaSP model, z(u) is a random variable with a conditional (on the first N data values) distribution, and so I(u) is also a random variable for which the conditional distribution can be characterized. A sequential design approach based on expected improvement selects as the next u the vector that maximizes the expectation of I(u), conditional on all information collected through the first N evaluations. I(u)|z(U ) has a distribution that is a mixture of a truncated normal distribution and a point mass distribution (at I = 0). For notational simplicity, represent the conditional standard deviation of z(u) by S(u); S(u) = q V ar(z(u)|z(U )). Then the conditional expectation of I(u) can be derived as: ( ẑ(u) − zN,max ẑ(u) − zN,max Φ E(I(u)|z(U )) = S(u) S(u) S(u) ! ẑ(u) − zN,max +φ S(u) !) where Φ and φ denote the cumulative distribution function and the density function, repectively, of the standard normal random variable. The algorithm described above can be altered by substituting E(I(u)|z(U )) for ẑ(u) in the second step. The result is that, rather than selecting the site for which ẑ(u) is maximized, sites for which ẑ(u) is somewhat smaller, but for which S(u) is large (indicating the strong possibility of a larger value of z) are sometimes selected. Overall, this is a compromise between identifying the points that appear to maximize z based on the best scalar-valued predictions, and points at which uncertainty is large enough that improved precision of these scalar-valued predictions is needed. References Johnson, M.E., L.M. Moore, and D. Ylvisaker (1990). “Minimax and Maximin Distance Designs,” Journal of Statistical Planning and Inference 26, 131-148. 13 McKay, M.D., R.J. Beckman, and W.J. Conover (1979). “A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code,” Technometrics 21, no.2, 239-245. Morris, M.D., and T.J. Mitchell (1995). “Exploratory Designs for Computer Experiments. Journal of Statistical Planning and Inference 43, 381-402. Stein, M. (1987). “Large Sample Properties of Simulations Using Latin Hypercube Sampling,”Technometrics 29, no. 2, 143-151. 14