Bayesian Approach to Optimal Design Much of this section follows the development of the excellent review paper of Chaloner and Verdinelli (1996). A note on notation: Recall that Silvey (and my notes on nonlinear models) make a distinction between a design measures designated as η and µ, with the idea that they reflect probability over U or Xθ , respectively. This helps in Silvey’s development since the optimal design of θ 1 is not generally the same as that for θ 2 . In this section, we discuss situations in which the value of θ is uncertain, but a single design is constructed. Accordingly, these notes retreat to labeling a design measure as η, following the notation of Chaloner and Verdinelli. General Set-Up, and φ1 In the section on nonlinear models, we briefly mentioned so-called robust approaches to design that express prior knowledge about θ via intervals or other regions of the parameter space, and extend the traditional view of optimal design by focusing on minimax or average performance. If prior knowledge can be characterized by a probability distribution instead, this is a natural first-step to developing what has come to be known as Bayesian experimental design. It should be immediately noted that nothing in this approach requires that the data collected in the experiment should be a analyzed using Bayesian methods, although that is always a possibility. Formal Bayesian arguments are set in terms of a utility U which, given all quantities both known and unknown, defines how well things turn out if you make decision d, or more specifically: U (d, θ, η, y) (We’re recycling the notation U in this section; Silvey and others use it for the experimental design, but we’ll follow standard convention for a different batch of literature here and use it to denote the utility function.) In data analysis, the design and data are known, and θ is uncertain, as represented by its posterior distribution, and the Bayes decision maximizes (this) expectation of utility: dB = dB (y, η) = argmaxd Z θ U (d, θ, η, y)p(θ|y, η)dθ In contrast to this, for a given design before it is executed y is also uncertain, as represented by its prior distribution, which is derived from the prior for θ, the design η, and the form of the data model f (y). The expected utility, with respect to this prior uncertainty, is EU (η) = Z y maxd Z θ U (d, θ, η, y)p(θ|y, η)dθ p(y|η)dy The best design, from this perspective: 1 • maximizes the expectation, over the prior uncertainty of y, of the • Bayes expected utility for a specific value of y: ηB = argmaxη EU (η) Traditionally, Bayesian arguments have featured the idea of a “decision”, but more generally, “d” doesn’t have to be an explicit argument of U . Without an explicit d, the utility can still be interpreted to be any measure of value of the result of the experiment. For example, U1 = log p(θ|y, η) p(θ) is a measure of how much more certain you are about θ once the experiment is completed than you were before. Note that, as is generally true for utility functions, this is defined (and generally different) for each possible value of θ and y. For analysis purposes (i.e. known y), the expectation of this quantity with respect to θ: Z θ log p(θ|y, η) p(θ|y, η)dθ p(θ) is called the Kullback-Leibler distance between the prior and posterior distributions for θ. The following two graphs show a rough depiction of how this index works. The graph on the left displays three curves, p(θ) (normal with a standard deviation of 1) as a dotted curve, p(θ|y, η) (normal with a standard deviation of 0.5) as a dashed curve, and the log of the ratio of the two densities as a solid curve. The integral of the solid curve times the dashed curve is the Kullback-Leibler distance for this pair of distributions, and takes the value of 0.3181, which is a measure of “information gain” accompanying the reduction of uncertainty from a prior standard deviation of 1, to a posterior standard deviation of 0.5. The graph on the right is similar, but the posterior distribution here has a standard deviation of 0.25. Here, the Kullback-Leibler distance is 0.9175, reflecting the greater gain in information about θ 2 1 0 −1 −4 −3 −2 p & log−ratio 0 −1 −2 −3 −4 p & log−ratio 1 2 associated with the larger reduction in standard deviation (from 1 to 0.25). −3 −2 −1 0 1 2 3 −3 theta −2 −1 0 theta 2 1 2 3 For design purposes (i.e. without knowning y), we may also take the expectation with respect to y, and call this the expected Kullback-Leibler distance, or gain in Shannon information: EU1∗ (η) = Z Z p(θ|y, η) p(θ|y, η) log p(θ|y, η)p(y|η)dθdy = p(y, θ|η)dθdy log p(θ) p(θ) θ y θ Z Z y This can be simplified somewhat for design purposes, because the denominator (or, second term after log) isn’t a function of the design: Z Z y θ −log[p(θ)]p(y|θ, η)p(θ|η)dθdy = Z θ −log[p(θ)]p(θ|η)dθ where y is integrated out of the expression. But since there is no information about θ given only η (i.e. no data on which to base an inference), then this quantity is not a function of the design, and so the expected Kullback-Leibler distance can be modified as: EU1 (η) = Z Z y θ log[p(θ|y, η)]p(y, θ|η)dθdy φ1 for Linear Models and Normal Errors Suppose we are dealing with a standard linear model, y = x0 θ + where is normally distributed with mean 0 and variance σ 2 . Suppose also that we adopt a normal prior to express uncertainty about θ, p(θ) = M V N (θ 0 , σ 2 R−1 ), where σ 2 is also the variance of , but R−1 can be a general positive definite matrix (i.e. not necessarily a correlation matrix). We select a design η; suppose for definiteness that it can be expressed as an exact design U in N runs, and let the N -row model matrix of x-vectors associated with the design points be denoted by X. It follows that the posterior distribution is p(θ|y, η) = M V N ( (N M(η) + R)−1 (X0 y + Rθ 0 ) , σ 2 (N M(η) + R)−1 ). Under this model: k k 1 EU1 (η) = − log(2π) − + log|σ −2 (N M(η) + R)| 2 2 2 and this leads to what is sometimes called the Bayes version of D-optimality: φ1,N (η) = log|N M(η) + R| φ1 (η) = log|M(η) + 1 R| N Note that we have to pretend to know σ 2 above, because it is part of the definition of p(θ). If σ 2 isn’t known (the more common situation), this can be addressed by broadening the problem to include a conjugate inverse-gamma prior for σ 2 , and the same conditional normal prior for θ|σ 2 . In this case, posterior distributions for θ and y are scaled-and-sifted t’s, and EU1 (η) is defined, but is not of simple or closed form. Note also that for large N , φ1 (η) → log|M(η)|. This limiting form reflects the fact/hope that “with enough data, the prior doesn’t matter”, and can be used as a basis to design 3 experiments for any prior – so long as N is large relative to the elements of R. This also eliminates the problem with not knowning σ 2 , because the posterior t becomes normal for large N . Approximate φ1 for Nonlinear Models or Nonnormal Errors While the standard linear-model-with-normal errors can be developed in a Bayesian framework as shown above, after a “large N ” approximation, this leads to the same result as the classical arguments. The greater appeal is to nonlinear cases where classical theory does not provide a complete solution. For nonlinear models, the φ1 criterion is generally computable, but is generally not available in closed form, and approximate versions of the problem are more popular. Recall that the criterion we’re working with is: EU1 (η) = Z Z y log[p(θ|y, η)] p(θ|y, η) p(y|η) dθdy {z } | {z } | {z } θ| 1 1 2 The approximation argument requires two assumptions: 1. If p(θ|y, η) is approximately normal (“Bayes CLT”), then the terms labeled (1) above have analogous form to those in the argument for the normal case. 2. If U1 depends on y (approximately) only through an asymptotically normal variate (e.g. the mle of θ), and a change of variables can “get rid of the residual data” and leave a normal form as the integrand (in the spirit of a sufficient statistic), then EU1 might be rewritten as: EU1 (η) = Z Z θˆ θ log[p(θ|θ̂, η)]p(θ|θ̂, η)p(θ̂|η)p(θ)dθdθ̂ Under these conditions, EU1 can then be made to look like the integral form for the linear model, integrated with respect to the prior over θ, which results in: k 1Z k log|σ −2 (N M(η, θ) + R)|p(θ)dθ EU1 (η) = − log(2π) − + 2 2 2 θ where M(η, θ) is the per-observation information matrix, as used previously for nonlinear models, and so φ1,N (η) = Z θ log|N M(η, θ) + R|p(θ)dθ φ1 (η) = Z θ log|M(η, θ) + 1 R|p(θ)dθ N The approximation required in this argument is of the same order as the assumption allowing us to ignore the prior, so (again, conveniently) φ1 (η) = Z θ log|M(η, θ)|p(θ)dθ 4 This is intuitively reasonable, even from a non-Bayesian perspective, because it the priorknowledge average of the criterion that would have been suggested by the classical theory. φ2 , Bayesian A-Optimality By analogy to the idea of classical A-optimality, now define a second utility function U2 = −(θ − θ̂)0 A(θ − θ̂) Z Z (θ − θ̂)0 A(θ − θ̂)p(y, θ|η)dθdy θ where A is symmetric and positive definite, and the estimator is defined to minimize the EU2 (η) = − y expectation of this loss. Then under the linear model and normal p(θ), EU2 (η) = −σ 2 trace A[N M(η) + R]−1 so 1 −1 R] N For the conjugate-inverse gamma prior setup for σ 2 , U2 still leads to this same criterion φ2,N (η) = −trace A[N M(η) + R]−1 φ2 (η) → −trace A[M(η) + function (unlike for D-optimality). Under the nonlinear model, we have the approximate criterion: EU2 (η) = c1 c2 Z θ −trace A[N M(η, θ) + R]−1 p(θ)dθ where again, M(η, θ) is the per-observation information matrix, possibly apart from any constant multipliers, so φ2,N (η) = Z θ −trace A[N M(η, θ)+R]−1 p(θ)dθ φ2 (η) = Z θ −trace A[M(η, θ)+ 1 −1 R] p(θ)dθ N and for large N φ2 (η) = Z θ −trace AM(η, θ)−1 p(θ)dθ Equivalence Theory In short, for linear models with normal errors, where Bayes and Classical criteria are the same, the Frechet derivatives are also the same, and equivalence theory is identical for either approach. The main point of interest is in nonlinear cases. For a “classical criterion”, φC (η, θ), define a parallel “Bayes criterion” φB (η) = Z θ φC (η, θ)p(θ)dθ Then by definition 5 FφB (η1 , η2 ) = lim→0 1 [φB ((1 − )M(η1 , θ) + M(η2 , θ)) − φB (M(η1 , θ))] = lim→0 1 [ θ φC ((1 − )M(η1 , θ) + M(η2 , θ))p(θ)dθ − θ φC (M(η1 , θ))p(θ)dθ] o R n = θ lim→0 1 [φC ((1 − )M(η1 , θ) + M(η2 , θ)) − φC (M(η1 , θ)) p(θ)dθ R R (if lim and int can switch) R = θ FφC (η1 , η2 , θ)p(θ)dθ In particular: • Fφ1 (η1 , η2 ) = Eθ trace M(η2 , θ)M(η1 , θ)−1 − k • Fφ2 (η1 , η2 ) = Eθ trace AM(η1 , θ)−1 M(η2 , θ)M(η1 , θ)−1 + φ2 (η1 ) The General Equivalence Theorm for Bayesian criteria then follows exactly the same form as Whittle’s version. The interesting and important case is for nonlinear models, where both criteria and derivatives are the θ-expectations of their classical counterparts. Support of the Optimal Design Recall that for the classical set-up, we were assured that an optimal design could be found on a limited number of support points by Caratheodory’s Theorem, which essentially says that any M(η, θ) that corresponds to any design measure η and a specific value of θ) can be generated by a continuous design on a limited number of support points. This is sufficient to say that there is an optimal design on the indicated number of support points since φ is a function only of M(η, θ). Unfortunately, this theory does not carry over into the Bayesian case. Here, the Bayesian criterion is the expectation of the classical one, φB (η) = Eθ φC (M(η, θ)), which does not match the structure of Caratheodory’s argument. Chaloner and Larntz (1986, 1989) gave examples of how, when the prior distribution of θ has support only over a small region, the Bayes optimal designs often have the same number of support points as locally (classical) optimal designs, and that the number of support points increases as the prior becomes more dispersed. In some cases, Bayes optimal designs do not have finite numbers of support points. Construction Algorithms Algorithms for constructing near-optimal designs can be designed in much the same manner as was described in the section on classical algorithms, but new complications arise associated with taking expectations with respect to θ. At a “high level” (and therefore leaving out the most difficult computational aspects), the general iterative algorithm discussed before requires only replacement of the Frechet derivative with its expectation: 1. begin with an arbitrary η0 → ηcurrent 2. find uadd = argmaxu∈U Eθ Fφ (M(ηcurrent , θ), xθ x0θ ) 6 3. if Eθ Fφ (M(ηcurrent , θ), xθ ,add x0θ ,add ) ≤ 0, STOP ... ηcurrent is φB -optimal 4. replace ηcurrent by ηnext = (1 − α)ηcurrent + αηadd (ηadd puts all mass on uadd ) for some α ∈ (0, 1) 5. return to step 2 Computing the expectation is generally accomplished by evaluating the average classical Frechet derivative at a random sample of θ values drawn from the prior, or a probabilityweighted average evaluated over a grid of θ values. Depending on the problem and computing resources, some memory-versus-speed trade-offs may need to be addressed. Specifically, the number of xθ vectors that will be encountered in the calculation is the size of the collection of θ values used times the size of U. If there is sufficient memory to “pre-compute” and store all of these before the iterations, this can save execution time. If memory is more limited, values of xθ may need to be repeatedly computed within the iterative loop as different designs are considered. As a demonstration of the kind of designs that can be expected, an implementation of the “W” version of this algorithm for D-optimality was written and executed for the chemical kinetics model described in the section on algorithms: y = θ1 θ3 u1 /(1 + θ1 u1 + θ2 u2 ) + In the previous demonstration, “design values” of the three parameters were taken to be θ1 = 2.9, θ2 = 12.2, and θ3 = 0.69. In this exercise, they were given independent, truncated normal distributions, in which support was limited to non-negative values, and for which the mean and standard deviation were both 2.9 for θ1 , 12.2 for θ2 , and 0.69 for θ3 . Using a stopping δ of 0.1 and starting with an initial random design on 10 points (again), the algorithm stopped after 84 additional equal-mass points had been added. The following graphs show the approximate location of all added design points (slightly jittered), and an image map of the square root of probability mass for the final design. Note that, in contrast to the fixed-θ, it seems clear here that the mass of the optimal design measure is not limited to 3 “points” in (u1 , u2 ). 7 W−algorithm, sqrt(mass) u2 2.0 ● ● ● 1.0 u2 (jittered) ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ●● ●● 0.5 ● ● 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 W−algorithm, all points 0.0 u1 (jittered) 0.5 1.0 1.5 2.0 2.5 3.0 u1 Postscript Bayesian experimental design, as very briefly outlined here, seems a reasonable approach to constructing designs for nonlinear models when the knowledge of θ is imperfect – which is essentially always true in reality. In some cases, it provides useful continuous design measures that can be reasonably “rounded” to discrete designs for application. In others (such as the preceeding example), the Bayes-optimal design measure cannot be so easily approximated by a discrete design. In any case, while the relationship between the theory for Bayes and classical design is interesting, calculation for the Bayes version is far more demanding. Whether it is better to spend that computing effort in constructing a single Bayesian design, or in constructing several fixed-θ designs (that are more likely to be “roundable”) in the spirit of trying to find one design that does well across a spectrum of parameter values, may still be an open question. References Chaloner, K., and I. Verdinelli (1996). “Bayesian Experimental Design: A Review,” Statistical Science 10, 273-304. 8