Bayesian Approach to Optimal Design

advertisement
Bayesian Approach to Optimal Design
Much of this section follows the development of the excellent review paper of Chaloner
and Verdinelli (1996).
A note on notation: Recall that Silvey (and my notes on nonlinear models) make a
distinction between a design measures designated as η and µ, with the idea that they reflect
probability over U or Xθ , respectively. This helps in Silvey’s development since the optimal
design of θ 1 is not generally the same as that for θ 2 . In this section, we discuss situations in
which the value of θ is uncertain, but a single design is constructed. Accordingly, these notes
retreat to labeling a design measure as η, following the notation of Chaloner and Verdinelli.
General Set-Up, and φ1
In the section on nonlinear models, we briefly mentioned so-called robust approaches to
design that express prior knowledge about θ via intervals or other regions of the parameter
space, and extend the traditional view of optimal design by focusing on minimax or average
performance. If prior knowledge can be characterized by a probability distribution instead,
this is a natural first-step to developing what has come to be known as Bayesian experimental
design. It should be immediately noted that nothing in this approach requires that the data
collected in the experiment should be a analyzed using Bayesian methods, although that is
always a possibility.
Formal Bayesian arguments are set in terms of a utility U which, given all quantities
both known and unknown, defines how well things turn out if you make decision d, or more
specifically:
U (d, θ, η, y)
(We’re recycling the notation U in this section; Silvey and others use it for the experimental
design, but we’ll follow standard convention for a different batch of literature here and use
it to denote the utility function.) In data analysis, the design and data are known, and θ
is uncertain, as represented by its posterior distribution, and the Bayes decision maximizes
(this) expectation of utility:
dB = dB (y, η) = argmaxd
Z
θ
U (d, θ, η, y)p(θ|y, η)dθ
In contrast to this, for a given design before it is executed y is also uncertain, as represented
by its prior distribution, which is derived from the prior for θ, the design η, and the form of
the data model f (y). The expected utility, with respect to this prior uncertainty, is
EU (η) =
Z y
maxd
Z
θ
U (d, θ, η, y)p(θ|y, η)dθ p(y|η)dy
The best design, from this perspective:
1
• maximizes the expectation, over the prior uncertainty of y, of the
• Bayes expected utility for a specific value of y:
ηB = argmaxη EU (η)
Traditionally, Bayesian arguments have featured the idea of a “decision”, but more generally, “d” doesn’t have to be an explicit argument of U . Without an explicit d, the utility can
still be interpreted to be any measure of value of the result of the experiment. For example,
U1 = log
p(θ|y, η)
p(θ)
is a measure of how much more certain you are about θ once the experiment is completed
than you were before. Note that, as is generally true for utility functions, this is defined (and
generally different) for each possible value of θ and y. For analysis purposes (i.e. known y),
the expectation of this quantity with respect to θ:
Z
θ
log
p(θ|y, η)
p(θ|y, η)dθ
p(θ)
is called the Kullback-Leibler distance between the prior and posterior distributions for θ.
The following two graphs show a rough depiction of how this index works. The graph on the
left displays three curves, p(θ) (normal with a standard deviation of 1) as a dotted curve,
p(θ|y, η) (normal with a standard deviation of 0.5) as a dashed curve, and the log of the ratio
of the two densities as a solid curve. The integral of the solid curve times the dashed curve
is the Kullback-Leibler distance for this pair of distributions, and takes the value of 0.3181,
which is a measure of “information gain” accompanying the reduction of uncertainty from
a prior standard deviation of 1, to a posterior standard deviation of 0.5. The graph on the
right is similar, but the posterior distribution here has a standard deviation of 0.25. Here,
the Kullback-Leibler distance is 0.9175, reflecting the greater gain in information about θ
2
1
0
−1
−4
−3
−2
p & log−ratio
0
−1
−2
−3
−4
p & log−ratio
1
2
associated with the larger reduction in standard deviation (from 1 to 0.25).
−3
−2
−1
0
1
2
3
−3
theta
−2
−1
0
theta
2
1
2
3
For design purposes (i.e. without knowning y), we may also take the expectation with
respect to y, and call this the expected Kullback-Leibler distance, or gain in Shannon information:
EU1∗ (η)
=
Z Z
p(θ|y, η)
p(θ|y, η)
log
p(θ|y, η)p(y|η)dθdy =
p(y, θ|η)dθdy
log
p(θ)
p(θ)
θ
y θ
Z Z
y
This can be simplified somewhat for design purposes, because the denominator (or, second
term after log) isn’t a function of the design:
Z Z
y
θ
−log[p(θ)]p(y|θ, η)p(θ|η)dθdy =
Z
θ
−log[p(θ)]p(θ|η)dθ
where y is integrated out of the expression. But since there is no information about θ given
only η (i.e. no data on which to base an inference), then this quantity is not a function of
the design, and so the expected Kullback-Leibler distance can be modified as:
EU1 (η) =
Z Z
y
θ
log[p(θ|y, η)]p(y, θ|η)dθdy
φ1 for Linear Models and Normal Errors
Suppose we are dealing with a standard linear model, y = x0 θ + where is normally
distributed with mean 0 and variance σ 2 . Suppose also that we adopt a normal prior to
express uncertainty about θ, p(θ) = M V N (θ 0 , σ 2 R−1 ), where σ 2 is also the variance
of , but R−1 can be a general positive definite matrix (i.e. not necessarily a correlation
matrix). We select a design η; suppose for definiteness that it can be expressed as an
exact design U in N runs, and let the N -row model matrix of x-vectors associated with
the design points be denoted by X. It follows that the posterior distribution is p(θ|y, η) =
M V N ( (N M(η) + R)−1 (X0 y + Rθ 0 ) , σ 2 (N M(η) + R)−1 ). Under this model:
k
k 1
EU1 (η) = − log(2π) − + log|σ −2 (N M(η) + R)|
2
2 2
and this leads to what is sometimes called the Bayes version of D-optimality:
φ1,N (η) = log|N M(η) + R| φ1 (η) = log|M(η) +
1
R|
N
Note that we have to pretend to know σ 2 above, because it is part of the definition of
p(θ). If σ 2 isn’t known (the more common situation), this can be addressed by broadening
the problem to include a conjugate inverse-gamma prior for σ 2 , and the same conditional
normal prior for θ|σ 2 . In this case, posterior distributions for θ and y are scaled-and-sifted
t’s, and EU1 (η) is defined, but is not of simple or closed form.
Note also that for large N , φ1 (η) → log|M(η)|. This limiting form reflects the fact/hope
that “with enough data, the prior doesn’t matter”, and can be used as a basis to design
3
experiments for any prior – so long as N is large relative to the elements of R. This also
eliminates the problem with not knowning σ 2 , because the posterior t becomes normal for
large N .
Approximate φ1 for Nonlinear Models or Nonnormal Errors
While the standard linear-model-with-normal errors can be developed in a Bayesian
framework as shown above, after a “large N ” approximation, this leads to the same result as the classical arguments. The greater appeal is to nonlinear cases where classical
theory does not provide a complete solution. For nonlinear models, the φ1 criterion is generally computable, but is generally not available in closed form, and approximate versions
of the problem are more popular. Recall that the criterion we’re working with is:
EU1 (η) =
Z Z
y
log[p(θ|y, η)] p(θ|y, η) p(y|η) dθdy
{z
} | {z } | {z }
θ|
1
1
2
The approximation argument requires two assumptions:
1. If p(θ|y, η) is approximately normal (“Bayes CLT”), then the terms labeled (1) above
have analogous form to those in the argument for the normal case.
2. If U1 depends on y (approximately) only through an asymptotically normal variate
(e.g. the mle of θ), and a change of variables can “get rid of the residual data” and
leave a normal form as the integrand (in the spirit of a sufficient statistic), then EU1
might be rewritten as:
EU1 (η) =
Z Z
θˆ θ
log[p(θ|θ̂, η)]p(θ|θ̂, η)p(θ̂|η)p(θ)dθdθ̂
Under these conditions, EU1 can then be made to look like the integral form for the linear
model, integrated with respect to the prior over θ, which results in:
k 1Z
k
log|σ −2 (N M(η, θ) + R)|p(θ)dθ
EU1 (η) = − log(2π) − +
2
2 2 θ
where M(η, θ) is the per-observation information matrix, as used previously for nonlinear
models, and so
φ1,N (η) =
Z
θ
log|N M(η, θ) + R|p(θ)dθ φ1 (η) =
Z
θ
log|M(η, θ) +
1
R|p(θ)dθ
N
The approximation required in this argument is of the same order as the assumption allowing
us to ignore the prior, so (again, conveniently)
φ1 (η) =
Z
θ
log|M(η, θ)|p(θ)dθ
4
This is intuitively reasonable, even from a non-Bayesian perspective, because it the priorknowledge average of the criterion that would have been suggested by the classical theory.
φ2 , Bayesian A-Optimality
By analogy to the idea of classical A-optimality, now define a second utility function
U2 = −(θ − θ̂)0 A(θ − θ̂)
Z Z
(θ − θ̂)0 A(θ − θ̂)p(y, θ|η)dθdy
θ
where A is symmetric and positive definite, and the estimator is defined to minimize the
EU2 (η) = −
y
expectation of this loss. Then under the linear model and normal p(θ),
EU2 (η) = −σ 2 trace A[N M(η) + R]−1
so
1 −1
R]
N
For the conjugate-inverse gamma prior setup for σ 2 , U2 still leads to this same criterion
φ2,N (η) = −trace A[N M(η) + R]−1 φ2 (η) → −trace A[M(η) +
function (unlike for D-optimality).
Under the nonlinear model, we have the approximate criterion:
EU2 (η) = c1 c2
Z
θ
−trace A[N M(η, θ) + R]−1 p(θ)dθ
where again, M(η, θ) is the per-observation information matrix, possibly apart from any
constant multipliers, so
φ2,N (η) =
Z
θ
−trace A[N M(η, θ)+R]−1 p(θ)dθ φ2 (η) =
Z
θ
−trace A[M(η, θ)+
1 −1
R] p(θ)dθ
N
and for large N
φ2 (η) =
Z
θ
−trace AM(η, θ)−1 p(θ)dθ
Equivalence Theory
In short, for linear models with normal errors, where Bayes and Classical criteria are the
same, the Frechet derivatives are also the same, and equivalence theory is identical for either
approach. The main point of interest is in nonlinear cases.
For a “classical criterion”, φC (η, θ), define a parallel “Bayes criterion”
φB (η) =
Z
θ
φC (η, θ)p(θ)dθ
Then by definition
5
FφB (η1 , η2 )
= lim→0 1 [φB ((1 − )M(η1 , θ) + M(η2 , θ)) − φB (M(η1 , θ))]
= lim→0 1 [ θ φC ((1 − )M(η1 , θ) + M(η2 , θ))p(θ)dθ − θ φC (M(η1 , θ))p(θ)dθ]
o
R n
= θ lim→0 1 [φC ((1 − )M(η1 , θ) + M(η2 , θ)) − φC (M(η1 , θ)) p(θ)dθ
R
R
(if lim and int can switch)
R
= θ FφC (η1 , η2 , θ)p(θ)dθ
In particular:
• Fφ1 (η1 , η2 ) = Eθ trace M(η2 , θ)M(η1 , θ)−1 − k
• Fφ2 (η1 , η2 ) = Eθ trace AM(η1 , θ)−1 M(η2 , θ)M(η1 , θ)−1 + φ2 (η1 )
The General Equivalence Theorm for Bayesian criteria then follows exactly the same form
as Whittle’s version. The interesting and important case is for nonlinear models, where both
criteria and derivatives are the θ-expectations of their classical counterparts.
Support of the Optimal Design
Recall that for the classical set-up, we were assured that an optimal design could be
found on a limited number of support points by Caratheodory’s Theorem, which essentially
says that any M(η, θ) that corresponds to any design measure η and a specific value of θ)
can be generated by a continuous design on a limited number of support points. This is
sufficient to say that there is an optimal design on the indicated number of support points
since φ is a function only of M(η, θ).
Unfortunately, this theory does not carry over into the Bayesian case. Here, the Bayesian
criterion is the expectation of the classical one, φB (η) = Eθ φC (M(η, θ)), which does not
match the structure of Caratheodory’s argument. Chaloner and Larntz (1986, 1989) gave
examples of how, when the prior distribution of θ has support only over a small region, the
Bayes optimal designs often have the same number of support points as locally (classical)
optimal designs, and that the number of support points increases as the prior becomes more
dispersed. In some cases, Bayes optimal designs do not have finite numbers of support points.
Construction Algorithms
Algorithms for constructing near-optimal designs can be designed in much the same
manner as was described in the section on classical algorithms, but new complications arise
associated with taking expectations with respect to θ. At a “high level” (and therefore leaving out the most difficult computational aspects), the general iterative algorithm discussed
before requires only replacement of the Frechet derivative with its expectation:
1. begin with an arbitrary η0 → ηcurrent
2. find uadd = argmaxu∈U Eθ Fφ (M(ηcurrent , θ), xθ x0θ )
6
3. if Eθ Fφ (M(ηcurrent , θ), xθ ,add x0θ ,add ) ≤ 0, STOP ... ηcurrent is φB -optimal
4. replace ηcurrent by
ηnext = (1 − α)ηcurrent + αηadd
(ηadd puts all mass on uadd )
for some α ∈ (0, 1)
5. return to step 2
Computing the expectation is generally accomplished by evaluating the average classical
Frechet derivative at a random sample of θ values drawn from the prior, or a probabilityweighted average evaluated over a grid of θ values. Depending on the problem and computing
resources, some memory-versus-speed trade-offs may need to be addressed. Specifically, the
number of xθ vectors that will be encountered in the calculation is the size of the collection
of θ values used times the size of U. If there is sufficient memory to “pre-compute” and store
all of these before the iterations, this can save execution time. If memory is more limited,
values of xθ may need to be repeatedly computed within the iterative loop as different
designs are considered.
As a demonstration of the kind of designs that can be expected, an implementation of the
“W” version of this algorithm for D-optimality was written and executed for the chemical
kinetics model described in the section on algorithms:
y = θ1 θ3 u1 /(1 + θ1 u1 + θ2 u2 ) + In the previous demonstration, “design values” of the three parameters were taken to be
θ1 = 2.9, θ2 = 12.2, and θ3 = 0.69. In this exercise, they were given independent, truncated
normal distributions, in which support was limited to non-negative values, and for which
the mean and standard deviation were both 2.9 for θ1 , 12.2 for θ2 , and 0.69 for θ3 . Using
a stopping δ of 0.1 and starting with an initial random design on 10 points (again), the
algorithm stopped after 84 additional equal-mass points had been added. The following
graphs show the approximate location of all added design points (slightly jittered), and an
image map of the square root of probability mass for the final design. Note that, in contrast
to the fixed-θ, it seems clear here that the mass of the optimal design measure is not limited
to 3 “points” in (u1 , u2 ).
7
W−algorithm, sqrt(mass)
u2
2.0
●
●
●
1.0
u2 (jittered)
● ●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
●
●●
●●
0.5
●
●
1.0
1.5
2.0
2.5
3.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
W−algorithm, all points
0.0
u1 (jittered)
0.5
1.0
1.5
2.0
2.5
3.0
u1
Postscript
Bayesian experimental design, as very briefly outlined here, seems a reasonable approach
to constructing designs for nonlinear models when the knowledge of θ is imperfect – which is
essentially always true in reality. In some cases, it provides useful continuous design measures
that can be reasonably “rounded” to discrete designs for application. In others (such as the
preceeding example), the Bayes-optimal design measure cannot be so easily approximated
by a discrete design. In any case, while the relationship between the theory for Bayes
and classical design is interesting, calculation for the Bayes version is far more demanding.
Whether it is better to spend that computing effort in constructing a single Bayesian design,
or in constructing several fixed-θ designs (that are more likely to be “roundable”) in the
spirit of trying to find one design that does well across a spectrum of parameter values, may
still be an open question.
References
Chaloner, K., and I. Verdinelli (1996). “Bayesian Experimental Design: A Review,” Statistical Science 10, 273-304.
8
Download