Numerical Approaches for Sequential Bayesian Optimal Experimental Design Xun Huan

advertisement
Numerical Approaches for
Sequential Bayesian Optimal Experimental Design
by
Xun Huan
B.A.Sc., University of Toronto (2008)
S.M., Massachusetts Institute of Technology (2010)
Submitted to the Department of Aeronautics and Astronautics
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computational Science and Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2015
c Massachusetts Institute of Technology 2015. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Aeronautics and Astronautics
August 20, 2015
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Youssef M. Marzouk
Class of 1942 Associate Professor of Aeronautics and Astronautics
Thesis Supervisor
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
John N. Tsitsiklis
Clarence J. Lebel Professor of Electrical Engineering
Thesis Committee
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mort D. Webster
Associate Professor of Energy Engineering, Pennsylvania State University
Thesis Committee
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Karen E. Willcox
Professor of Aeronautics and Astronautics
Thesis Committee
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Paulo C. Lozano
Associate Professor of Aeronautics and Astronautics
Chair, Graduate Program Committee
2
Numerical Approaches for
Sequential Bayesian Optimal Experimental Design
by
Xun Huan
Submitted to the Department of Aeronautics and Astronautics
on August 20, 2015, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy in Computational Science and Engineering
Abstract
Experimental data play a crucial role in developing and refining models of physical systems.
Some experiments can be more valuable than others, however. Well-chosen experiments can
save substantial resources, and hence optimal experimental design (OED) seeks to quantify
and maximize the value of experimental data. Common current practice for designing a
sequence of experiments uses suboptimal approaches: batch (open-loop) design that chooses
all experiments simultaneously with no feedback of information, or greedy (myopic) design that optimally selects the next experiment without accounting for future observations
and dynamics. In contrast, sequential optimal experimental design (sOED) is free of these
limitations.
With the goal of acquiring experimental data that are optimal for model parameter
inference, we develop a rigorous Bayesian formulation for OED using an objective that incorporates a measure of information gain. This framework is first demonstrated in a batch
design setting, and then extended to sOED using a dynamic programming (DP) formulation. We also develop new numerical tools for sOED to accommodate nonlinear models
with continuous (and often unbounded) parameter, design, and observation spaces. Two
major techniques are employed to make solution of the DP problem computationally feasible. First, the optimal policy is sought using a one-step lookahead representation combined
with approximate value iteration. This approximate dynamic programming method couples backward induction and regression to construct value function approximations. It also
iteratively generates trajectories via exploration and exploitation to further improve approximation accuracy in frequently visited regions of the state space. Second, transport maps are
used to represent belief states, which reflect the intermediate posteriors within the sequential
design process. Transport maps offer a finite-dimensional representation of these generally
non-Gaussian random variables, and also enable fast approximate Bayesian inference, which
must be performed millions of times under nested combinations of optimization and Monte
Carlo sampling.
The overall sOED algorithm is demonstrated and verified against analytic solutions on a
simple linear-Gaussian model. Its advantages over batch and greedy designs are then shown
via a nonlinear application of optimal sequential sensing: inferring contaminant source location from a sensor in a time-dependent convection-diffusion system. Finally, the capability of
the algorithm is tested for multidimensional parameter and design spaces in a more complex
setting of the source inversion problem.
3
Thesis Supervisor: Youssef M. Marzouk
Title: Class of 1942 Associate Professor of Aeronautics and Astronautics
Committee Member: John N. Tsitsiklis
Title: Clarence J. Lebel Professor of Electrical Engineering
Committee Member: Mort D. Webster
Title: Associate Professor of Energy Engineering, Pennsylvania State University
Committee Member: Karen E. Willcox
Title: Professor of Aeronautics and Astronautics
4
Acknowledgments
First and foremost, I would like to thank my advisor Youssef Marzouk, for giving me the
opportunity to work with him, and for his constant guidance and support. Youssef has
been a great mentor, friend, and inspiration to me throughout my graduate school career. I
find myself incredibly lucky to have crossed paths with him right as he started as a faculty
member at MIT. I would also like to thank all my committee members, John Tsitsiklis,
Mort Webster, and Karen Willcox, and my readers Peter Frazier and Omar Knio. I have
benefited greatly from their support and insightful discussions, and I am honored to have
each of them in the making of this thesis.
There are many friends and colleagues that helped me through graduate school and
enriched my life: Huafei Sun, with a friendship that started all the way back in Toronto,
who has been like a big brother to me; Masayuki Yano, a great roommate of 3 years,
with whom I had many interesting discussions about research and life; Hemant Chaurasia,
together we endured through quals and classes, enjoyed MIT $100K events, and played many
intramural hockey games side-by-side; Matthew Parno for graciously sharing his MUQ code;
Tarek El Moselhy for the fun times in exploring Vancouver and Japan; Tiangang Cui for
many enjoyable outings outside of research; Chi Feng, Alessio Spantini, and Sergio Amaral
for performing “emergency surgery” on my desktop computer when its power supply died the
week before my defense; and many others that I cannot name all here. I want to thank the
entire UQ group and ACDL, all the students, post-docs, faculty and staff, past and present.
Special thanks go to Sophia Hasenfus, Beth Marois, Meghan Pepin, and Jean Sofronas, for
all the help behind the scenes. I am also grateful for all my friends from Toronto for making
my visits back home extra fun and memorable.
Last but not least, I want to thank my parents, who have always been there for me,
through tough and happy times. I would not have made this far without their constant love,
support, and encouragement, and I am very proud to be their son.
My research was generously supported by funding from the BP-MIT Energy Fellowship,
the KAUST Global Research Partnership, the U.S. Department of Energy, Office of Science,
Office of Advanced Scientific Computing Research (ASCR), the Air Force Office of Scientific
Research (AFOSR) Computational Mathematics Program, the National Science Foundation
(NSF), and the Natural Sciences and Engineering Research Council of Canada (NSERC).
5
6
Contents
1 Introduction
1.1
1.2
19
Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.1.1
Batch (open-loop) optimal experimental design . . . . . . . . . . . . . 21
1.1.2
Sequential (closed-loop) optimal experimental design . . . . . . . . . . 26
Thesis objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2 Batch Optimal Experimental Design
31
2.1
Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2
Stochastic optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.1
Robbins-Monro stochastic approximation . . . . . . . . . . . . . . . . 35
2.2.2
Sample average approximation . . . . . . . . . . . . . . . . . . . . . . 36
2.2.3
Challenges in optimal experimental design . . . . . . . . . . . . . . . . 39
2.3
Polynomial chaos expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4
Infinitesimal perturbation analysis . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5
Numerical results: 2D diffusion source inversion problem . . . . . . . . . . . . 46
2.5.1
Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3 Formulation for Sequential Design
65
3.1
Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2
Dynamic programming form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3
Information-based Bayesian experimental design . . . . . . . . . . . . . . . . . 69
3.4
Notable suboptimal sequential design methods . . . . . . . . . . . . . . . . . . 71
7
4 Approximate Dynamic Programming for Sequential Design
73
4.1
Approximation approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2
Policy representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3
Policy construction via approximate value iteration . . . . . . . . . . . . . . . 76
4.3.1
Backward induction and regression . . . . . . . . . . . . . . . . . . . . 76
4.3.2
Exploration and exploitation . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.3
Iterative update of state measure and policy approximation . . . . . . 78
4.4
Connection to the rollout algorithm (policy iteration) . . . . . . . . . . . . . . 80
4.5
Connection to POMDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5 Transport Maps for Sequential Design
87
5.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2
Bayesian inference using transport maps . . . . . . . . . . . . . . . . . . . . . 90
5.3
Constructing maps from samples . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.1
Optimization objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.2
Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.3
Convexity and separability of the optimization problem . . . . . . . . 96
5.3.4
Map parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4
Relationship between quality of joint and conditional maps
. . . . . . . . . . 98
5.5
Sequential design using transport maps . . . . . . . . . . . . . . . . . . . . . . 104
5.5.1
Joint map structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.2
Distributions on design variables . . . . . . . . . . . . . . . . . . . . . 108
5.5.3
Generating samples in sequential design . . . . . . . . . . . . . . . . . 111
5.5.4
Evaluating the Kullback-Leibler divergence . . . . . . . . . . . . . . . 112
6 Full Algorithm Pseudo-code for Sequential Design
117
7 Numerical Results
119
7.1
7.2
Linear-Gaussian problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.1.1
Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.1.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
1D contaminant source inversion problem . . . . . . . . . . . . . . . . . . . . 126
7.2.1
Case 1: comparison with greedy (myopic) design . . . . . . . . . . . . 131
8
7.3
7.2.2
Case 2: comparison with batch (open-loop) design . . . . . . . . . . . 133
7.2.3
Case 3: sOED grid and map methods . . . . . . . . . . . . . . . . . . 136
2D Contaminant source inversion problem . . . . . . . . . . . . . . . . . . . . 142
7.3.1
Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8 Conclusions
155
8.1
Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.2
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.2.1
Computational advances . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.2.2
Formulational advances . . . . . . . . . . . . . . . . . . . . . . . . . . 159
A Analytic Derivation of the Unbiased Gradient Estimator
163
B Analytic Solution to the Linear-Gaussian Problem
169
B.1 Derivation from batch optimal experimental design . . . . . . . . . . . . . . . 169
B.2 Derivation from sequential optimal experimental design . . . . . . . . . . . . . 172
9
10
List of Figures
1-1 The learning process can be characterized as an iteration between theory and
practice via deductive and inductive reasoning. . . . . . . . . . . . . . . . . . 19
2-1 Example forward model solution and realizations from the likelihood. The
solid line represents the time-dependent contaminant concentration w(x, t; xsrc )
at x = xsensor = (0, 0), given a source centered at xsrc = (0.1, 0.1), source
strength s = 2.0, width h = 0.05, and shutoff time τ = 0.3. Parameters are
defined in Equation 2.18. The five crosses represent noisy measurements at
five designated measurement times. . . . . . . . . . . . . . . . . . . . . . . . . 47
2-2 Surface plots of independent ÛN,M realizations, evaluated over the entire design space [0, 1]2 ∋ d = (x, y). Note that the vertical axis ranges and color
scales vary among the subfigures. . . . . . . . . . . . . . . . . . . . . . . . . . 49
2-3 Contours of posterior densities for the source location, given different sensor
placements. The true source location, marked with a blue circle, is xsrc =
(0.09, 0.22). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2-4 Sample paths of the RM algorithm with N = 1, overlaid on ÛN,M surfaces
from Figure 2-2 with the corresponding M values. The large is the starting
position and the large × is the final position. . . . . . . . . . . . . . . . . . . 52
2-5 Sample paths of the RM algorithm with N = 11, overlaid on ÛN,M surfaces
from Figure 2-2 with the corresponding M values. The large is the starting
position and the large × is the final position. . . . . . . . . . . . . . . . . . . 53
2-6 Sample paths of the RM algorithm with N = 101, overlaid on ÛN,M surfaces
from Figure 2-2 with the corresponding M values. The large is the starting
position and the large × is the final position. . . . . . . . . . . . . . . . . . . 54
11
2-7 Realizations of the objective function surface using SAA, and corresponding
steps of BFGS, with N = 1. The large is the starting position and the
large × is the final position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2-8 Realizations of the objective function surface using SAA, and corresponding
steps of BFGS, with N = 11. The large is the starting position and the
large × is the final position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2-9 Realizations of the objective function surface using SAA, and corresponding
steps of BFGS, with N = 101. The large is the starting position and the
large × is the final position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2-10 Mean squared error, defined in Equation 2.22, versus average run time for
each optimization algorithm and various choices of inner-loop and outer-loop
sample sizes. The highlighted curves are “optimal fronts” for RM (light red)
and SAA-BFGS (light blue). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3-1 Batch design exhibits an open-loop behavior, where no feedback of information
is involved, and the observations yk from any experiment do not affect the
design of any other experiments. Sequential design exhibits a closed-loop
behavior, where feedback of information takes place, and the data yk from an
experiment can be used to guide the design of future experiments. . . . . . . . 68
5-1 A log-normal random variable z can be mapped to a standard Gaussian rani.d.
dom variable ξ via ξ = T (z) = ln(z). . . . . . . . . . . . . . . . . . . . . . . . 88
5-2 Example 5.3.1: samples and density contours. . . . . . . . . . . . . . . . . . . 99
5-3 Example 5.3.1: posterior density functions using different map polynomial
basis orders and sample sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5-4 Illustration of exact map and perspectives of approximate maps. Contour
plots on the left reflect the reference density, and on the right the target
density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5-5 Example 5.5.1: posteriors from joint maps constructed under different d distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5-6 Example 5.5.1: additional examples of posteriors from joint maps constructed
under different d distributions. The same legend in Figure 5-5 applies. . . . . 115
12
7-1 Linear-Gaussian problem: J˜1 surfaces and regression points used to build
them. The left, middle, and right columns correspond to ℓ = 1, 2, and 3,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7-2 Linear-Gaussian problem: d0 histograms from 1000 simulated trajectories.
The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively.125
7-3 Linear-Gaussian problem: d1 histograms from 1000 simulated trajectories.
The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively.126
7-4 Linear-Gaussian problem: (d0 , d1 ) pair scatter plots from 1000 simulated trajectories superimposed on top of the analytic expected utility surface. The
left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. . . 127
7-5 Linear-Gaussian problem: total reward histograms from 1000 simulated trajectories. The left, middle, and right columns correspond to ℓ = 1, 2, and 3,
respectively. The plus-minus quantity is 1 standard error. . . . . . . . . . . . 128
7-6 Linear-Gaussian problem: samples used to construct the exploration map and
samples generated from the resulting map. . . . . . . . . . . . . . . . . . . . . 129
7-7 1D contaminant source inversion problem, case 1: physical state and belief
state density progression of a sample trajectory. . . . . . . . . . . . . . . . . . 132
7-8 1D contaminant source inversion problem, case 1: (d0 , d1 ) pair scatter plots
from 1000 simulated trajectories for greedy design and sOED. . . . . . . . . . 133
7-9 1D contaminant source inversion problem, case 1: total reward histograms
from 1000 simulated trajectories for greedy design and sOED. The plus-minus
quantity is 1 standard error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7-10 1D contaminant source inversion problem, case 2: d0 and d1 pair scatter plots
from 1000 simulated trajectories for batch design and sOED. Roughly 55% of
the sOED trajectories qualify for the precise device in the second experiment.
However, there is no particular pattern or clustering of these designs, thus we
do not separately color-code them in the scatter plot. . . . . . . . . . . . . . . 135
7-11 1D contaminant source inversion problem case 2: total reward histograms
from 1000 simulated trajectories for batch design and sOED. The plus-minus
quantity is 1 standard error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
13
7-12 1D contaminant source inversion problem, case 3: d0 histograms from 1000
simulated trajectories for the sOED grid and map methods. The left, middle,
and right columns correspond to ℓ = 1, 2, and 3, respectively. . . . . . . . . . 138
7-13 1D contaminant source inversion problem, case 3: d1 histograms from 1000
simulated trajectories for the sOED grid and map methods. The left, middle,
and right columns correspond to ℓ = 1, 2, and 3, respectively. . . . . . . . . . 139
7-14 1D contaminant source inversion problem, case 3: total reward histograms
from 1000 simulated trajectories for the sOED grid and map methods. The
left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively.
The plus-minus quantity is 1 standard error. . . . . . . . . . . . . . . . . . . . 140
7-15 1D contaminant source inversion problem: samples used to construct exploration map and samples generated from the resulting map. . . . . . . . . . . . 141
7-16 1D contaminant source inversion problem, case 3: (d0 , d1 ) pair scatter plots
from 1000 simulated trajectories. The sOED result here is for ℓ = 1. . . . . . 142
7-17 1D contaminant source inversion problem, case 3: total reward histograms
from 1000 simulated trajectories using batch and greedy designs. The plusminus quantity is 1 standard error. . . . . . . . . . . . . . . . . . . . . . . . . 142
7-18 2D contaminant source inversion problem: plume signal and physical state
progression of sample trajectory 1. . . . . . . . . . . . . . . . . . . . . . . . . 146
7-19 2D contaminant source inversion problem: belief state posterior density contour progression of sample trajectory 1. . . . . . . . . . . . . . . . . . . . . . 147
7-20 2D contaminant source inversion problem: plume signal and physical state
progression of sample trajectory 2. . . . . . . . . . . . . . . . . . . . . . . . . 148
7-21 2D contaminant source inversion problem: belief state posterior density contour progression of sample trajectory 2. . . . . . . . . . . . . . . . . . . . . . 149
7-22 2D contaminant source inversion problem: dk histograms from 1000 simulated
trajectories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7-23 2D contaminant source inversion problem: total reward histograms from 1000
simulated trajectories. The left, middle, and right columns correspond to
ℓ = 1, 2, and 3, respectively. The plus-minus quantity is 1 standard error. . . 150
7-24 2D contaminant source inversion problem: samples used to construct exploration map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
14
7-25 2D contaminant source inversion problem: samples generated from the resulting map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7-26 2D contaminant source inversion problem: samples used to construct exploration map between dk and yk , with θ. The columns from left to right
correspond to d0,0 , d0,1 , y0 , d1,0 , d1,1 , y1 , d2,0 , d2,1 , y2 , and the marginals
for the row variables, and the rows from top to bottom correspond to the
marginal for the column variables, θ0 , θ1 , θ0 , θ1 , θ0 , θ1 , where each pair of
rows corresponding to θ for inference after 1, 2, and 3 experiments, respectively.153
7-27 2D contaminant source inversion problem: samples generated from the resulting map between dk and yk , with θ. The columns from left to right correspond
to d0,0 , d0,1 , y0 , d1,0 , d1,1 , y1 , d2,0 , d2,1 , y2 , and the marginals for the row variables, and the rows from top to bottom correspond to the marginal for the
column variables, θ0 , θ1 , θ0 , θ1 , θ0 , θ1 , where each pair of rows corresponding
to θ for inference after 1, 2, and 3 experiments, respectively. . . . . . . . . . . 154
B-1 Linear-Gaussian problem: analytic expected utility surface, with the “front”
of optimal designs in dotted black line. . . . . . . . . . . . . . . . . . . . . . . 172
15
16
List of Tables
2.1
Histograms of final search positions resulting from 1000 independent runs
of RM (top subrows) and SAA (bottom subrows) over a matrix of N and
M sample sizes. For each histogram, the bottom-right and bottom-left axes
represent the sensor coordinates x and y, respectively, while the vertical axis
represents frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.2
High-quality expected information gain estimates at the final sensor positions resulting from 1000 independent runs of RM (top subrows, blue) and
SAA-BFGS (bottom subrows, red). For each histogram, the horizontal axis
represents values of ÛM =1001,N =1001 and the vertical axis represents frequency. 62
2.3
Histograms of optimality gap estimates for SAA-BFGS, over a matrix of samples sizes M and N . For each histogram, the horizontal axis represents value
of the gap estimate and the vertical axis represents frequency. . . . . . . . . . 63
2.4
Number of iterations in each independent run of RM (top subrows, blue) and
SAA-BFGS (bottom subrows, red), over a matrix of sample sizes M and N .
For each histogram, the horizontal axis represents iteration number and the
vertical axis represents frequency. . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1
Different levels of scope are available when transport maps are used as the
belief state in the sOED context. niter represents the number of stochastic
optimization iterations from numerically evaluating Equation 5.38, and nMC
represents the Monte Carlo size of approximating its expectation. In our
implementation, these values are typically around 50 and 100, respectively. . . 105
17
5.2
Structure of joint maps needed to perform inference under different number
of experiments. For simplicity of notation, we omit the conditioning in the
subscript of map components; please see Equation 5.40 for the full subscripts.
The same pattern is repeated for higher number of experiments. The components grouped by the red rectangular boxes are identical. . . . . . . . . . . . . 108
5.3
Marginal distributions of d used to construct joint map. . . . . . . . . . . . . 110
7.1
Linear-Gaussian problem: total reward mean values (of histograms in Figure 7-5) from 1000 simulated trajectories. Monte Carlo standard errors are
all ±0.02. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2
Contaminant source inversion problem: problem settings.
. . . . . . . . . . . 130
7.3
Contaminant source inversion problem: algorithm settings. . . . . . . . . . . . 130
7.4
1D contaminant source inversion problem, case 3: total reward mean values
from 1000 simulated trajectories; the Monte Carlo standard errors are all
±0.02. The grid and map cases are all from sOED. . . . . . . . . . . . . . . . 138
18
Chapter 1
Introduction
Experiments play an essential role in the learning process. As George E. P. Box points out,
“. . . science is a means whereby learning is achieved, not by mere theoretical speculation on
the one hand, nor by the undirected accumulation of practical facts on the other, but rather
by a motivated iteration between theory and practice . . . ” [29]. As illustrated in Figure 1-1,
theory is used to deduce what is expected to be observed in practice, and observations from
experiments are used in turn to induce how theory may be further improved. In science
and engineering, experiments are a fundamental building block of the scientific method, and
crucial in the continuing development and refinement of models of physical systems.
Theory
Deduction
Induction
Practice
(experimental
observations)
Figure 1-1: The learning process can be characterized as an iteration between theory and
practice via deductive and inductive reasoning.
Whether obtained through field observations or laboratory experiments, experimental
data may be difficult and expensive to acquire. Even controlled experiments can be timeconsuming or delicate to perform. Experiments are also not equally useful, with some
providing valuable information while others perhaps irrelevant to the investigation goals. It
is therefore important to quantify the trade-off between costs and benefits, and maximize
19
the overall value of experimental data—to design experiments to be “optimal” by some appropriate measure. Not only is this an important economic consideration, it can also greatly
accelerate the advancement of scientific understanding. Experimental design thus encompasses questions of where and when to measure, which variables to interrogate, and what
experimental conditions to employ (some examples of real-life experimental design situations
are shown below). In this thesis, we develop a systematic framework for experimental design
that can help answer these questions.
Example 1.0.1. Combustion kinetics: Alternative fuels, such as biofuels [139] and synthetic fuels [86], are becoming increasingly popular [36]. They are attractive sources for safeguarding volatile petroleum price, ensuring energy security, and promoting new and desirable
properties that traditional fossil fuels might not offer. The development of these new fuels
relies on a deep understanding of the underlying chemical combustion process, which is often
modeled by complicated, nonlinear chemical mechanisms composed of many elementary reactions. Parameters governing the rates of these reactions, known as kinetic rate parameters,
are usually inferred from experimental measurements such as ignition delay times [72, 58].
Many of these kinetic parameters have large uncertainties even today [12, 11, 141, 133],
and more data are needed to reduce the uncertainties. Combustion experiments, often conducted using shock tubes, are usually expensive and difficult to set up, and need to be
carefully planned. Furthermore, one may choose to carry out these experiments under different temperatures and pressures, with different initial concentrations of reactants, and
selecting different output quantities to observe and at different times. Experimental design
provides guidance in making these choices such that the most information may be gained
on the kinetic rate parameters [89, 90].
Example 1.0.2. Optimal sensor placement: The United States government has initiated a number of terrorism prevention measures since the events of 9/11. For example,
the BioWatch program [152] focuses on the prevention and response to scenarios where a
biological pathogen is released in a city. One of its main goals is to find and intercept the
contaminant source and eliminate it as soon as possible. It is often too dangerous to dispatch personnel into the contamination zone, but a limited number of measurements may be
available from remote-controlled robotic vehicles. It is thus crucial for these measurements
to yield the most information on the location of contaminant source [91]. This problem
20
will be revisited in this thesis, with particular focus on situations that allow a sequential
selection of measurement locations.
1.1
Literature review
Systematic design of experiments has received much attention in the statistics community
and in many science and engineering applications. Early design approaches primarily relied
on heuristics and experience, with the traditional factorial, composite, and Latin hypercube
designs all based on the concepts of space-filling and blocking [69, 31, 56, 32]. While these
methods can produce good designs in relatively simple situations involving a few design
variables, they generally do not take into account, and take advantage of, the knowledge
of the underlying physical process. Simulation-based experimental design uses a model
to guide the choice of experiments, and optimal experimental design (OED) furthermore
incorporates specific and relevant metrics to design experiments for a particular purpose,
such as parameter inference, prediction, or model discrimination.
The design of multiple experiments can be pursued via two broad classes of approaches:
• Batch (open-loop) design involves the design of all experiments concurrently as a batch.
The outcome of any experiment would not affect the design of others. In some situations, this approach may be necessary, such as under certain scheduling constraints.
• Sequential (closed-loop) design allows experiments to be conducted in sequence, thus
permitting newly acquired data to help guide the design of future experiments.
1.1.1
Batch (open-loop) optimal experimental design
Extensive theory has been developed for OED of linear models, where the quantities probed
in the experiments depend linearly on the model parameters of interest. Common solution criteria for the OED problem are written as functionals of the Fisher information
matrix [66, 9]. These criteria include the well-known “alphabetic optimality” conditions,
e.g., A-optimality to minimize the average variance of parameter estimates, or G-optimality
to minimize the maximum variance of model predictions. The derivations may also adopt
a Bayesian perspective [94, 156], which provides a rigorous foundation for inference from
noisy, indirect, and incomplete data and a natural mechanism for incorporating physical
21
constraints and heterogeneous sources of information. Bayesian analogues of alphabetic optimality, reflecting prior and posterior uncertainty in the model parameters, can be attained
from a decision-theoretic point of view [18, 146, 130], with the formulation of an expected
utility quantity. For instance, Bayesian D-optimality can be obtained from a utility function containing Shannon information while Bayesian A-optimality may be derived from a
squared error loss. In the case of linear-Gaussian models, the criteria of Bayesian alphabetic
optimality reduce to mathematical forms that parallel their non-Bayesian counterparts [43].
For nonlinear models, however, exact evaluation of optimal design criteria is much more
challenging. More tractable design criteria can be obtained by imposing additional assumptions, effectively changing the form of the objective; these assumptions include linearizations
of the forward model, Gaussian approximations of the posterior distribution, and additional
assumptions on the marginal distribution of the data [33, 43]. In the Bayesian setting, such
assumptions lead to design criteria that may be understood as approximations of an expected utility. Most of these involve prior expectations of the Fisher information matrix [49].
Cruder “locally optimal” approximations require selecting a “best guess” value of the unknown model parameters and maximizing some functional of the Fisher information evaluated at this point [70]. None of these approximations, though, is suitable when the parameter
distribution is broad or when it departs significantly from normality [51]. A more general
design framework, free of these limiting assumptions, is preferred [118, 80]. With recent
advances in algorithm development and computational power, OED for nonlinear systems
can now be tackled directly using numerical simulation [109, 145, 169, 108, 158, 89, 90, 91].
Information-based objectives
Our work accommodates nonlinear experimental design from a Bayesian perspective (e.g., [118]).
We focus on experiments described by a continuous design space, with the goal of choosing experiments that are optimal for Bayesian parameter inference. Rigorous informationtheoretic criteria have been proposed throughout the literature (e.g., [75]). The seminal
paper of Lindley [105] suggests using the expected information gain in model parameters
from prior to posterior—or equivalently, the mutual information between parameters and
observations, conditioned on the design variables—as a measure of the information provided by an experiment. This objective can also be derived using the Kullback-Leibler
divergence from posterior to prior as a utility function [60, 43]. Sebastiani and Wynn [148]
22
propose selecting experiments for which the marginal distribution of the data has maximum
Shannon entropy; this may be understood as a special case of Lindley’s criterion. Maximum entropy sampling (MES) has seen use in applications ranging from astronomy [109] to
geophysics [169], and is well suited to nonlinear models. Reverting to Lindley’s criterion,
Ryan [145] introduces a Monte Carlo estimator of expected information gain to design experiments for a model of material fatigue. Terejanu et al. [166] use a kernel estimator of
mutual information to identify parameters in chemical kinetic model. The latter two studies
evaluate their criteria on every element of a finite set of possible designs (on the order of
ten designs in these examples), and thus sidestep the challenge of optimizing the design
criterion over general design spaces. And both report significant limitations due to computation expense; [145] concludes that “full blown search” over the design space is infeasible,
and that two order-of-magnitude gains in computational efficiency would be required even
to discriminate among the enumerated designs.
The application of optimization methods to experimental design has thus favored simpler design objectives. The chemical engineering community, for example, has tended to use
linearized and locally optimal [117] design criteria or other objectives [144] for which deterministic optimization strategies are suitable. But in the broader context of decision-theoretic
design formulations, sampling is required. [120] proposes a curve fitting scheme wherein the
expected utility was fit with a regression model, using Monte Carlo samples over the design
space. This scheme relies on problem-specific intuition about the character of the expected
utility surface. Clyde et al. [52] explore the joint design, parameter, and data space with a
Markov chain Monte Carlo (MCMC) sampler, while Amzal et al. [6] expanded this concept
to multiple MCMC chains in a sequential Monte Carlo framework; this strategy combines
integration with optimization, such that the marginal distribution of sampled designs is proportional to the expected utility. This idea is extended with simulated annealing in [121] to
achieve more efficient maximization of the expected utility. [52, 121] use expected utilities
as design criteria but do not pursue information-theoretic design metrics. Indeed, direct optimization of information-theoretic metrics has seen much less development. Building on the
enumeration approaches of [169, 145, 166] and the one-dimensional design space considered
in [109], [80] iteratively finds MES designs in multi-dimensional spaces by greedily choosing
one component of the design vector at a time. Hamada et al. [84] also find “near-optimal”
designs for linear and nonlinear regression problems by maximizing expected information
23
gain via genetic algorithms. Guestrin, Krause and others [81, 99, 100] find near-optimal
placement of sensors in a discretized domain by iteratively solving greedy subproblems, taking advantage of the submodularity of mutual information. More recently, the author has
made several contributions addressing the coupling of rigorous information-theoretic design
criteria, complex nonlinear physics-based models, and efficient optimization strategies on
continuous design spaces [89, 90, 91].
Stochastic optimization
There are many approaches for solving continuous optimization problems with stochastic
objectives. While some do not require the direct evaluation of gradients (e.g., NelderMead [124], Kiefer-Wolfowitz [95], and simultaneous perturbation stochastic approximation [161]), other algorithms can use gradient evaluations to great advantage. Broadly,
these algorithms involve either stochastic approximation (SA) [102] or sample average approximation (SAA) [149], where the latter approach must also invoke a gradient-based deterministic optimization algorithm. Hybrids of the two approaches are possible as well. The
Robbins-Monro algorithm [142] is one of the earliest and most widely used SA methods,
and has become a prototype for many subsequent algorithms. It involves an iterative update that resembles steepest descent, except that it uses stochastic gradient information.
SAA (also referred to as retrospective method [85] and sample-path method [82]) is a more
recent approach, with theoretical analysis initially appearing in the 1990s [149, 82, 97].
Convergence rates and stochastic bounds, although useful, do not necessarily reflect empirical performance under finite computational resources and imperfect numerical optimization
schemes. To the best of our knowledge, extensive numerical testing of SAA has focused on
stochastic programming problems with special structure (e.g., linear programs with discrete
design variables) [3, 170, 16, 79, 147]. While numerical improvements to SAA have seen
continual development (e.g., estimators of optimality gap [127, 111] and sample size adaptation [46, 47]), the practical behavior of SAA in more general optimization settings is largely
unexplored. SAA is frequently compared to stochastic approximation methods such as RM.
For example, [150] suggests that SAA is more robust than SA because of sensitivity to step
size choice in the latter. On the other hand, variants of SA have been developed that, for
certain classes of problems (e.g., [125]), reach solution quality comparable to that of SAA
in substantially less time. In this thesis, we also make comparisons between SA and SAA,
24
but from a practical and numerical perspective and in the context of OED.
Surrogates for computationally intensive models
In either case of Robbins-Monro or SAA, for information-based OED, one needs to employ
gradients of an information gain objective. Typically, this objective function involves nested
integrations over possible model outputs and over the input parameter space, where the
model output may be a functional of the solution of a partial differential equation. In
many practical cases, the model may be essentially a black box; while in other cases, even
if gradients can be evaluated with adjoint methods, using the full model to evaluate the
expected information gain or its gradient is computationally prohibitive. To make these
calculations tractable, one would like to replace the forward model with a cheaper “surrogate”
model that is accurate over the entire regions of the model input parameters.
Surrogates can be generally categorized into three classes [65, 71]: data-fit models,
reduced-order models, and hierarchical models. Data-fit models capture the input-output
relationship of a model from available data points, and assume regularity by imposing interpolation or regression. Given the data points, it matters not how the original model
functions, and it may be treated as a black box. One common approach for constructing
data-fit models is Gaussian process regression [94, 140]; other approaches rely on so-called
polynomial chaos expansions (PCE) and related stochastic spectral methods [178, 74, 180,
59, 123, 179, 104, 53]. In the context of OED, the former can be used to replace the likelihood altogether, allowing quick inferences and objective evaluations from this statistical
model of much simpler structure [177]. The latter builds a subspace from a set of orthogonal
polynomial basis functions, and exploits the regularity in the dependence of model outputs
on uncertain input parameters. PCE capturing dependencies jointly on parameters and
design conditions further accelerates the overall OED process [90], and can be constructed
using dimension-adaptive sparse quadrature [73] that identifies and exploits anisotropic dependencies for efficiency in high dimensions.
Reduced-order models are based on a projection of the output space onto a smaller, lowerdimensional subspace. One example is the proper orthogonal decomposition (POD), where
a set of “snapshots” of the model outputs are used to construct a basis for the subspace [19,
155, 38]. Finally, hierarchical models are those where simplifications are performed based on
the underlying physics. Techniques based on grid-coarsening, simplification of mechanics,
25
addition of assumptions, are of this type, and are often the basis of multifidelity analysis
and optimization [28, 4].
1.1.2
Sequential (closed-loop) optimal experimental design
Compared to batch OED, sequential optimal experimental design (sOED) has seen much less
development and use. The value of feedback through sequential design has been recognized
early on, with original approaches typically involving a heuristic partitioning of experiments
into batches. For instance, in the context of experimental design for improving chemical
plant filtration rate [30], an initial “empirical feedback” stage involving space-filling designs
is administered to “pick the winner” and find designs that best fix the problem, and a subsequent “scientific feedback” stage with adapted designs is followed to better understand the
reasons for what went wrong or why a solution worked. Initial attempts in finding optimal
sequential designs relied heavily on results from batch OED, by simply repeating its design
methodology in a greedy manner. Some work made use of linear design theory by iteratively
alternating between parameter estimation and applications of linear optimality (e.g., [2]).
Since many physically realistic models involve output quantities that have nonlinear dependencies on model parameters, it is desirable to employ nonlinear OED tools. The key
challenge, then, is to represent and propagate general non-Gaussian posteriors beyond the
first experiment.
Various representation techniques have been tested within the greedy design framework,
with a large body of research based on sample representations of the posterior. For instance, posterior importance sampling has been employed for variance-based utility [158]
and in greedy augmentations of generalized linear models [61]. Sequential Monte Carlo
methods have also been utilized both in experimental design for parameter inference [62]
and model discrimination [42, 63]. Even grid-based discretizations/representations of posterior density functions have shown success in adaptive design optimization that makes use
of hierarchical models in visual psychophysics [96]. While these developments provide a
convenient and intuitive avenue of extending existing batch OED tools, greedy design is
ultimately suboptimal. A truly optimal sequential design framework needs to account for
all relevant future effects in making every decision, but such considerations are dampened
by challenges in computational feasibility. With recent advances in numerical algorithms
and computing power, sOED can now be made practical.
26
sOED is often posed in a dynamic programming (DP) form, a framework widely used to
describe sequential decision-making under uncertainty. While the DP description of sOED
is gaining traction in recent years [119, 172], implementations and applications of this framework remain few, due to notoriously large computational requirements. The few existing
attempts have mostly focused on optimal stopping problems [18], stemming predominantly
from applications of clinical trial designs. Under simple situations, direct backward induction with tabular storage may be used, but is only feasible for discrete variables that can
take on a few possible outcomes [37, 174]. Applications of more involved numerical solution
techniques all rely on special structures of the problem with careful choices of loss functions.
For example, Carlin et al. [41] propose a forward sampling method that directly optimizes
a Monte Carlo estimate of the expected utility, but targets monotonic loss functions and
certain conjugate priors that result in threshold policies based on the posterior mean. Continued development on backward induction also find feasible numerical implementations
owing to policies that depend only on lower-dimensional sufficient statistics such as the posterior mean and standard deviation [21, 48]. Other approaches replace the simulation model
altogether, and instead use statistical models with assumed distribution forms [122]. None
of these works, however, uses an information-based objective. Incorporation of utilities that
reflect information gain induces quantities that are much more challenging to evaluate, and
has been attempted only for simple situations. For instance, Ben-Gal and Caramanis [15]
find near-optimal stopping policies in multidimensional design spaces by deriving and making use of diminishing return (submodularity) on the expected incremental information gain;
however, this is possible only for linear-Gaussian problems, where mutual information does
not depend on the observations.
With the current state-of-the-art in sOED heavily relying on special problem structures
and often feasible only for discrete variables that can take on a few values, we seek to
contribute to its development with a more general framework and numerical tools that can
accommodate broader classes of problems.
Dynamic programming
The solution to the sOED problem directly relates to the solution of a DP problem. As
DP is a broad subject accompanied by a vast sea of literature from many different fields of
research, including control theory [24, 22, 23], operations research [138, 137], and machine
27
learning [93, 164], we do not attempt to make a comprehensive review. Instead, we make
a brief introduction and only describe parts that are most relevant and promising to the
sOED problem, while referring readers to the references above.
Central to DP is the famous Bellman’s equation [13, 14], describing the relationship between cost or reward incurred immediately, with the expected cost or reward in the uncertain
future, as a consequence of a decision. Its recursive definition leads to an exponential explosion of scenarios, and this “curse of dimensionality” cements to become the fundamental
challenge of DP. Typically, only special classes of problems have analytic solutions, such as
those described by linear dynamics and quadratic cost [8]. As a result, substantial research
has been devoted to developing efficient numerical strategies for accurately capturing DP
solutions—this field is known as approximate dynamic programming (ADP) (also referred
to as neuro-dynamic programming and reinforcement learning) [137, 24, 93, 164].
With the goal of finding the (near) optimal policy, one must first be able to represent a
policy. While direct approximations can be made, a policy is more often portrayed implicitly,
such as by the limited lookahead forms. These forms eventually relegate the approximation
to their associated value functions by probing their values at different states, leading to
broad branches of ADP strategy in approximate value iteration (AVI) and approximate policy iteration (API). The key difference between AVI and API is that the former updates
the policy immediately and maintains as good of an approximation to the optimal policy as
possible, while the latter makes an accurate assessment of the value from a fixed policy (i.e.,
policy evaluation or learning) in an inner loop before improvements are made. Both of these
strategies have stimulated the development of a host of learning (policy evaluation) techniques based on the well-known temporal-differencing method (e.g., [163, 164, 34]), and API
further sparked the expansion of policy improvement methods such as least squares policy
iteration [103], actor-critic methods (e.g., [24]), and policy-gradient algorithms (e.g., [165]).
Finally, representation of value functions can be replaced by “model-free” Q-factors that capture the values in state-action pairs—this leads to the widely used reinforcement learning
technique of Q-learning [175, 176].
28
1.2
Thesis objectives
Current research in OED has seen rapid advances in the design of batch experiments.
Progress towards the optimal design of sequential experiments, however, remains in relatively early stages. Direct applications of batch OED methods to sequential settings are
suboptimal, and initial explorations of the optimal framework have been limited to problems
with discrete spaces of very few states and with special problem and solution structures. We
aim to extend the optimal sequential design framework to much more general settings.
The objectives of this thesis are:
• To advance the numerical methods for batch OED from author’s previous work [89, 90]
in order to accommodate nonlinear and computationally intensive models with an
information gain objective. This involves deriving and accessing gradient information
via the use of polynomial chaos and infinitesimal perturbation analysis in order to
enable the application of gradient-based optimization methods.
• To formulate the sOED problem in a rigorous manner, for a finite number of experiments, accommodating nonlinear and physically realistic models, under continuous
parameter, design, and observation spaces of multiple dimensions, using a Bayesian
treatment of uncertainty with general non-Gaussian distributions and an information
measure design objective. This goal includes formulating the DP form of the sOED
problem that is central to the subsequent development of numerical methods.
• To develop numerical methods for solving the sOED problem in a computationally
practical manner. This is achieved via the following sub-objectives.
– To implement ADP techniques based on a one-step lookahead policy representation, combined with approximate value iteration (in particular backward induction and regression) for constructing value function approximations.
– To represent continuous belief states numerically for general multivariate nonGaussian random variables using transport maps.
– To construct and utilize transport maps in the joint design, observation, and
parameter space, in a form that enables fast and approximate Bayesian inference
by conditioning; this capability is necessary to achieve computational feasibility
in the ADP methods.
29
• To demonstrate the computational effectiveness of our sOED numerical tools on realistic design applications with multiple experiments and multidimensional parameters.
These applications include contaminant source inversion problems in both one- and
two-dimensional physical domains.
More broadly speaking, this thesis seeks to develop a rigorous mathematical framework
and a set of numerical tools for performing sequential optimal experimental design in a
computationally feasible manner.
The thesis is organized as follows. Chapter 2 begins with the formulation, numerical
methods, and results for the batch OED method, particularly focusing on the development
of gradient information. It also provides a foundation of understanding in the relatively
simpler batch design setting before extending to sequential designs for the rest of the thesis. Chapter 3 then presents the formulation of the sOED problem, including the DP form
that is the basis for developing our numerical methods. We also demonstrate the frequently
used batch and greedy design methods to be simplifications from the sOED problem, and
thus suboptimal for sequential settings. Chapter 4 details the ADP techniques we employ
to numerically solve the DP form of the sOED problem, including the development of an
adaptive strategy to refine the policy induced state space. Chapter 5 introduces and describes the use of transport map as belief state, along with the framework for using joint
maps to enable fast and approximate Bayesian inference. The full algorithm for the sOED
problem is summarized in Chapter 6. It is then applied to several numerical examples in
Chapter 7. We first illustrate the solution on a simple linear-Gaussian problem to provide
intuitive insights and establish comparisons with analytic references. We then demonstrate
these tools on contaminant source inversion problems of 1D and 2D convection-diffusion
scenarios. Finally, Chapter 8 provides concluding remarks and future work.
30
Chapter 2
Batch Optimal Experimental Design
Batch (open-loop) optimal experimental design (OED) involves the design of all experiments
concurrently as a batch, where the outcome of any experiment would not affect the design
of others.1 This self-contained chapter introduces the framework of batch OED, assuming
the goal of the experiments is to infer uncertain model parameters from noisy and indirect observations. The framework developed here, however, can be used to accommodate
other experimental goals as well. Furthermore, it uses a Bayesian treatment of uncertainty,
employs an information measure objective, and accommodates nonlinear models under continuous parameter, design, and observation spaces. We pay particular attention to the use
of gradient information and the overall computational behavior of the method, and demonstrate its feasibility with a partial differential equation (PDE)-based 2D diffusion source
inversion problem. We then extend this foundation to sequential (closed-loop) OED in
subsequent chapters.
The content of this chapter is a continuation from the author’s previous work [89, 90],
and draws heavily from the author’s recent publication [91].
2.1
Formulation
Let (Ω, F, P) be a probability space, where Ω is a sample space, F is a σ-field, and P
is a probability measure on (Ω, F ). Let the vector of real-valued random variables2 θ :
Ω → Rnθ denote the uncertain model parameters of interest (referred to as “parameters”
1
For simplicity of terminology, we refer to the entire batch of experiments as a single entity “experiment”
in this chapter.
2
For simplicity, we will use lower case to represent both the random variables and their realizations.
31
in this thesis), i.e., they are the parameters to be conditioned on experimental data. Here
nθ is the dimension of parameters. θ is associated with a measure µ on Rnθ , such that
µ(A) = P θ−1 (A) for A ∈ Rnθ . We then define f (θ) = dµ/dθ to be the density of θ
with respect to the Lebesgue measure. For the present purposes, we will assume that such
a density always exists. Similarly, we treat the observations from the experiment, y ∈ Y
(referred to as “observations”, “noisy measurements”, or “data” in this thesis), as a real-valued
random vector endowed with an appropriate density, and d ∈ D as the vector of continuous
design variables (referred to as “design” in this thesis). If one performs an experiment under
design d and observes a realization of the data y, then the change in one’s state of knowledge
about the parameters is given by Bayes’ rule:
f (θ|y, d) =
f (y|θ, d)f (θ|d)
f (y|θ, d)f (θ)
=
.
f (y|d)
f (y|d)
(2.1)
For simplicity of notation, we shall use f (·) to represent all density functions, and which
specific distribution it corresponds to is reflected by its arguments (when needed for clarity,
we will explicitly include a subscript of the associated random variable). Here, f (θ|d) is
the prior density, f (y|θ, d) is the likelihood function, f (θ|y, d) is the posterior density, and
f (y|d) is the evidence. The second equality is due to the assumption that knowing the
design of an experiment without knowing its observations does not affect our belief about
the parameters (i.e., the prior would not change based on what experiment we plan to
do)—thus f (θ|d) = f (θ). The likelihood function is assumed to be given, and describes
the discrepancy between the observations and a forward model prediction in a probabilistic
way. The forward model, denoted by G(θ, d), is a function that maps both the parameters
and design into the observation space, and usually describes the outcome of some (possibly
computationally expensive, such as PDE-based) simulation process. For example, y can be
from, but not limited to, an additive Gaussian likelihood model: y = G(θ, d) + ǫ, where
ǫ ∼ N (0, σǫ2 ), leading to a likelihood function of f (y|θ, d) = fǫ (y − G(θ, d)).
We take a decision-theoretic approach and follow the concept of expected utility (or expected reward) to quantify the value of experiments [18, 146, 130]. While utility functions
are quite flexible and can be based on loss functions defined for specific goals or tasks, we
focus on utility functions that lead to valid measures of information gain of experiments [75].
Taking an information-theoretic approach, we choose utility functions that reflect the ex32
pected information gain on the parameters θ [105, 106]. In particular, we use the relative
entropy, or the Kullback-Leibler (KL) divergence, from the posterior to the prior, and take
its expectation under the prior predictive distribution of the data to obtain an expected
utility U (d):
f (θ|y, d)
f (θ|y, d) ln
U (d) =
dθ f (y|d) dy
f (θ)
Y H
= Ey|d DKL fθ|y,d (·|y, d)||fθ (·) ,
Z Z
(2.2)
where H ⊆ Rnθ is the support of the prior. Because the observations y cannot be known
before the experiment is performed, taking the expectation over the prior predictive f (y|d)
lets the resulting utility function reflect the information gain on average, over all anticipated
outcomes of the experiment. The expected utility U (d) is thus the expected information gain
due to an experiment performed at design d. A more detailed derivation of the expected
utility can be found in [89, 90].
We choose to use the KL divergence for several reasons. First, KL is a special case of
a wide range of divergence measures that satisfy the minimal set of requirements to be a
valid measure of information on a set of experiments [75]. These requirements are based on
the sufficient ordering (or “always at least as informative” ordering) of experiments, and are
developed rigorously from likelihood ratio statistics, in a general setting without specifically
targeting decision-theoretic or Bayesian perspectives. Second, KL gives an intuitive indication of information gain in the sense of Shannon information [55]. Since KL reflects the
difference between two distributions, a large KL divergence from posterior to prior implies
that the observations y decrease entropy in θ by a large amount, and hence those observations are more informative for parameter inference. Indeed, the KL divergence reflects
the difference in information carried by two distributions in units of nats [55, 110], and the
expected information gain is also equivalent to the mutual information between the parameters θ and the observations y, given the design d. Third, such a formulation for general
nonlinear forward models (where G(θ, d) are nonlinear functions in the parameters θ) is
consistent with linear optimal design theory based on the Fisher information matrix [66, 9].
When a linear model is used in this formulation, it simplifies to the linear D-optimality design, which is an attractive design approach due to, for example, its invariant under smooth
model reparameterization [45]. Finally, the use of information measure contrasts with a loss
33
function in that, while the former does not target a particular task (such as estimation)
in the context of a decision problem, it provides a general guidance of learning about the
uncertain environment, and gaining information that performs well for a wide range of tasks
albeit not best for any particular task.
Typically, the expected utility in Equation 2.2 has no closed form (even if the forward
model is, for example, a polynomial function of θ). Instead, it must be approximated
numerically. By applying Bayes’ rule to the quantities inside and outside the logarithm in
Equation 2.2, and then introducing Monte Carlo approximations for the resulting integrals,
we obtain the nested Monte Carlo estimator proposed by Ryan [145]:



N  h
M

i
X
X
1
1
ln f (y (i) |θ(i) , d) − ln 
f (y (i) |θ̃(i,j) , d) , (2.3)
U (d) ≈ ÛN,M (d, θs , ys ) ≡


N
M
i=1
j=1
o
n
where θs ≡ θ(i) ∪ θ̃(i,j) , i = 1 . . . N , j = 1 . . . M , are i.i.d. samples from the prior f (θ);
and ys ≡ y (i) , i = 1 . . . N , are independent samples from the likelihoods f (y|θ(i) , d). The
variance of this estimator is approximately A(d)/N + B(d)/N M and its bias is (to leading
order) C(d)/M [145], where A, B, and C are terms that depend only on the distributions
at hand. While the estimator ÛN,M is biased for finite M , it is asymptotically unbiased.
Finally, the expected utility must be maximized over the design space D to find the
optimal design:
d∗ = argmax U (d).
(2.4)
d∈D
Since U can only be approximated by Monte Carlo estimators such as ÛN,M , optimization
methods for stochastic objective functions are needed.
2.2
Stochastic optimization
Optimization methods can be broadly categorized as gradient-based and non-gradient-based.
While gradient-based methods require additional gradient information, they are also generally more efficient than their non-gradient counterparts. With the intention to solve Equation 2.4, we make the gradient information available for the batch OED problem in this
chapter. In particular, we consider two gradient-based stochastic optimization approaches:
34
Robbins-Monro (RM) stochastic approximation, and sample average approximation (SAA)
combined with the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method. Both approaches
require some flavor of gradient information, but they do not use the exact gradient of U (d).
Calculating the latter is generally not possible, given that we only have a Monte Carlo
estimator of U (d). The use of non-gradient optimization methods in the context of batch
OED, with simultaneous perturbation stochastic approximation [160, 161] and Nelder-Mead
simplex method [124], have been previously investigated by the author [89, 90].
Recasting the optimization problem in the convention of a minimization statement:
d∗ = argmin [−U (d)] = argmin
d∈D
d∈D
n
io
h
h(d) ≡ Ey|d ĥ(d, y) ,
(2.5)
where ĥ(d, y) is the underlying unbiased estimator of the unavailable objective function
h(d) ≡ −U (d). We note that y is generally dependent on d.
2.2.1
Robbins-Monro stochastic approximation
The iterative update of the RM method is
dj+1 = dj − aj ĝ(dj , y ′ ),
(2.6)
where j is the optimization iteration index and ĝ(dj , y ′ ) is an unbiased estimator of the
gradient (with respect to d) of h(d) evaluated at dj . In other words, Ey′ |d [ĝ(d, y ′ )] = ∇d h(d),
but ĝ is not necessarily equal to ∇d ĥ. Also, y ′ and y may, but need not, be related. The
gain sequence aj should satisfy the following properties:
∞
X
j=0
aj = ∞ and
∞
X
j=0
a2j < ∞.
(2.7)
One natural choice, used in this work, is the harmonic step size sequence aj = β/j, where
β is some appropriate scaling constant. For example, in the diffusion application problem
of Section 2.5, β is chosen to be 1.0 since the design space is [0, 1]2 . With various technical assumptions on ĝ and g, it can be shown that RM converges to the exact solution of
Equation 2.5 almost surely [102].
Choosing the sequence aj is often viewed as the Achilles’ heel of RM, as the algorithm’s
35
performance can be very sensitive to the step size. We acknowledge this fact and do not
downplay the difficulty of choosing an appropriate gain sequence, but there exist logical
approaches to selecting aj that yield reasonable performance. More sophisticated strategies, such as search-then-converge learning rate schedules [57], adaptive stochastic step size
rules [17], and iterate averaging methods [135, 102], have been developed and successfully
demonstrated in applications.
We will also use relatively simple stopping criteria for the RM iterations: the algorithm
will be terminated when changes in dj stall (e.g., k dj − dj−1 k falls below some designated
tolerance for 5 successive iterations) or when a maximum number of iterations has been
reached (e.g., 50 iterations for the results of Section 2.5.)
2.2.2
Sample average approximation
Transformation to design-independent noise
The central idea of SAA is to reduce the stochastic optimization problem to a deterministic
problem, by fixing the noise throughout the entire optimization process. In practice, if the
noise y is design-dependent, it is first transformed to a design-independent random variable
by effectively moving all the design dependence into the function ĥ. (An example of this
transformation is given in Section 2.4.) The noise variables at different d then share a
common distribution, and a common set of realizations is employed at all values of d.
Such a transformation is always possible in practice, since the random numbers in any
computation are fundamentally generated from uniform random (more precisely, pseudorandom) numbers. Thus one can always transform y back into these uniform random variables,
which are of course independent of d.3 For the remainder of this section (Section 2.2.2) we
shall, without loss of generality, assume that y has been transformed to a random variable
w that is independent of d, while abusing the same notation of ĥ(d, w).
Reduction to a deterministic problem
SAA approximates the optimization problem of Equation 2.5 with
)
N
X
1
ĥ(d, wi ) ,
dˆs = argmin ĥN (d, ws ) ≡
N
d∈D
(
3
(2.8)
i=1
One does not need to go all the way to the uniform random variables; any higher-level “transformed”
random variables, as long as they remain independent of d, suffice.
36
where dˆs and ĥN (dˆs , ws ) are the optimal design and objective values under a particular set
of N realizations of the random variable w, ws ≡ {wi }N
i=1 . The same set of realizations is
used for different values of d during the optimization process, thus making the minimization
problem in Equation 2.8 deterministic. (One can view this approach as an application of
common random numbers.) A deterministic optimization algorithm can then be chosen to
find dˆs as an approximation to d∗ .
Estimates of h(dˆs ) can be improved by using ĥN ′ (dˆs , ws′ ) instead of ĥN (dˆs , ws ), where
′
′
ĥN ′ (dˆs , ws′ ) is computed from a larger set of realizations ws′ ≡ {wm }N
m=1 with N > N ,
in order to attain a lower variance. Finally, multiple (say R) optimization runs are often
performed to obtain a sampling distribution for the optimal design values and the optimal
objective values, i.e., dˆrs and ĥN (dˆrs , wsr ), for r = 1 . . . R. The sets wsr are independently
chosen for each optimization run, but remain fixed within each run. Under certain assumptions on the objective function and the design space, the optimal design and objective
estimates in SAA generally converge to their respective true values in distribution at a rate
√
of 1/ N [149, 97].4
For the solution of a particular deterministic problem dˆrs , stochastic bounds on the true
optimal value can be constructed by estimating the optimality gap h(dˆrs ) − h(d∗ ) [127, 111].
The first term can simply be approximated using the unbiased estimator ĥN ′ (dˆrs , wsr′ ) since
h
i
Ews′ ĥN ′ (dˆrs , ws′ ) = h(dˆrs ). The second term may be estimated using the average of the
approximate optimal objective values across the R replicate optimization runs (based on
wsr , rather than wsr′ ):
R
h̄N =
1 X
ĥN (dˆrs , wsr ).
R
(2.9)
r=1
This is a negatively biased estimator and hence a stochastic lower bound on h(d∗ ) [127,
4
More precise properties of these asymptotic distributions depend on properties of the objective and
the set of optimal solutions to the true problem. For instance, in the case of a singleton optimum d∗ ,
the SAA estimates ĥN (dˆs , ·) converge to a Gaussian with variance Varw [ĥ(d∗ , w)]/N . Faster convergence
to the optimal objective value may be obtained when the objective satisfies stronger regularity conditions.
The SAA solutions dˆs are not in general asympotically normal, however. Furthermore, discrete probability
distributions lead to entirely different asymptotics of the optimal solutions.
37
111, 151].5,6 The difference ĥN ′ (dˆrs , wsr′ ) − h̄N is thus a stochastic upper bound on the true
optimality gap h(dˆrs ) − h(d∗ ). The variance of this optimality gap estimator can be derived
from the Monte Carlo standard error formula [3]. One could then use the optimality gap
estimator and its variance to decide whether more runs are required, or which approximate
optimal designs are most trustworthy.
Pseudo-code for the SAA method is presented in Algorithm 1. At this point, we have
reduced the stochastic optimization problem to a series of deterministic optimization problems; a suitable deterministic optimization algorithm is still needed to solve them.
Algorithm 1: Pseudo-code for SAA.
Set optimality gap tolerance η and number of replicate optimization runs R;
r = 1;
while optimality gap estimate > η and r ≤ R do
Sample the set wsr = {wir }N
i=1 ;
Perform a deterministic optimization run and find dˆrs (see Algorithm 2);
r }N ′
Sample the larger set wsr′ = {wm
where N ′ > N ;
m=1
P
N′
1
r
r
r
r
ˆ
ˆ
′
ĥ d , w ;
Compute ĥN (d , w ′ ) = ′
1
2
3
4
5
6
7
s
s
N
m=1
s
m
Estimate the optimality gap and its variance;
r = r + 1;
end
ˆr r R
Output the sets {dˆrs }R
r=1 and {ĥN ′ (ds , ws′ )}r=1 for post-processing;
8
9
10
11
Broyden-Fletcher-Goldfarb-Shanno method
The BFGS method [126] is a gradient-based method for solving deterministic nonlinear
optimization problems, widely used for its robustness, ease of implementation, and efficiency.
It is a quasi-Newton method, iteratively updating an approximation to the (inverse) Hessian
matrix from objective and gradient evaluations at each stage. Pseudo-code for the BFGS
method is given in Algorithm 2. In the present implementation, a simple backtracking
line search is used to find a step size that satisfies the first (Armijo) Wolfe condition only.
The algorithm can be terminated according to many commonly used criteria: for example,
when the gradient stalls, the line search step size falls below a prescribed tolerance, the
h
i
Short proof from [151]: For any d ∈ D, we have that Ews ĥN (d, ws ) = h(d), and that ĥN (d, wsr ) ≥
h
h
h
i
i
i
mind′ ∈D ĥN (d′ , wsr ). Then h(d) = Ews ĥN (d, ws ) ≥ Ews mind′ ∈D ĥN (d′ , ws ) = Ews ĥN (dˆrs , ws ) =
Ews h̄N .
6
The bias decreases monotonically with N [127].
5
38
design or function value stalls, or a maximum allowable number of iterations or objective
evaluations is reached. BFGS is shown to converge super-linearly to a local minimum if a
quadratic Taylor expansion exists near that minimum [126]. The limited memory BFGS
(L-BFGS) [126] method can also be used when the design dimension becomes very large
(e.g., more than 104 ), such that the dense inverse Hessian cannot be stored explicitly.
Algorithm 2: Pseudo-code for BFGS. In this context, ĥN (d, wsr ) is the deterministic
objective function we want to minimize (as a function of d).
1
2
3
4
5
6
Initialize starting point d0 , inverse Hessian approximation H0 , gradient termination
tolerance ε;
Initialize any other termination conditions and parameters;
j = 0; while ∇d ĥN (dj , wsr ) > ε and other termination conditions are not met do
Compute search direction pj = −Hj ∇d ĥN (dj , wsr );
Find acceptable step size αj via line search;
Update position dj+1 = dj + αj pj ;
Define vectors sj = dj+1 − dj and uj = ∇d ĥN (dj+1 , wsr ) − ∇d ĥN (dj , wsr ) ;
sj s⊤
sj u ⊤
u j s⊤
j
j
Update inverse Hessian approximation Hj+1 = I − s⊤ u Hj I − u⊤ s + s⊤ uj ;
7
8
9
j
10
11
12
j
j
j
j
j
j = j + 1;
end
Output dˆrs = dj ;
2.2.3
Challenges in optimal experimental design
The main challenge in applying the aforementioned stochastic optimization algorithms to
batch OED is the lack of readily-available gradient information. For RM, we need an unbiased estimator of the gradient of the expected utility, i.e., ĝ in Equation 2.6. For SAA-BFGS,
we need the gradient of the finite-sample Monte Carlo approximation of the expected utility,
i.e., ∇d ĥN (·, wsr ).
We address these needs by introducing two concepts in the next two sections:
1. A simple surrogate model, based on polynomial chaos expansions (see Section 2.3),
replaces the often computationally intensive forward model. The purpose of the surrogate is twofold. First, it allows the nested Monte Carlo estimator in Equation 2.3
to be evaluated in a computationally tractable manner. Second, its polynomial form
allows the gradient of Equation 2.3, ∇d ĥN (·, wsr ), to be derived analytically. These
39
gains come at the expense of introducing additional error via the polynomial approximation of the original forward model, however. In other words, given a surrogate for
the forward model and the resulting expected information gain, we can derive exact
gradients of a Monte Carlo approximation of this expected information gain, and use
these gradients in SAA.
2. Infinitesimal perturbation analysis (see Section 2.4) applied to Equation 2.2, along with
the estimator in Equation 2.3 and the polynomial surrogate model, allows the analytic
derivation of an unbiased gradient estimator ĝ, as required for the RM approach.
2.3
Polynomial chaos expansions
This section introduces polynomial chaos expansions (PCE) for mitigating the cost of repeated forward model evaluations. In the next section, they will also shown be used to help
evaluate appropriate gradient information for stochastic optimization methods.
Mathematical models of the experiment enter the inference and design formulation
through the likelihood function f (y|θ, d). For example, a simple likelihood function might
allow for an additive discrepancy ǫ between experimental observations and model predictions y = G(θ, d) + ǫ, where G is the forward model. Computationally intensive forward
models can render Monte Carlo estimation of the expected information gain impractical.
In particular, drawing a sample from f (y|θ, d) requires evaluating G at a particular (θ, d).
Evaluating the density f (y|θ, d) = fǫ (y − G(θ, d)) again requires evaluating G.
To make these calculations tractable, one would like to replace G with a cheaper “surrogate” model that is accurate over the entire prior support and the entire design space
D. As discussed near the end of Section 1.1.1, various options exist with different properties. We focus on PCE, which has seen extensive use in a range of engineering applications (e.g., [88, 141, 173, 181]) including parameter estimation and inverse problems
(e.g., [113, 112, 114]). More recently, it has also been used in the batch OED setting [89, 90],
with excellent accuracy and multiple order-of-magnitude speedups over direct evaluations
of forward model.
The formulation of PCE is as follows. Any random variable z with finite variance can
40
be represented by an infinite series
z=
∞
X
ai Ψi (ξ1 , ξ2 , . . .),
(2.10)
|i|=0
where i = (i1 , i2 , . . .) , ij ∈ N0 , is an infinite-dimensional multi-index (we bold this index to
emphasize its multidimensional nature); |i| = i1 + i2 + . . . is the l1 norm; ai ∈ R are the
expansion coefficients; ξi are independent random variables; and
Ψi (ξ1 , ξ2 , . . .) =
∞
Y
ψij (ξj )
(2.11)
j=1
are multivariate polynomial basis functions [180]. Here ψij is an orthogonal polynomial of
order ij in the variable ξj , where orthogonality is with respect to the density of ξj ,
Eξ [ψm (ξ)ψn (ξ)] =
Z
Ξ
2
ψm (ξ)ψn (ξ)f (ξ) dξ = δm,n Eξ ψm
(ξ) ,
(2.12)
and Ξ is the support of f (ξ). The expansion in Equation 2.10 is convergent in the meansquare sense [39]. For computational purposes, the infinite sum in Equation 2.10 must be
truncated to some finite stochastic dimension ns and a finite number of polynomial terms.
A common choice is the “total-order” truncation |i| ≤ p, but other truncations that retain
fewer cross terms, a larger number of cross terms, or anisotropy among the dimensions are
certainly possible [53].
In the OED context, the model outputs depend on both the parameters and the design.
Constructing a new polynomial expansion at each value of d encountered during optimization
is generally impractical. Instead, we can construct a single PCE for each component of G,
depending jointly on θ and d [90]. To proceed, we assign one stochastic dimension to each
component of θ and one to each component of d. Further, we assume an affine transformation
between each component of d and the corresponding ξi ; any realization of d can thus be
uniquely associated with a vector of realizations ξi . Since the design variables will usually
be supported on a bounded domain (e.g., inside some hyper-rectangle), the corresponding
ξi are endowed with uniform distributions. The associated univariate ψi are thus Legendre
polynomials. These distributions effectively define a uniform weight function over the design
space D that governs where the L2 -convergent PCE should be most accurate.7
7
Ideally, we would like to use a weight function that is proportional to how often the different d values
41
Constructing the PCE involves computing the coefficients ai . This computation generally
can proceed via two possible approaches, intrusive and nonintrusive. The intrusive approach
results in a new system of equations that is larger than the original deterministic system,
but it needs be solved only once. The difficulty of this latter step depends strongly on the
character of the original equations, however, and may be prohibitive for arbitrary nonlinear
systems. The nonintrusive approach computes the expansion coefficients by directly using
the quantity of interest (e.g., the model outputs), for example, by projecting them onto the
basis functions Ψi . One advantage of this method is that the deterministic solver can be
reused and treated as a black box. The deterministic problem then needs to be solved many
times, but typically at carefully chosen parameter and design values. The nonintrusive
approach also offers flexibility in choosing arbitrary functionals of the state trajectory as
observation variables; these functionals may depend smoothly on ξ even when the state
itself has a less regular dependence. Here, we will employ a nonintrusive approach.
Applying orthogonality, the PCE coefficients for a forward model surrogate are simply
Gc,i
Eξ [Gc (θ(ξ), d(ξ))Ψi (ξ)]
=
=
Eξ Ψ2i (ξ)
R
d(ξ))Ψi (ξ)f (ξ) dξ
Ξ Gc (θ(ξ),
R
,
2
Ξ Ψi (ξ)f (ξ) dξ
(2.13)
where Gc,i is the coefficient of Ψi for the cth component of the model outputs. Analytic
expressions are available for the denominators Eξ Ψ2i (ξ) , but the numerators must be
evaluated numerically. When the evaluations of the integrand (and hence the forward model)
are expensive and ns is large, an efficient method for numerical integration in high dimensions
is essential.
To evaluate the numerators in Equation 2.13, we employ Smolyak sparse quadrature
based on one-dimensional Clenshaw-Curtis quadrature rules [50]. Care must be taken to
avoid significant aliasing errors when using sparse quadrature to construct polynomial approximations, however. Indeed, it is advantageous to recast the approximation as a Smolyak
sum of constituent full-tensor polynomial approximations, each associated with a tensorproduct quadrature rule that is appropriate to its polynomials [54, 53]. This type of approximation may be constructed adaptively, thus taking advantage of weak coupling and
anisotropy in the dependence of G on θ and d. More details can be found in [53].
are visited over the entire algorithm (e.g., from stochastic optimization). This distribution, if known, could
replace the uniform distribution and define a more efficient weighted L2 norm; however, it is almost always
too complex to extract in practice.
42
At this point, we may substitute the polynomial approximation of G into the likelihood
function f (y|θ, d), which in turn enters the expected information gain estimator in Equation 2.3. This enables fast evaluation of the expected information gain. The computation
of appropriate gradient information is discussed next.
2.4
Infinitesimal perturbation analysis
This section applies the method of infinitesimal perturbation analysis (IPA) [87, 76, 7] to
construct an unbiased estimator ĝ of the gradient of the expected information gain, for use
in RM. The same procedure yields the gradient ∇d ĥN,M (·, wsr ) of a finite-sample Monte
Carlo approximation of the expected information gain, for use in SAA. The central idea of
IPA is that under certain conditions, an unbiased estimator of the gradient of a function
can be obtained by simply taking the gradient of an unbiased estimator of the function. We
apply this idea in the context of batch OED.
The first requirement of IPA is the availability of an unbiased estimator of the objective
function. Unfortunately, as described in Section 2.1, ÛN,M from Equation 2.3 is a biased
estimator of U for finite M [145]. To circumvent this technicality, let us optimize the
following objective function instead of U :
i
h
ŪM (d) ≡ Eθs ,ys |d ÛN,M (d, θs , ys )
Z Z
ÛN,M (d, θs , ys )f (θs , ys |d) dθs dys
=
Ys
=
Z
Ys
Hs
Z
(N,M )
ÛN,M (d, θs , ys )
Hs
Y
(i,j)=(1,1)
f (y (i) |θ(i) , d)f (θ(i) )f (θ̃(i,j) ) dθs dys . (2.14)
Our original estimator ÛN,M is now unbiased for the new objective ŪM by construction!
The trade-off, of course, is that the function being optimized is no longer the true U . But
it is consistent in that ŪM (d) → U (d) as M → ∞, for any N > 0. (To illustrate this
convergence in the numerical results of Section 2.5, realizations of ÛN,M , i.e., Monte Carlo
approximations of ŪM , are plotted in Figure 2-2 for varying M .)
The second requirement of IPA comprises conditions allowing an unbiased gradient estimator to be constructed by taking the gradient of the unbiased function estimator. Standard
conditions (see, for example, [7]) require that the random quantity (e.g., ÛN,M ) be almost
43
surely continuous and differentiable. Here, because ÛN,M is parameterized by continuous
random variables that have densities with respect to Lebesgue measure, we can take a perspective that relies on Leibniz’s rule with the following conditions:
1. ÛN,M and ∇d ÛN,M are continuous over the product space of design variables and
random variables, D × Hs × Ys ;
2. the density of the “noise” random variable is independent of d.
The first condition supports the interchange of differentiation and integration according
to Leibniz’s rule. This condition might be difficult to verify in general cases, but the use
of finite-order polynomial forward models and continuous distributions for the prior and
observational noise ensures that we meet the requirement.
The second condition is needed to preserve the form of the expectation. If it is violated,
differentiation with respect to d must be performed on the f (θs , ys |d) term as well via the
R R
product rule, in which case the additional term Ys Hs ÛN,M (d, θs , ys ) ∇d [f (θs , ys |d)] dθs dys
would no longer be an expectation with respect to the original density. The likelihood-ratio
method may be used to restore the expectation [77, 7], but it is not pursued here. Instead,
it is simpler to transform the noise to a design-independent random variable as described in
Section 2.2.2.
In the context of OED, the outcome of the experiment y is a stochastic quantity that
depends on the design d. From the stochastic optimization perspective, y is thus the noise
variable. To demonstrate the transformation to design-independent noise, we assume a
likelihood where the data result from an additive Gaussian perturbation to the forward
model:
y = G(θ, d) + ǫ
= G(θ, d) + C(θ, d)z.
(2.15)
Here C is a diagonal matrix with non-zero entries reflecting the dependence of the noise
standard deviation on other quantities, and z is a vector of i.i.d. standard normal random
variables. For example, “10% Gaussian noise on the cth component” would translate to
Cc,i = δci 0.1|Gc (θ, d)|, where δci is the Kronecker delta function. For other forms of the
likelihood, the right-hand side of Equation 2.15 is simply replaced by a generic function of
44
θ, d, and some design-independent random variable z. Here, however, we will focus on the
additive Gaussian form in order to derive illustrative expressions.
By extracting a design-independent random variable z from the noise term ǫ ≡ C(θ, d)z,
we will satisfy the second condition above. The design dependence of y is incorporated into
ÛN,M by substituting Equation 2.15 into Equation 2.3:
N
i
1 Xn h
ÛN,M (d, θs , zs ) =
ln fy|θ,d G(θ(i) , d) + C(θ(i) , d)z (i) θ(i) , d
N
i=1


M

X
1
− ln 
fy|θ,d G(θ(i) , d) + C(θ(i) , d)z (i) θ(i,j) , d  ,(2.16)

M
j=1
where zs = z (i) . The new noise variables are now independent of d. The samples of y (i)
drawn from the likelihood are instead realized by drawing z (i) from multivariate standard
Gaussian, then multiplying these samples by C and adding them to the model output.
With all conditions for IPA satisfied, an unbiased estimator of the gradient of ŪM ,
corresponding to ĝ in Equation 2.6, is simply ∇d ÛN,M (d, θs , zs ) since
Z
h
i
Eθs ,zs ∇d ÛN,M (d, θs , zs ) =
Z
∇d ÛN,M (d, θs , zs )f (θs , zs ) dθs dzs
Z Z
= ∇d
ÛN,M (d, θs , zs )f (θs , zs ) dθs dzs
Zs H s
i
h
= ∇d Eθs ,zs ÛN,M (d, θs , zs )
Zs
Hs
= ∇d ŪM (d),
(2.17)
where Zs is the support of f (zs ). This gradient estimator is therefore suitable for use in
RM.
The gradient of the finite-sample Monte Carlo approximation of U (d), i.e., ∇d ĥN,M (·, wsr )
used in SAA, takes exactly the same form. The only difference between the two is that ĝ
lets θs and zs be random at every iteration of the optimization process. When used as
∇d ĥN,M (·, wsr ), θs and zs are frozen at some realization throughout the optimization process. In either case, these gradient expressions contain derivatives of the likelihood function
and thus derivatives ∇d G(θ, d). When G is replaced with a polynomial expansion, these
derivatives can be computed inexpensively. Detailed derivations of the gradient estimator
using orthogonal polynomial expansions can be found in Appendix A.
45
2.5
2.5.1
Numerical results: 2D diffusion source inversion problem
Problem setup
We demonstrate the batch OED formulation and stochastic optimization tools on a source
inversion problem in a 2D diffusion field. The goal is to place a single sensor that yields
maximum information about the location of the contaminant source. Contaminant transport
is governed by a scalar diffusion equation on a square domain:
∂w
= ∇2 w + S (xsrc , x, t) ,
∂t
x ∈ X = [0, 1]2 ,
(2.18)
where w(x, t; xsrc ) is the space-time concentration field parameterized by the coordinate of
the source center xsrc . We impose homogeneous Neumann boundary conditions
on ∂X
∇w · n = 0
(2.19)
where n is the normal vector, along with a zero initial condition
(2.20)
w(x, 0; xsrc ) = 0.
The source function has a Gaussian spatial profile
S (xsrc , x, t) =



s
2πh2
exp
−xk2
− kxsrc
2h2
0,
, 0≤t<τ
(2.21)
t≥τ
where s, h, and τ are known (prescribed) source intensity, width, and shutoff time parameters, respectively, and xsrc ≡ (θx , θy ) = θ is the unknown source location that we would ultimately like to infer. The design vector is the location of a single sensor, xsensor ≡ (dx , dy ) = d,
and the observations {yi }5i=1 comprise five noisy point measurements of w at the sensor location and at five equally-spaced sample times. For this study, we choose s = 2.0,
h = 0.05, τ = 0.3; a uniform prior θx , θy ∼ U (0, 1); and an additive noise likelihood model
yi = w (xsensor , ti , ; xsrc ) + ǫi , i = 1 . . . 5, such that the ǫi are zero-mean Gaussian random
variables, mutually independent given xsensor , t, and xsrc , each with standard deviation
σi = 0.1 + 0.1 |w (xsensor , ti ; xsrc )|. In other words, the measurement noise associated with
the data has a “floor” value of 0.1 plus an additional contribution that is 10% of the signal.
46
The sensor may be placed anywhere in the square domain, such that the design space is
(dx , dy ) ∈ [0, 1]2 . Figure 2-1 shows an example concentration profile and measurements.
2.5
Concentration
2
1.5
1
0.5
model prediction
noisy measurements
0
0
0.1
0.2
t
0.3
0.4
Figure 2-1: Example forward model solution and realizations from the likelihood. The solid
line represents the time-dependent contaminant concentration w(x, t; xsrc ) at x = xsensor =
(0, 0), given a source centered at xsrc = (0.1, 0.1), source strength s = 2.0, width h = 0.05,
and shutoff time τ = 0.3. Parameters are defined in Equation 2.18. The five crosses represent
noisy measurements at five designated measurement times.
Evaluating the forward model thus requires solving the PDE in Equation 2.18 at fixed
realizations of θ = xsrc and extracting the solution field at the design location d = xsensor . We
discretize Equation 2.18 using 2nd-order centered differences on a 25 × 25 spatial grid and a
4th-order backward differentiation formula for time integration. As described in Section 2.3,
we replace the full forward model with a PCE surrogate, for computational efficiency. To
this end, we construct a Legendre polynomial approximation of the forward model output
over the 4-dimensional joint parameter and design space, using a total-order polynomial
truncation of degree 12 and 106 forward model evaluations. This high polynomial degree
and rather large number of forward model evaluations are deliberately selected in order
to render truncation and aliasing errors insignificant in our study. OED results of similar
quality may be obtained for this problem with surrogates of lower order and with far fewer
quadrature points (e.g., degree 4 with 104 forward model evaluations) but for brevity they
are not included here. The relative L2 errors of the current surrogate range from 6 × 10−3
to 10−6 .
The OED formulation now seeks the sensor location d∗ = x∗sensor such that when the
47
experiment is performed, on average—i.e., averaged over all possible source locations according to the prior, and over all possible resulting concentration measurements according
to the likelihood—the five concentration readings {yi }5i=1 yield the greatest information gain
from prior to posterior.
2.5.2
Results
Objective function
Before we present the results of numerical optimization, we first explore the properties of
the expected information gain objective. Numerical realizations of ÛN,M for N = 1001
and M = 2, 11, 101, and 1001 are shown in Figure 2-2. These plots can be interpreted
as 1-sample Monte Carlo approximations of ŪM = E[ÛN,M ], or equivalently, as l-sample
Monte Carlo approximations of ŪM = E[Û(N/l),M ]. As N grows, ÛN,M becomes a better
approximation to ŪM and as M grows, ŪM becomes a better approximation to U . The
figures show that values of ÛN,M increase when M increases (for fixed N ), suggesting a
negative bias at finite M . At the same time, the objective becomes less flat in d; since U is
certainly closer to the M = 1001 surface than the M = 2 surface, these results suggest that U
is not particularly flat in d. This feature of the current design problem is encouraging, since
stochastic optimization problems with higher curvature can be more easily solved; in the
context of stochastic optimization, for example, they effectively have a higher signal-to-noise
ratio.
The expected information gain objective inherits symmetries from the square, as expected from the physical nature of the problem. The plots also suggest a smooth albeit
non-convex underlying objective U , with inflection points lying on an interior circle and
four local maxima symmetrically located at the corners of the design space. The best placement for a single sensor is therefore at the corners of the design space, while the worst
placement is at the center. The reason for this perhaps counterintuitive result is that the
diffusion process is isotropic: a series of concentration measurements can only determine
the distance of the source from the sensor, not its orientation. The posterior distribution
thus resembles an annulus of constant radius surrounding the sensor. A sensor placement
that minimizes the area of these annuli, averaged over all possible source locations according
to the prior, tends to be optimal. In this problem, because of the domain geometry and
48
0.45
Expected Utility
Expected Utility
0.5
0.4
0.4
0.3
1.4
1.1
1.2
1
1
0.9
0.8
0.8
0.35
0.2
1
1
1
1
0.5
y
0.5
0 0
0.5
0.3
y
x
(a) N = 1001, M = 2
0 0
x
(b) N = 1001, M = 11
1.5
1.4
1.2
1
1.1
1
0.5
1
0.9
1
y
0.8
0.5
0 0
1.4
2
1.3
Expected Utility
Expected Utility
1.5
0.5
0.7
0.5
0.7
x
(c) N = 1001, M = 101
1.3
1.5
1.2
1.1
1
1
0.5
1
0.9
1
0.5
y
0.8
0.5
0 0
x
(d) N = 1001, M = 1001
Figure 2-2: Surface plots of independent ÛN,M realizations, evaluated over the entire design
space [0, 1]2 ∋ d = (x, y). Note that the vertical axis ranges and color scales vary among the
subfigures.
the magnitude of the observational noise, these optimal locations happen to be the furthest
points from the domain center, i.e., the corners.
Figure 2-3 shows posterior densities for the source location, under different sensor placements, given data generated from a “true” source centered at xsrc = (0.09, 0.22). The
posterior densities are evaluated using the PCE surrogate via Bayes’ rule, while the data
are generated by directly solving the diffusion equation on a denser (101 × 101) spatial grid
than before and then adding the Gaussian noise described in Section 2.5.1. Note that the
posteriors are extremely non-Gaussian. Moreover, they generally include the true source
location, but do not center on it. Reasons for not expecting the posterior mode to match
49
the true source location are twofold: first, we have only 5 measurements, each perturbed
with a relatively significant random noise; second, there is model error, due to mismatch
between the PCE approximation constructed from the coarser spatial discretization of the
PDE and the more finely discretized PDE model used to simulate the data.8,9 For this
source configuration, it appears that a sensor placed at any of the corners yields a “tighter”
posterior than a sensor placed at the center. But we must keep in mind that this result is
not guaranteed for all source locations and data realizations; it depends on where the source
actually is. [Imagine, for example, if the source happened to be very close to the center of
the domain; then the sensor at (0.5, 0.5) would yield the tightest posterior.] What the batch
OED method yields is the optimal sensor placement averaged over the prior distribution of
the source location and the predictive distribution of the data.
1
1
60
Sensor
Source Center
50
0.8
1
4.5
4
0.8
3
Sensor
Source Center
2.5
0.8
3.5
30
0.4
0.4
2
20
0.2
0.4
0.6
0.8
1
Sensor
Source Center
0
0
1
x
(a) xsensor = (0.0, 0.0)
0.2
1
10
0.2
1.5
0.4
1.5
0.2
0
0
2
0.6
2.5
y
y
3
0.6
y
40
0.6
0.2
0.4
0.6
0.8
0.5
0.5
0
0
1
0.2
0.4
0.6
0.8
x
x
(b) xsensor = (0.0, 1.0)
(c) xsensor = (1.0, 0.0)
1
5
1
0.8
4
0.8
0.6
3
0.6
0.4
2
0.4
0.2
1
0.2
Sensor
Source Center
1
1.8
1.6
1.2
y
y
1.4
1
0.8
0.6
0.4
Sensor
Source Center
0
0
0.2
0.4
0.6
0.8
0.2
0
0
1
x
0.2
0.4
0.6
0.8
1
x
(d) xsensor = (1.0, 1.0)
(e) xsensor = (0.5, 0.5)
Figure 2-3: Contours of posterior densities for the source location, given different sensor
placements. The true source location, marked with a blue circle, is xsrc = (0.09, 0.22).
8
Indeed, there are two levels of model error: (1) between the PCE and the PDE model used to construct
the PCE, which has a ∆x = ∆y = 1/24 spatial discretization; (2) between this PDE model and the more
finely discretized (∆x = ∆y = 1/100) PDE model used to simulate the noisy data.
9
Model error is an extremely important aspect of uncertainty quantification [94], but its treatment is
beyond the scope of this thesis. Understanding the impact of model error on OED is an important direction
for future work.
50
Stochastic optimization results
We now analyze the optimization results, first assessing the behavior of the two stochastic
optimization methods individually, and then comparing their performance. Simple termination criteria are used for both methods, stopping when k dj − dj−1 k falls below a tolerance
of 10−6 for 5 successive iterations, or when a maximum number of 50 iterations has been
reached.
Recall that the RM algorithm is essentially a steepest-ascent method (since we are maximizing the expected utility) with a stochastic gradient estimate. Figures 2-4–2-6 each show
four sample RM optimization paths overlaid on the ÛN,M surfaces from Figure 2-2. The
optimization does not always proceed in an ascent direction, due to the noise in the gradient
estimate, but even a noisy gradient can be useful in eventually guiding the algorithm to
regions of high objective value. Naturally, fewer iterations are needed and good designs are
more likely to be found when the variance of the gradient estimator is reduced by increasing
N and M . Note that one must be cautious not to over-generalize from these figures, since
the paths shown in each plot are not necessarily representative. Instead, their purpose is to
provide intuition about the optimization mechanics. Data derived from many runs are more
appropriate performance metrics, and will be used later in this section.
For SAA-BFGS, each choice of the sample set wxr yields a different deterministic objective; example realizations of this objective surface are shown in Figures 2-7–2-9. For each
realization, a local maximum is found efficiently by the BFGS algorithm, requiring only a
few (usually less than 10) iterations. For each set of results corresponding to a particular
N (i.e., each of Figures 2-7–2-9), the random numbers used for smaller values of M are
proper subsets of those used for larger M . We thus expect some similarity and a sense of
convergence among the subplots in each figure. Note also that when N is low, realizations
of the objective can be extremely different from Figure 2-2 (for example, the plots in Figure 2-7 have local maxima near the center of the domain), although improvement is observed
as N is increased. In general, each deterministic problem in SAA can have very different
features than the underlying objective function. None of the realizations encountered here
has maxima at the corners, or is even symmetric. Nonetheless, when sampling over many
SAA subproblems, even a low N can provide reasonably good results. This will be shown
in Tables 2.1 and 2.2, and discussed in detail below.
51
1
1
1.1
0.45
0.8
0.8
1
0.6
0.6
y
y
0.4
0.9
0.4
0.4
0.35
0.8
0.2
0.2
0.7
0
0
0.3
0.2
0.4
0.6
0.8
0
0
1
0.2
0.4
0.6
0.8
1
x
x
(a) N = 1, M = 2
(b) N = 1, M = 11
1
1
1.4
1.4
1.3
0.8
0.8
1.3
1.2
1.2
0.6
y
1.1
0.4
y
0.6
1
0.9
0.2
1.1
0.4
1
0.9
0.2
0.8
0
0
0.2
0.4
0.6
0.8
0.8
0
0
1
0.2
0.4
0.6
0.8
x
x
(c) N = 1, M = 101
(d) N = 1, M = 1001
1
Figure 2-4: Sample paths of the RM algorithm with N = 1, overlaid on ÛN,M surfaces from
Figure 2-2 with the corresponding M values. The large is the starting position and the
large × is the final position.
To compare the performance of RM and SAA-BFGS, 1000 independent runs are conducted for each algorithm, over a matrix of N and M values. The starting locations of these
runs are sampled from a uniform distribution over the design space. We make reasonable
choices for the numerical parameters in each algorithm (e.g., gain schedule scaling, termination criteria) leading to similar run times. Histograms of the final design parameters (sensor
positions) resulting from each set of 1000 optimization runs are shown in Table 2.1. The
top figures in each major row represent RM results, while the bottom figures in each major
row correspond to SAA-BFGS results. Columns correspond to different values of M . It is
immediately apparent that more designs cluster at the corners of the domain as N and M
are increased. For the case with the largest number of samples (N = 101 and M = 1001),
52
1
1
1.1
0.45
0.8
0.8
1
0.6
0.6
y
y
0.4
0.9
0.4
0.4
0.35
0.8
0.2
0.2
0.7
0
0
0.3
0.2
0.4
0.6
0.8
0
0
1
0.2
0.4
0.6
0.8
1
x
x
(a) N = 11, M = 2
(b) N = 11, M = 11
1
1
1.4
1.4
1.3
0.8
0.8
1.3
1.2
1.2
0.6
y
1.1
0.4
y
0.6
1
0.9
0.2
1.1
0.4
1
0.9
0.2
0.8
0
0
0.2
0.4
0.6
0.8
0.8
0
0
1
0.2
0.4
0.6
0.8
x
x
(c) N = 11, M = 101
(d) N = 11, M = 1001
1
Figure 2-5: Sample paths of the RM algorithm with N = 11, overlaid on ÛN,M surfaces
from Figure 2-2 with the corresponding M values. The large is the starting position and
the large × is the final position.
each corner has around 250 designs, suggesting that higher sample sizes cannot further improve the optimization results. An “overlap” in quality across the different N cases is also
observed: for example, results of the N = 101, M = 2 case are worse than those of the
N = 11, M = 1001 case. A balance is thus needed in choosing samples sizes N and M , and
it is not ideal to heavily favor sampling either the inner or outer Monte Carlo loop in ÛN,M .
Overall, comparing the RM and SAA-BFGS plots at intermediate values of M and N , we
see that RM has a slight advantage over SAA-BFGS by placing more designs at the corners.
The distribution of final designs alone does not reflect the robustness of the optimization
results. For example, if U is very flat near the optimum, then suboptimal designs need not
53
1
1
1.1
0.45
0.8
0.8
1
0.6
0.6
y
y
0.4
0.9
0.4
0.4
0.35
0.8
0.2
0.2
0.7
0
0
0.3
0.2
0.4
0.6
0.8
0
0
1
0.2
0.4
0.6
0.8
1
x
x
(a) N = 101, M = 2
(b) N = 101, M = 11
1
1
1.4
1.4
1.3
0.8
0.8
1.3
1.2
1.2
0.6
y
1.1
0.4
y
0.6
1
1
0.9
0.2
1.1
0.4
0.9
0.2
0.8
0
0
0.2
0.4
0.6
0.8
0.8
0
0
1
0.2
0.4
0.6
0.8
x
x
(c) N = 101, M = 101
(d) N = 101, M = 1001
1
Figure 2-6: Sample paths of the RM algorithm with N = 101, overlaid on ÛN,M surfaces
from Figure 2-2 with the corresponding M values. The large is the starting position and
the large × is the final position.
be very close to the true optimum in the design space to be considered good designs in
practice. To evaluate robustness, a “high-quality” objective estimate Û1001,1001 is computed
for each of the 1000 final designs considered above. The resulting histograms are shown
in Table 2.2, where again the top subrows are for RM and the bottom subrows are for
SAA-BFGS, with the results covering a full range of N and M values. In keeping with our
previous observations, performance is improved as N and M are increased—in that the mean
(over the optimization runs) expected information gain increases, while the variance in the
expected information gain decreases. Note, however, that even if all 1000 optimization runs
produced identical final designs, this variance will not reach zero, as there exists a “floor”
corresponding to the variance of the estimator Û1001,1001 . This minimum variance can be
54
1
1
2
0.6
1.8
0.8
0.5
0.8
1.6
0.4
0.6
1.4
0.6
1.2
y
y
0.3
0.2
0.4
0.4
1
0.1
0.8
0.2
0
0.2
0.6
−0.1
0
0
0.2
0.4
0.6
0.8
0
0
1
0.4
0.2
0.4
0.6
0.8
1
x
x
(a) N = 1, M = 2
(b) N = 1, M = 11
1
1
2.5
0.8
2.5
0.8
2
0.6
2
0.4
1.5
0.2
1
y
y
0.6
1.5
0.4
1
0.2
0
0
0.2
0.4
0.6
0.8
0
0
1
0.2
0.4
0.6
0.8
x
x
(c) N = 1, M = 101
(d) N = 1, M = 1001
1
Figure 2-7: Realizations of the objective function surface using SAA, and corresponding
steps of BFGS, with N = 1. The large is the starting position and the large × is the final
position.
observed in the histograms of the RM results with N = 101 and M = 101 or 1001.
One interesting feature of the histograms in Table 2.2 is their bimodality. The higher
mode reflects designs near the four corners, while the lower mode encompasses all other
suboptimal designs. As N or M increase, we observe a transfer of probability mass from the
lower mode to the upper mode. However, the sample sizes are not large enough for the lower
mode to completely disappear for most cases; it is only absent in the two RM cases with
the largest sample sizes. Overall, the histograms are similar in shape for both algorithms,
but RM appears to produce less variability in the expected information gain, particularly
at high N values.
Table 2.3 shows histograms of optimality gap estimates from the 1000 SAA-BFGS runs.
55
1
1
1.4
0.6
0.8
0.8
1.3
0.55
1.2
0.6
0.5
1.1
y
y
0.6
0.45
1
0.4
0.4
0.9
0.4
0.2
0.8
0.2
0.35
0.7
0
0
0.3
0.2
0.4
0.6
0.8
0
0
1
0.2
0.4
0.6
0.8
1
x
x
(a) N = 11, M = 2
(b) N = 11, M = 11
1
1
1.8
0.8
0.8
1.6
0.6
1.4
0.4
1.2
0.2
1
1.6
0.4
1.2
y
1.4
y
0.6
1
0.2
0.8
0
0
0.2
0.4
0.6
0.8
0
0
1
0.8
0.2
0.4
0.6
0.8
x
x
(c) N = 11, M = 101
(d) N = 11, M = 1001
1
Figure 2-8: Realizations of the objective function surface using SAA, and corresponding
steps of BFGS, with N = 11. The large is the starting position and the large × is the
final position.
Since we are dealing with a maximization problem (for the expected information gain),
the estimator from Section 2.2.2 is reversed in sign, such that the upper bound is now h̄N
and the lower bound is ĥN ′ (dˆrs , wsr′ ). The lower bound must be evaluated with the same
inner-loop Monte Carlo sample size M used in the optimization run in order to represent
an identically-biased underlying objective; hence, the lower bound values will not be the
same as the “high-quality” objective estimates Û1001,1001 discussed above. From the table,
we observe that as N increases, values of the optimality gap estimate decrease. This is
a result of the lower bound rising with N (since the optimization is better able to find
designs in regions of large ŪM , e.g., corners of the domains in Table 2.1), and the upper
56
1
1
1.2
0.45
0.8
0.8
1.1
0.4
0.6
1
y
y
0.6
0.9
0.2
0.2
0
0
0.4
0.35
0.4
0.8
0.3
0.2
0.4
0.6
0.8
0
0
1
0.2
0.4
0.6
0.8
1
x
x
(a) N = 101, M = 2
(b) N = 101, M = 11
1
1
1.4
1.2
0.8
1.3
0.8
1.2
1.1
0.6
0.6
y
y
1.1
1
0.4
1
0.4
0.9
0.2
0.9
0.2
0.8
0.8
0
0
0.2
0.4
0.6
0.8
0
0
1
0.7
0.2
0.4
0.6
0.8
x
x
(c) N = 101, M = 101
(d) N = 101, M = 1001
1
Figure 2-9: Realizations of the objective function surface using SAA, and corresponding
steps of BFGS, with N = 101. The large is the starting position and the large × is the
final position.
bound simultaneously falling (since its positive bias monotonically decreases with N [127]).
Consequently, both bounds become tighter and the gap estimates tend toward zero. As
M increases, the variance of the gap estimates increases. Since the upper bound (h̄N ) is
fixed for a given set of SAA runs, the spread is only affected by the variability of the lower
bound. Indeed, from Figure 2-2, it is apparent that the objective becomes less flat as M
increases, with the highest gradients (considering the good design regions only) occurring at
the corners. This translates to a higher sensitivity, as a small “imperfection” in the design
would lead to larger changes in objective estimate; one then would expect the variation
of ĥN ′ (dˆrs , wsr′ ) to become higher as well, leading to greater variance in the gap estimates.
57
Finally, as M increases, the histogram values tend to increase, but they increase more slowly
for larger values of N . Some intuition for this result may be obtained by considering the
relative rates of change of the upper and lower bounds with respect to M , given different
values of N . Again referring to Figure 2-2, the objective values generally increase with
M , indicating an increase of the lower bound. This increase should be more pronounced
for larger N , since the optimization converges to designs closer to the corners, where, as
mentioned earlier, the objective has larger gradient. The upper bound increases with M
as well, as indicated by the contour levels in Figures 2-7–2-9. But this rate of increase is
observed to be slowest at the highest N (i.e., in Figure 2-9). Combining these two effects,
it is reasonable that as N increases, the gap estimate will increase with M at a slower rate.
Can the optimality gap be used to choose values of M and N ? For a fixed M , we
certainly have convergence as N increases, and the gap estimate can be a good indicator of
solution quality. However, because different values of M correspond to different objective
surfaces (due to the bias of ÛN,M ), the optimality gap is unsuitable for comparisons across
different values of M ; indeed, in our example, even though solution quality is improved with
M , the gap estimates appear looser and noisier.
Another performance metric we extract from the stochastic optimization runs is the number of iterations required to reach a solution; histograms of iteration number for RM and
SAA, for the same matrix of M and N values, are shown in Table 2.4. At low sample sizes,
many of the SAA-BFGS runs take only a few iterations, while almost all of the RM runs
terminate at the maximum allowable number of iterations (50 in this case). This difference
again reflects the efficiency of BFGS for deterministic optimization problems. As N and M
are increased, the histograms show a “transfer of mass” from higher iteration numbers to
lower iteration numbers, coinciding somewhat with the bimodal behavior described previously. The reduction in iteration number with increased sample size implies that an n-fold
increase in sample size leads to an increase in computational time that is often much less
than a factor of n. Accounting for this sublinear relationship when allocating computational
resources, especially if samples can be drawn in parallel, can lead to substantial savings.
Although SAA-BFGS generally requires fewer iterations, each iteration takes longer than
a step of RM. RM thus offers a higher “resolution” in run times, potentially giving more
freedom to the user in stopping the algorithm. RM thus becomes more attractive as the
evaluation of the objective function becomes more expensive.
58
As a single integrated measure of the quality of the stochastic optimization solutions, we
evaluate the following mean squared error (MSE):
2
1 X
Û1001,1001 (dr , θsr′ , zsr′ ) − U ref ,
R
R
MSE =
(2.22)
r=1
where dr , r = 1 . . . R, are the final designs from a given optimization algorithm, and U ref
is the true optimal value of the expected information gain. Since the true optimum is
unavailable in this study, U ref is taken to be the maximum value of the objective over all runs.
Recall that the MSE combines the effects of bias and variance; here it reflects the variance
in objective values plus the difference (squared) between the mean objective value and the
true optimum, calculated via R = 1000 replicated optimization runs. Figure 2-10 relates
solution quality to computational effort by plotting the MSE against average computational
time (per run). Each symbol represents a particular value of N (×, , and represent
N = 1, 11, and 101, respectively), while the four different M values are reflected through
the average run times. These plots confirm the behavior we have previously encountered.
Solution quality generally improves (lower MSE) with increasing sample sizes, although a
balanced allocation of samples must be chosen. For instance, a large N with small M
can yield inferior solutions to a smaller N with larger M ; while, for any given N , continued
increases in M beyond some threshold yield minimal improvements in MSE. The best sample
allocation is described by the minimum of all the curves. We highlight these “optimal fronts”
in light red for RM and in light blue for SAA-BFGS. Monte Carlo error in the “high-quality”
estimator Û1001,1001 may also be reflected in the non-zero MSE asymptote for the high-N
RM cases.
According to Figure 2-10, RM outperforms SAA-BFGS by consistently achieving smaller
MSE for a given computational effort. One should be cautious, however, in generalizing from
these numerical experiments. The advantage of RM is relatively small, and other factors
such as code optimization, choices of algorithm parameters, and of course the OED problem
itself can affect or even reverse this advantage.
59
0.08
0.07
0.07
Optimization Result MSE
Optimization Result MSE
0.08
0.06
0.05
0.04
0.03
0.02
0.01
N=1
N=11
N=101
0
10
−2
0.06
0.05
0.04
0.03
0.02
0.01
0
0
2
10
Average Time [s]
N=1
N=11
N=101
10
10
−2
0
10
Average Time [s]
(a) RM
2
10
(b) SAA-BFGS
0.08
Optimization Result MSE
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
RM
SAA−BFGS
−2
10
0
10
Average Time [s]
10
2
(c) RM and SAA-BFGS “optimal fronts”
Figure 2-10: Mean squared error, defined in Equation 2.22, versus average run time for each
optimization algorithm and various choices of inner-loop and outer-loop sample sizes. The
highlighted curves are “optimal fronts” for RM (light red) and SAA-BFGS (light blue).
60
❍
❍❍ M
❍❍
❍
2
N
11
101
1001
200
200
200
200
100
100
100
100
0
1
0
1
0
1
1
0.5
0 0
1
0
1
1
0.5
0.5
1
0.5
0.5
0 0
1
0.5
0.5
0 0
200
200
200
200
100
100
100
100
0
1
0
1
0
1
1
0.5
0
1
1
0.5
0.5
0 0
1
0.5
0.5
0 0
1
0.5
0.5
0 0
200
200
200
100
100
100
100
0
1
0
1
0
1
1
0.5
0 0
11
0
1
1
0.5
1
0.5
0.5
0 0
1
0.5
0.5
0 0
200
200
200
100
100
100
100
0
1
0
1
0
1
1
0
1
1
0.5
0.5
0 0
1
0.5
0.5
0 0
1
0.5
0.5
0 0
200
200
200
100
100
100
100
0
1
0
1
0
1
1
0 0
101
0
1
1
0.5
0.5
1
0.5
0.5
0 0
1
0.5
0.5
0 0
200
200
200
100
100
100
100
0
1
0
1
0
1
1
0.5
0 0
0
1
1
0.5
0.5
0 0
0.5
0 0
200
0.5
0.5
0 0
200
0.5
0.5
0 0
200
0.5
0.5
0 0
200
0.5
0.5
0 0
1
0.5
0.5
0 0
1
0.5
0.5
0 0
Table 2.1: Histograms of final search positions resulting from 1000 independent runs of RM
(top subrows) and SAA (bottom subrows) over a matrix of N and M sample sizes. For each
histogram, the bottom-right and bottom-left axes represent the sensor coordinates x and y,
respectively, while the vertical axis represents frequency.
61
❍❍
N
M
❍❍
1
2
1001
250
250
250
200
200
200
200
150
150
150
150
100
100
100
100
50
50
50
0
0.6
0.8
1
1.2
1.4
1.6
0
0.6
0.8
1
1.2
1.4
1.6
0
0.6
50
0.8
1
1.2
1.4
1.6
0
0.6
250
250
250
250
200
200
200
200
150
150
150
150
100
100
100
100
50
50
50
50
0.8
1
1.2
1.4
1.6
0
0.6
0.8
1
1.2
1.4
1.6
0
0.6
0.8
1
1.2
1.4
1.6
0
0.6
250
250
250
250
200
200
200
200
150
150
150
150
100
100
100
100
50
50
50
50
0
0.6
0.8
1
1.2
1.4
1.6
0
0.6
0.8
1
1.2
1.4
1.6
0
0.6
0.8
1
1.2
1.4
1.6
0
0.6
250
250
250
250
200
200
200
200
150
150
150
150
100
100
100
100
50
50
50
0
0.6
101
101
250
0
0.6
11
11
❍
❍
0.8
1
1.2
1.4
1.6
0
0.6
0.8
1
1.2
1.4
1.6
0
0.6
1
1.2
1.4
1.6
0
0.6
250
250
250
200
200
200
200
150
150
150
150
100
100
100
100
50
50
50
50
0.8
1
1.2
1.4
1.6
0
0.6
0.8
1
1.2
1.4
1.6
0
0.6
0.8
1
1.2
1.4
1.6
0
0.6
250
250
250
250
200
200
200
200
150
150
150
150
100
100
100
100
50
50
50
50
0
0.6
0.8
1
1.2
1.4
1.6
0
0.6
0.8
1
1.2
1.4
1.6
0
0.6
1
1.2
1.4
1.6
0.8
1
1.2
1.4
1.6
0.8
1
1.2
1.4
1.6
0.8
1
1.2
1.4
1.6
0.8
1
1.2
1.4
1.6
0.8
1
1.2
1.4
1.6
50
0.8
250
0
0.6
0.8
0.8
1
1.2
1.4
1.6
0
0.6
Table 2.2: High-quality expected information gain estimates at the final sensor positions
resulting from 1000 independent runs of RM (top subrows, blue) and SAA-BFGS (bottom
subrows, red). For each histogram, the horizontal axis represents values of ÛM =1001,N =1001
and the vertical axis represents frequency.
62
❍
❍❍ M
❍❍
❍
2
N
1
11
101
11
101
1001
250
250
250
250
200
200
200
200
150
150
150
150
100
100
100
100
50
50
50
50
0
0
0
0
0.5
1
1.5
0
0.5
1
1.5
0
0.5
1
1.5
0
250
250
250
250
200
200
200
200
150
150
150
150
100
100
100
100
50
50
50
50
0
0
0
0
0.5
1
1.5
0
0.5
1
1.5
0
0.5
1
1.5
0
250
250
250
250
200
200
200
200
150
150
150
150
100
100
100
100
50
50
50
0
0
0.5
1
1.5
0
0
0.5
1
1.5
0
0
0.5
1
1.5
0
0.5
1
1.5
0
0.5
1
1.5
50
0
0.5
1
1.5
0
Table 2.3: Histograms of optimality gap estimates for SAA-BFGS, over a matrix of samples
sizes M and N . For each histogram, the horizontal axis represents value of the gap estimate
and the vertical axis represents frequency.
63
❍
❍❍ M
❍❍
❍
2
N
1
1001
1000
1000
1000
800
800
800
800
600
600
600
600
400
400
400
400
200
200
200
0
0
10
20
30
40
50
0
0
10
20
30
40
50
0
0
200
10
20
30
40
50
0
0
600
600
600
600
500
500
500
500
400
400
400
400
300
300
300
300
200
200
200
200
100
100
100
5
10
15
20
0
0
5
10
15
20
5
10
15
20
0
0
1000
1000
1000
800
800
800
800
600
600
600
600
400
400
400
400
200
200
200
200
0
0
10
20
30
40
50
0
0
10
20
30
40
50
0
0
10
20
30
40
50
0
0
600
600
600
600
500
500
500
500
400
400
400
400
300
300
300
300
200
200
200
200
100
100
100
5
10
15
20
0
0
5
10
15
20
5
10
15
20
1000
1000
800
800
800
800
600
600
600
600
400
400
400
400
200
200
200
200
10
20
30
40
50
10
20
30
40
50
0
0
10
20
30
40
50
0
0
600
600
600
600
500
500
500
500
400
400
400
400
300
300
300
300
200
200
200
200
100
100
100
100
0
0
5
10
15
20
0
0
5
10
15
20
0
0
5
10
0
0
1000
0
0
20
30
40
10
20
15
30
20
40
5
10
15
20
0
0
5
10
10
20
5
15
50
30
10
20
40
15
Table 2.4: Number of iterations in each independent run of RM (top subrows, blue) and
SAA-BFGS (bottom subrows, red), over a matrix of sample sizes M and N . For each
histogram, the horizontal axis represents iteration number and the vertical axis represents
frequency.
64
50
100
0
0
1000
0
0
10
100
0
0
1000
0
0
101
101
1000
0
0
11
11
50
20
Chapter 3
Formulation for Sequential Design
Having described batch optimal experimental design (OED) in the previous chapter, we now
extend our framework to the more general setting of sequential OED (sOED) for the rest of
this thesis.
sOED allows experiments to be conducted in sequence, thus permitting newly acquired
experimental observations to help guide the design of future experiments. While batch OED
techniques may be repeatedly applied for sequential problems, such a procedure would not
be optimal. We provide the optimal design formulation for sequential experimental design.
In particular, it targets a finite number of experiments, adopts a Bayesian treatment of uncertainty, employs an information measure objective, and accommodates nonlinear models
under continuous parameter, design, and observation spaces. While the sOED notation remains similar as the batch OED formulation in Chapter 2, some conflicts do arise. To avoid
confusion, we provide a full and detailed formulation to the sOED problem in this chapter. Numerical solution techniques for solving the problem will be presented in subsequent
chapters.
3.1
Problem definition
A complete formulation for optimal sequential design needs to account for all sources of
uncertainty over the entire relevant time period, and under a full description of the system
state as well as its evolution dynamics. In essence, we need to establish a mathematical
description of all factors that determine which designs are optimal under different situations.
With this goal in mind, we first define the core formulation components, and then state the
65
sOED problem. At this point, the formulation remains general, and does not assume an
experimental goal of parameter inference (we will specialize later in Section 3.3).
• Experiment index: k = 0, . . . , N − 1. The experiments are assumed to be discrete
and ordered by the integer index k, for a total of N experiments. We consider finite
horizon N .
• State: xk = [xk,b , xk,p ]. The state should encompass all information of the system
necessary in making the optimal future experimental design decisions. Generally, this
includes a belief state xk,b component that reflects the current state of uncertainty,
and a physical state xk,p component that describes any deterministic decision-relevant
variables. We consider continuous, and possibly unbounded, state variables. Specific
choices will be discussed later.
• Design: dk ∈ Dk . The design (also known as “control”, “action”, or “decision” in
other contexts) represents the conditions under which the experiment is to be performed. Moreover, we seek a policy (also known as “controller” or “decision rule”)
π ≡ {µ0 , µ1 , . . . , µN −1 } consisting of a set of policy functions, one for each experiment, that indicates which design to perform depending on what the current state is:
µk (xk ) = dk . We consider continuous design variables.
Design methods that produce a policy are known as sequential (closed-loop) designs
because a feedback of observations from experiments is necessary to determine the
current state, which in turn is needed to apply the policy. This is in contrast to
batch (open-loop) designs, where the designs are determined before any experiments
are performed. These designs only depend on the initial state and not on subsequent
designs or their observations, and hence involve no feedback. These perspectives of
the batch and sequential designs are illustrated in Figure 3-1.
• Observations: yk ∈ Yk . The observations (also referred to as “noisy measurements”,
or “data” in this thesis) from the experiment are assumed to be the only source of
uncertainty to the system, and often incorporate measurement noise and model inadequacy. Some models also have internal stochasticity as a part of the system dynamics;
we currently do not study these cases. We consider continuous observation variables.
• Stage reward: gk (xk , yk , dk ). The stage reward reflects the immediate reward asso66
ciated with performing a particular experiment. This quantity could depend on the
state, observations, or design. Typically, it would reflect the monetary and time costs
of performing the experiment, as well as any additional benefits or penalties.
• Terminal reward: gN (xN ). The terminal reward serves as a mechanism to end the
system dynamics by providing a reward value solely based on the final system state
xN .
• System dynamics: xk+1 = Fk (xk , yk , dk ). The system dynamics (also known as
“transition function”, “transfer function”, or simply “the model” in other contexts) describes the evolution of the system state after performing an experiment, incorporating
the design and observations of that experiment. This includes both the propagation
of the belief state and the physical state. Specific dynamics depends on the choice of
the state variable, and will be discussed later.
Following the same decision-theoretic approach used to develop the expected utility for batch
OED in Section 2.1, we seek to maximize the expected total reward functional (while this
quantity is the expected utility, we use the term “expected total reward” in sOED to parallel
the definitions of stage and terminal rewards):
U (π) = Ey0 ,...,yN −1 |π
"N −1
X
#
gk (xk , yk , µk (xk )) + gN (xN ) ,
k=0
(3.1)
subject to the system dynamics xk+1 = Fk (xk , yk , dk ) for all experiments k = 0, . . . , N − 1.
The optimal policy is then
π ∗ = µ∗0 , . . . , µ∗N −1 =
argmax
U (π),
(3.2)
π={µ0 ,...,µN −1 }
subject to the design space constraints µk (xk ) ∈ Dk , ∀xk for k = 0, . . . , N − 1. For simplicity, we also refer to Equation 3.2 as “the sOED problem” in this thesis. As shown later in
Section 3.4, the commonly used batch (open-loop) and greedy (myopic) design approaches
can be viewed as derivatives (and thus suboptimal design methods) from this general formulation.
67
Design d0
d1
Optimizer
(controller)
dN −1
Observations y0
Experiment 0
y1
Experiment 1
..
.
Experiment N − 1
yN −1
(a) Batch (open-loop) design
Observations yk
Design dk
System dynamics
xk+1 = Fk (xk , yk , dk )
State xk
Policy (controller)
µk
(b) Sequential (closed-loop) design
Figure 3-1: Batch design exhibits an open-loop behavior, where no feedback of information
is involved, and the observations yk from any experiment do not affect the design of any
other experiments. Sequential design exhibits a closed-loop behavior, where feedback of
information takes place, and the data yk from an experiment can be used to guide the
design of future experiments.
3.2
Dynamic programming form
The sOED problem involves the optimization of a functional of a set of policy functions.
While this type of problem is studied in the field of calculus of variations, it is challenging
to solve directly. Instead, we express the problem in an alternative form using Bellman’s
principle of optimality [13, 14]: with the argument “the tail portion of an optimal policy is
optimal for the tail subproblem”, we can break it into a set of smaller subproblems. The
resulting form is the well known dynamic programming (DP) formulation (e.g., [22, 23]):
Jk (xk ) =
max Eyk |xk ,dk [gk (xk , yk , dk ) + Jk+1 (Fk (xk , yk , dk ))]
dk ∈Dk
(3.3)
(3.4)
JN (xN ) = gN (xN ),
68
for k = 0, . . . , N − 1. The Jk (xk ) functions are known as the “reward-to-go” or “value” functions (also referred to as “cost-to-go” or “cost” functions if gk and gN are defined as costs and
the overall problem is defined to minimize the expected total cost), and collectively known
as the Bellman’s equation. The optimal policy functions are now implicitly represented by
the arguments of the maximization expressions, and if d∗k = µ∗k (xk ) maximizes the right side
of Equation 3.3 then the policy π ∗ = µ∗0 , µ∗1 , . . . , µ∗N −1 is optimal. Each evaluation of the
value function now involves a function optimization, which can be tackled more readily.
Solving the DP problem has its own challenges, as its recursive structure involving nested
maximization and expectation leads to an exponential growth in computation with respect
to the horizon N . The growth further amplifies upon the state, design, and observation
spaces, leading to the “curse of dimensionality”. Analytic solution is rarely available except
for some specific classes of problems. Most of the time, DP problems can only be solved
numerically and approximately. A combination of various approximation techniques and
numerical methods is required to solve the sOED problem in DP form, and we will describe
them in detail in Chapters 4 and 5.
3.3
Information-based Bayesian experimental design
We now refine the sOED problem under the experimental goal of inferring uncertain model
parameters θ from noisy and indirect observations yk . With this specialization, we now
choose the appropriate state variable and reward functions.
We follow the Bayesian perspective described in Section 2.1 and generalize Bayes’ rule
for the sequential setting. If one performs the kth experiment under design dk and observes
a realization of the observations yk , then the change in one’s state of knowledge about the
parameters is given by:
f (θ|yk , dk , Ik ) =
f (yk |θ, dk , Ik )f (θ|Ik )
f (yk |θ, dk , Ik )f (θ|dk , Ik )
=
.
f (yk |dk , Ik )
f (yk |dk , Ik )
(3.5)
Here, Ik = {d0 , y0 , . . . , dk−1 , yk−1 } is the information vector representing the history from the
previous experiments, encompassing their designs and observations. Similar to Equation 2.1,
we assume that knowing the design of the current (kth) experiment without knowing its
observations does not affect our current belief about the parameters (i.e., the prior for
the kth experiment would not change based on what experiment we plan to do)—thus
69
f (θ|dk , Ik ) = f (θ|Ik ). In this Bayesian setting, a belief state that fully describes the state
of uncertainty after k experiments then is the posterior. This can be any set of properties
that fully describes the posterior, including the posterior random variable itself θ|yk , dk , Ik ,
its density function f (θ|yk , dk , Ik ) or distribution function F (θ|yk , dk , Ik ), other sufficient
statistics, or even simply the prior along with the entire history of designs and observations
from all previous experiments. For example, in the event where θ is a discrete random
variable that can take on a finite number of possible realizations, methods from partially
observable Markov decision process (POMDP) [154, 138] typically designate the belief state
to be a finite-dimensional vector of possible θ realizations combined with their corresponding
probability mass function values. Since we deal with continuous (and often unbounded) θ,
an analogous perspective manifests in an infinite-dimensional belief state entity; we thus
seek alternative approaches. In this chapter, for the purpose of illustration, we denote the
belief state to be the posterior random variable, i.e., xk,b = θ|Ik . In Chapters 5 and 7,
the belief state will take on different meanings depending on the choice of its numerical
representation; these choices will be made clear in context.
Following the same information-theoretic approach and discussions from Section 2.1, it
is natural to set the terminal reward as the Kullback-Leibler (KL) divergence from the final
posterior after all N experiments have been performed, to the prior before any experiment
is performed:
gN (xN ) = DKL
f (xN,b )
dθ,
fxN,b (xN,b )||fx0,b (x0,b ) =
f (xN,b ) ln
f (x0,b )
H
Z
(3.6)
where H is the support of the prior. The stage rewards then reflect all other immediate
rewards or costs related in performing particular experiments, such as its monetary, time,
and personnel costs, or level of difficulty and risk. When the stage rewards are zero, we
arrive at an expected total reward that is analogous to the expected utility developed for
batch OED (Equation 2.2):
U (π) = Ey0 ,...,yN −1 |π DKL fθ|d0 ,y0 ,...,dN −1 ,yN −1 (xN,b )||fθ (x0,b )
(3.7)
subject to dk = µk (xk ) and xk+1 = Fk (xk , yk , dk ) for k = 0, . . . , N − 1.
Another intuitive alternative is to use incremental information gain after each experiment
70
is performed by setting
gk (xk , yk , dk ) = DKL fxk+1,b (xk+1,b )||fxk,b (xk,b ) =
Z
f (xk+1,b ) ln
H
f (xk+1,b )
dθ
f (xk,b )
for k = 0, . . . , N − 1, where xk+1,b is the belief state component of xk+1 = Fk (xk , yk , dk ).
The expected total reward from this specification is not equivalent to Equation 3.7 since
the reference distributions in the KL divergence terms are different. While intuitively this
approache reflects, in some sense, the amount of information gained from all the experiments, one should use caution in the quantitative interpretation of its results as it involves
the additions of (and comparisons between) KL divergence terms with respect to different
reference distributions (they are thus quantities expressed in different units). Additionally,
such a formulation needs to evaluate KL divergence after every experiment, often an approximate and expensive process, that may deteriorate the overall computational performance.
We therefore take the approach in Equation 3.6 in this thesis.
3.4
Notable suboptimal sequential design methods
Two notable design approaches frequently encountered in the OED literature are batch
(open-loop) (described in detail in Chapter 2) and greedy (myopic) designs. Compared to
the full sOED formulation derived in this chapter, batch and greedy design approaches are
simpler to form and to solve. However, when applied to a sequential design problem, they
are both special cases from simplifying the structure of the sOED problem, and thus are
suboptimal. We discuss these two designs below for the purpose of emphasis, but do not
employ them in developing our numerical method for solving the sOED problem. (In this
study, we take an approach to preserve the original problem as much as possible, and instead
rely more heavily on techniques to approximately solve the exact problem.) These suboptimal
designs, though, will be used as numerical comparisons in Chapter 7.
Batch OED involves the design of all experiments concurrently as a batch, where the
outcome of any experiment would not affect the design of others. Mathematically, the policy
functions µk for batch design do not depend on the states xk , since no feedback is involved.
71
Equation 3.2 thus reduces to a multidimensional vector space optimization problem1
d∗0 , . . . , d∗N −1
= argmax Ey0 ,...,yN −1 |d0 ,...,dN −1
d0 ,...,dN −1
"N −1
X
#
gk (xk , yk , dk ) + gN (xN ) ,
k=0
(3.8)
subject to the design space constraints dk ∈ Dk , ∀k. More specifically, setting gN to Equa−1
N −1
tion 3.6, gk = 0 for k = 0, . . . , N − 1, d = {dk }N
k=0 and y = {yk }k=0 , we recover exactly the
batch OED problem (Equations 2.2 and 2.4). Since batch OED involves applying stricter
conditions to the sOED problem, it therefore yields suboptimal designs.
Greedy design is a type of sequential (closed-loop) formulation where only the next experiment is considered without taking into account other future consequences. Mathematically,
the greedy policy is described by2
Jk (xk ) =
max Eyk |xk ,dk [gk (xk , yk , dk )]
dk ∈Dk
JN (xN ) = gN (xN ).
(3.9)
(3.10)
gr
If dgr
k = µk (xk ) maximizes the right side of Equation 3.9 for all k = 0, . . . , N − 1, then the
gr gr
policy π gr = µgr
0 , µ1 , . . . , µN −1 is the greedy policy. The primary advantage of greedy
design is that by ignoring the future effects, Bellman’s equation becomes decoupled, and
the exponential growth of computation with respect to the horizon N is avoided. It may
also be a reasonable choice under circumstances where the total number of experiments is
unknown. Nonetheless, since the formulation is a truncation to the DP form of the sOED
problem (Equations 3.3 and 3.4), the greedy policy is also suboptimal.
1
Batch OED generally cannot be expressed in the DP form since they do not abide Bellman’s principle
of optimality: the truncated optimal batch design {d∗i , . . . , d∗N −1 } is generally not the optimal batch design
for the tail subproblem of designing experiments i to N − 1.
2
A greedy design formulation would require an incremental information gain formulation (Equation 3.8)
in order to properly reflect the value of information after each experiment is performed.
72
Chapter 4
Approximate Dynamic Programming
for Sequential Design
The sequential optimal experimental design (sOED) problem, even expressed in the dynamic
programming (DP) form (Equations 3.3 and 3.4 from Chapter 3), almost always needs to
be solved numerically and approximately. This chapter describes the techniques we use in
finding an approximate solution to a DP problem under continuous spaces, and focuses on
the optimality aspect of the approximate solution. For the most part, these techniques are
applicable outside the sOED context as well. We then specifically discuss the representation
aspect of the belief state and performing Bayesian inference for general non-Gaussian random
variables in the next chapter.
4.1
Approximation approaches
Approximate dynamic programming (ADP) broadly refers to numerical methods in finding
an approximate solution to a DP problem. Substantial research has been devoted towards
developing these techniques across a number of different communities, targeting different
variations of the DP expression. For example, the area of stochastic control in control
theory usually deals with multidimensional continuous control variables [24, 22, 23], the
study of Markov decision processes in operations research typically accommodates highdimensional discrete decision vectors [138, 137], and the branch of reinforcement learning
from machine learning often handles small, finite sets of discrete actions [93, 164]. While a
plethora of different terminology is used across these fields, there is often a large overlap in
73
the fundamental spirit of their solution approaches. We thus take a perspective to group
the various ADP techniques into the following two broad categories.
1. Problem approximation: where there is no natural way to refine the approximation,
or that refinement does not lead to the solution of the original problem—these methods
typically lead to suboptimal designs.
Examples: batch and greedy designs, open-loop feedback control, certainty equivalent
control, Gaussian approximation of distributions.
2. Solution approximation: where there is some natural way to refine the approximation, and the effects of approximation diminish with refinement—these methods have
some sense of convergence, and may be refined towards the solution of the original
problem.
Examples: policy iteration, value function and Q-factor approximations, numerical
optimization, Monte Carlo sampling, regression, quadrature and numerical integration,
discretization and aggregation, rolling horizon.
In practice, techniques from both categories are often combined together to find an approximate solution to a DP problem. In this thesis, however, we take an approach to try to
preserve the original problem as much as possible, and rely more heavily on solution approximation techniques to approximately solve the exact problem. Keeping in line with this
philosophy, we proceed to build our ADP method around a backbone of one-step lookahead policy representation, and approximate value iteration via backward induction and
regression construction of approximate value functions.
4.2
Policy representation
In seeking the optimal policy, we first need to be able to represent a (generally supoptimal)
policy π = {µ0 , µ1 , . . . , µN −1 }. On the one hand, one may represent a policy function µk (xk )
directly (and approximately), for example, by tabulating its values on a discretized grid of xk
or using functional approximation techniques. On the other hand, one can preserve the recursive relationship in Bellman’s equation and “parameterize” the policy via value functions.
We proceed with a policy representation using one step of lookahead, to retain some level
74
of structural property from the original DP problem while keeping the method computationally feasible. By looking ahead only one step, the recursion between the value functions
is broken, and the exponential growth of computational cost with respect to the horizon
N is reduced to linear growth.1 This leads to the one-step lookahead policy representation
(e.g., [22]):
h
i
µk (xk ) = argmax Eyk |xk ,dk gk (xk , yk , dk ) + J˜k+1 (Fk (xk , yk , dk ))
(4.1)
dk ∈Dk
for k = 0, . . . , N − 1, and J˜N (xN ) ≡ gN (xN ). The policy function µk is therefore indirectly
represented via some value function J˜k+1 , and one can view the policy π to be implicitly
parameterized by the set of value functions J˜1 , . . . , J˜N .2 If J˜k+1 (xk+1 ) = Jk+1 (xk+1 ), we
recover the Bellman’s equation (Equation 3.3), and µk = µ∗k ; therefore we would like to find
J˜k+1 ’s that are in some sense close to Jk+1 ’s.
Before we describe how to construct good value function approximations in the next
section, we first describe how to numerically represent these approximations. We employ a
simple parametric linear architecture function approximator:
J˜k (xk ) = rk⊤ φk (xk ) =
m
X
rk,i φk,i (xk ),
(4.2)
i=1
where rk,i is the coefficient (weight) corresponding to the ith feature (basis) φk,i (xk ). While
more sophisticated nonlinear function, or even nonparametric, approximators are possible
(e.g., k-nearest-neighbor [78], kernel regression [129], neural network [24]), linear approximator is easy to use and intuitive to understand [103], and is often required for many algorithm
analysis and convergence results [23]. It follows that the construction of J˜k (xk ) involves the
selection of features and training of the coefficients.
The choice of features is an important, but difficult, task. A concise set of features
that is relevant in reflecting the function values from data points can substantially improve
the accuracy and efficiency of the function approximators, and in turn, of the overall algo1
Multi-step lookahead is possible in theory, but impractical, as the amount of online computation would
be tremendous under continuous spaces.
2
A similar method is the use of Q-factors [175, 176]: µk (xk ) = argmaxdk ∈Dk Q̃k (xk , dk ), where the Qfactor corresponding to the optimal policy is Qk (xk , dk ) ≡ Eyk |xk ,dk [gk (xk , yk , dk ) + Jk+1 (Fk (xk , yk , dk ))].
The functions Q̃k (xk , dk ) have a higher input dimension than J˜k (xk ), but once they are available, the
corresponding policy can be evaluated without the system dynamics Fk , and is thus known as a “model-free”
method. Q-learning via value iteration is a prominent method in reinforcement learning.
75
rithm. Identifying helpful features, however, is non-trivial. Substantial research has been
dedicated in developing systematic procedures for both extracting and selecting features
in the machine learning and statistics communities [83, 107], but in practice, finding good
features often relies on experience, trial-and-error, and expertise knowledge of the particular
problem at hand. We acknowledge the difficulty of this process, but do not further pursue
detailed discussions of general and systematic feature formation. Instead, we take a reasonable heuristic step in the sOED context, and choose features that are based on the mean
and log-variance of the belief state, along with the physical state component. The main
motivation for this move stems from the KL divergence term in the terminal reward, which
is chiefly responsible for reflecting information gain. While generally the belief states are not
Gaussian, the analytic formula of KL divergence between two Gaussian random variables,
which includes their mean and log-variance terms, provides a starting point for promising
features. We will specify the feature choices with more detail in Chapter 7. For the present
purpose of developing our ADP method in this chapter, we assume the features are set. We
now focus on developing efficient procedures for training the coefficients.
4.3
Policy construction via approximate value iteration
4.3.1
Backward induction and regression
Our goal is to find policy parameterization (value function approximations) J˜k ’s that are
close to the optimal policy value functions Jk ’s that satisfy Equation 3.3. We take a direct
approach, and would like to solve the following ideal regression problem that minimizes the
least squares error of the approximation under the optimal policy induced state measure (also
known as the D-norm in other works; its density function is denoted by fπ∗ (x1 , . . . , xN −1 )):
min
rk ,∀k
Z
X1 ,...,XN −1
"N −1
X
k=1
Jk (xk ) − rk⊤ φk (xk )
2
#
fπ∗ (x1 , . . . , xN −1 ) dx1 , . . . , dxN −1 ,
(4.3)
where Xk is the support of xk , and we impose the linear architecture of J˜k (xk ) = rk⊤ φk (xk ).
The distribution of regression points is a reflection of where emphasis is placed for the
approximation function to be more accurate. Intuitively, we would ultimately like more
accurate approximations in regions of states that are more likely or frequently visited under
the optimal policy. More precisely, we would like to use the state measure induced together
76
by the optimal policy and by the associated numerical methods. For example, the choice
of stochastic optimization algorithm as well as its setting (discussed in Section 2.2) affect
which intermediate states are more frequently visited during the optimization process. The
accuracy at the intermediate states can be crucial, since they can potentially mislead the
optimization algorithm to arrive at completely different designs, and in turn change the
regression and policy evaluation outcomes. It would then be prudent to include the states
visited from the optimization procedure as regression points as well. In Chapter 7, we
will demonstrate the importance of including these states through illustrative numerical
examples. For the rest of this thesis, we shall refer “policy induced state measure” to include
effects of the associated numerical methods as well. As we do not have the optimal policy,
Jk (xk ), or fπ∗ (x1 , . . . , xN −1 ), we must solve Equation 4.3 approximately.
To sidestep the need of optimal value functions Jk (xk ) in the ideal regression problem
(Equation 4.3), we proceed to construct the approximate functions by approximate value
iteration, specifically using backward induction and regression. The resulting J˜k ’s will then
be used as parameterization of the one-step lookahead policy in Equation 4.1. Starting with
J˜N (xN ) ≡ gN (xN ), we proceed backwards from k = N − 1 to k = 1 and form
J˜k (xk ) = rk⊤ φk (xk )
h
i
˜
= Π max Eyk |xk ,dk gk (xk , yk , dk ) + Jk+1 (Fk (xk , yk , dk ))
dk ∈Dk
= Π Jˆk (xk ),
(4.4)
where Π is the approximation operator that can be, for example, regression. This leads to
a set of ideal stage-k regression problems
min
rk
Z
Xk
2
Jˆk (xk ) − rk⊤ φk (xk ) fπ∗ (xk ) dxk ,
(4.5)
h
i
with Jˆk (xk ) ≡ maxdk ∈Dk Eyk |xk ,dk gk (xk , yk , dk ) + J˜k+1 (Fk (xk , yk , dk )) , fπ∗ (xk ) being the
marginal of fπ∗ (x1 , . . . , xN −1 ), and J˜k (xk ) = rk⊤ φk (xk ). While we no longer need the optimal
value functions Jk (xk ) in constructing J˜k (xk ), we remain unable to select regression points
according to fπ∗ (xk ); we discuss this issue in the next subsection. Furthermore, since J˜k (xk )
is built based on J˜k+1 (xk+1 ) through the backward induction process, the effects of numerical
approximation error aggregate, potentially at an exponential rate [168]. The accuracy of all
77
J˜k (xk ) approximations (i.e., for all k) is thus extremely important.
4.3.2
Exploration and exploitation
Although we cannot generate regression points distributed exactly according to the optimal policy induced state measure, it is possible to generate them according to a given
(suboptimal) policy. This includes heuristic policies, and the current approximation to the
optimal policy in the algorithm (we shall refer to this as the “current policy” throughout
this section). In particular, we proceed to generate regression points using a combination of
exploration and exploitation. Exploration is conducted by randomly selecting designs (i.e., a
random heuristics policy). For example, if the feasible design space is bounded, this can be
performed by uniform sampling; when it is unbounded, however, a designated exploration
design measure needs to be prescribed, often selected from experience and understanding of
the problem. The purpose of exploration is to allow a positive probability of probing regions
that can potentially lead to better rewards than through the current policy. Exploitation is
conducted by applying the current policy, in this case, from exercising the one-step lookahead policy using the parameterization value functions J˜k . The purpose of exploitation is
to take advantage of the current understanding of a good policy. When states visited by
exploitation are used as regression points, they help increase the weight of accuracy in regions of states that would be reached and visited frequently via this policy. In practice, a
balance of both exploration and exploitation is used to achieve good results, but an infusion
of exploration (or other heuristics) generally invalidates theoretical algorithm analysis and
convergence results [23, 137]. In our algorithm, the states visited from both exploration
and exploitation trajectories are used as regression points for the least squares problem in
Equation 4.5.
4.3.3
Iterative update of state measure and policy approximation
A dilemma emerges from generating regression samples via exploitation is a “chicken or the
egg” problem: exploitation requires the availability of a current policy from the algorithm,
and the construction of such a policy (that is not a heuristics policy) requires regression
samples. We address this issue by introducing an iterative approach to update the state
measure used for generating the regression samples. On a high level, we achieve this by
alternating between generating regression points from exploitation of the current policy
78
using Equation 4.1, and constructing approximate optimal policy by solving the regression
problem of Equation 4.5.
Here is a concrete description of the procedure. The algorithm starts with only an
exploration heuristics, denoted by π explore . States from exploration trajectories generated
from π explore are then used as regression points to approximately solve Equation 4.5, producing J˜k1 ’s that parameterize π 1 . π 1 is then used to generate exploitation trajectories via
Equation 4.1, and together with a mixture of exploration states from π explore , the overall
set of states is used as regression points to again solve Equation 4.5, giving us J˜k2 ’s that
parameterize π 2 . The process is repeated, and one would expect the regression points to
distribute closer to the optimal policy induced state measure. Additionally, the biggest
change is expected to occur when the first exploitation policy π 1 becomes available, with
smaller changes in subsequent iterations. A rigorous proof of convergence of this iterative
procedure is difficult with the infusion of exploration and the generally unpredictable state
measure induced by the numerical methods and settings; we therefore start with numerical
investigations of this procedure in this thesis, and will develop formal proofs in the future.
Combining the stage-k regression problem from all stages (Equation 4.5), the overall
regression problem being solved approximates the ideal regression problem of Equation 4.3:
min
rk ,∀k
Z
X1 ,...,XN −1
"N −1
X
k=1
Jˆkℓ+1 (xk )
−
rk⊤ φk (xk )
2
#
fπexplore +πℓ (x1 , . . . , xN −1 ) dx1 , . . . , dxN −1(4.6)
where fπexplore +πℓ (x1 , . . . , xN −1 ) is the joint density corresponding to the mixture of exploration and exploitation from the ℓth iteration, and the approximation rk⊤ φk (xk ) at iteration
ℓ is denoted J˜kℓ (xk ). Note that fπexplore +πℓ (x1 , . . . , xN −1 ) lags one iteration behind Jˆkℓ+1 (xk )
since we need to have constructed the policy first before we can sample trajectories from it.
Simulating exploitation trajectories, applying policies, and evaluating the regression system all involve finding the maximum of an expected value in a continuous design space
(Equations 4.1 and 4.4). While the expected value generally cannot be found analytically, a
robust and natural approximation may be obtained via Monte Carlo estimate. As a result,
the optimization objective is effectively noisy. Following the developments of Section 2.2,
we employ Robbins-Monro (or Kiefer-Wolfowitz when gradient is not analytically available)
stochastic approximation algorithm for stochastic optimization.
79
4.4
Connection to the rollout algorithm (policy iteration)
While the one-step lookahead policy representation described in Equation 4.1 has a similar
form to the one-step lookahead rollout algorithm [167, 22],3 our implementation of approximation value iteration is different from rollout. A typical rollout algorithm involves three
main steps:
1. policy initialization: choose a base (heuristics) policy;
2. policy evaluation: compute the corresponding value functions of this policy; and
3. policy improvement: apply these value functions in the one-step lookahead formula
(Equation 4.1) to obtain a new policy that is guaranteed to be no worse than the
previous policy [22].
Policy iteration simply repeats steps 2 and 3, and is more frequently used in infinite-horizon
settings. Our approach differs from rollout in that J˜k (xk ) are not necessarily value functions
corresponding to any policy. Instead, we perform backward induction to construct J˜k (xk )
that directly approximate the value functions of the optimal policy (which is the key property
of an approximate value iteration method). One-step lookahead is simply the means to apply
the policy parameterized by J˜k ’s.
A rollout implementation would include the construction of base policy value functions
Jπbase ,k (xk ). They can be either directly approximated pointwise as needed in an online
manner using Monte Carlo of trajectory simulations from xk , or approximated offline using
function approximation techniques such as Equation 4.2 combined with regression. The former involves fewer sources of approximation, but is computationally expensive. The latter
is similar in spirit to the procedure introduced in the previous section, and is furthermore
computationally cheaper. This is because base policies are typically described in forms
much simpler than the one-step lookahead representation, and producing values of dk from
them normally do not require the maximization operation. Nonetheless, we perform the
full backward induction in Equation 4.4 as the additional maximization is affordable under
our current computational setup, and its inclusion can offer an advantage in leading to an
overall better policy. This can be seen from the fact that the value function approximations
3
Multi-step lookahead rollout algorithms are also possible. Similar to discussions in Section 4.2, the
tremendous amount of online computation under continuous spaces make them impractical.
80
produced from backward induction would generally be a better starting point heading into
one-step lookahead, compared to the value functions of a (perhaps arbitrarily chosen) base
policy. An interesting future investigation would be to compare the computational performance between the direct backward induction approach with multiple iterations of rollout
(i.e., approximate policy iteration).
4.5
Connection to POMDP
Partially-observable Markov decision process (POMDP) is a generalized framework of Markov
decision process (MDP) where the underlying state cannot be directly observed [154, 138],
and as such, a probability distribution of the state is maintained. (In POMDP vernacular,
the “partially-observed state” is simply the parameters θ from our optimal experimental design (OED) terminology; we may use them interchangeably in this section.) While a general
continuous version of the POMDP framework can be used to describe the sOED problem
introduced in Chapter 3, traditional MDP and POMDP research largely focus on problems
with discrete (often finitely-discrete) spaces and random variables. Nonetheless, we might
expect existing POMDP algorithms suitable and insightful in handling discretized versions
of the sOED problem.
There are two major limitations to the majority of the state-of-the-art POMDP algorithms when applied to the sOED problem. First, these algorithms are often designed to
handle only a handful of possible values of design and state variables, while even a simple
discretization of the type of sOED problems considered in this thesis would lead to an extremely large number of discretized values. Second, most POMDP algorithms are based on,
and exploit, the property that the problem’s cost functions are piecewise linear and convex
(when minimizing) with respect to the belief state [159, 157] (some examples of such algorithms include the witness algorithm [92], point-based value iteration [134], SARSOP [101],
and Perseus [136]). (In problems with discrete partially-observed states, the belief state is
simply the vector of probability mass function values, which in itself is a full and finitedimensional representation of the uncertainty. Property of piecewise linear and convex then
naturally arises for many problems in the field of operations research, where a specific value of
cost or reward is usually assigned to each possible realization of the partially-observed state.
The expected cost or reward then becomes a linear combination of these values weighed by
81
the probability masses.) However, we show that these algorithms would not be suitable to
solve even the discretized version of a one-experiment OED problem that employs an information measure objective (i.e., in information-based Bayesian experimental design). This is
because such an objective (at least a practical one) necessarily leads to value functions that
are not linear. By an induction argument, the value functions in a multi-experiment sOED
problem would also generally not have the piecewise linear and convex (concave) property.
We demonstrate the second limitation under a one-experiment, n-state (finitely) discrete
random variable θ setting (recall that θ is the partially-observed state in the POMDP vernacular). We start with a rigorous definition for measure of information in experimental
design.
Definition 4.5.1. (Ginebra [75], with notation adaptations) A measure of the information
about θ in an experiment d assigns a value U (d) such that
1. U (d) is a real number
2. U (dtni ) = 0, and
3. whenever dA and dB are such that dA is “sufficient for” dB , then U (dA ) ≥ U (dB ).
The notation U (d) corresponds to the expected utility from the batch OED problem in
Chapter 2, or the expected total reward (but with a fixed prior and N = 1) from the sOED
problem in Chapter 3; in this one-experiment setting, it is also the sole value function. dtni is
the “totally non-informative” experiment, where one cannot learn about θ by observing the
outcomes of dtni . In the Bayesian setting, this is when the posterior remains the same as the
prior. In contrast, a “totally informative” experiment dti is one where for every pair of (θi , θj ),
θi 6= θj , (where θi is the ith realization of all possible values θ can take on) the intersection
of the support sets for their likelihoods, Yi = supp (f (y|θi , dti )) and Yj = supp (f (y|θj , dti )),
is an empty set, and thus it is a family of mutually singular distributions. After performing
dti , the value of θ can be determined with certainty, hence the totally informativeness. As
a consequence of the requirements in Definition 4.5.1, 0 = U (dtni ) ≤ U (d) ≤ U (dti ). The
third requirement in Definition 4.5.1 needs the definition of “sufficient for”.
Definition 4.5.2. (Originally Blackwell [25, 26], then Ginebra [75], with notation adaptations) Experiment dA is said to be “sufficient for” dB if there exists a stochastic transfor82
mation of y|dA to a random variable w(y|dA ) such that w(y|dA ) and y|dB have identical
distribution under each θ.
The following proposition based on Definition 4.5.2 was first proven by Blackwell [25, 26],
Sherman [153], and Stein [162], and then generalized and stated in the Bayesian setting by
Ginebra [75].
Proposition 4.5.1. (Ginebra [75], with notation adaptations) Experiment dA is “sufficient
for” dB if and only if for a given strictly positive prior distribution p(θ),
(4.7)
Ey|dA [φ (p(θ|y, dA ))] ≥ Ey|dB [φ (p(θ|y, dB ))]
for every convex function φ(·) on the simplex of Rn , where p(θ|y, dA ) and p(θ|y, dB ) are the
posterior distributions on the same prior p(θ).
We use the notation p(·) to denote probability mass functions of discrete random variables. We now propose the following.
Theorem 4.5.1. When a measure of the information U about an n-state random variable
θ in a single experiment d is in linear form with respect to its probability mass function
U (d) = Ey|d
"
n
X
i=1
#
(4.8)
αi p(θi |y, d) ,
the measure of information is constant (zero) for all experiments and therefore is not useful.
Here αi ∈ R and p(θi |d, y) is the posterior probability mass at θ = θi .
Proof. We start by first establishing conditions under which Equation 4.8 is a valid measure
of information, by satisfying the requirements in Definition 4.5.1. Requirement 1 is satisfied
by Equation 4.8 by definition. To meet requirement 2, the coefficients must satisfy
U (dtni ) = Ey|d
"
n
X
i=1
#
αi p(θi |dtni , y) = Ey|d
"
n
X
i=1
#
αi p(θi ) =
n
X
αi p(θi ) = 0,
(4.9)
i=1
where the posterior remains unchanged from the prior by definition of dtni . To meet requirement 3, we require that whenever Proposition 4.5.1 is satisfied, then U (dA ) ≥ U (dB ). This is
satisfied by Equation 4.8 since it is a specialization where φ is linear and hence convex—thus
83
if Proposition 4.5.1 is satisfied then indeed our choice of U (d) satisfies U (dA ) ≥ U (dB ) by
construction. Under these conditions, Equation 4.8 is a valid measure of information.
We now show that U (d) = 0 for all d, and thus it is not a practically useful measure
of information. Consider the totally informative experiment, which is an experiment (regardless whether or not it can be physically achieved in practice) that can deterministically
pinpoint the value of θ. In other words, the posterior would be a Kronecker delta function.
The information value of the totally informative experiment thus provides the theoretically
highest achievable U . Its information value is
U (dti ) = Ey|d
=
=
"
=
=
=
=
i=1
Z X
n
Y i=1
n
XZ
αi p(θi |dti , y)
n Z
X
n
X
n
X
αj
j=1
n
X
j=1
n
X
αj
αi p(θi |dti , y)f (y|dti ) dy
αj f (y|dti ) dy
j=1 Yj
Z
n
X
j=1
n
X
#
αi p(θi |dti , y)f (y|dti ) dy
Yj i=1
j=1
=
n
X
Z
f (y|θm , dti )p(θm ) dy
Yj m=1
f (y|θj , dti )p(θj ) dy
Yj
αj p(θj )
Z
f (y|θj , dti ) dy
Yj
αj p(θj ) = 0.
(4.10)
j=1
The second equality is from the definition of expectation (here we assume observation space
Y is continuous, but the same result applies for discrete cases). The third equality involves
breaking the integral over Y into the disjoint sets Yj = supp(f (y|θj , dti )), j = 1, . . . , n. The
fourth equality is due to p(θi |dti , y) = δi,j for all y ∈ Yj (any y ∈ Yj will lead to a delta
posterior at θj ). The fifth and sixth equalities apply the definition of conditional probability
and use the fact that for all y ∈ Yj , f (y|θm , dti ) = 0 for all θm 6= θj (due to the disjoint
likelihood functions under dti ). The next line simply rearranges. The eigth equality again
uses the property of disjoint likelihood functions. The last equality is due to Equation 4.9.
84
Both the totally non-informative and totally informative experiments yield information
values of zero under the linear form of information measure in Equation 4.8. 0 = U (dtni ) ≤
U (d) ≤ U (dti ) = 0, and therefore U (d) = 0 for all d. Hence, the linear form of information
measure is not a practically useful measure of information.
As a result, a practically useful measure of information necessarily has a nonlinear form
with respect to the belief state in problems of discrete parameters.
85
86
Chapter 5
Transport Maps for Sequential Design
Another important challenge of the sequential optimal experimental design (sOED) problem
is in representing the belief state xk,b and performing Bayesian inference (the part of system dynamics Fk for propagating the belief state). In particular, we seek to accommodate
nonlinear forward models involving non-Gaussian continuous random variables. Following
the discussions in Chapter 3, a Bayesian perspective suggests a belief state that comprehensively describes the uncertain environment is simply the posterior. However, representing
a general continuous random variable posterior in a finite-dimensional manner is difficult.
We propose to represent the belief state using transport maps. As we demonstrate in this
chapter, transport maps are especially attractive compared to other traditional alternatives,
in that they can be constructed directly from samples without requiring model knowledge,
and the optimization problem from the construction process is dimensionally-separable and
convex. Furthermore, by constructing joint maps, they enable Bayesian inference to be performed very quickly, albeit approximately, by conditioning on different realizations of design
and observations.
We start the chapter with some general background of transport maps in Section 5.1,
demonstrate how they can be used for Bayesian inference in Section 5.2, and provide the
details of how to construct maps from samples in Section 5.3. The connection of quality
between joint and conditional maps is discussed in Section 5.4, which is important to justify
constructing accurate posterior maps. Finally, the particular implementation and use of
maps in the sOED problem is presented in Section 5.5. We note that Sections 5.1 and 5.3
contain material drawn heavily from the work of Parno and Marzouk [131, 132].
87
5.1
Background
Consider two Borel probability measures on Rn , µz and µξ . We will refer to these as
the target and reference measures, respectively, and associate them with random variables
z ∼ µz and ξ ∼ µξ . A transport map T : Rn → Rn is a deterministic transformation that
pushes forward µz to µξ , yielding
(5.1)
µ ξ = T♯ µ z .
In other words, µξ (A) = µz T −1 (A) for any Borel set A ⊆ Rn . In terms of the random
i.d.
i.d
variables, we write ξ = T (z), where = denotes equality in distribution. The transport
map is equivalently a deterministic coupling of probability measures [171]. For example,
Figure 5-1 illustrates a log-normal random variable z mapped to a standard Gaussian random
i.d.
variable ξ via ξ = T (z) = ln(z).
ξ
f (ξ)
T (z)
z
f (z)
Figure 5-1: A log-normal random variable z can be mapped to a standard Gaussian random
i.d.
variable ξ via ξ = T (z) = ln(z).
Of course, there can be infinitely many transport maps between two probability measures.
On the other hand, it is possible that no transport map exists: consider the case where µz
has an atom but µξ does not. If a transport map exists, one way of regularizing the problem
and finding a unique map is to introduce a cost function c(z, ξ) on Rn × Rn that represents
the work needed to move one unit of mass from z to ξ. Using this cost function, the total
88
cost of pushing µz to µξ is
C(T ) =
Z
Rn
(5.2)
c (z, T (z)) dµz (z).
Minimization of this cost subject to the constraint µξ = T♯ µz is called the Monge problem [116]. A transport map satisfying the measure constraint in Equation 5.1 and minimizing the cost in Equation 5.2 is an optimal transport map. The celebrated result of [35], later
generalized by [115], shows that this map exists, is unique, and is monotone µz -a.e. when µz
is atomless and the cost function c(z, ξ) is quadratic. Generalizations of this result to other
cost functions and spaces have been established in [44, 5, 68, 20].
The choice of cost function in Equation 5.2 naturally influences the structure of the map.
For illustration, consider the Gaussian case of z ∼ N (0, I) and ξ ∼ N (0, Σ) for some positive
i.d.
definite covariance matrix Σ. The associated transport map is linear: ξ = T (z) = Sz, where
the matrix S is any square root of Σ. When the transport cost is quadratic, c(z, ξ) = |z −ξ|2 ,
S is the symmetric square root obtained from the eigendecomposition of Σ, Σ = V ΛV ⊤ and
1
S = V Λ 2 V ⊤ [128]. If the cost is instead taken to be the following weighted quadratic
c(z, ξ) =
n
X
i=1
ti−1 |zi − ξi |2 , t > 0,
(5.3)
then as t → 0, the optimal map becomes lower triangular and equal to the Cholesky factor of
Σ. Generalizing to non-Gaussian µz and µξ , as t → 0 optimal maps Tt obtained with the cost
function in Equation 5.3 are shown by [40] and [27] to converge to the Knothe-Rosenblatt
(KR) rearrangement [143, 98] between probability measures. The KR map exists and is
uniquely defined if µz is absolutely continuous with respect to Lebesgue measure. It is
defined by and typically constructed via an iterative procedure that involves evaluating and
inverting a series of marginalized conditional cumulative distribution functions. As a result,
it inherits several useful properties: the Jacobian matrix of T is lower triangular and has
positive diagonal entries µz -a.e. (i.e., monotone). Because of this triangular structure, the
Jacobian determinant and the inverse of the map are easy to evaluate. This is an important
computational advantage that we exploit.
We will employ KR maps (lower triangular and monotone), but without directly appealing to the transport cost in Equation 5.3. While this cost is meaningful for theoretical
89
analysis and even numerical continuation schemes [40], we find that for small t, the sequence of weights {ti } quickly produces numerical underflow as the parameter dimension n
increases. Instead, we will directly impose the lower triangular structure and search for a
map Te that approximately satisfies the measure constraint, i.e., for which µξ ≈ Te♯ µz . This
approach is a key difference between our construction and classical optimal transportation.
Numerical challenges with Equation 5.3 are not the only reason to seek approximate
maps. Suppose that the target measure µz is a Bayesian posterior or some other intractable
distribution, but let the reference µξ be something simpler, e.g., a Gaussian distribution
with identity covariance. In this case, the complex structure of µz is captured by the map
T . Sampling and other tasks can then be performed with the simple reference distribution
instead of the more complicated distribution. In particular, if a map exactly satisfying Equation 5.1 were available, sampling the target distribution µz would simply require drawing a
sample ξ ∗ ∼ µξ and pushing it to the target space with θ∗ = T −1 (ξ ∗ ). This concept was
employed by [64] for posterior sampling. Depending on the structure of the reference and
the target, however, finding an exact map may be computationally challenging. In particular, if the target contains many nonlinear dependencies that are not present in the reference
distribution, the representation of the map T (e.g., in some canonical basis) can become
quite complex. Hence, it is desirable to work with approximations to T .
5.2
Bayesian inference using transport maps
Transport maps can be viewed as a representation of random variables. We may then
choose to perform Bayesian inference via transport maps, instead of probability density
functions from the traditional form of Bayes’ theorem in Equation 2.1. To illustrate this, we
employ KR maps, which are lower-triangular, monotone, and where the reference measure
is standard Gaussian. Adopting the same notation as Equation 2.1 for this section, where θ
is the parameters, y the observations, and d the experimental design, the KR map from the
target joint random vector in the order of (d, θ, y) is

η1


Td (d)
 

 

 η2  =  Tθ|d (d, θ)
 

η3
Ty|θ,d (d, θ, y)


Φ−1 (F (d))
 
  −1
 =  Φ (F (θ|d))
 
Φ−1 (F (y|θ, d))
90



,

(5.4)
where η1 , η2 , η3 are i.i.d. standard Gaussians, and the subscript on the map components
denotes the corresponding conditional distribution. For simplicity, we omit the “i.d.” above
the equality signs except when this property needs to be emphasized. We also use F (·)
to represent all distribution functions, and which specific distribution it corresponds to is
reflected by its arguments (when needed for clarity, a subscript of the random variable will
be explicitly included). Equation 5.4 may be interpreted as the prior form of the joint map,
where the associated conditional distribution functions are of the prior and likelihood, both
of which are available to us at the beginning of an inference procedure. Another form of the
target joint random vector with a different order of (d, y, θ) yields the KR map of

ξ1


Td (d)
 

 

 ξ2  =  Ty|d (d, y)
 

Tθ|y,d (d, y, θ)
ξ3


Φ−1 (F (d))
 
  −1
 =  Φ (F (y|d))
 
Φ−1 (F (θ|y, d))



,

(5.5)
where ξ1 , ξ2 , ξ3 are i.i.d. standard Gaussians. Equation 5.5 may be interpreted as the
posterior form of the joint map, where the associated conditional distribution functions are
of the evidence and posterior, components we seek through the inference process.1 While
the prior form of the joint map is easy to construct and can be done even analytically, the
inference process then involves reordering the random variables to obtain the posterior form
of the joint map, a non-trivial task. In the next section, we will show how to construct the
approximate map to the posterior form of Equation 5.5 directly, and circumvent the prior
form of Equation 5.4 and the reordering process altogether.
To demonstrate that Equation 5.5 indeed carries the posterior information, consider the
posterior random variable of θ conditioned on a particular experimental design d = d∗ and
observations y = y ∗ . Its KR map is precisely
Tθ|y∗ ,d∗ (θ) = Φ−1 (F (θ|y ∗ , d∗ )) = Tθ|y,d (d∗ , y ∗ , θ),
(5.6)
where the first equality is due to the definition of KR maps, and the second equality uses
the relationship of the last component in Equation 5.5. Therefore, once the posterior form
1
Other orderings of the random variables are also possible, such as (y, d, θ). Such a sequence would
still associate with the posterior conditional distribution, but not the evidence. If the only interest is the
posterior, then any ordering would be suitable as long as θ is positioned after all the variables we plan to
condition it on.
91
of the joint map in Equation 5.5 is available, we can obtain the KR map of the posterior
random variable by simply condition the last component. Effectively, we have attained a
posterior map that is parameterized on y and d. This is extremely useful in the context
of sOED, where many repeated inference computations need to be conducted on the same
prior belief state but with different realizations of d and y in numerically evaluating the
Bellman’s equation (Equation 3.3) using a stochastic optimization algorithm.
The probability density function of the joint can also be easily obtained via
f (d, y, θ) = fξ1 ,ξ2 ,ξ3 (Td,y,θ (d, y, θ)) |det ∂Td,y,θ (d, y, θ)| ,
(5.7)
where Td,y,θ (d, y, θ) denotes the entire joint map from Equation 5.5 evaluated at (d, y, θ),
and ∂Td,y,θ (d, y, θ) is its Jacobian of transformation. The determinant Jacobian is easily
computable as it is simply the product of diagonal terms due to the triangular structure.
Similarly, the density function for the posteriors can be obtained via
f (θ|y, d) = fξ3 (Tθ|y,d (d, y, θ)) det ∂θ Tθ|y,d (d, y, θ) ,
(5.8)
where ∂θ Tθ|y,d (d, y, θ) is the Jacobian of transformation (with respect to θ) for the last
component of Equation 5.5.
Before we describe the map construction method in the next section, we first illustrate
the concepts from this section with an example.
Example 5.2.1. For simplicity, consider an inference problem where the design d is fixed,
and thus omitted from notation. The prior on θ is N (0, 1), and the observation (likelihood
model) has the form
y = G(θ) + 1.7ǫ = 0.01θ5 + 0.1(θ − 1.5)3 + 0.2θ + 5 + 1.7ǫ,
(5.9)
where ǫ ∼ N (0, 1) is an independent noise random variable. The prior form of the joint map
can be easily constructed using the prior and likelihood information:
(5.10)
η1 = θ
η2 =
1 1
[y − G(θ)] =
y − 0.01θ5 − 0.1(θ − 1.5)3 − 0.2θ − 5 .
1.7
1.7
92
(5.11)
It is interesting to note that for likelihood models in the form of additive Gaussian noise,
the joint maps constructed in this manner are always monotone regardless of the form of
forward model G; this is due to the triangular form of the maps. To obtain the posterior
form of the joint map, we require
ξ1 = Ty (y)
(5.12)
ξ2 = Tθ|y (y, θ),
(5.13)
which is difficult to attain analytically. Instead, we will construct an approximation to this
joint map using numerical techniques introduced in the next section. Once it is made available, the map for the posterior conditioned on y = y ∗ is then simply Tθ|y∗ (θ) = Tθ|y (y ∗ , θ), for
any realizations y ∗ . We will revisit this example later after introducing the map construction
method.
5.3
Constructing maps from samples
We now describe a method to numerically construct an approximate map from samples of the
target measure. The work presented in this section (Section 5.3) was originally developed by
Parno and Marzouk, with additional details in [131, 132]. We repeat much of the derivation
here for completeness.
We seek transport maps that have a lower triangular structure, i.e.,




T (z1 , z2 , . . . , zn ) = 



T1 (z1 )
T2 (z1 , z2 )
..
.
Tn (z1 , z2 , . . . , zn )




,



(5.14)
where zi denotes the ith component of z and Ti : Ri → R is ith component of the map
T for simplicity. We assume that both the target and reference measures are absolutely
continuous on Rn . This assumption precludes the existence of atoms in µz and thus makes
the KR coupling well-defined. To find a useful approximation of the KR coupling, we will
define a map-induced density f˜z (z) and minimize the distance between this map-induced
density and the target density fz (z).
93
5.3.1
Optimization objective
Let fξ be the probability density associated with the reference measure µξ , and consider
a transformation Te(z) that is monotone and differentiable µz -a.e. (In Section 5.3.2 we
will discuss constraints to ensure monotonicity; moreover, we will employ maps that are
everywhere differentiable by construction.) Now consider the pullback of µξ through Te.
The density of this pullback measure is
f˜z (z) = fξ (Te(z))| det ∂ Te(z)|,
(5.15)
where ∂ Te(z) is the Jacobian of the map evaluated at z, and | det ∂ Te(z)| is the absolute value
of the Jacobian determinant.
If the measure constraint µξ = Te♯ µz were exactly satisfied, the map-induced density f˜z
would equal the target density fz . This suggests finding Te by minimizing a distance or
divergence between f˜z and fz ; to this end, we use the Kullback-Leibler (KL) divergence
from f˜z to fz :
fz (z)
˜
DKL (fz kfz ) = Efz ln
f˜z (z)
e
e
= Efz ln fz (z) − ln fξ T (z) − ln det ∂ T (z) .
(5.16)
We can then find transport maps by solving the following optimization problem:
min Efz − ln fξ (T (z)) − ln |det ∂T (z)| ,
T ∈T
(5.17)
where T is some space of lower-triangular functions from Rn to Rn . If T is large enough
to include the KR map, then the solution of this optimization problem will exactly satisfy
Equation 5.1. Note that we have removed the ln fz (z) term in Equation 5.16 from the
optimization objective Equation 5.17, as it is independent of T . If the exact coupling
condition is satisfied, however, then the quantity inside the expectation in Equation 5.16
becomes constant in z.
Note that the KL divergence is not symmetric. We choose the direction above so that
we can use Monte Carlo samples to approximate the expectation with respect to fz (z).
Furthermore, as we will show below, this direction allows us to dramatically simplify the
94
solution of Equation 5.17 when fξ is Gaussian. Suppose that we have K samples from fz ,
denoted by {z (1) , z (2) , . . . , z (K) }. Taking a sample-average approximation (SAA) approach
described in Section 2.2.2, we replace the objective in Equation 5.17 with its Monte Carlo
estimate and, for this fixed set of samples, solve the corresponding deterministic optimization
problem:
"
#
K
X
1
(k)
(k)
− ln fξ T (z ) − ln det ∂T (z ) .
Te = argmin
K
T ∈T
(5.18)
k=1
The solution Te is an approximation to the exact transport map for two reasons: first, we
have used an approximation of the expectation operator; and second, we have restricted the
feasible domain of the optimization problem to T . The specification of T is the result of
constraints, discussed in Section 5.3.2, and of the finite-dimensional parameterization of the
map, such as a multivariate polynomial expansion.
5.3.2
Constraints
To write the map-induced density f˜z as in Equation 5.15, it is sufficient that Te be differen-
tiable and monotone, i.e., (z ′ − z)⊤ (Te(z ′ ) − Te(z)) ≥ 0 for distinct points z, z ′ ∈ Rn . Since
we assume that µz has no atoms, to ensure that the pushforward Te♯ µz also has no atoms we
only need to require that Te be strictly monotone. Our map is by construction everywhere
differentiable and lower triangular, and we impose the monotone constraint via
∂ Tei
≥ λmin > 0, i = 1 . . . n.
∂zi
(5.19)
Since Te is lower triangular, the Jacobian ∂ Te is also lower triangular, and Equation 5.19
ensures that the Jacobian is positive definite. Because the Jacobian determinant is then
positive, we can remove the absolute value from the determinant terms in Equation 5.17,
Equation 5.18, and related expressions. This is an important step towards arriving at a
convex optimization problem (see Section 5.3.3).
Unfortunately, we cannot generally enforce the lower bound in Equation 5.19 over the
entire support of the target measure. A weaker, but practically enforceable, alternative is
to require the map to be increasing at each sample used to approximate the KL divergence.
95
In other words, we use the constraints
∂ Tei ∂zi z (k)
≥ λmin > 0 ∀i ∈ {1, 2, . . . , n}, ∀k ∈ {1, 2, . . . , K}.
(5.20)
In practice, we have found that Equation 5.20 is sufficient to ensure the monotonicity of a
map represented by a finite basis expansion.
5.3.3
Convexity and separability of the optimization problem
Now we consider the task of minimizing the objective in Equation 5.18. The 1/K term can
immediately be discarded, and the derivative constraints above let us remove the absolute
value from the determinant term. While one could tackle the resulting minimization problem
directly, we can simplify it further by exploiting the structure of the reference density and
the triangular map.
First, we let ξ ∼ N (0, I). This choice of reference distribution yields
n
1X 2
n
ξi .
ln fξ (ξ) = − ln(2π) −
2
2
(5.21)
i=1
Next, the lower triangular Jacobian ∂ Te simplifies the determinant term in Equation 5.18 to
give
n
Y
∂ Tei
ln det ∂ Te(z) = ln (det ∂ Te(z)) = ln
∂zi
i=1
!
=
n
X
i=1
ln
∂ Tei
.
∂zi
(5.22)
The objective function in Equation 5.18 now becomes
"
K
n X
X
ei 1
∂
T
2
(k)
C(Te) =
Te (z ) − ln
2 i
∂zi i=1 k=1
z (k)
#
.
(5.23)
This objective is separable: it is a sum of n terms, each involving a single component Tei of
the map. The constraints in Equation 5.20 are also separable; there are K constraints for
each Tei , and no constraint involves multiple components of the map. Hence the entire opti-
mization problem separates into n individual optimization problems, one for each dimension
of the parameter space. Moreover, each optimization problem is convex : the objective is
convex and the feasible domain is closed (note the ≥ operator in the linear constraints of
Equation 5.20) and convex.
96
In practice, we must solve the optimization problem over some finite-dimensional space
of candidate maps. Let each component of the map be written as Tei (z; γi ), i = 1 . . . n,
where γi ∈ RMi is a vector of parameters, e.g., coordinates in some basis. Throughout this
thesis, we employ multivariate polynomial basis functions, but other choices are certainly
possible. For instance, [131] found radial basis function representations of the map also to be
useful. For any choice of basis, we will require that Tei be linear in γi . The complete map is
then defined by the parameters γ̄ = [γ1 , γ2 , . . . , γn ]. Note that there are distinct parameter
vectors for each component of the map. The optimization problem over the parameters
remains separable, with each of the n different subproblems given by:
min
γi
s.t.
#
"
K
X
∂ Tei (z; γi ) 1 e2 (k)
T (z ; γi ) − ln
2 i
∂zi (k)
k=1
z
∂ Tei (z; γi ) ≥ λmin > 0, k ∈ {1, 2, . . . , K},
∂zi (k)
(5.24)
z
for i = 1 . . . n. All of these optimization subproblems can be solved in parallel without
evaluating the target density fz (z). Since the map components Tei are linear in the coefficients
γi , each finite-dimensional problem is still convex. Moreover, efficient matrix-matrix and
matrix-vector operations can be used to evaluate the objective. This allows us to easily
solve Equation 5.24 with a standard Newton method.
5.3.4
Map parameterization
One way to parameterize each component of the map Tei is with a multivariate polynomial
expansion. We define each multivariate polynomial ψj as
ψj (z) =
n
Y
ϕji (zi ).
(5.25)
i=1
where j = (j1 , j2 , . . . , jn ) ∈ Nn0 is a multi-index and ϕji is a univariate polynomial of degree ji .
The univariate polynomials can be chosen from any family of orthogonal polynomials (e.g.,
Hermite, Legendre, Jacobi). For simplicity, monomials are used for the present purposes.
Using these multivariate polynomials, we express the map as a finite expansion of the form
Tei (z; γi ) =
X
j∈Ji
97
γi,j ψj (z),
(5.26)
where Ji is a set of multi-indices defining the polynomial terms in the expansion. Notice
that the cardinality of the multi-index set defines the dimension of each parameter vector
γi , i.e., Mi = |Ji |. An appropriate choice of each multi-index set Ji will force the entire
map Te to be lower triangular.
A simple choice of the multi-index set corresponds to a total-order polynomial basis,
where the maximum degree of each multivariate polynomial is bounded by some integer
p ≥ 0:
JiT O = {j : kjk1 ≤ p, jk = 0 ∀k > i}.
(5.27)
The first constraint in this set limits the polynomial order, while the second constraint,
jk = 0 ∀k > i, applied over all i = 1 . . . n components of the map, forces Te to be lower
triangular. In this work, we adopt the construction of total-order (monomial) polynomial
basis.
Example 5.3.1. We now continue from Example 5.2.1 and present its numerical results.
Samples from the joint distribution, and the joint density function contours, are shown in
Figure 5-2. A particular posterior for y = y ∗ = 1 is studied; this is represented by the dotted
horizontal line in the joint density, and its exact posterior density is shown in Figure 5-3. An
approximate joint map for the form in Equations 5.12 and 5.13 is constructed numerically
using monomial polynomial basis of different total orders, and with different number of
samples; the density corresponding to the posterior map Tθ|y∗ are shown in Figure 5-3. As
expected, the density induced by the approximate posterior map better approximates the
exact density as the polynomial basis order and sample size are increased. However, while
the model in Equation 5.9 is a 5th-order polynomial, the posterior form of its joint map
(Equations 5.12 and 5.13) can generally be of higher than 5th order. As a result, even with
a 5th-order polynomial basis, we do not expect to obtain the exact posterior density.
5.4
Relationship between quality of joint and conditional maps
The map construction method described in the previous section can be used to construct
an approximate map to the posterior form of the joint map (Equation 5.5) in the context of
Bayesian inference. This construction method minimizes an objective that is the KL diver98
·10−2
20
20
8
7
10
10
6
0
y
y
5
0
4
3
−10
−10
−20
−20
−4
−2
0
θ
2
4
(a) Samples from joint distribution
2
1
−4
−2
0
θ
2
4
0
(b) Joint density contours, dotted line is the
y = 1 value where inference is perform on
Figure 5-2: Example 5.3.1: samples and density contours.
gence between the map-induced joint density and the target joint density (Equation 5.16).
As the goal of inference is to ultimately obtain the posterior map and its density by conditioning on the joint map (Equations 5.5 and 5.6), we would like to explore the implications
of joint map quality on the subsequent posterior conditional map quality. In other words,
does a “good” joint map also lead to “good” posterior maps? We prove this is indeed the case,
and that the optimal (in the KL sense) approximate joint map also produces the optimal
expected posterior maps.
Consider an exact n-dimensional, lower triangular and monotone transport map from
target random vector z1:n to an i.i.d. reference random vector ξ1:n :

ξ1:n
ξ1








 ξ2  i.d. 


= .  = 

 .. 




ξn
T1 (z1 )
T2 (z1 , z2 )
...
Tn (z1 , z2 , . . . , zn )




 = T1:n (z1:n ),



(5.28)
where the subscript notation zj:k = zj , . . . , zk . This exact map T1:n always exists and is
unique, and has target density fz1:n (z1:n ) = fξ1:n (T1:n (z1:n )) |det ∂T1:n (z1:n )|. Let T̃1:n ∈ T̃1:n
denote an approximate map, where T̃1:n ⊆ T1:n is an approximation subspace of the same
99
0.6
0.5
PDF
0.4
0.3
0.2
0.1
−5 −4 −3 −2 −1 0
θ
1
2
3
4
5
(a) Exact posterior density for y = 1
0.6
0.6
Exact
p1 map
p3 map
p5 map
0.5
0.4
PDF
PDF
0.4
0.3
0.3
0.2
0.2
0.1
0.1
−5 −4 −3 −2 −1 0
θ
1
2
3
4
Exact
103 samples
104 samples
105 samples
0.5
5
(b) Various map basis orders, 106 samples
−5 −4 −3 −2 −1 0
θ
1
2
3
4
5
(c) Various sample sizes, 5th order
Figure 5-3: Example 5.3.1: posterior density functions using different map polynomial basis
orders and sample sizes.
dimension to T1:n , and T1:n is the space of all lower triangular diffeomorphisms on Rn :
ξ˜1:n




=



ξ˜1


T̃1 (z1 )




 i.d.  T̃2 (z1 , z2 )
 = 
 ..

 .



ξ˜n
T̃n (z1 , z2 , . . . , zn )
ξ˜2
..
.




 = T̃1:n (z1:n ).



(5.29)
In Equation 5.29, the target random vector z1:n is unchanged, but the reference random
vector ξ˜1:n will be approximate and as a result generally no longer i.i.d. Similarly, we could
i.d.
keep the reference random vector ξ1:n unchanged (and thus i.i.d.), and view ξ1:n = T̃1:n (z̃1:n )
for some approximate target random vector z̃1:n . Such a z̃1:n always exists since it is precisely
−1
T̃1:n
(ξ1:n ). Figure 5-4 illustrates these two perspectives.
The approximate target density is fz̃1:D (z1:D ) = fξ1:D (T̃1:D (z1:D )) det ∂ T̃1:D (z1:D ), which
100
ξ1:D = T1:D (z1:D )
ξ˜1:D = T̃1:D (z1:D )
ξ1:D = T̃1:D (z̃1:D )
Figure 5-4: Illustration of exact map and perspectives of approximate maps. Contour plots
on the left reflect the reference density, and on the right the target density.
in general only approximates the true target density fz1:D (z1:D ). The map construction
approach described in Section 5.3 finds a good approximate map by minimizing the KL
divergence between fz̃1:D (z1:D ) and fz1:D (z1:D ), where the KL provides a reflection of the
map quality jointly in all of its dimensions. Through the following theorem and corollary,
we show that the optimal approximate joint map also produces optimal expected posterior
conditional maps.
Theorem 5.4.1. Let the optimal approximate joint map that satisfies
∗
T̃1:n
= argmin DKL (fz1:n ||fz̃1:n )
T̃1:n ∈T̃1:n
101
(5.30)
denoted by the component structure




∗
T̃1:n (z1:n ) = 



T̃1∗ (z1 )
T̃2∗ (z1 , z2 )
..
.
T̃n∗ (z1 , z2 , . . . , zn )




.



(5.31)
Then for each k = 1, . . . , n, the dimension-truncated “head” map




∗
T̃1:k (z1:k ) = 



T̃1∗ (z1 )
T̃2∗ (z1 , z2 )
..
.
T̃k∗ (z1 , z2 , . . . , zk )








(5.32)
is also the optimal approximate map for z1:k , in the sense
∗
T̃1:k
= argmin DKL (fz1:k ||fz̃1:k ) ,
(5.33)
T̃1:k ∈T̃1:k
where T̃1:k ⊆ T̃1:n is its first k-dimensional truncation.
Proof. We want to show that Equation 5.33 holds for k = 1, . . . , n. We prove by induction.
∗ . Now assume Equation 5.33
The base case for k = n is clearly true by definition of T̃1:n
holds for k = m + 1, and we want to show then this holds for k = m as well.
For any approximate map T̃1:(m+1) ∈ T̃1:(m+1) ,
=
=
=
=
=
DKL fz1:(m+1) ||fz̃1:(m+1)
"
!#
fz1:(m+1) (z1:(m+1) )
Ez1:(m+1) ln
fz̃1:(m+1) (z1:(m+1) )
fzm+1 |z1:m (zm+1 |z1:m )fz1:m (z1:m )
Ez1:(m+1) ln
fz̃m+1 |z̃1:m (zm+1 |z1:m )fz̃1:m (z1:m )
fzm+1 |z1:m (zm+1 |z1:m )
fz1:m (z1:m )
+ Ez1:(m+1) ln
Ez1:(m+1) ln
fz̃m+1 |z̃1:m (zm+1 |z1:m )
fz̃1:m (z1:m )
fzm+1 |z1:m (zm+1 |z1:m ) f
(z
)
z
1:m
1:m
z1:m + Ez
Ez1:m Ezm+1 |z1:m ln
1:m ln
fz̃m+1 |z̃1:m (zm+1 |z1:m ) fz̃1:m (z1:m )
Ez1:m DKL fzm+1 |z1:m (·|z1:m )||fz̃m+1 |z̃1:m (·|z1:m ) + DKL (fz1:m ||fz̃1:m ) ,
(5.34)
102
where the 2nd equality is due to
fz̃1:(m+1) (z1:(m+1) )
= fξ1:(m+1) (T̃1:(m+1) (z1:(m+1) )) det ∂ T̃1:(m+1) (z1:(m+1) )
= fξm+1 (T̃m+1 (z1:(m+1) ))fξ1:m (T̃1:m (z1:m )) det ∂m+1 T̃m+1 (z1:(m+1) ) det ∂ T̃1:m (z1:m )
(5.35)
= fz̃m+1 |z̃1:m (zm+1 |z1:m )fz̃1:m (z1:m ),
where fz̃m+1 |z̃1:m (zm+1 |z1:m ) = fξm+1 (T̃m+1 (z1:(m+1) )) det ∂m+1 T̃m+1 (z1:(m+1) ) depends only
on the map component of dimension m + 1 (i.e., T̃m+1 ), and not on any of the previous map
components (i.e., T̃1:m ). The decomposition of fξ1:(m+1) (T̃1:(m+1) (z1:(m+1) )) in the 2nd equal-
ity of Equation 5.35 uses the independence property of ξ1:(m+1) , and the decomposition of
det
∂
T̃
(z
)
1:(m+1) 1:(m+1) is due to the triangular structure of the map.
Now take argmin on both sides of Equation 5.34, we obtain
argmin
T̃1:(m+1) ∈T̃1:(m+1)
=
DKL fz1:(m+1) ||fz̃1:(m+1)
argmin
T̃1:m ∈T̃1:m ,T̃m+1 ∈T̃m+1
=
Ez1:m DKL fzm+1 |z1:m (·|z1:m )||fz̃m+1 |z̃1:m (·|z1:m )
+ DKL (fz1:m ||fz̃1:m )}
argmin Ez1:m DKL fzm+1 |z1:m (·|z1:m )||fz̃m+1 |z̃1:m (·|z1:m )
T̃m+1 ∈T̃m+1
+ argmin DKL (fz1:m ||fz̃1:m ) ,
(5.36)
T̃1:m ∈T̃1:m
where we have made use of the fact that the two terms in the summation depend on separate
dimension components of the overall map.
∗
for dimensions 1 to m + 1
As a result, we see that the optimal approximate map T̃1:(m+1)
∗
is the concatenation of the optimal map T̃1:m
for dimensions 1 to m, plus the (m + 1)th
∗
component T̃m+1
. This completes the proof.
Corollary 5.4.1. For each k = 1, . . . , n, the component map T̃k∗ is the optimal expected
conditional map, in the sense
h
i
T̃k∗ = argmin Ez1:(k−1) DKL fzk |z1:(k−1) (·|z1:(k−1) )||fz̃k |z̃1:(k−1) (·|z1:(k−1) ) .
T̃k ∈T̃k
Proof. This is a result of Equation 5.36.
103
(5.37)
In the context of Bayesian inference through joint maps (Section 5.2), the component map
used on the right hand side of Equation 5.6 is therefore optimal under the joint expectation
of d and y.
5.5
Sequential design using transport maps
We now shift focus back to the sOED problem described in Chapter 3. Recall we would
like to solve the dynamic programming form of the sOED problem, stated as Equations 3.3
and 3.4, and restated here for convenience:
Jk (xk ) =
max Eyk |xk ,dk [gk (xk , yk , dk ) + Jk+1 (Fk (xk , yk , dk ))]
dk ∈Dk
(5.38)
(5.39)
JN (xN ) = gN (xN ).
While approximate dynamic programming techniques have been introduced in Chapter 4
for finding an approximate solution to this form, the issue of choosing the belief state xk
remains unaddressed. Two major requirements emerge in considering this decision: the
belief state needs to be able to (1) represent general, non-Gaussian posteriors of multiple
dimensions in a finite-dimensional manner, and (2) perform Bayesian inference quickly. The
second requirement is driven by Equation 5.38, where its numerical evaluation even on a
single xk involves performing Bayesian inference (i.e., operations of Fk (xk , yk , dk )) many
times under different values of dk (stochastic optimization iterations) and yk (Monte Carlo
approximation of the expectation).
Traditional approaches for representing random variables, such as direct approximations
to their probability density and distribution functions, and using Gaussian mixtures or particles, generally do not scale well with dimension, are constrained to limited forms and
structures, or computationally expensive to construct and propagate under inference. In
contrast, the transport map method introduced in this chapter satisfies both requirements
well. Following the discussions in Section 5.3.4, not only do the approximate KR maps
provide a finite-dimensional representation of general random variables, they also offer a
mechanism to adapt. This can be done by adjusting the selection of basis in different dimensions or regions, a topic to be explored as future work. Targeting an efficient representation
can greatly alleviate the burden of extending to multiple and higher-dimensional settings.
104
Additionally, as illustrated in Section 5.2, only a single joint map needs to be constructed
which can then be used to perform inference almost trivially for many different realizations
of design and observations. The construction process itself is non-intrusive, requiring only
samples from the target measure, which can effectively be treated as a black box. The
construction procedure ends with a computationally attractive optimization problem that
is dimensionally-separable and convex, and its solution can be obtained inexpensively.
Motivated by these advantages, we proceed to use transport maps as the belief state for
the sOED problem.
5.5.1
Joint map structure
Transport maps of different levels of scope may be constructed for the sOED problem; we
explore these possibilities here, and discuss their pros and cons. To illustrate this idea,
consider the first experiment only at stage k = 0 with the prior x0 fixed, and the dimensions
nθ , ny and nd inherited from Section 2.1 and assumed constant for all experiments. We
then have three different levels of map construction under this situation, summarized in
Table 5.1.
Map targets
θ
θ, y0
θ, d0 , y0
Dimension
nθ
nθ + ny
nθ + nd + ny
Construct for each...
{d0 , y0 }
{d0 }
1
Total no. constructions
niter × nMC
niter
1
Table 5.1: Different levels of scope are available when transport maps are used as the belief
state in the sOED context. niter represents the number of stochastic optimization iterations
from numerically evaluating Equation 5.38, and nMC represents the Monte Carlo size of
approximating its expectation. In our implementation, these values are typically around 50
and 100, respectively.
The first choice involves constructing an nθ -dimensional map for each realization of
{d0 , y0 }, which is equivalent to capturing each posterior directly. Such an approach does not
align with the map construction and inference tools described in this chapter: constructing
these maps requires samples from the posteriors, the very distributions we are trying to
recover in the first place; it also does not take advantage of the inference mechanism of conditioning on a joint map. The second choice involves constructing an (nθ + ny )-dimensional
map for each realization of {d0 }. This approach requires samples from the joint distribution
of θ and y0 conditioned on d0 , which can be attained by first sampling θ from the prior,
105
and then y0 from the likelihood conditioned on the θ sample and d0 . A map constructed at
d0 can then be used for inference on that d0 and any realization of y0 . However, niter map
constructions are required, one for each d0 encountered within the stochastic optimization
when evaluating Equation 5.38 numerically (niter , number of stochastic optimization iterations, is typically capped at 50 in our implementation), as these maps cannot be reused
when the optimizer moves to a different value of d0 . The last choice involves a single
(nθ + nd + ny )-dimensional map that can be used for inference of any realizations of d0 and
y0 . Its construction involves samples from the joint distribution of θ, y0 , and d0 , which can
be attained but requires some predefined rule (or distribution) that governs the generation
of d0 ; we will discuss this requirement in detail within Section 5.5.3.
The primary trade-off between these choices is the map dimension and the number of
maps needed. For example, let us focus on the second and third choices, with nd considered
in our numerical examples around 2, and niter at most 50. From experience, the extra time
to construct a map nd dimensions higher is usually substantially shorter than constructing
the smaller joint map niter times. While both approaches remain affordable, the larger map
choice appears to be more computationally economical.
Accuracy consideration becomes more important when the map dimension is high. On
one front, with more dimensions and basis terms, especially using total-order polynomial
basis, more samples from the joint distribution are required to construct the map while
maintaining the same level of accuracy. On another front, a map of higher dimension
also brings additional relationships in the new variables compared to its lower-dimensional
counterpart, and is consequently more difficult to capture accurately. In context, a map
attempting to accommodate the dependence on d tends to be less accurate at any particular
values of d compared to a lower-dimensional map that is constructed for that specific d.
Depending on the regularity of the problem, the map accuracy can have a huge impact on
the overall sOED results.
The same pattern of trade-off is observed when extending to multiple experiments. Along
a similar argument, we propose to construct a single joint map at each stage k that can be
used for inference on any realizations of designs and observations from the previous experiments. Table 5.2 illustrates these maps for the first three experiments. In particular, Tθ|d0 ,y0
(last component of the first column) is used for inference after performing one experiment,
Tθ|d0 ,y0 ,d1 ,y1 (last component of the second column) is used for inference after performing two
106
experiments, etc. A closer examination their structure reveals two interesting observations.
First, only the Tθ|d0 ,y0 ,...,dk ,yk (bottom) component of each map is be used for performing
inference, while all other components are not needed but created as a by-product from constructing these maps. Second, there is substantial overlap of components between these
maps at different stages. Specifically, the components grouped by the red rectangular boxes
are identical. It is then natural to retain only the unique components from all these maps in
an N -experiment design setting, arriving at the following single joint map that can be used
for performing Bayesian inference upon any number of experiments:
ξd 0
= Td0 (d0 )
ξ y0
= Ty0 |d0 (d0 , y0 )
ξd 1
= Td1 |d0 ,y0 (d0 , y0 , d1 )
ξ y1
= Ty1 |d0 ,y0 ,d1 (d0 , y0 , d1 , y1 )
..
.
ξdN −1
= TdN −1 |d0 ,y0 ,...,dN −2 ,yN −2 (d0 , y0 , . . . , dN −2 , yN −2 , dN −1 )
ξyN −1
= TyN −1 |d0 ,y0 ,...,dN −2 ,yN −2 ,dN −1 (d0 , y0 , . . . , dN −2 , yN −2 , dN −1 , yN −1 )
ξθ 0
= Tθ|d0 ,y0 (d0 , y0 , θ)
ξθ 1
= Tθ|d0 ,y0 ,d1 ,y1 (d0 , y0 , d1 , y1 , θ)
..
.
ξθN −1
= Tθ|d0 ,y0 ,...,dN −1 ,yN −1 (d0 , y0 , . . . , dN −1 , yN −1 , θ).
(5.40)
This final map has a dimension of N (nθ + nd + ny ), and the entire map can be constructed
all-at-once using the method described in Section 5.3. The components ξθk correspond to
the same θ variable, but with dependence structures consisting different number of dk ’s and
yk ’s for inference on different numbers of experiments. This setup then does not require any
intermediate posteriors Tθ|d∗0 ,y0∗ ,...,d∗k ,yk∗ directly, and inference is done by conditioning the
entire history of past d∗k and yk∗ values. Consequently, intermediate posterior approximation
errors are avoided altogether. The triangular structure is maintained while the block of θ
variables have a sparse dependence (e.g., Tθ|d0 ,y0 is for inference after the first experiment and
thus has no dependence on dk and yk for k > 0; θ components corresponding to different
number of experiments also do not depend on each other), and this sparsity property is
107
leveraged in our implementation.
k=0
ξd0 = Td0 (d0 )
ξy0 = Ty0 (d0 , y0 )
ξθ0 = Tθ0 (d0 , y0 , θ)
1
ξd 0
ξ y0
ξd 1
ξ y1
ξθ 1
= Td0 (d0 )
= Ty0 (d0 , y0 )
= Td1 (d0 , y0 , d1 )
= Ty1 (d0 , y0 , d1 , y1 )
= Tθ1 (d0 , y0 , d1 , y1 , θ)
2
ξd 0
ξ y0
ξd 1
ξ y1
ξd 2
ξ y2
ξθ 2
= Td0 (d0 )
= Ty0 (d0 , y0 )
= Td1 (d0 , y0 , d1 )
= Ty1 (d0 , y0 , d1 , y1 )
= Td2 (d0 , y0 , d1 , y1 , d2 )
= Ty2 (d0 , y0 , d1 , y1 , d2 , y2 )
= Tθ2 (d0 , y0 , d1 , y1 , d2 , y2 , θ)
···
Table 5.2: Structure of joint maps needed to perform inference under different number of
experiments. For simplicity of notation, we omit the conditioning in the subscript of map
components; please see Equation 5.40 for the full subscripts. The same pattern is repeated
for higher number of experiments. The components grouped by the red rectangular boxes
are identical.
5.5.2
Distributions on design variables
The joint maps presented in Table 5.2 and Equation 5.40 all involve dependence on dk ,
so that the same maps can be used for inference under different designs. To construct
these joint maps, then, dk samples are required. While θ and yk samples can be naturally
generated from the prior and likelihood model, it is not immediate clear how to generate
dk . On the one hand, intuition tells us that the joint map can be made more accurate if we
prescribe an appropriate distribution for dk that reflects how often the designs are visited.
On the other hand, we must do so without compromising what we ultimately seek for from
the joint maps: the posteriors. We address these thoughts generally in this subsection, and
focus on dk generation specifically within the sOED context in the next subsection.
Consider a simple one-experiment joint map (i.e., the k = 0 column from Table 5.2); for
simplicity we drop the subscripts on d0 and y0 . The ultimate purpose of this joint map is
to produce posterior joint maps Tθ|d∗ ,y∗ (θ) = Tθ|d,y (d∗ , y ∗ , θ). Assuming that (1) d and θ
are (marginally) independent, and (2) prior f (θ) and likelihood f (y|θ, d) are fixed, then the
posterior conditional remains unchanged regardless of the marginal distribution of d:
f (θ|y, d) =
f (θ, y, d)
f (y, d)
=
=
f (y|θ, d)f (d)f (θ)
f (y|d)f (d)
f˜(θ, y, d)
f (y|θ, d)f˜(d)f (θ)
f (y|θ, d)f˜(d)f (θ)
=
, (5.41)
=
f˜(y|d)f˜(d)
f˜(y, d)
f (y|d)f˜(d)
108
where f˜ denotes density functions as a result of an alternative choice of marginal distribution
on d. The second equality uses the independence assumption between d and θ. The third
equality employs the alternative f˜(d), which does not affect the prior and likelihood. The
four equality can be more clearly seen by
f˜(y|d) =
Z
f (y|θ, d)f (θ) dθ = f (y|d),
(5.42)
Y
again using the independence assumption between θ and d. Equation 5.41 implies that,
regardless of the d marginal, the same posteriors are maintained, and we indeed have the
freedom to select a distribution to sample d. The joint distribution and joint map, however,
would be different.
The linear-Gaussian example below demonstrates that when the d marginal is chosen to
reflect designs more frequently visited or otherwise we are interested in, the quality of the
posteriors is improved.
Example 5.5.1. Consider a linear model y = θd + ǫ with prior θ ∼ N (s0 , σ02 ) and noise
variable ǫ ∼ N (0, σǫ2 ). This linear-Gaussian problem has conjugate Gaussian posteriors in
the form
θ|y, d ∼ N

y/d
2
2
 σǫ /d
1
σǫ2 /d2
+
+
s0
σ02
1
σ02
,
1
+
1
σǫ2 /d2
1
σ02

.
(5.43)
We first point out that, even when the marginal of d is Gaussian, the joint distribution
of θ, y, d is not multivariate Gaussian. This can be seen from the following argument. If the
joint is Gaussian, then all of its marginals are also Gaussian; conversely, if any of its marginals
is not Gaussian, then the joint cannot be Gaussian. Since θ and d are independent Gaussian
random variables, their product θd cannot be Gaussian, and thus (y −ǫ) cannot be Gaussian.
Knowing ǫ is independent of d and θ, and that (y − ǫ) is neither Gaussian nor a constant,
then the marginal on y cannot be Gaussian as well (if y is Gaussian, then its moment
generation function is My (t) = exp µy t + 21 σy2 t2 = My−ǫ (t)Mǫ (t) = My−ǫ (t) exp 12 σǫ2 t2 ,
and (y − ǫ) must either be a Gaussian or a constant, which is a contradiction). As a result,
we can conclude that the joint distribution cannot be multivariate Gaussian, and using a
linear polynomial basis to represent the joint map in this example incurs truncation error.
Now assume we are interested in performing inference accurately for designs distributed
109
as d◦ ∼ N (1, 0.12 ). To test the accuracy of posteriors of a given joint map, we randomly
sample designs from d◦ , θ from the prior with s0 = 0, σ0 = 1, y from the likelihood with
σǫ = 1, and produce different posterior maps by condition on these samples. A number
of joint maps are tested. They all employ first-order polynomial basis and are constructed
from 105 samples generated from the prior and likelihood, and different choices of marginal
distributions for d listed in Table 5.3. For cases 4 to 8, the samples are “fixed” on a grid in
order to mimic uniform distributions, but at the same time allows us to maintain control
over uniformity and frequency of samples; their purpose is to also test the numerical effects
of utilizing multiple exactly repeated (i.e., stacked) design samples.
Case
1
2
3
4
5
6
7
8
d marginal
N (1, 0.12 )
N (0, 52 )
N (−2, 0.52 )
uniform on a grid of [0, 2]
uniform on a grid of [0, 2] with each point repeated 10 times
uniform on a grid of [−3, 0]
uniform on a grid of [−3, 0] but with 50% points at exactly d = 1
same as case 7 but with 3rd-order map
Table 5.3: Marginal distributions of d used to construct joint map.
Posterior density functions for a particular sample realization from different joint maps
are shown in Figure 5-5, with those from additional sample realization shown in Figure 5-6.
In all the figures, posteriors from cases 1 and 4 match closest with analytic results. This is
expected, since the joint map for the former is constructed using the exact d distribution
as d◦ , while the latter has a good coverage of it via a uniform grid. Case 2 is less accurate
because it “over-covers”, placing weights of accuracy also in regions we are not interested in.
Case 3 is even less accurate because it is concentrated in a narrow region of d that is much
further away from the bulk of d◦ . Case 5 is essentially exactly the same as case 4 since we
are increasing the samples proportionally, thus not changing the d distribution. Case 6 is
not accurate again because the samples do not have good coverage for the design region of
interest. Case 7 improves upon case 6 as more samples at the mean of d◦ are added, but
remains inaccurate. Cases 4 to 8 also demonstrate that the map construction algorithm is
numerically sound even when there are samples with identical values of d.
Overall, poor posteriors estimates can be attributed to two main factors. First, accuracy
deteriorates as d is more different (in a loose sense of “rough coverage”, rather than precise
110
forms such as whether it is Gaussian, uniform, or grid) from d◦ . Second, since the joint
distribution is not multivariate Gaussian for this example, there is truncation error from
using a linear polynomial basis for the joint map. This is further supported by case 8, which
is the same as case 7 but using a 3rd-order polynomial basis, showing improved results over
case 7.
d=0.87299, y=3.0213
1
0.8
0.6
0.4
Analytic
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6
Case 7
Case 8
0.2
0
−5 −4 −3 −2 −1
0
1
2
3
4
5
Figure 5-5: Example 5.5.1: posteriors from joint maps constructed under different d distributions.
5.5.3
Generating samples in sequential design
To construct the joint map described in Equation 5.40 for the sOED problem, we need to
generate samples of θ, dk , yk for k = 0, . . . , N − 1. In particular, the joint density function
has the form
f (θ, d0 , y0 , . . . , dN −1 , yN −1 ) =
"N −1
Y
k=0
#
f (yk |dk , θ)f (dk ) f (θ),
(5.44)
where yk are independent conditioned on θ and dk (this is simply due to the noise in the
likelihood model being independent), and dk and θ are (marginally) independent. θ can be
naturally generated from the prior f (θ), and yk from the likelihood f (yk |dk , θ) given dk and
θ; the only missing part is f (dk ).
As illustrated in the previous subsection, we may choose any marginal f (dk ) without
changing the posteriors generated from the joint map. Furthermore, it is advantageous to
111
select f (dk ) that is in proportion to how often we will visit the designs. This is precisely the
dk distribution induced by the optimal policy and the associated numerical methods; the
same concept has been introduced and discussed in Section 4.3 in the context of regression
for value function approximation. However, not only do we not have the optimal policy,
we cannot even generate dk from an approximate policy in the one-step lookahead form of
Equation 4.1, since it requires performing inference, capability provided by the very joint
map we are trying to construct. The only choice then is to generate dk from a distribution
that does not require inference, such as random exploration. This is the method we shall
employ: we generate samples of θ from the prior, d0 , . . . , dN −1 from an exploration policy,
and finally y0 , . . . , yN −1 from the likelihood.
As this “exploration joint map” is constructed using only exploration samples, its performance is not optimal when used under other policies (i.e., exploitation). In practice, the
design distributions induced by exploitation are usually more complicated and concentrated
compared to exploration, and thus using an exploration joint map from a generally “wider”
marginal can be regarded as a conservative stance. A natural extension is to use exploitation samples and construct new maps that would be more accurate for exploitation. In fact,
these samples are readily available from the state measure update procedure described in
Section 4.3. However, preliminary testings of this idea frequently caused numerical instability of the sOED algorithm when inaccurate inference evaluations lead to unrealistically
high KL estimates; incorporating this idea in a stable and accurate manner is a promising
future research direction.
5.5.4
Evaluating the Kullback-Leibler divergence
Evaluation of the KL divergence is a core component of information-based OED; it is a
non-trivial task for non-Gaussian random variables represented by transport maps.
A straightforward method is by Monte Carlo sampling. Samples from the base distribution (that which the expectation is taken with respect to) are first obtained, by sampling
from the reference distribution and pulling them through the map. Their density values can
then be evaluated via formulas such as Equation 5.8, and a Monte Carlo estimate of the KL
integral can then be established. While the inversion of an exact map is always possible,
monotonicity is only enforced at the sample points used when constructing an approximate
map (Equation 5.24). When monotonicity is lost, not only does the map inversion yield mul112
tiple roots, the density function formula also becomes invalid. Subsequently, unrealistically
high values of KL divergence may surface, leading to numerical instability in the ensuing
regression systems. In practice, loss of monotonicity may occur under high-dimensional
and non-Gaussian distributions, especially when observations are dominated by a highly
nonlinear signal, and when exploitation joint maps are attempted.
An alternative approach for estimating the KL is to first use linear truncation to the polynomial map basis, and then apply the analytic formula for computing KL divergence between
Gaussian random variables. Effectively, the random variables are “Gaussianized”, but this
procedure differs from Laplace approximation since the linearization here is not necessarily
performed at the mode. We emphasize that this approach suggests using Gaussian approximations only for evaluating the KL divergence; it is different from simply using Gaussian
approximations throughout the sOED process, since in the case here higher-order information is still propagated throughout inference. While the Monte Carlo sampling method
reflects higher-order information in its KL estimate as well, computation of the truncation
approach is stable and can be performed much more quickly. We thus adopt this approach
for the numerical examples presented in Chapter 7.
As examples, we describe the precise truncation process for 1D and 2D maps with 3rdorder monomial basis. A 1D map has the form
ξ = a0 + a1 z + a2 z 2 + a3 z 3 ,
(5.45)
where ξ ∼ N (0, 1) is the Gaussian reference random variable, and z is the target. The linear
truncation is then
ξ = a0 + a1 z̃,
(5.46)
where a simple inversion yields
z̃ =
ξ − a0
.
a1
(5.47)
This form implies that z̃ ∼ N − aa10 , a12 . The 2D case is slightly more complicated, where
1
113
the map now has the form
ξ0 = a0 + a1 z0 + a2 z02 + a3 z03
(5.48)
ξ1 = b0 + b1 z0 + b2 z1 + b3 z02 + b4 z0 z1 + b5 z12 + b6 z03 + b7 z02 z1 + b8 z0 z12 + b9 z13 .(5.49)
The linear truncation is then
ξ0 = a0 + a1 z̃0
(5.50)
ξ1 = b0 + b1 z̃0 + b2 z̃1 ,
(5.51)
and an inversion yields
z̃0 =
ξ0 − a 0
a1
(5.52)
z̃1 =
0
ξ1 − b0 − b1 ξ0a−a
ξ1 − b0 − b1 z̃0
1
=
,
b2
b2
(5.53)
which can be summarized in matrix form
1
(5.54)
z̃ = Σ 2 ξ + µ
with
1

Σ2 = 
1
a1
0
− ab11b2
1
b2



− aa10
µ=
− bb20
1
1
+
a0 b 1
a1 b 2

.
(5.55)
This form implies that z̃ ∼ N (µ, Σ), where Σ = (Σ 2 )(Σ 2 )⊤ . For completeness, the analytic
KL divergence formula for two Gaussian random variables z̃A ∼ (µA , ΣA ) and z̃B ∼ (µB , ΣB )
is
det ΣA
1
−1
−1
tr ΣB ΣA + (µB − µA )ΣB (µB − µA ) − n − ln
, (5.56)
DKL (fz̃A ||fz̃B ) =
2
det ΣB
where n is the dimension of the random variables.
Since the map coefficients carry information that fully describes the random variable
(up to truncation), it is promising and also valuable to develop an analytic formula for KL
directly from the map coefficients in the future.
114
d=0.98031, y=-0.21571
d=1.0586, y=1.8556
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
−5 −4 −3 −2 −1
0
1
2
3
4
0
−5 −4 −3 −2 −1
5
(a) Sample 1
d=0.91481, y=1.0191
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
1
2
3
4
0
−5 −4 −3 −2 −1
5
(c) Sample 3
d=0.84906, y=1.9637
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
1
4
5
0
1
2
3
4
5
3
4
5
d=1.0876, y=1.6068
1
0
3
(d) Sample 4
1
0
−5 −4 −3 −2 −1
2
d=1.08, y=0.71798
1
0
1
(b) Sample 2
1
0
−5 −4 −3 −2 −1
0
2
3
4
5
(e) Sample 5
0
−5 −4 −3 −2 −1
0
1
2
(f) Sample 6
Figure 5-6: Example 5.5.1: additional examples of posteriors from joint maps constructed
under different d distributions. The same legend in Figure 5-5 applies.
115
116
Chapter 6
Full Algorithm Pseudo-code for
Sequential Design
Combining the approximate dynamic programming techniques from Chapter 4 and the transport map technology from Chapter 5, we present the pseudo-code of our map-based algorithm
for sequential optimal experimental design, outlined in Algorithm 3.
We will also use a grid-based version of this algorithm for comparing numerical results of
examples in Chapter 7 that involve 1D parameter space. In this variation, instead of using
a transport map to represent the posterior, a grid is used to capture its probability density
function. Whenever Bayesian inference is performed within the algorithm, the grid needs
to be adapted in order to ensure reasonable coverage and grid resolution for the posterior
density. A simple scheme first computes the unnormalized posterior density values on the
current grid, and decides whether grid expansion is needed on either side based on a threshold
that is the ratio of grid end-point density value to the grid mode density value. Second,
a uniform grid is laid over the expanded regions, with new unnormalized posterior density
values computed. Finally, a new grid over the original and expanded regions is constructed
such that the probability masses between neighboring grid points are equal—this provides
a mechanism for sparsifying the grid in regions of low density values. Results from this
grid method are used as reference of comparison for the map-based algorithm, since the
inference computations of the former generally involve fewer approximations. With respect
to Algorithm 3, the grid method no longer requires line 3, and the inference computations
in lines 5, 7, and 12 use the grid adaptation procedure described above.
117
Algorithm 3: Algorithm for map-based sequential optimal experimental design.
1
2
3
4
5
6
7
8
9
10
11
12
Set parameters: Select features {zk }, ∀k, exploration measure, L, R0 , R, T ;
Initial exploration: Simulate R0 exploration trajectories by sampling θ from prior,
dk from exploration measure, yk from likelihood, ∀k, without inference;
Make exploration joint map: Make Texplore from these samples;
for ℓ = 1, . . . , L do
Exploration: Simulate R exploration trajectories by sampling θ from prior, dk
from exploration measure, yk from likelihood, ∀k, with inference using Texplore ;
ℓ
Store all states visited Xk,explore
= {xrk }R
r=1 , ∀k;
Exploitation: (if ℓ > 1) Simulate T exploitation trajectories by sampling θ from
prior, dk from one-step lookahead
h policy
i
ℓ−1
ℓ−1
µk (xk ) = argmaxd′k Eyk |xk ,d′k gk (xk , yk , d′k ) + J˜k+1
(Fk (xk , yk , d′k )) , yk from
likelihood, ∀k, with inference using Texplore ;
ℓ
Store all states visited Xk,exploit
= {xtk }Tt=1 , ∀k;
Approximate value iteration: Construct J˜kℓ functions via backward induction
ℓ
ℓ
using new regression points {Xk,explore
∪ Xk,exploit
}, ∀k, described by loop below;
for k = N − 1, . . . , 1 do
ℓ
ℓ
for rt = 1, . . . , R + T where xrt
k are all members of {Xk,explore ∪ Xk,exploit } do
Compute training values
h
i
rt
′
ℓ
rt
′
ℓ
rt
˜
ˆ
′
′
rt
J (x ) = maxd Ey |x ,d gk (x , yk , d ) + J (Fk (x , yk , d )) ,
k
13
14
15
16
17
k
k
k
k
k
k
k
k+1
inference performed using Texplore ;
Construct J˜kℓ = Π Jˆkℓ by regression on training values;
end
end
end
Extract final policy parameterization: J˜L , ∀k;
k
118
k
k
Chapter 7
Numerical Results
We present several numerical examples of the sequential optimal experimental design (sOED)
problem in this chapter. Each example serves different purposes in highlighting various
properties and observations. Through them, we demonstrate
• Linear-Gaussian problem (Section 7.1):
– ability of the numerical methods developed in this thesis in solving an sOED
problem, where the analytic solution is available and can be compared to
– agreement between results generated from analytic, grid, and map representations
of the belief state, along with their associated inference methods
• 1D contaminant source inversion problem (Section 7.2):
– Case 1: advantages of sOED over batch (open-loop) design
– Case 2: advantages of sOED over greedy (myopic) design
– Case 3: performance of sOED using the map method, and comparison to the grid
method (as reference solution)
• 2D contaminant source inversion problem (Section 7.3):
– ability of the numerical methods in handling complicated situations of multiple
experiments and dimensions
Details of these numerical examples are described in the subsequent sections.
119
7.1
Linear-Gaussian problem
7.1.1
Problem setup
Consider a forward model that is linear with respect to the parameters, with no physical
state component, and where observations are corrupted by an additive Gaussian noise:
(7.1)
yk = G(θ, dk ) + ǫ = θdk + ǫ.
The prior on θ is N (s0 , σ02 ), ǫ ∼ N (0, σǫ2 ), and d ∈ [dL , dR ]. The resulting inference problem
on θ has a conjugate Gaussian structure, and all subsequent posteriors are Gaussian with
the formula:
2
=
sk+1 , σk+1
y
k /dk
σǫ2 /d2k

1
σǫ2 /d2k
+
+
sk
σk2
1
σk2
1
, 1
+
σ 2 /d2
ǫ
k
1
σk2

.
(7.2)
We consider the design of N = 2 experiments, with s0 = 0, σ0 = 3, σǫ = 1, dL = 0.1,
dR = 3.
Three methods of belief state representation and inference are studied in this example:
• analytic representation:1 xk,b = (sk , σk2 ) with exact inference using Equation 7.2;
• grid representation of the posterior density function: xk,b is a grid on f (θ|Ik ), and the
simple grid adaptation scheme described in Chapter 6 is used for inference; and
• map representation: xk,b is the set of posterior map coefficients, with inference performed by conditioning on a joint map composed of total-order polynomial basis (with
sparsification for dk and θ dimensions as illustrated in Equation 5.40) that is constructed using trajectories from a prescribed exploration policy.
For this linear-Gaussian problem, the grid method uses grids of 50 nodes; the map method
uses monomial basis functions of total order 3. The joint map has a total of N (nd ×ny ×nθ ) =
6 dimensions and 129 basis terms, and the coefficients are determined using 105 exploration
trajectories with the exploration policy designated by dk ∼ N (1.25, 0.52 ). All posterior
maps are 1D 3rd-order polynomials and thus have 4 coefficients.
1
For simplicity, these different methods of belief state representation will also be referred to as the
“analytic method”, “grid method”, and “map method” in this chapter.
120
The reward functions used are
(7.3)
gk (xk , yk , dk ) = 0
2
gN (xN ) = DKL (fθ|IN (·|IN )||fθ (·)) − 2(ln σN
− ln 2)2 ,
(7.4)
for k = 0, 1. The terminal reward is a combination of information gain and a penalty away
from a log-variance target. The latter increases the difficulty of this problem by moving
the optimal policy away from the design space boundary and avoiding constructions of
fortuitous policies.2 The analytic formula for the Kullback-Leibler (KL) divergence between
two univariate Gaussians involves operations of the mean and log-variance of the Gaussians—
this motivates the selection of value function features φk,i (in Equation 4.2) to be 1, sk ,
ln(σk2 ), s2k , ln(σk2 )2 , and sk ln(σk2 ). The features are evaluated by trapezoidal rule integration
for the grid method, and inversion of a linear truncation for the map method. The KL
divergence is approximated by first estimating the mean and variance using these techniques,
and then applying the analytic KL formula for Gaussians. Since we know the posteriors
should all be Gaussian in this example, these approximations are expected to be quite
accurate. L = 3 iterations of state measure update are conducted with regression points
generated by, only exploration for ℓ = 1, and 30% exploration and 70% exploitation for
subsequent iterations. Analytic method uses 1000 regression points, while grid and map
methods use 500.
The policies generated from different methods are compared by applying them in 1000
simulated trajectories; this procedure is summarized in Algorithm 4. Each policy is first
applied in trajectories under the same belief state representation method originally used to
construct that policy. Then, inference is performed on the resulting sequence of designs and
observations using a common evaluation framework, regardless of how the trajectory is produced: we use the analytic method as this common framework in this example. This ensures
a fair comparison between policies, where the designs are produced using the “native” belief
state representation that the policy was originally created for, while all final trajectories
are evaluated using a “common” method. We note that there is also a distribution for the
final policy due to the randomness involved in the numerical methods (e.g., repeating the
2
Without the second term in the terminal reward, the optimal policies will always be those that lead to
the highest achievable signal, which occurs at the dk = 3 boundary. It is then more vulnerable to fortuitously
producing policies that lead to boundary designs even when the overall value function approximation may
be poor.
121
algorithm to construct the policy would not result in exactly the same policy, simply due
to the different random numbers being used in simulations). We currently do not take this
policy distribution into account in the assessment; instead, only a single policy realization
is used to generate all 1000 trajectories. A more comprehensive study by repeating the
policy constructing algorithm many times may be conducted in the future, although such
an undertaking would be extremely expensive.
Algorithm 4: Procedure for evaluating policies by simulating trajectories.
1
2
3
4
5
6
Select “native” belief state representation to generate policy: for example:
analytic, map, or grid; see Section 7.1.1;
Construct policy: use the native belief state representation, and the numerical
methods developed in this thesis for solving the sOED problem;
for q = 1, . . . , ntrajectories do
Apply policy: generate a trajectory using the native belief state representation:
sample θ from prior, evaluate dk by applying the constructed policy, sample yk
from the likelihood, for k = 0, . . . , N − 1;
Evaluate rewards via a “common” evaluation framework: perform
inference on the dk and yk values from this trajectory and evaluate all rewards,
using the analytic state representation;
end
7.1.2
Results
Since this example has a horizon of N = 2, only J˜1 is constructed via function approximation
while J2 is directly (and numerically for grid and map methods) evaluated. These surfaces
plotted in the posterior mean and variance are shown in Figure 7-1, along with the regression
points used to build them. Excellent agreement across all three methods are observed, and
there is a noticeable change in the distribution of regression points from ℓ = 1 (regression
points from only exploration) to ℓ = 2 (regression points from a mixture of exploration
and exploitation), leading to a better approximation of the policy induced state measure.
The regression points appear to be grouped more closely together for ℓ = 1 even though
they are from exploration, because exploration in fact covers a large region of dk space that
leads to small values of σk2 . In this simple example, the particular choice of exploration
design measure did not lead to a noticeable negative impact of the total reward for ℓ = 1
(Figure 7-5). However, this can easily be reversed for problems with more complicated value
functions and less suitable choices of exploration design measures.
122
5
9
5
6
−5
6
−5
6
−5
3
−15
3
−15
3
−15
−25
0
−10
−25
0
−10
0
−10
−5
0
s1
5
10
σ 2k
9
σ 2k
5
σ 2k
9
−5
0
s1
5
10
−5
0
s1
5
10
−25
(a) Analytic method
9
5
9
5
6
−5
6
−5
6
−5
3
−15
3
−15
3
−15
−25
0
−10
−25
0
−10
σ 2k
σ 2k
5
σ 2k
9
0
−10
−5
0
s1
5
10
−5
0
s1
5
10
−5
0
s1
5
10
−25
(b) Grid method
9
5
9
5
6
−5
6
−5
6
−5
3
−15
3
−15
3
−15
−25
0
−10
−25
0
−10
σ 2k
0
−10
−5
0
s1
5
10
σ 2k
5
σ 2k
9
−5
0
s1
5
10
−5
0
s1
5
10
−25
(c) Map method
Figure 7-1: Linear-Gaussian problem: J˜1 surfaces and regression points used to build them.
The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively.
Histograms for d0 and d1 are shown in Figures 7-2 and 7-3. While overall the agreement
is quite good between the three methods, d0 for the grid and map methods ℓ = 2 and
ℓ = 3 are concentrated slightly to the left compared to those for the analytic method. The
opposite is true for d1 , where the grid and map methods ℓ = 2 and ℓ = 3 are slightly to
the right compared to those for the analytic method. This is because the optimal policy
is not unique for this problem, and there is a natural notion of exchangeability between
the two experimental designs d0 and d1 . With no stage cost, the overall objective of this
problem is the expected KL divergence and the expected distance of the final log-variance
to the target log-variance. This quantity can be shown to be only a function of its final
variance, which is determined exactly given values of dk through Equation 7.2 (it is not
affected by the observations yk ). In fact, this linear-Gaussian problem (with constant noise
123
variance) is a deterministic problem, and the optimal policy is reducible to optimal designs
d∗0 and d∗1 . Batch design would produce the same optimal designs as sOED for deterministic
problems. An analytic derivation of the optimal designs and the expected utility surface for
this problem is presented in Appendix B, with
d∗2
0
+
d∗2
1
1
18014398509481984 ln 3 − 5117414861322735
=
exp
−1 ,
9
9007199254740992
(7.5)
and
U (d∗0 , d∗1 ) ≈ 0.783289.
(7.6)
Indeed, there is a “front” of optimal designs, as there are different combinations of d0 and
d1 that together lead to the underlying optimal final variance. The pairwise (d0 , d1 ) scatter
plots for 1000 simulated trajectories are shown in Figure 7-4, and superimposed on the
analytic expected utility surface. From the expected utility surface and the optimal design
front, we can immediately see the symmetry between the two designs. Furthermore, we
can now clearly understand the earlier observations, that the optimizer is hovering around
different parts of the optimal front in different cases. The expected utility surface also
appears quite flat around the optimal design front, thus we expect all these methods to
have performed fairly well. Indeed, the histograms of total rewards and their mean from
the simulated trajectories presented in Figure 7-5 and Table 7.1 show good agreement with
each other and the optimal expected utility, with grid and map methods exhibiting slightly
lower mean values but all within Monte Carlo standard error. For contrast, the exploration
policy produces a much lower mean reward of −8.5.
Analytic
Grid
Map
ℓ=1
0.77
0.74
0.77
ℓ=2
0.78
0.76
0.75
ℓ=3
0.78
0.75
0.75
Table 7.1: Linear-Gaussian problem: total reward mean values (of histograms in Figure 7-5)
from 1000 simulated trajectories. Monte Carlo standard errors are all ±0.02.
The pairwise and marginal kernel density estimates (KDEs) from samples used to construct the joint exploration map, and samples generated from the resulting map, are shown
in Figure 7-6. Excellent agreement is observed between the two sets of KDEs. As evident
124
600
500
500
500
400
400
400
300
Count
600
Count
Count
600
300
300
200
200
200
100
100
100
0
0.45
0.5
0.55
0.6
0.65
0
0.45
0.7
0.5
0.55
d0
0.6
0.65
0
0.45
0.7
0.5
0.55
d0
0.6
0.65
0.7
0.6
0.65
0.7
0.6
0.65
0.7
d0
600
600
500
500
500
400
400
400
300
Count
600
Count
Count
(a) Analytic method
300
300
200
200
200
100
100
100
0
0.45
0.5
0.55
0.6
0.65
0
0.45
0.7
0.5
0.55
d0
0.6
0.65
0
0.45
0.7
0.5
0.55
d0
d0
600
600
500
500
500
400
400
400
300
Count
600
Count
Count
(b) Grid method
300
300
200
200
200
100
100
100
0
0.45
0.5
0.55
0.6
d0
0.65
0.7
0
0.45
0.5
0.55
0.6
d0
0.65
0.7
0
0.45
0.5
0.55
d0
(c) Map method
Figure 7-2: Linear-Gaussian problem: d0 histograms from 1000 simulated trajectories. The
left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively.
by, for example, the pairwise KDEs between dk and yk , the joint distribution is in general
not Gaussian even for a linear-Gaussian problem (and even with Gaussian marginals on dk
from the prescribed exploration design measure); this is discussed in Example 5.5.1 form a
theoretic perspective.
In summary for the linear-Gaussian example, we have shown numerical results from
sOED to agree with the analytic optimal. Furthermore, we demonstrate agreement between
results from analytic, grid, and map methods, using their associated inference methods.
This is also a starting point in displaying strength of the transport map technology as well
as the overall method for solving the sOED problem that we have developed in this thesis.
Furthermore, the grid method can be trusted to be used as a comparison reference for the
upcoming 1D nonlinear non-Gaussian example, where analytic representation of the belief
125
500
400
400
400
300
300
300
200
100
Count
500
Count
Count
500
200
100
0
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
d1
200
100
0
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
d1
0
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
d1
500
500
400
400
400
300
300
300
200
100
Count
500
Count
Count
(a) Analytic method
200
100
0
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
d1
200
100
0
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
d1
0
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
d1
500
500
400
400
400
300
300
300
200
100
Count
500
Count
Count
(b) Grid method
200
100
0
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
d1
200
100
0
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
d1
0
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
d1
(c) Map method
Figure 7-3: Linear-Gaussian problem: d1 histograms from 1000 simulated trajectories. The
left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively.
state would not be possible.
7.2
1D contaminant source inversion problem
Consider a situation where a chemical contaminant is accidentally released in the air. The
contaminant plume is diffusing and carried by wind, posing a great danger to the general
public. It is crucial to infer the source location of the contaminant, so that appropriate
response actions may be taken to eliminate this threat. A state-of-the-art robotic vehicle
is dispatched to take contaminant concentration measurements at a sequence of different
locations and under a fixed time schedule. We seek the optimal policy of where the vehicle
should move to take measurements, in order to obtain the highest expected information gain
126
3
2.5
3
0
2.5
−5
2
0
2.5
−5
2
−5
2
−10
1.5
−10
1.5
−10
1
−15
1
−15
1
−15
0.5
−20
0.5
−20
0.5
−20
0.5
1
1.5
d0
2
2.5
3
−25
0.5
1
1.5
d0
2
2.5
3
d1
1.5
d1
d1
3
0
−25
0.5
1
1.5
d0
2
2.5
3
−25
(a) Analytic method
−10
1
0.5
0.5
1
1.5
d0
2
2.5
3
1.5
−10
−15
1
−20
0.5
−25
0.5
1
1.5
d0
2
2.5
3
0
2.5
−5
2
d1
1.5
3
0
2.5
−5
2
d1
3
0
−5
2
1.5
−10
−15
1
−15
−20
0.5
−20
d1
3
2.5
−25
0.5
1
1.5
d0
2
2.5
3
−25
(b) Grid method
−10
1
0.5
0.5
1
1.5
d0
2
2.5
3
1.5
−10
−15
1
−20
0.5
−25
0.5
1
1.5
d0
2
2.5
3
0
2.5
−5
2
d1
1.5
3
0
2.5
−5
2
d1
3
0
−5
2
1.5
−10
−15
1
−15
−20
0.5
−20
−25
d1
3
2.5
0.5
1
1.5
d0
2
2.5
3
−25
(c) Map method
Figure 7-4: Linear-Gaussian problem: (d0 , d1 ) pair scatter plots from 1000 simulated trajectories superimposed on top of the analytic expected utility surface. The left, middle, and
right columns correspond to ℓ = 1, 2, and 3, respectively.
about the source location.
For simplicity, assume the mean contaminant concentration G (scalar) with source location θ measured at location z and time t has the value
k θ + dw (t) − z k2
s
exp −
√
G(θ, z, t) = √
2(4)(0.3 + Dt)
2π 2 0.3 + Dt
!
,
(7.7)
where s, D, and dw (t) are known source intensity, diffusion coefficient, and cumulative net
displacement due to wind up to time t, respectively (their values will be specified later). A
total of N measurements are taken uniformly spaced in time, with the relationship t = k + 1
(while t is a continuous variable, it corresponds to the experiment index via this relationship;
127
mean = 0.77 ± 0.02
500
mean = 0.78 ± 0.02
500
400
400
300
300
300
200
100
0
0
Count
400
Count
Count
500
200
100
1
2
Reward
3
0
0
4
mean = 0.78 ± 0.02
200
100
1
2
Reward
3
0
0
4
1
2
Reward
3
4
(a) Analytic method
mean = 0.74 ± 0.02
500
mean = 0.76 ± 0.02
500
400
400
300
300
300
200
100
0
0
Count
400
Count
Count
500
200
100
1
2
Reward
3
0
0
4
mean = 0.75 ± 0.02
200
100
1
2
Reward
3
0
0
4
1
2
Reward
3
4
(b) Grid method
mean = 0.77 ± 0.02
500
mean = 0.75 ± 0.02
500
400
400
300
300
300
200
100
0
0
Count
400
Count
Count
500
200
100
1
2
Reward
3
4
0
0
mean = 0.74 ± 0.02
200
100
1
2
Reward
3
4
0
0
1
2
Reward
3
4
(c) Map method
Figure 7-5: Linear-Gaussian problem: total reward histograms from 1000 simulated trajectories. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. The
plus-minus quantity is 1 standard error.
hence, y0 is taken at t = 1, y1 at t = 2, etc.). The state is a combination of belief and physical
state components. Introduced in Section 7.1.1, the grid and map methods of belief state
representation are studied here (the analytic method is no longer available since the nonlinear
forward model leads to general non-Gaussian posteriors). The relevant physical state is
the current location of the vehicle: xk,p = z; the inclusion of physical state is necessary
since the optimal design is expected to be dependent on the vehicle position as well. The
movement constraints of the vehicle for the next time unit is described by a box constraint
dk ∈ [−dL , dR ], where dL and dR reflect its movement range. The physical state dynamics
then simply describe position and displacement: xk+1,p = xk,p + dk . The concentration
128
d0
y0
d1
y1
θ
θ
(a) Samples used for map construction
d0
y0
d1
y1
θ
θ
(b) Samples generated from map
Figure 7-6: Linear-Gaussian problem: samples used to construct the exploration map and
samples generated from the resulting map.
measurements are corrupted by additive Gaussian noise:
yk = G(θ, xk+1,p , k + 1) + ǫk (xk , dk ),
(7.8)
where the noise ǫk ∼ N (0, σǫ2k (xk , dk )) may depend on the state and the design. When simulating a trajectory, the physical state needs to be propagated first before an observation yk
can be generated, since the latter requires the evaluation of G at xk+1,p . Once yk is obtained,
the belief state can then be propagated via Bayesian inference. The reward functions used
129
in this problem are
gk (xk , yk , dk ) = −cb − cq k dk k22
(7.9)
(7.10)
gN (xN ) = DKL (fθ|IN (·|IN )||fθ (·)),
for k = 0, . . . , N − 1. The terminal reward is simply the KL divergence, and the stage
reward consists of a base cost of operation plus a penalty that is quadratic with the vehicle
movement distance.
We start with the 1D version of the problem, where θ, dk , and xk,p are scalars (i.e., the
plume and vehicle are confined to movements in a line). Problem and algorithm settings
common to all 1D cases can be found in Tables 7.2 and 7.3, and additional variations will
be described in each of the following case subsections.
1D case
2
Number of experiments N
Prior on θ
N (0, 22 )
Design constraints on dk
Initial physical state x0,p
State measure updates L
Concentration strength s
Diffusion coefficient D
Base operation cost cb
Quadratic movement cost coefficient cq
[−3, 3]
5.5
3
30
0.1
0.1
0.1
2D case
3
0
22 0
N
,
0
0 22
[−5, 5]2
(5, 5)
3
50
1
0
0.03
Table 7.2: Contaminant source inversion problem: problem settings.
Grid method: number of grid points
Map method: map total order
Map method: number of map construction samples
1D case
100
3
106
Exploration policy measure on dk
N (0, 22 )
Total number of regression points
% of regression points from exploration
Max number of optimization iterations
Monte Carlo sample size in optimization
Robbins-Monro harmonic gain sequence multiplier
500
30%
50
100
5
2D case
3
5
5×10 2
0
3
0
N
,
0
0 32
1000
20%
30
10
15
Table 7.3: Contaminant source inversion problem: algorithm settings.
130
7.2.1
Case 1: comparison with greedy (myopic) design
This case highlights the advantage of sOED over greedy design, which is accentuated when
there are factors in the future important for designing the current experiments. We illustrate
this via the wind factor: the air is calm initially, and then a constant wind of velocity 10
commences at t = 1, leading to the following cumulative net displacement due to wind up
to time t:
dw (t) =


0,
t<1
 10(t − 1), t ≥ 1
.
(7.11)
Intuitively, greedy design would not be able to take into account of the wind when designing
the first experiment. Batch design (not presented in this case), however, would be able to.
The observation noise standard deviation is set to σǫk = 2.
For sOED, only the grid method (described in Section 7.1.1 but now using 100 nodes)
is used for this case for demonstration purposes (we focus on comparing sOED with greedy
design here; the map method will be studied later in Case 3). Motivated by the analytic
KL divergence formula between Gaussians, the value function features are selected to be
1, posterior mean, log-variance, physical state, their squares and cross terms, for a total of
10 terms. The moments are evaluated by trapezoidal rule integration. The KL divergence
is approximated by first estimating the mean and variance using this technique, and then
applying the analytic KL formula for Gaussians. No state measure update is performed in
this case (i.e., L = 1); the effects of state measure updates will be studied later in Case 3.
For greedy design, the same grid method is used to represent the belief state. Similar to
the linear-Gaussian problem, the policies generated from different methods are compared by
applying them in 1000 simulated trajectories using Algorithm 4, except that the common
evaluation framework at the end now uses a high-resolution grid method with 1000 nodes,
since the analytic method is no longer available for this non-Gaussian setting.
Before presenting the results, we first provide some intuition on the physical phenomenon
through Figure 7-7, which shows the progression of a sample trajectory. The left figure
displays the physical space, with the robotic vehicle starting at the black square location.
For the first experiment, it moves to a new location and acquires the noisy observation
indicated by the blue cross, while the solid blue curve indicates the plume signal profile
131
G at that time. For the second experiment, the vehicle moves to another new location
and acquires the noisy observation indicated by the red cross, while the dotted red curve
indicates the plume signal profile G at that time after having diffused slightly and carried to
the right by the wind. The right figure shows the corresponding belief state density functions
at different stages, constructed using the grid method. Starting from the solid blue prior
density, the dashed red posterior density after the first experiment is only slightly narrower,
since the first observation (blue cross) is in the region that dominated by the measurement
noise. The dotted yellow final posterior after both experiments, however, becomes much
narrower, as the second observation (red cross) is in the high gradient region (and thus
carries high information for identifying θ) of the plume profile. The posteriors can become
quite non-Gaussian and even multimodal. The black circle indicates the true θ value; the
posterior modes do not necessarily match this value, due to noisy measurements and finite
number of observations.
0.4
10
G at t=1
G at t=2
z0
y0
y1
8
0.3
PDF
y
6
PDF of x0,b
PDF of x1,b
PDF of x2,b
θ∗
4
0.2
2
0.1
0
−2
−10
−5
0
5
z
10
15
0
−8 −6 −4 −2
20
(a) Physical state and plume progression
0
θ
2
4
6
8
(b) Belief state density progression
Figure 7-7: 1D contaminant source inversion problem, case 1: physical state and belief state
density progression of a sample trajectory.
The pairwise (d0 , d1 ) scatter plots for 1000 simulated trajectories are shown in Figure 7-8.
Greedy designs generally move towards the left in the first design (negative values of d0 ) since
for almost all realizations of θ (generated from the prior), the main part of the plume start
on the left of the initial vehicle location. When designing the first experiment, greedy design
does not know there will be a second experiment and that the wind will blow the plume
back to the right, thus exerts a great effort to move to the left. Similarly, when designing
the second experiment, it then chases the plume which is now on its right (positive values
132
of d1 ). sOED, however, generally starts heading to the right in the first experiment right
away, so that it can arrive in the regions of higher information gain in time for the second
experiment after the plume has been carried by the wind. In both approaches, there are a
few cases where d1 is very close to zero. These cases correspond to where θ are sampled
from the right tail of the prior, making the plume to be much closer to the initial vehicle
location. As a result, a high amount of information is obtained from the first observation.
The plume is subsequently carried 10 units to the right by the wind, and the vehicle cannot
reach regions that yield a high enough amount of information in the second experiment that
justify its d1 movement cost. The best action is then to simply stay put. The “chasing”
tendency of greedy design turns out to be overall costly due to the quadratic movement
penalty. This is reflected in Figure 7-9, which shows histograms for total rewards from the
trajectories. sOED yields a mean reward of 0.12 ± 0.02, whereas greedy produces a much
1.5
1.5
1
1
d1
d1
lower mean reward of 0.07 ± 0.02; the plus-minus quantity is 1 standard error.
0.5
0
0.5
0
−0.5
−1 −0.75 −0.5 −0.25
d0
0
0.25
−0.5
−1 −0.75 −0.5 −0.25
d0
0.5
(a) Greedy design
0
0.25
0.5
(b) sOED
Figure 7-8: 1D contaminant source inversion problem, case 1: (d0 , d1 ) pair scatter plots from
1000 simulated trajectories for greedy design and sOED.
7.2.2
Case 2: comparison with batch (open-loop) design
This case highlights the advantage of sOED over batch design, which is accentuated when
there is information useful for designing experiments that can be obtained from performing
some of the experiments first (i.e., feedback). We illustrate this via different measurement devices: the robotic vehicle is carrying two measuring instruments, a “rough” device
that achieves an observation noise standard deviation of σǫk = 2, and a “precise” device of
σǫk = 0.5. The precise device is much more expensive to operate. Fortunately, the device
133
mean = 0.07 ± 0.02
500
500
400
400
300
300
200
200
100
100
0
−1
0
1
2
Reward
mean = 0.12 ± 0.02
600
Count
Count
600
3
0
4
(a) Greedy design
−1
0
1
2
Reward
3
4
(b) sOED
Figure 7-9: 1D contaminant source inversion problem, case 1: total reward histograms
from 1000 simulated trajectories for greedy design and sOED. The plus-minus quantity is 1
standard error.
cost is charged to the funding agency, and not reflected in our reward functions. However,
the agency only permits (and requires) its use under promising situations where the current posterior variance is below a threshold of 3.0 (recall the prior variance is 4.0).3 The
observation noise standard deviation is then

 0.5, if variance corresponding to x < 3
k,b
σǫk (xk,b ) =
.
 2,
otherwise
(7.12)
Intuitively, batch design would not be able to use the first observation to update the belief
state, thus unable to take advantage of the feedback of information and with it, the opportunity to use the precise device. Greedy design (not presented in this case), however, would
be able to. The same wind conditions from Equation 7.11 are also applied.
The same grid method setup as Case 1 is used for both sOED and batch design. Policies
are also compared using the same technique as Case 1 with 1000 simulated trajectories, but
with one caveat. For this case, since the measurement noise is dependent on the belief state
and therefore the method of belief state representation, the noise standard deviation is also
recorded as the observations are generated. The correct corresponding standard deviation
is used when inference is performed on the common evaluation framework. In other words,
while the belief state governs which measuring device is used, we would know which device
is in fact used to obtain any particular observation.
3
Which instrument is used is then not a design decision.
134
The pairwise (d0 , d1 ) scatter plots for 1000 simulated trajectories are shown in Figure 7-10. As expected, batch design is able to account for the future wind effect, and starts
moving to the right for the first experiment so that it can arrive in the regions of higher
information in time for the second experiment after the plume has been carried by the wind.
sOED, however, realizes that there is the possibility of using the precise device if it can
reduce the posterior variance to less than 3 from the first observation. Thus it moves to the
left towards the plume location in the first experiment to get a more informative observation
even though the movement cost is higher. Roughly 55% of these trajectories achieve the
requirement for using the precise device in the second experiment, and they produce a mean
reward of 0.51 in contrast to −0.01 for trajectories that fail to qualify for this technology.
Effectively, sOED has taken a risk in order to achieve an overall higher expected reward.
The risk factor is not in the current problem formulation, but it certainly should be considered in practice, especially for such crucial missions where there is perhaps only one chance
to ensure public safety. The histograms for total rewards from all trajectories are shown in
Figure 7-9. The risk taken by sOED indeed pays off as it sees a mean reward 0.28 ± 0.02,
whereas greedy produces a much lower mean reward of 0.11 ± 0.02; the plus-minus quantity
is 1 standard error.
3
2
2
1
1
d1
d1
3
0
0
−1
−1
−2
−1.5
−1
−0.5
d0
0
−2
−1.5
0.5
(a) Batch design
−1
−0.5
d0
0
0.5
(b) sOED
Figure 7-10: 1D contaminant source inversion problem, case 2: d0 and d1 pair scatter plots
from 1000 simulated trajectories for batch design and sOED. Roughly 55% of the sOED
trajectories qualify for the precise device in the second experiment. However, there is no
particular pattern or clustering of these designs, thus we do not separately color-code them
in the scatter plot.
135
mean = 0.11 ± 0.02
500
500
400
400
300
300
200
200
100
100
0
−1
0
1
2
Reward
mean = 0.28 ± 0.02
600
Count
Count
600
3
0
4
−1
(a) Batch design
0
1
2
Reward
3
4
(b) sOED
Figure 7-11: 1D contaminant source inversion problem case 2: total reward histograms
from 1000 simulated trajectories for batch design and sOED. The plus-minus quantity is 1
standard error.
7.2.3
Case 3: sOED grid and map methods
This case investigates the performance of the map method under the sOED algorithm developed in this thesis. The same wind conditions from Equation 7.11 are applied, and a
similar two-tier measuring device system from Equation 7.12 is implemented with slightly
different parameters:

 0.2, if variance corresponding to x < 2
k,b
σǫk (xk,b ) =
.
 2,
otherwise
(7.13)
This case setting is chosen so that sOED can show advantage over both greedy and batch
designs at the same time.
For sOED, the grid and map methods described in Section 7.1.1 are studied. The settings
for both methods can be found in Table 7.3. The map method uses monomial basis functions
of total order 3. The joint map has a total of N (nd × ny × nθ ) = 6 dimensions and 129
basis terms, and the coefficients are determined using 106 exploration trajectories with
the exploration policy designated by dk ∼ N (0, 22 ). All posterior maps are 1D 3rd-order
polynomials and thus have 4 coefficients. Furthermore, the moments used in the value
function features are estimated by inverting the linear truncation for the map method.
The KL divergence is approximated by first estimating the mean and variance using this
technique, and then applying the analytic KL formula for Gaussians. In additional to the
136
moment-based value function features, features composed of 1st and 3rd degree total-order
polynomials on posterior map coefficients and the physical state are also investigated. The
main advantages of such a construction are the accessibility of map coefficients (especially
in multidimensional parameter spaces), and that the coefficients carry all the information
about the posterior (more than just mean and variance). There are some caveats in the
formulation of these features; we defer a detailed discussion to Section 7.3. For greedy and
batch designs, the same grid method setup is used. Policies are also evaluated using the
same technique as Case 2, with 1000 simulated trajectories.
sOED results using grid and map methods
We start with a comparison between the grid and map methods using the sOED algorithm.
The analytic method is no longer appropriate for this 1D source inversion problem since the
forward model is nonlinear and posteriors are non-Gaussian. As a result, having established
the agreement between the analytic and grid methods in the linear-Gaussian problem, we
now use the grid method as a reference to compare the map method to.
Histograms for d0 and d1 from the grid and map methods are shown in Figures 7-12
and 7-13. Excellent agreement is observed for d0 between the two methods, while d1 from
the map method are generally of lower values than the grid method. Since there are only
N = 2 experiments in this problem, only J˜1 is constructed via function approximation
while J2 is directly and numerically evaluated. The fact that d0 values between grid and
map methods are similar implies that J˜1 (and therefore the policy) generated from the two
methods are in good agreement. The discrepancy in d1 then must be due to the less accurate
inference and KL computations from the map method. This difference is also reflected in
the total rewards, shown in Figure 7-14 and Table 7.4, where the mean rewards from the
map method are generally slightly lower than those from the grid method.
There are two main approaches to improve the map quality: (1) improve the map construction process, and (2) use more relevant samples for the map construction. The joint
map is currently constructed using exploration trajectory samples, with Figure 7-15 showing the pairwise and marginal KDEs from samples used to construct the exploration joint
map, and samples generated from that map. Overall, the joint distribution appears quite
non-Gaussian, with heavy tail especially for those involving y1 , rendering the mapping to
standard Gaussian random variables more nonlinear and thus more challenging to repre137
400
300
300
300
200
100
0
−0.5
Count
400
Count
Count
400
200
100
−0.25
0
d0
0.25
0
−0.5
0.5
200
100
−0.25
0
d0
0.25
0
−0.5
0.5
−0.25
0
d0
0.25
0.5
−0.25
0
d0
0.25
0.5
400
400
300
300
300
200
100
0
−0.5
Count
400
Count
Count
(a) Grid method
200
100
−0.25
0
d0
0.25
0.5
0
−0.5
200
100
−0.25
0
d0
0.25
0.5
0
−0.5
(b) Map method
Figure 7-12: 1D contaminant source inversion problem, case 3: d0 histograms from 1000
simulated trajectories for the sOED grid and map methods. The left, middle, and right
columns correspond to ℓ = 1, 2, and 3, respectively.
sent. While the current map captures these features reasonably well, increasing the order of
polynomial basis beyond degree 3 is expected to further improve its performance. However,
higher order polynomial basis brings new challenges as well. First, the map construction,
evaluation, conditioning, and sampling procedures all require more computations (for example, moving to a 5th-order polynomial basis would increase the number of map coefficients
from the current 129 to 467). Second, a higher-order polynomial is also more prone to losing monotonicity, making sampling and density evaluation difficult. Currently, we use 106
samples to build the map. From experience, this sample size is much higher than needed in
Grid
Map (moment features)
Map (coefficient features 1st-order)
Map (coefficient features 3rd-order)
Batch design
Greedy design
ℓ=1
0.14
0.13
0.12
0.15
ℓ=2
0.12
0.16
0.14
0.18
ℓ=3
0.19
0.16
0.12
0.15
0.11
0.09
Table 7.4: 1D contaminant source inversion problem, case 3: total reward mean values from
1000 simulated trajectories; the Monte Carlo standard errors are all ±0.02. The grid and
map cases are all from sOED.
138
250
200
200
200
150
150
150
100
50
0
0
Count
250
Count
Count
250
100
50
0.2
0.4
0.6
0.8
0
0
1
100
50
0.2
0.4
d1
0.6
0.8
0
0
1
0.2
0.4
d1
0.6
0.8
1
0.6
0.8
1
d1
250
250
200
200
200
150
150
150
100
50
0
0
Count
250
Count
Count
(a) Grid method
100
50
0.2
0.4
0.6
d1
0.8
1
0
0
100
50
0.2
0.4
0.6
0.8
1
0
0
d1
0.2
0.4
d1
(b) Map method
Figure 7-13: 1D contaminant source inversion problem, case 3: d1 histograms from 1000
simulated trajectories for the sOED grid and map methods. The left, middle, and right
columns correspond to ℓ = 1, 2, and 3, respectively.
practice for producing reasonably accurate results for this 6D 3rd-order map, but we choose
it intentional in order to minimize this particular source of error.
Another perspective to improve the map representation is to use more relevant samples
for its construction. The exploration map is placing much computational effort in ensuring
accuracy over a wide region of state space that may not by visited by the exploitation policies.
As a direction of future research, the adaptation of the joint map to exploitation trajectory
samples as they become available, is also expected to further improve the performance of
the map method.
Lastly, value function features based on map coefficients produced similar histograms of
d0 , d1 , and rewards, and these plots are omitted. Their mean rewards from 1000 trajectories
are shown in Table 7.4. While the 1st-order coefficient features perform only slightly worse
than the moment features, the 3rd-order coefficient features are able to achieve a similar level
of mean reward. These observations provide good motivation and support for using map
coefficients as features in higher-dimensional problems, especially where posteriors depart
further from normality and higher-order moment information becomes important.
139
mean = 0.14 ± 0.02
800
200
600
Count
400
400
200
0
1
2
Reward
3
0
−1
4
mean = 0.19 ± 0.02
800
600
Count
Count
600
0
−1
mean = 0.12 ± 0.02
800
400
200
0
1
2
Reward
3
0
−1
4
0
1
2
Reward
3
4
3
4
(a) Grid method
mean = 0.13 ± 0.02
800
200
600
Count
400
400
200
0
1
2
Reward
3
4
0
−1
mean = 0.16 ± 0.02
800
600
Count
Count
600
0
−1
mean = 0.16 ± 0.02
800
400
200
0
1
2
Reward
3
4
0
−1
0
1
2
Reward
(b) Map method
Figure 7-14: 1D contaminant source inversion problem, case 3: total reward histograms
from 1000 simulated trajectories for the sOED grid and map methods. The left, middle,
and right columns correspond to ℓ = 1, 2, and 3, respectively. The plus-minus quantity is 1
standard error.
Comparison between sOED, batch, and greedy designs
We now focus on comparisons between different design approaches. For simplicity, only
the ℓ = 3 grid sOED results are used in this part. All batch and greedy design results
are produced using the grid method. Intuitively, one would expect this problem to show
an advantage of sOED over both batch and greedy designs: batch design does not have a
feedback mechanism and thus is unable to make use of the precise device; greedy design
does not look into the future and thus is unable to account for the wind that will blow the
plume to the right.
Indeed, this is supported by the pairwise (d0 , d1 ) scatter plots shown in Figure 7-16.
Batch design foresees the wind and moves towards the right immediately but abandons the
chance to use the precise device, while greedy design chases after the high information regions
of the plume and incurs a high movement cost. sOED is able to balance the knowledge of
the wind and the precise device, moving slightly to the left in the first design in order to
have a chance to qualify for the precise device in the second experiment (these turn out
140
d0
y0
d1
y1
θ
θ
(a) Samples used for map construction
d0
y0
d1
y1
θ
θ
(b) Samples generated from map
Figure 7-15: 1D contaminant source inversion problem: samples used to construct exploration map and samples generated from the resulting map.
to be cases where the initial plume starts to the right of the origin) before shifting to the
right in the second experiment. The mean rewards are shown in Table 7.4, where batch
and greedy designs achieve lower values compared to any of the sOED variants. The sOED
map methods outperform batch and greedy designs despite their less accurate inference and
KL computations compared to their grid counterpart. For contrast, the exploration policy
produces a much lower mean reward of around −0.5.
In summary for the 3 cases of 1D contaminant source inversion problem, we have demonstrated the advantages of sOED over batch and greedy designs in realistic situations. Furthermore, with sOED grid method used as a comparison reference, the map method has
141
1
1
1
0.75
0.75
0.75
0.5
0.5
d1
d1
d1
0.5
0.25
0.25
0.25
0
0
0
−0.25
−1 −0.75 −0.5 −0.25
d0
0
0.25
0.5
−0.25
−1 −0.75 −0.5 −0.25
d0
(a) Batch design
0
0.25
−0.25
−1 −0.75 −0.5 −0.25
d0
0.5
(b) Greedy design
0
0.25
0.5
(c) sOED using grid method
Figure 7-16: 1D contaminant source inversion problem, case 3: (d0 , d1 ) pair scatter plots
from 1000 simulated trajectories. The sOED result here is for ℓ = 1.
mean = 0.11 ± 0.02
800
600
Count
Count
600
400
200
0
−1
mean = 0.09 ± 0.02
800
400
200
0
1
2
Reward
3
0
−1
4
(a) Batch design
0
1
2
Reward
3
4
(b) Greedy design
Figure 7-17: 1D contaminant source inversion problem, case 3: total reward histograms from
1000 simulated trajectories using batch and greedy designs. The plus-minus quantity is 1
standard error.
shown good performance while employing moment-based as well as map coefficient-based
value function features.
7.3
7.3.1
2D Contaminant source inversion problem
Problem setup
Consider a 2D version of the contaminant source inversion problem described in Section 7.2,
where now θ = [θ0 , θ1 ]⊤ , dk = [dk,0 , dk,1 ]⊤ , z = [z0 , z1 ]⊤ , and xk,p = [xk,p,0 , xk,p,1 ]⊤ are
2D vectors (i.e., the plume and vehicle are confined to movements in a 2D physical space).
The air is calm initially, and then a variable wind commences at t = 1. This leads to the
following values of cumulative net displacement due to wind at the time points coinciding
142
with the experiments:

dw (t = 1) = 
0
0



dw (t = 2) = 
0
5



dw (t = 3) = 
5
10

.
(7.14)
The precise evolution of the wind in between these time points is not relevant, since only these
integrated quantities directly affect the contaminant profile in Equation 7.7. The concentration measurements are corrupted by an additive Gaussian noise described by Equation 7.8,
with the noise variable having a constant standard deviation ǫk ∼ N (0, 0.52 ). Additional
problem and algorithm settings are summarized in Tables 7.2 and 7.3.
In this multidimensional problem, the grid method is no longer practical. Such an
implementation would require a sophisticated 2D grid adaptation strategy for inference
in order to capture the posteriors with sufficient resolution; the overall setup would be
very computationally expensive. The map method, however, is capable of accommodating
multidimensional parameters relatively easily. For this problem, we employ a map method
using polynomial basis functions of total order 3. With N = 3 experiments, the total
dimension of the joint map is 3(nd × ny × nθ ) = 15, and its coefficients are determined
using 5 × 105 exploration trajectories with an exploration design measure dk,j ∼ N (0, 32 ),
j = 0, 1. Posterior maps, constructed by conditioning on the appropriate dimensions of the
joint map, are then 2D 3rd-order polynomials with the form
ξθk,0
= a0 + a1 θ0 + a2 θ02 + a3 θ03
ξθk,1
= b0 + b1 θ0 + b2 θ1 + b3 θ02 + b4 θ0 θ1 + b5 θ12 + b6 θ03 + b7 θ02 θ1 + b8 θ0 θ12 + b9 θ13(7.16)
.
(7.15)
KL evaluations on the posteriors are performed using the linear truncation technique described in Section 5.5.4.
Features in value function approximation are chosen to be based on posterior map coefficients, instead of posterior first and second moments as we have done in the earlier examples.
We make this choice for two main reasons. First, moment information is not directly available via a map representation, and must be approximated by, for example, linear truncation
or sampling. Such estimates can become computationally cumbersome in multidimensional
settings, and inaccurate for non-Gaussian posteriors. Map coefficients, however, are easily accessible. Second, information fully describing the posterior (up to basis limitations)
143
is encoded within the entire set of coefficients. This includes all moment information as
well. Map coefficients thus provide an accessible source of full posterior description without
requiring additional approximations.
To be more specific, we construct the features only from map coefficients corresponding
to terms that are strictly less than the highest total polynomial order. With reference to
Equations 7.15 and 7.16, the features are then functions of
{ai }2i=0 ,
{bi }5i=0 .
(7.17)
The excluded coefficients correspond to terms that are not affected by dk and yk when
conditioned from the joint map. (For example, in the k = 0 case with the total order
of polynomial capped at 3, the b5 θ12 term results from conditioning the joint map terms
c0 d0,0 θ12 + c1 d0,1 θ12 + c2 y0 θ12 + c3 θ12 on a particular realization of d0 and y0 . Here ci are the
coefficients from the joint map; their particular ordering is unimportant in this illustration.
b6 θ03 , however, is contributed only from a single joint map term, c4 θ03 , which is not affected
by either dk or yk . Note that the joint map does not have terms such as d0,0 θ03 since that
would exceed the total-order limit.) Consequently, those coefficients are identical for all
regression points, regardless of the experimental designs and observations. Their inclusion
only introduces linear dependence in the regression system, and have no positive contributions. We thus use features that are 2nd-order polynomials jointly in the map coefficients
from Equation 7.17 and the 2D physical state, leading to a total of 9 + 2 + 2 = 13 choose 2
equaling 78 features.
Before presenting the results, we first provide some intuition on the physical phenomenon
through Figure 7-18, which shows the progression of a sample trajectory. The solid, dotted,
and dotted-dash contour lines depict the plume signal when y0 , y1 , and y2 are observed, respectively. The plume diffuses with time, and is also carried by the wind first northward, and
then towards the northeast. Figure 7-19 shows the probability density of the corresponding
belief states, beginning with the prior in Figure 7-19(a). The black circle indicates the true
θ value; the posterior modes do not necessarily match this value, due to noisy measurements
and finite number of observations.
The vehicle starts at the initial location represented by the black square in Figure 7-18.
For the first experiment, it moves to a new location indicated by the circle, southwest towards
144
where the initial plume is generally situated (in accordance to the prior). Here, this location
remains far from the main part of the plume signal, and acquires only little information.
This is reflected by Figure 7-19(b), where the density contours remain fairly wide. At the
same time, this location is close to the region of high information content (intuitively, high
gradient area) for the second experiment in anticipation of the wind. The vehicle then only
needs to move a small amount to the diamond mark, and able to make a fairly informative
observation despite a slight loss of plume signal from diffusion. Indeed, Figure 7-19(c) shows
a more concentrated density. Finally, with the wind carrying the plume much further away
and the vehicle unable to catch it without accumulating substantial movement cost, it only
nudges slightly towards the final plume position in the last design. As expected, very little
additional information is obtained in the last measurement, as the final posterior remains
largely unchanged.
Figures 7-20 and 7-21 show the physical and belief state progression of another sample
trajectory, where the plume starting location is to the southwest of the origin. In this
case, a meaningful amount of information is attainable in the final experiment, and justifies
its corresponding movement cost. The vehicle thus makes a large displacement towards
the final plume position, and indeed a significant narrowing in the posterior after the final
measurement is observed.
These trajectory samples illustrate that, when a policy is produced from sOED, it is
able to find good experimental designs based on different situations—an inherent adaptive
property.
7.3.2
Results
Histograms for designs d0 , d1 , and d2 are shown in Figure 7-22. Each dk has two components, corresponding to the two physical space dimensions. The middle column of the figure
provides three-dimensional histograms reflecting both components at the same time, while
the left and right columns display the marginal histograms of each dimension.
The starting location of most plume realizations (generated from the prior) are to the
southwest compared to the initial physical state of the robotic vehicle. As a result, the
vehicle generally has a southwest tendency in d0 , with negative values in both components.
However, the magnitude of the first movement is not at the largest possible (recall the design
space is [−5, 5]2 ) due to three factors: first, there is a competing quadratic movement cost;
145
15
8
7.5
10
7
z1
6.5
5
6
5.5
0
5
4.5
-5
-5
4
0
5
z
10
15
0
Figure 7-18: 2D contaminant source inversion problem: plume signal and physical state
progression of sample trajectory 1.
second, the plume can still be fairly far away where only little information is acquirable
(such as the situation depicted in Figure 7-18); and third, it may be better to get into
a good position in anticipation of the second experiment instead. The trade-off between
these factors is complicated, and the algorithm developed in this thesis helps address these
difficulties in a quantitative manner. Moving onto the second design, d1 sees more variation
in the second component while its first component remains mostly around the same position.
This is because by this point, the vehicle is often in between the current and next plume
positions in the first component, while south of both the current and next plume positions
in the second component. This observation thus demonstrates characteristics of anticipating
the subsequent plume movement towards the northeast. Finally, the last movement sees the
largest spread in the histograms. Since this is the final decision in the sequence, the only
consideration is then the trade-off between movement cost and information gain, dependent
on the current plume location, vehicle position, and belief state. The final design thus fully
and clearly adjusts to this trade-off, with no need, or opportunity, for additional reservations
due to future effects. Experiments later in the sequence are often where feedback effects are
most influential.
Histograms for trajectory rewards are plotted in Figure 7-23 for ℓ = 1, 2, and 3, with
mean rewards of 1.04 ± 0.03, 1.10 ± 0.03, and 0.96 ± 0.03, respectively; the plus-minus
quantity is 1 standard error. The dk histograms are similar for the subsequent ℓ iterations
146
·10−2
·10−2
4
6
3.5
3.5
4
4
3
3
2
2
2.5
z1
z1
2.5
0
0
2
−2
−4
−6
−6 −4 −2
0
z0
2
4
6
2
1.5
−2
1.5
1
−4
1
0.5
−6
(a) x0,b density
−6 −4 −2
0
z0
2
4
6
·10−2
4
6
4
6
3.5
3.5
4
4
3
3
2
2
2.5
z1
2.5
z1
0.5
(b) x1,b density
·10−2
0
0
2
−2
−4
−6
4
6
−6 −4 −2
0
z0
2
4
6
2
1.5
−2
1.5
1
−4
1
0.5
−6
(c) x2,b density
−6 −4 −2
0
z0
2
4
6
0.5
(d) x3,b density
Figure 7-19: 2D contaminant source inversion problem: belief state posterior density contour
progression of sample trajectory 1.
and thus are not included.
In this example, we observe only small changes of results as the state measure is updated
with ℓ. First, it implies that the exploration design measure we selected is reasonable, since
ℓ = 1 iteration (where all regression points are from exploration) does not show a noticeable
disadvantage compared to subsequent ℓ iterations (where exploitation samples are then
incorporated). More specifically, this is due to either designs from exploitation policies
being similarly distributed as those from the exploration policy (which is not the case here
as evident from Figure 7-22), or that the value function approximations are robust against
the locations of regression points. This leads to the second implication, that the features
we selected span a subspace that is sufficiently rich to approximate the value functions
well. (To illustrate this more simply, imagine a quadratic function being approximated
147
15
8
7.5
10
7
z1
6.5
5
6
5.5
0
5
4.5
-5
-5
4
0
5
z
10
15
0
Figure 7-20: 2D contaminant source inversion problem: plume signal and physical state
progression of sample trajectory 2.
by only linear basis functions, then different regression sample distributions can produce
drastically different outcomes; whereas if the basis functions are quadratic, then very similar
approximations would be produced.) This observation is particularly encouraging, as it
provides support for our choice of features, made largely from heuristics.
We also point out the importance of including regression samples produced from the
numerical methods (discussed in Section 4.3.1). When those samples are not included, the
ℓ = 2 iteration suffers tremendously and produces a much lower mean reward of 0.55. The
culprit of deterioration is inaccurate value function approximations that lead the optimizer
to designs that are in fact far from the true optimum. For contrast, the exploration policy
yields a much lower mean reward of −0.78.
The pairwise and marginal KDEs from samples used to construct the exploration joint
map, and samples generated from that map, for the dk and yk dimensions, are shown in
Figures 7-24 and 7-25 . Figures 7-26 and 7-27 display those of dimensions crossed between
dk and yk with θ: the columns from left to right correspond to d0,0 , d0,1 , y0 , d1,0 , d1,1 , y1 ,
d2,0 , d2,1 , y2 , and the marginals for the row variables, and the rows from top to bottom
correspond to the marginal for the column variables, θ0 , θ1 , θ0 , θ1 , θ0 , θ1 , where each pair of
rows corresponding to θ for inference after 1, 2, and 3 experiments, respectively. The only
part omitted from the joint map is the pairwise KDEs between θ’s, which are independent
Gaussian and uninteresting. Overall, the pairwise KDEs exhibit extremely non-Gaussian,
148
·10−2
6
·10−2
4
6
3.5
3.5
4
4
3
3
2
2
2.5
z1
z1
2.5
0
0
2
−2
2
−2
1.5
−4
−6
1
−6 −4 −2
0
z0
2
4
6
1.5
−4
−6
0.5
(a) x0,b density
1
−6 −4 −2
0
z0
2
4
6
·10−2
·10−2
4
6
3.5
4
3.5
4
4
3
3
2
2
2.5
z1
2.5
z1
0.5
(b) x1,b density
6
0
0
2
−2
2
−2
1.5
−4
−6
4
1
−6 −4 −2
0
z0
2
4
6
1.5
−4
−6
0.5
(c) x2,b density
1
−6 −4 −2
0
z0
2
4
6
0.5
(d) x3,b density
Figure 7-21: 2D contaminant source inversion problem: belief state posterior density contour
progression of sample trajectory 2.
heavy-tail, and even borderline multi-modal behavior. Nonetheless, the map is still able to
capture these characteristics reasonably well, with the map-generated KDEs matching fairly
well with their counterparts from the samples used to construct the map. As the problem
becomes more nonlinear and higher dimensional, the joint behavior will also become more
difficult to mirror. While one aspect of development is through the enrichment of map basis
and samples, another promising future research direction is to leverage the exploitation
samples, and to construct lower-dimensional targeted local maps that are more accurate for
specific realizations (as discussed at the beginning of Section 5.5.1). We will expand these
ideas in Chapter 8.
149
250
200
200
150
150
Count
Count
250
100
50
0
−5
100
50
−2.5
0
d0,0
2.5
0
−5
5
−2.5
0
d0,1
2.5
5
−2.5
0
d1,1
2.5
5
−2.5
0
d2,1
2.5
5
250
250
200
200
150
150
Count
Count
(a) d0
100
50
0
−5
100
50
−2.5
0
d1,0
2.5
0
−5
5
250
250
200
200
150
150
Count
Count
(b) d1
100
50
50
0
−5
100
−2.5
0
d2,0
2.5
0
−5
5
(c) d2
Figure 7-22: 2D contaminant source inversion problem: dk histograms from 1000 simulated
trajectories.
mean = 1.04 ± 0.03
200
100
50
0
−4
200
150
Count
Count
150
mean = 1.10 ± 0.03
100
50
−2
0
2
Reward
4
6
0
−4
mean = 0.96 ± 0.03
150
Count
200
100
50
−2
0
2
Reward
4
6
0
−4
−2
0
2
Reward
4
6
Figure 7-23: 2D contaminant source inversion problem: total reward histograms from 1000
simulated trajectories. The left, middle, and right columns correspond to ℓ = 1, 2, and 3,
respectively. The plus-minus quantity is 1 standard error.
150
d0,0
d0,1
y0
d1,0
d1,1
y1
d2,0
d2,1
y2
Figure 7-24: 2D contaminant source inversion problem: samples used to construct exploration map.
151
d0,0
d0,1
y0
d1,0
d1,1
y1
d2,0
d2,1
y2
Figure 7-25: 2D contaminant source inversion problem: samples generated from the resulting
map.
152
Figure 7-26: 2D contaminant source inversion problem: samples used to construct exploration map between dk and yk , with θ. The columns from left to right correspond to d0,0 ,
d0,1 , y0 , d1,0 , d1,1 , y1 , d2,0 , d2,1 , y2 , and the marginals for the row variables, and the rows
from top to bottom correspond to the marginal for the column variables, θ0 , θ1 , θ0 , θ1 , θ0 ,
θ1 , where each pair of rows corresponding to θ for inference after 1, 2, and 3 experiments,
respectively.
153
Figure 7-27: 2D contaminant source inversion problem: samples generated from the resulting
map between dk and yk , with θ. The columns from left to right correspond to d0,0 , d0,1 , y0 ,
d1,0 , d1,1 , y1 , d2,0 , d2,1 , y2 , and the marginals for the row variables, and the rows from top
to bottom correspond to the marginal for the column variables, θ0 , θ1 , θ0 , θ1 , θ0 , θ1 , where
each pair of rows corresponding to θ for inference after 1, 2, and 3 experiments, respectively.
154
Chapter 8
Conclusions
8.1
Summary and conclusions
This thesis has developed a rigorous mathematical framework and a set of numerical tools
for performing optimal sequential experimental design (sOED) in a computationally feasible
manner. Experiments play an essential role in the learning process, and a systematic design
procedure for finding the optimal experiments can lead to tremendous resource savings.
Propelled by recent algorithm developments, simulation-based optimal experimental design
(OED) has seen substantial advances in accommodating nonlinear and physically realistic
processes. However, the state-of-the-art OED tools today are largely limited to batch (openloop) and greedy (myopic) designs. While sufficient under some circumstances, these design
approaches generally do not yield the optimal design of multiple experiments conducted in
a sequence. The use of fully optimal description for sequential design is still in the early
stages.
We begin the thesis with an extension to our previous batch OED work. In addition
to describing the framework and numerical tools for batch OED, particularly focus is paid
to enhancing the capability for accommodating nonlinear and computationally intensive
models with an information gain objective. This involves deriving and accessing gradient
information via the use of polynomial chaos and infinitessimal perturbation analysis in order
to enable the use of gradient-based optimization methods, which would be otherwise not
possible or impractical. An extensive comparison between two gradient-based methods,
Robbins-Monro stochastic approximation and sample average approximation, is made from
a practical and numerical perspective and in the context of batch OED, via a diffusion source
155
inversion application governed by a 2D partial differential equation.
We then develop a rigorous mathematical framework for sOED. This framework is formulated from a decision-theoretic perspective, with a Bayesian treatment of uncertainty and
an information measure objective. It is capable of accommodating the sequential design of
a finite number of experiments, with nonlinear models and non-Gaussian distributions, and
under continuous parameter, design, and observation spaces of multiple dimensions. What
sets sOED apart from batch OED is that it seeks an optimal policy, a set of functions that
determines what the optimal design is, depending on the current system state. Directly solving for the optimal policy for the sOED problem is a challenging task. Instead, we re-express
it using a dynamic programming formulation, and then make use of various approximate
dynamic programming (ADP) techniques in finding an approximation to the optimal policy.
The ADP techniques employed are based on a one-step lookahead policy representation,
combined with approximate value iteration (in particular backward induction and regression). Value functions are approximated using a linear architecture, with features selected
from heuristics and motivated by moment terms in the analytic formula of Kullback-Leibler
divergence between Gaussian distributions. The approximations are then constructed from
regression problems resulting from the backward induction process. Regression samples are
generated from trajectory simulations, via both exploration and exploitation. In obtaining
good regression sample locations, we emphasize the notion of policy and numerical method
induced state measure. An iterative update procedure is introduced to help adapt and refine this measure as better policy approximations are constructed. Lastly, we further point
out the difficulty of the problem as we mathematically show that many advanced partiallyobservable Markov decision process algorithms are not suitable for information-based OED.
The next major challenge involves the expression of the belief state, which are posteriors
of multivariate non-Gaussian continuous random variables. Transport maps with finitedimensional parameterization are introduced to represent the belief states. This technology
is numerically attractive in that they can be constructed directly from samples without
requiring model knowledge, and the optimization problem in the construction process is
dimensionally-separable and convex. More importantly, by building a map jointly in the
parameter and observation spaces, one can recover the posterior map by simply condition on
the joint map. This allows Bayesian inference, which needs to be repeated millions of times
throughout the entire sOED process under different realizations of design and observations,
156
to be performed very quickly, albeit approximately. This ability plays a key role in making
the overall method computationally feasible. We take a step further, and build a single joint
map in the parameter, observation, and design spaces of all stages, such that only one map
is needed for all subsequent inferences in solving the sOED problem. Currently, samples
for map construction are generated from exploration only, future research will involve the
incorporation of exploitation samples as well.
Finally, we demonstrate the computational effectiveness of these methods via three examples. The first is a linear-Gaussian problem, where analytic solution is available. A
comparison of sOED using analytic, grid, and map representations of the belief state provides understanding of the various sources of numerical error. Next is a realistic nonlinear
contaminant source inversion problem in a 1D physical space with diffusion and convection
effects. Through different settings, we demonstrate the advantage of sOED over the more
often used batch and greedy designs, and also establish confidence in our map-based algorithm for handling nonlinear problems. The map-based method has constructed excellent
policies, using both moment-based value function features as well as map coefficient-based
features. The last problem is the contaminant source inversion problem in a 2D physical
space setting. With multiple dimensions in many variables, this problem tests the limitations of the numerical methods developed in this work, and offers insights of future research
directions.
8.2
Future work
Throughout this thesis, we have identified several promising avenues of future work, which
are briefly outlined below. We broadly divide them into areas of computational and formulational advances.
8.2.1
Computational advances
1. Transport maps and inference accuracy: One fruitful direction of research involves improving the accuracy of the transport map representation of belief state,
and the accompanying inference method of conditioning a joint map. As the number
of experiments and variable dimensions increase, and as the problem becomes more
nonlinear and non-Gaussian, accuracy of maps also becomes more difficult to main157
tain. Echoing the discussion from Section 7.2.3, the accuracy of a particular map can
be increased by improving the map construction process (such as enriching its basis
functions and boosting the number of construction samples), or to use more relevant
samples (such as those from exploitation trajectories that better reflect which states
the algorithm visits). On a higher level, we may also consider different maps altogether. Instead of a single joint map adopted in this thesis, targeted local maps may
be created that are more accurate for specific states. With reference to Equation 3.3,
one possible route is to construct a separate map for each xk visited and as needed,
which can then be used for inference on different realizations of dk and yk . Such an
approach would require joint maps of only ndk × nyk × nθ dimensions, independent of
N (interestingly, information-based OED is expected to be most effective when only
a few experiments are available, as otherwise even suboptimal experiments can still
eventually lead to informative posteriors after many measurements; this suggests that
the number of experiments would be less of a problem than increases of other variable
dimensions). Naturally, lower-dimensional joint maps would produce more accurate
inference results, but many more of such maps need to be constructed, and must be
done in an online fashion. Furthermore, performance would be affected by additional
sources of error in the propagation of truncated representations of xk , as we would then
need to explicitly store these representations whereas we only needed to store their
associated history of designs and observations in the current implementation. More
generally, hybrid methods of inference appear promising. For instance, we can use a
“rough” joint map to arrive at an initial starting point of posterior, and further refine
it as needed using other techniques such as importance sampling. An easily tune-able
setup is also attractive for numerical adaptation, another topic to be discussed shortly.
2. Alternative approximate dynamic programming techniques: There is a vast
literature on ADP techniques in additional to those used in this thesis. For many
possible alternative approaches, it is not immediately clear whether they can produce
accurate results more efficiently. For example, we have employed a direct backward
induction approach to construct the value function approximations. A rollout formulation, perhaps involving multiple iterations of rollout (approximate policy iteration)
can potentially be computationally cheaper, but produces “less optimal” policies. At
the same times, a whole field of algorithms for policy evaluation can be tested, such
158
as the temporal differencing (TD) variations.
3. Adaptation of numerical methods: In tackling a difficult problem such as sOED,
numerous numerical and approximation techniques need to be employed, and accompanying them are also different sources of error. The work of this thesis has largely relied
on techniques that have some natural way of refinement. For example, one can make
the value function approximations more accurate by enriching its features, increasing
the regression sample size and making more efficient sample choices, and improving
the quality of objective estimates in the stochastic optimization algorithm. We have
the choice of which components to improve, and by how much, through the allocation
of computational resources. Yet, not all errors are equally important. The key is
to understand to which sources of error is the quality of approximate optimal policy
more sensitive to, and which sources are more prominent but can also be economically
reduced. We would like to further investigate the behavior of these numerical errors,
with the aim both to create a goal-oriented adaptation scheme that can efficiently
improve the accuracy of the overall method, as well as to achieve quantifiable and
meaningful error bounds on the results.
8.2.2
Formulational advances
1. Changes to the number of experiments: While the sOED formulation in this
thesis has assumed the number of experiments to be known and fixed, it is often
not the case in practice. For example, when a particular goal has been achieved
(e.g., enough information has been acquired, effect of the drug has reached its target),
additional experiments are often no longer necessary. For projects heavily influenced
by political climate or funding availability, the probability of project termination or
renewal is often ambiguous. Regardless whether these changes are intentional or not,
their inclusion requires some mechanism that can change the number of experiments.
One well studied variant of sOED that accommodates possible early termination is the
optimal stopping problem. In experimental design, the concept has been used in the
design of clinical trials and biological experiments (e.g., [41, 48]). On a high level, its
formulation involves using a “maximum possible” horizon, and with each experiment
having the option of a termination design when certain conditions are met. Whether
159
such a maximum horizon even exists, or what is a reasonable “long” substitute of this
number, may be problematic in itself. Despite these additional challenges, however,
an extension from our current formulation would not be difficult.
More subtle is how to accommodate unforeseen additional experiments, especially
when this news is revealed partly through the project. This is usually a good problem
to have. However, the policy constructed for the previously shorter horizon may no
longer be “good” for the remaining original experiments, and does not even apply for
the new ones. While one may simply solve the new sOED problem starting from the
current state, this can be an expensive and inefficient process. Of particular concern are
situations where there is some sense of diminishing return in the experiments, or if only
a small change is made compared to the original total (e.g., 1 new experiment is added
to the original total of 100 may have insignificant effects). These factors, combined
with often limited time in computing the new policy, present a need for updating or
modifying the original policy on the fly, perhaps suboptimally. Further extending on
this line of thought, advanced structural approximations (such as mixtures of batch,
greedy, and sequential designs of blocks of experiments) may be more robust to these
changes, but at a trade-off of optimality.
2. Additional advances in formulation: There are several additional general formulation aspects of sOED that are worth incorporating in the future. First, the element
of risk can be extremely important in some missions (for instance, as mentioned in
the example from Section 7.2.2). Risk can be incorporated, for example, through the
objective function (such as adding terms that reflects the variance, or worst case scenarios) and probabilistic constraints (such as the probability of some notion of failure
that reflects the reliability of the policy). These new structures naturally lead to more
sophisticated robust optimization algorithms (e.g., [10]). Second, the treatment of
nuisance parameters can help increase the efficiency of sOED. For example, in the
examples of Sections 7.2 and 7.3, if the wind condition is unknown, then acquiring information to reduce uncertainty in this nuisance parameter may still be useful towards
the primary goal of gaining information for the contaminant source location. While
nuisance parameter treatment has been investigated in batch OED [67], its incorporation in sOED would have an even bigger impact. Finally, the treatment of model
160
discrepancy is an interesting and challenging area by itself. Its inclusion in sOED
would help make the method even more reflective of real life situations.
161
162
Appendix A
Analytic Derivation of the Unbiased
Gradient Estimator
We derive the analytical form of the unbiased gradient estimator ∇d ÛN,M (d, θs , zs ),1 following the method presented in Section 2.4.
The estimator ÛN,M (d, θs , zs ) is defined in Equation 2.16. Its gradient in component
form is








∇d ÛN,M (d, θs , zs ) = 






∂
∂d1 ÛN,M (d, θs , zs )






..

.

,
∂

Û
(d,
θ
,
z
)
s
s
N,M

∂da

..

.


∂
Û
(d,
θ
,
z
)
s s
N,M
∂dn
∂
∂d2 ÛN,M (d, θs , zs )
(A.1)
d
where nd is the dimension of the design vector d, and da denotes the ath component of d.
The ath component of the gradient is then
∂
ÛN,M (d, θs , zs ) =
∂da
N
1 X
N
i=1
PM
(
j=1
− PM
G(θ(i) , d) + C(θ(i) , d)z (i) θ(i) , d
fy|θ,d G(θ(i) , d) + C(θ(i) , d)z (i) θ(i) , d
)
∂
(i)
(i)
(i) θ (i,j) , d
∂da fy|θ,d G(θ , d) + C(θ , d)z
. (A.2)
fy|θ,d G(θ(i) , d) + C(θ(i) , d)z (i) θ(i,j ′ ) , d
∂
∂da fy|θ,d
j ′ =1
1
Recall that this estimator is unbiased with respect to the gradient of ŪM .
163
Partial derivatives of the likelihood function with respect to d are required above. We
assume that each component of C(θ(i) , d) is of the form αc + βc |Gc (θ(i) , d)|, c = 1 . . . ny ,
where ny is the dimension of the obsevation vector y, and αc , βc are constants. Also, let
the random vectors z (i) be mutually independent and composed of i.i.d. components, such
that the observations are conditionally independent given θ and d. The derivative of the
likelihood function then becomes
∂
fy|θ,d G(θ(i) , d) + C(θ(i) , d)z (i) θ(i,j) , d
∂da
" ny
#
∂ Y
=
fyc |θ,d Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc(i) θ(i,j) , d
∂da
c=1
ny X ∂
(i) fyk |θ,d Gk (θ(i) , d) + (αk + βk |Gk (θ(i) , d)|)zk θ(i,j) , d
=
∂da
k=1

ny
Y

fyc |θ,d Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc(i) θ(i,j) , d  .
(A.3)
c=1
c6=k
(i)
Introducing a standard normal density for each zc , the likelihood associated with a single
component of the data vector is
=
fyc |θ,d Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc(i) θ(i,j) , d
1
2π αc + βc |Gc (θ(i,j) , d)|
 
(i) 2
(i,j)
(i)
(i)
, d) − (Gc (θ , d) + (αc + βc |Gc (θ , d)|)zc ) 
 Gc (θ
exp −
,
2
2 αc + βc |Gc (θ(i,j) , d)|
√
164
(A.4)
and its derivatives are
∂
fyc |θ,d Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc(i) θ(i,j) , d
∂da
−βc sgn(Gc (θ(i,j) , d)) ∂d∂ a Gc (θ(i,j) , d)
=
√
2
2π αc + βc |Gc (θ(i,j) , d)|
 
(i) 2
(i,j)
(i)
(i)
, d) − (Gc (θ , d) + (αc + βc |Gc (θ , d)|)zc ) 
 Gc (θ
× exp −

2
2 αc + βc |Gc (θ(i,j) , d)|
1
2π αc + βc |Gc (θ(i,j) , d)|
 
(i) 2
(i,j)
(i)
(i)
, d) − (Gc (θ , d) + (αc + βc |Gc (θ , d)|)zc ) 
 Gc (θ
× exp −

2
2 αc + βc |Gc (θ(i,j) , d)|
  Gc (θ(i,j) , d) − (Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc(i) )
× −
2

αc + βc |Gc (θ(i,j) , d)|
∂
∂
(i,j)
(i)
(i)
(i)
×
Gc (θ
, d) −
Gc (θ , d)(1 + βc sgn(Gc (θ , d))zc )
∂da
∂da
(i) 2
Gc (θ(i,j) , d) − (Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc )
+
3
αc + βc |Gc (θ(i,j) , d)|
∂
(i,j)
(i,j)
×βc sgn(Gc (θ
, d))
Gc (θ
, d) .
∂da
+√
(A.5)
In cases where conditioning on θ(i,j) is replaced by conditioning on θ(i) (i.e., for the first
summation term in Equation A.2), the expressions simplify to
fyc |θ,d (Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc(i) |θ(i) , d)
 2 
(i)
1

 zc
exp −
= √

2
2π αc + βc |Gc (θ(i) , d)|
(A.6)
and
=
∂
f
(Gc (θ(i) , d) + (αc + βc Gc (θ(i) , d))zc(i) |θ(i) , d)
∂da yc |θ,d
 2 
(i)
∂
(i)
(i)
−βc sgn(Gc (θ , d)) ∂da Gc (θ , d)

 zc
exp −
.
√
2
(i)
2
2π αc + βc |Gc (θ , d)|
165
(A.7)
We now require the derivative of each model output Gc with respect to d. In most cases,
this quantity will not be available analytically. One could use an adjoint method to evaluate
the derivatives, or instead employ a finite difference approximation, but embedding these
approaches in a Monte Carlo sum may be prohibitive, particularly if each forward model
evaluation is computationally expensive. The polynomial chaos surrogate introduced in Section 2.3 addresses this problem by replacing the forward model with polynomial expansions
for either Gc
Gc (θ(i) , d) ≈
X
b∈J
gb Ψb ξ(θ(i) , d)
(A.8)
or ln Gc
(i)
Gc (θ , d) ≈ exp
"
X
(i)
gb Ψb ξ(θ , d)
b∈J
#
.
(A.9)
Here gb are the expansion coefficients and J is an admissible multi-index set indicating which
polynomial terms are in the expansion. For instance, if nθ is the dimension of θ and nd is the
dimension of d, such that nθ + nd is the dimension of ξ, then J := {b ∈ Nn0 θ +nd : |b|1 ≤ p}
is a total-order expansion of degree p. This expansion converges in the L2 sense as p → ∞.
Consider the latter (ln-Gc ) case; here, the derivative of the polynomial chaos expansion
is
#
"
X
X
∂
∂
(i)
(i)
Gc (θ , d) = exp
Ψb (ξ(θ(i) , d)).
gb
gb Ψb ξ(θ , d)
∂da
∂da
(A.10)
b
b
In the former (Gc without the logarithm) case, we obtain the same expression except without
the exp [·] term.
To complete the derivation, we assume that each component of the input parameters θ
and design vector d is represented by an affine transformation of corresponding basis random
variable ξ:
θ l = γ l + δ l ξl ,
dl′ −nθ
= γ l ′ + δ l ′ ξl ′ ,
(A.11)
(A.12)
where γ(·) and δ(·) 6= 0 are constants, and l = 1, . . . , nθ and l′ = nθ + 1, . . . , nθ + nd . This
166
is a reasonable assumption since ξ can be typically chosen such that their distributions are
of the same family as the prior on θ (or the uniform “prior” on d); this choice avoids any
need for approximate representations of the prior. The derivative of Ψb (ξ(θ(i) , d)) from
Equation A.10 is thus
∂
Ψb (ξ(θ(i) , d)) =
∂da
=
nθ
+nd
nθY
∂ Y
(i)
ψbl′ (ξl′ (dl′ −nθ ))
ψbl ξl (θl )
∂da
′
l=1
l =nθ +1


nθ
Y
ψbl
l=1
(i)
ξl (θl )
+nd
 nθY


′ (dl′ −n ))
ψ
(ξ
b
l
′
θ
l


l′ =nθ +1
l′ −nθ 6=a
∂
(ξa+nθ (da )) ,
ψb
∂da a+nθ
(A.13)
and the derivative of the univariate basis function ψ with respect to da is
∂
(ξa+nθ (da )) =
ψb
∂da a+nθ
=
∂
∂
ξa+nθ (da )
∂da
1
(ξa+nθ )
,
δa+nθ
ψba+nθ (ξa+nθ )
∂ξa+nθ
∂
ψb
∂ξa+nθ a+nθ
(A.14)
where the second equality is a result of using Equation A.12. The derivative of the polynomial basis function with respect to its argument is available analytically for many standard
orthogonal polynomials, and may be evaluated using recurrence relationships [1]. For example, in the case of Legendre polynomials, the usual derivative recurrence relationship is
∂
∂ξ ψn (ξ)
= [−bξψn (ξ) + bψn−1 (ξ)] /(1 − ξ 2 ), where n is the polynomial degree. However, di-
vision by (1 − ξ 2 ) presents numerical difficulties when evaluated on ξ that fall on or near the
boundaries of the domain. Instead, a more robust alternative that requires both previous
polynomial function and derivative evaluations can be obtained by directly differentiating
the three-term recurrence relationship for the polynomial, and is preferable in practice:
2n − 1
2n − 1 ∂
n−1 ∂
∂
ψn (ξ) =
ψn−1 (ξ) +
ξ ψn−1 (ξ) −
ψn−2 (ξ).
∂ξ
n
n
∂ξ
n ∂ξ
(A.15)
This concludes the derivation of the analytical gradient estimator ∇d ÛN,M (d, θs , zs ).
167
168
Appendix B
Analytic Solution to the
Linear-Gaussian Problem
We derive the analytic solution to the linear-Gaussian problem described in Section 7.1.
As discussed in the main text, this problem is deterministic, and its optimal policy can be
reduced to optimal designs (i.e., the expected utility or reward is a function of d0 and d1 ,
rather than of a policy). In this case, batch optimal experimental design (OED) and sequential optimal experimental design (sOED) yield the same optimal designs since feedback is
not needed. We pursue the derivation first via the batch design formulation in Section B.1,
which is simpler and also produces the entire analytic expected utility function. In Section B.2, we present the derivation under the sOED formulation. The production of these
derivations is assisted with the MATLAB Symbolic Math Toolbox package.
B.1
Derivation from batch optimal experimental design
Following the expected utility definition of Equation 2.2 and with the additional term introduced in Equation 7.4, the expected utility for this problem is
h
2 i
U (d0 , d1 ) = Ey0 ,y1 |d0 ,d1 DKL fθ|y0 ,y1 ,d0 ,d1 (·|y0 , y1 , d0 , d1 )||fθ (·) − 2 ln σ22 − ln 2
2
f (θ|y0 , y1 , d0 , d1 )
− 2 ln σ22 − ln 2
= Ey0 ,y1 |d0 ,d1 Eθ|y0 ,y1 ,d0 ,d1 ln
f (θ)
2
f (θ|y0 , y1 , d0 , d1 )
= Eθ|d0 ,d1 Ey0 ,y1 |θ,d0 ,d1 ln
− 2 ln σ22 − ln 2 , (B.1)
f (θ)
169
where the second equality is due to σ22 being independent of y0 and y1 given d0 and d1 (see
Equation 7.2), and the last equality is from the re-arrangement of conditional expectations.
Let us first focus on the first term in Equation B.1, and substitute the following formula
for log-posterior and log-prior density functions
(s0 − θ)2
1
ln f (θ) = − ln 2πσ02 −
2
2σ02
(s2 − θ)2
1
ln f (θ|y0 , y1 , d0 , d1 ) = − ln 2πσ22 −
.
2
2σ22
(B.2)
(B.3)
We then further substitute s2 , σ22 , s1 , σ12 with the formula in Equation 7.2, and s0 = 0,
σ02 = 9, and σǫ2 = 1 from the problem setting. The resulting expression is
1
ln 9d0 2 + 9d1 2 + 1
2
18d0 + 18d1 + 2
2
2
4
− d0 − d1 − 9d0 − 9d1 4 θ2 − 9d0 2 y0 2 − 9d1 2 y1 2
Eθ|d0 ,d1 Ey0 ,y1 |θ,d0 ,d1
2
+ 18d0 3 θy0 + 18d1 3 θy1 + 2d0 θy0 + 2d1 θy1 − 18d0 2 d1 2 θ2 − 18d0 d1 y0 y1
!!##
2
2
1
ln
d
+
d
+
0
1
9
+ ln 3
(. B.4)
+ 18d0 d1 2 θy0 + 18d0 2 d1 θy1 + 18(d0 2 + d1 2 )
2
Next, we make use of the linearity of expectation operators, and apply the inner expectation
Ey0 ,y1 |θ,d0 ,d1 term-by-term, with the formulas
Eyk |θ,dk [yk ] = Eǫ|θ,dk [θdk + ǫ] = θdk ,
Eyk |θ,dk yk2 = Varǫ|θ,dk [yk ] + Eǫ|θ,dk [yk ]2
(B.5)
= Varǫ|θ,dk [θdk + ǫ] + θ2 d2k
= σǫ2 + θ2 d2k ,
(B.6)
for k = 0, 1. The substitution of yk invokes the model in Equation 7.1. For the cross
term with y0 and y1 in Equation B.4, the observations are independent conditioned on θ,
d0 , and d1 , due to the independent ǫ assumption—hence the joint expectation is separable.
Again using the linearity of expectation operators, we apply the outer expectation Eθ|d0 ,d1
170
term-by-term, using the formulas
Eθ|d0 ,d1 [θ] = s0 ,
Eθ|d0 ,d1 θ2 = Varθ|d0 ,d1 [θ] + Eθ|d0 ,d1 [θ]2
= σ02 + s20 .
(B.7)
(B.8)
Upon these substitutions, Equation B.4 simplifies to
ln(d0 2 + d1 2 + 19 ) 2473854946935173
+
.
2
2251799813685248
(B.9)
The second term in Equation B.1, upon substituting the formula in Equation 7.2 and simplifying, becomes
−2 ln 2 − ln
9
2
9 d0 + 9 d1 2 + 1
2
.
(B.10)
Combining Equations B.9 and B.10, we obtain the analytic formula for the expected
utility
U (d0 , d1 ) =
ln(d0 2 + d1 2 + 91 ) 2473854946935173
+
2
2251799813685248
2
9
.
−2 ln 2 − ln
9 d0 2 + 9 d1 2 + 1
(B.11)
Finding the stationary points in the design space (d0 , d1 ) ∈ [0.1, 3]2 by setting the gradient
to zero and checking the boundaries, the optimal designs satisfy the condition
d∗2
0
+
d∗2
1
18014398509481984 ln 3 − 5117414861322735
1
exp
−1 ,
=
9
9007199254740992
(B.12)
with
U (d∗0 , d∗1 ) ≈ 0.783289.
(B.13)
The optimal solution is indeed not unique, as there is a “front” of optimal designs. The
expected utility contours and the optimal design front are plotted in Figure B-1.
171
3
0
2.5
−5
d1
2
1.5
−10
1
−15
0.5
−20
0.5
1
1.5
d0
2
2.5
3
−25
Figure B-1: Linear-Gaussian problem: analytic expected utility surface, with the “front” of
optimal designs in dotted black line.
B.2
Derivation from sequential optimal experimental design
We now present the first steps of a derivation using the sOED formulation. This approaches
reaches the same optimal designs as the derivation using the batch OED formulation.
We start the derivation from the terminal reward defined in Equation 7.4
J2 (x2 ) = DKL fθ|d0 ,y0 ,d1 ,y1 (·|d0 , y0 , d1 , y1 )||fθ (·) − 2(ln σ22 − ln 2)2
"
#
2
1 σ22 (s2 − s0 )2
σ2
=
+
− ln
− 1 − 2(ln σ22 − ln 2)2
2
2
2 σ02
σ0
σ0
2
2
2
s2
σ2
1 σ2
− 1 − 2(ln σ22 − ln 2)2 ,
+
− ln
=
2 9
9
9
(B.14)
where the second equality is due to the analytic formula for Kullback-Leibler divergence
between two univariate Gaussians, and the third equality is upon simplification from s0 = 0
and σ02 = 9 for this problem. Substituting this into the Bellman’s equation (Equation 3.3)
172
produces
J1 (x1 ) = max Ey1 |x1 ,d1 [g1 (x1 , y1 , d1 ) + J2 (F1 (x1 , y1 , d1 ))]
d1
2
2
σ2
s22
1 σ2
2
2
+
− ln
− 1 − 2(ln σ2 − ln 2)
= max Ey1 |x1 ,d1
d1
2 9
9
9
" (
!
)
2
1
1 y1 σ12 d1 + s1
σ12
σ12
+
−1 −
= max Ey1 |x1 ,d1
− ln
d1
2 9 σ12 d21 + 1
9
σ12 d21 + 1
9 σ12 d21 + 1
2 #
σ12
2 ln 2 2
− ln 2
σ1 d1 + 1
" (
2 4 2
σ12
1
1
2
2
y
σ
d
+
s
+
2y
σ
d
s
E
+
= max
1
1
1
y
|x
,d
1
1
1
1
1
1
1
1
2
d1
2 9 σ12 d21 + 1
9 σ 2 d21 + 1
!
) 1 2 #
σ12
σ12
− 1 − 2 ln 2 2
− ln 2
,
(B.15)
− ln
σ1 d1 + 1
9 σ12 d21 + 1
where we have substituted for s2 and σ22 with the analytic formula from Equation 7.2, and
also made use of g1 = 0 and σǫ2 = 1 for this problem.
The next step requires taking the expectation with respect to y1 |x1 , d1 ; this is equivalent
to taking the expectation with respect to y1 |s1 , σ12 , d1 , since x1 is completely described by
its mean and variance in this conjugate Gaussian setting. With the intention to use the
linearity of expectation and apply the expectation term-by-term, we develop the identities
Ey1 |s1 ,σ12 ,d1 [y1 ] =
=
=
=
Z
Z
Z
+∞
y1 f (y1 |s1 , σ12 , d1 ) dy1
−∞
+∞ Z +∞
−∞
−∞
+∞ Z +∞
−∞
Z +∞
−∞
−∞
y1 f (y1 , θ|s1 , σ12 , d1 ) dy1 dθ
y1 f (y1 |θ, s1 , σ12 , d1 )f (θ|s1 , σ12 , d1 ) dy1 dθ
d1 θf (θ|s1 , σ12 , d1 ) dy1 dθ
(B.16)
= d1 s1 ,
173
and
Ey1 |s1 ,σ12 ,d1 y12 =
=
=
=
=
Z
Z
Z
Z
+∞
y12 f (y1 |s1 , σ12 , d1 ) dy1
−∞
+∞ Z +∞
−∞
−∞
+∞ Z +∞
−∞
+∞
−∞
σǫ2 +
−∞
y12 f (y1 , θ|s1 , σ12 , d1 ) dy1 dθ
y12 f (y1 |θ, s1 , σ12 , d1 )f (θ|s1 , σ12 , d1 ) dy1 dθ
(σǫ2 + d21 θ2 )f (θ|s1 , σ12 , d1 ) dy1 dθ
d21 (σ12 + s21 )
= 1 + d21 (σ12 + s21 ),
(B.17)
where we have applied the property Var(y) = E y 2 − (E [y])2 twice, and the last equality
uses σǫ2 = 1. Substituting Equations B.16 and B.17 into Equation B.15 then yields
" (
1
(1 + d21 (σ12 + s21 ))σ14 d21 + s21 + 2σ12 d21 s21
σ12
J1 (x1 ) = max
+
2
d1
2 9 σ12 d21 + 1
9 σ12 d21 + 1
!
)
2 #
σ12
σ12
− 1 − 2 ln 2 2
− ln 2
− ln
σ1 d1 + 1
9 σ12 d21 + 1
≡ max J¯1 (x1 , d1 ).
d1
(B.18)
(B.19)
To find the optimal d1 , we take the partial derivative
of J¯ with respect to d1 and setting it
r
1
to zero to attain three stationary points: 0, ±
r
e 8 σ12 −2
.
2σ12
The only feasible candidate in the
1
design space is +
e 8 σ12 −2
,
2σ12
and it needs to be checked for global optimality along with the
boundary values. This is a hideous process involving verification of their second derivative
properties and under different regions of x1 , and we r
omit the details here. Nonetheless,
1
the optimum can be ultimately shown to be d∗1 = +
e 8 σ12 −2
.
2σ12
Substituting for σ12 using
Equation 7.2 and s0 = 0, σ02 = 9, and σǫ2 = 1 from this problem, we arrive at the final
relationship between d∗0 and d∗1 that is exactly as Equation B.12.
174
Bibliography
[1] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions With Formulas, Graphs, and Mathematical Tables. U.S. Department of Commerce, NIST (National
Institute of Standards and Technology), Washington, DC, 1972.
[2] A. K. Agarwal and M. L. Brisk. Sequential Experimental Design for Precise Parameter
Estimation. 2. Design Criteria. Industrial & Engineering Chemistry Process Design
and Development, 24(1):207–210, 1985.
[3] S. Ahmed and A. Shapiro. The Sample Average Approximation Method for Stochastic
Programs with Integer Recourse. Technical report, Georgia Institute of Technology,
2002.
[4] N. M. Alexandrov, R. M. Lewis, C. R. Gumbert, L. L. Green, and P. A. Newman.
Approximation and Model Management in Aerodynamic Optimization with VariableFidelity Models. Journal of Aircraft, 38(6):1093–1101, 2001.
[5] L. Ambrosio and N. Gigli. A User’s Guide to Optimal Transport. In Modelling and
Optimisation of Flows on Networks, pages 1–155. Springer Berlin Heidelberg, Berlin,
Germany, 2013.
[6] B. Amzal, F. Y. Bois, E. Parent, and C. P. Robert. Bayesian-Optimal Design
via Interacting Particle Systems. Journal of the American Statistical Association,
101(474):773–785, 2006.
[7] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis.
Springer New York, New York, NY, 2007.
[8] M. Athans. The Role and Use of the Stochastic Linear-Quadratic-Gaussian Problem
in Control System Design. IEEE Transactions on Automatic Control (Institute of
Electrical and Electronics Engineers), 16(6):529–552, 1971.
[9] A. C. Atkinson and A. N. Donev. Optimum Experimental Designs. Oxford University
Press, New York, NY, 1992.
[10] F. Augustin and Y. M. Marzouk. NOWPAC: A provably convergent derivative-free
nonlinear optimizer with path-augmented constraints. arXiv preprint arXiv:1403.1931,
2014.
[11] D. L. Baulch, C. T. Bowman, C. J. Cobos, R. A. Cox, T. Just, J. A. Kerr, T. Murrells,
M. J. Pilling, D. Stocker, J. Troe, W. Tsang, R. W. Walker, and J. Warnatz. Evaluated
Kinetic Data for Combustion Modeling: Supplement II. Journal of Physical and
Chemical Reference Data, 34(3):757–1397, 2005.
175
[12] D. L. Baulch, C. J. Cobos, R. A. Cox, P. Frank, G. Hayman, T. Just, J. A. Kerr,
T. Murrells, M. J. Pilling, J. Troe, R. W. Walker, and J. Warnatz. Evaluated Kinetic
Data for Combustion Modeling: Supplement I. Journal of Physical and Chemical
Reference Data, 23(6):847–1033, 1994.
[13] R. Bellman. Bottleneck Problems and Dynamic Programming. Proceedings of the
National Academy of Sciences of the United States of America, 39(9):947–951, 1953.
[14] R. Bellman. Dynamic Programming and Lagrange Multipliers. Proceedings of the
National Academy of Sciences of the United States of America, 42(10):767–769, 1956.
[15] I. Ben-Gal and M. Caramanis. Sequential DOE via dynamic programming. IIE Transactions (Institute of Industrial Engineers), 34(12):1087–1100, 2002.
[16] M. Benisch, A. Greenwald, V. Naroditskiy, and M. C. Tschantz. A Stochastic Programming Approach to Scheduling in TAC SCM. In Proceedings of the 5th ACM
Conference on Electronic Commerce (Association of Computing Machinery), pages
152–159, New York, NY, 2004.
[17] A. Benveniste, M. Métivier, and P. Priouret. Adaptive Algorithms and Stochastic
Approximation. Springer Berlin Heidelberg, Berlin, Germany, 1990.
[18] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer New York,
New York, NY, 1985.
[19] G. Berkooz, P. Holmes, and J. L. Lumley. The Proper Orthogonal Decomposition
in the Analysis of Turbulent Flows. Annual Review of Fluid Mechanics, 25:539–575,
1993.
[20] P. Bernard and B. Buffoni. Optimal mass transportation and Mather theory. Journal
of the European Mathematical Society, 9(1):85–121, 2007.
[21] D. A. Berry, P. Müller, A. P. Grieve, M. Smith, T. Parke, R. Blazek, N. Mitchard, and
M. Krams. Adaptive Bayesian Designs for Dose-Ranging Drug Trials. In Case studies
in Bayesian statistics, pages 99–181. Springer New York, New York, NY, 2002.
[22] D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol. 1. Athena Scientific, Belmont, MA, 2005.
[23] D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol. 2. Athena Scientific, Belmont, MA, 2007.
[24] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific,
Belmont, MA, 1996.
[25] D. Blackwell. Comparison of Experiments. In Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probability, pages 93–102, Berkeley, CA, 1951.
[26] D. Blackwell. Equivalent Comparisons of Experiments. The Annals of Mathematical
Statistics, 24(2):265–272, 1953.
[27] N. Bonnotte. From Knothe’s Rearrangement to Brenier’s Optimal Transport Map.
SIAM Journal on Mathematical Analysis (Society for Industrial and Applied Mathematics), 45(1):64–87, 2013.
176
[28] A. J. Booker, J. E. Dennis, P. D. Frank, D. B. Serafini, V. Torczon, and M. W. Trosset.
A rigorous framework for optimization of expensive functions by surrogates. Structural
Optimization, 17(1):1–13, 1999.
[29] G. E. P. Box. Science and Statistics. Journal of the American Statistical Association,
71(356):791–799, 1976.
[30] G. E. P. Box. Sequential Experimentation and Sequential Assembly of Designs. Quality
Engineering, 5(2):321–330, 1992.
[31] G. E. P. Box and N. R. Draper. Empirical Model-Building and Response Surfaces.
John Wiley & Sons, Hoboken, NJ, 1987.
[32] G. E. P. Box, J. S. Hunter, and W. G. Hunter. Statistics for Experimenters: Design,
Innovation and Discovery. John Wiley & Sons, Hoboken, NJ, 2nd edition, 2005.
[33] G. E. P. Box and H. L. Lucas. Design of Experiments in Non-Linear Situations.
Biometrika, 46(1-2):77–90, 1959.
[34] S. J. Bradtke and A. G. Barto. Linear Least-Squares Algorithms for Temporal Difference Learning. Machine Learning, 22(1-3):33–57, 1996.
[35] Y. Brenier. Polar Factorization and Monotone Rearrangement of Vector-Valued Functions. Communications on Pure and Applied Mathematics, 44(4):375–417, 1991.
[36] S. Bringezu, H. Schütz, M. O’Brien, L. Kauppi, R. W. Howarth, and J. McNeely.
Towards Sustainable Production and Use of Resources: Assessing Biofuels. Technical
report, United Nations Environment Programme, 2009.
[37] A. E. Brockwell and J. B. Kadane. A Gridding Method for Bayesian Sequential
Decision Problems. Journal of Computational and Graphical Statistics, 12(3):566–584,
2003.
[38] T. Bui-Thanh, K. Willcox, and O. Ghattas. Model Reduction for Large-Scale Systems with High-Dimensional Parametric Input Space. SIAM Journal on Scientific
Computing (Society for Industrial and Applied Mathematics), 30(6):3270–3288, 2008.
[39] R. H. Cameron and W. T. Martin. The Orthogonal Development of Non-Linear
Functionals in Series of Fourier-Hermite Functionals. The Annals of Mathematics,
48(2):385–392, 1947.
[40] G. Carlier, A. Galichon, and F. Santambrogio. From Knothe’s Transport to Brenier’s
Map and a Continuation Method for Optimal Transport. SIAM Journal on Mathematical Analysis (Society for Industrial and Applied Mathematics), 41(6):2554–2576,
2010.
[41] P. Carlin, Bradley, J. B. Kadane, and A. E. Gelfand. Approaches for Optimal Sequential Decision Analysis in Clinical Trials. Biometrics, 54(3):964–975, 1998.
[42] D. R. Cavagnaro, J. I. Myung, M. A. Pitt, and J. V. Kujala. Adaptive Design Optimization: A Mutual Information-Based Approach to Model Discrimination in Cognitive Science. Neural Computation, 22(4):887–905, 2010.
177
[43] K. Chaloner and I. Verdinelli. Bayesian Experimental Design: A Review. Statistical
Science, 10(3):273–304, 1995.
[44] T. Champion and L. De Pascale. The Monge Problem in Rˆd. Duke Mathematical
Journal, 157(3):551–572, 2011.
[45] P. Chaudhuri and P. A. Mykland. Nonlinear Experiments: Optimal Design and
Inference Based on Likelihood. Journal of the American Statistical Association,
88(422):538–546, 1993.
[46] H. Chen and B. W. Schmeiser. Retrospective Approximation Algorithms for Stochastic
Root Finding. In Proceedings of the 1994 Winter Simulation Conference, pages 255–
261, Lake Buena Vista, FL, 1994.
[47] H. Chen and B. W. Schmeiser. Stochastic root finding via retrospective approximation.
IIE Transactions (Institute of Industrial Engineers), 33(3):259–275, 2001.
[48] J. A. Christen and M. Nakamura. Sequential Stopping Rules for Species Accumulation.
Journal of Agricultural, Biological & Environmental Statistics, 8(2):184–195, 2003.
[49] Y. Chu and J. Hahn. Integrating Parameter Selection with Experimental Design Under
Uncertainty for Nonlinear Dynamic Systems. AIChE Journal (American Institute of
Chemical Engineers), 54(9):2310–2320, 2008.
[50] C. W. Clenshaw and A. R. Curtis. A method for numerical integration on an automatic
computer. Numerische Mathematik, 2(1):197–205, 1960.
[51] M. A. Clyde. Bayesian Optimal Designs for Approximate Normality. PhD thesis,
University of Minnesota, 1993.
[52] M. A. Clyde, P. Müller, and G. Parmigiani. Exploring Expected Utility Surfaces by
Markov Chains. Technical report, Duke University, 1996.
[53] P. R. Conrad and Y. M. Marzouk. Adaptive Smolyak Pseudospectral Approximations.
SIAM Journal on Scientific Computing (Society for Industrial and Applied Mathematics), 35(6):A2643–A2670, 2013.
[54] P. G. Constantine, M. S. Eldred, and E. T. Phipps. Sparse pseudospectral approximation method. Computer Methods in Applied Mechanics and Engineering, 229-232:1–12,
2012.
[55] T. A. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons,
Hoboken, NJ, 2nd edition, 2006.
[56] D. R. Cox and N. Reid. The Theory of the Design of Experiments. Chapman &
Hall/CRC, Boca Raton, FL, 2000.
[57] C. Darken and J. E. Moody. Note on Learning Rate Schedules for Stochastic Optimization. In Advances in Neural Information Processing Systems 3, pages 832–838,
Denver, CO, 1990.
[58] D. F. Davidson and R. K. Hanson. Interpreting Shock Tube Ignition Data. International Journal of Chemical Kinetics, 36(9):510–523, 2004.
178
[59] B. J. Debusschere, H. N. Najm, P. P. Pébay, O. M. Knio, R. G. Ghanem, and O. P.
Le Maître. Numerical Challenges in the Use of Polynomial Chaos Representations for
Stochastic Processes. SIAM Journal on Scientific Computing (Society for Industrial
and Applied Mathematics), 26(2):698–719, 2004.
[60] M. H. DeGroot. Optimal Statistical Decisions. John Wiley & Sons, Hoboken, NJ,
2004.
[61] H. A. Dror and D. M. Steinberg. Sequential Experimental Designs for Generalized
Linear Models. Journal of the American Statistical Association, 103(481):288–298,
2008.
[62] C. C. Drovandi, J. M. McGree, and A. N. Pettitt. Sequential Monte Carlo for Bayesian
sequentially designed experiments for discrete data. Computational Statistics and Data
Analysis, 57:320–335, 2013.
[63] C. C. Drovandi, J. M. McGree, and A. N. Pettitt. A Sequential Monte Carlo Algorithm to Incorporate Model Uncertainty in Bayesian Sequential Design. Journal of
Computational and Graphical Statistics, 23(1):3–24, 2014.
[64] T. A. El Moselhy and Y. M. Marzouk. Bayesian inference with optimal maps. Journal
of Computational Physics, 231(23):7815–7850, 2012.
[65] M. S. Eldred, A. A. Giunta, and S. S. Collis. Second-Order Corrections for SurrogateBased Optimization with Model Hierarchies. In 10th AIAA/ISSMO Multidisciplinary
Analysis and Optimization Conference (American Institute of Aeronautics and Astronautics, International Society of Structural and Multidisciplinary Optimization),
Albany, NY, 2004.
[66] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, NY,
1972.
[67] C. Feng. Optimal Bayesian experimental design in the presence of model error. Master’s thesis, Massachusetts Institute of Technology, 2015.
[68] D. Feyel and A. S. Üstünel. Monge-Kantorovitch Measure Transportation and MongeAmpère Equation on Wiener Space. Probability Theory and Related Fields, 128(3):347–
385, 2004.
[69] R. A. Fisher. The Design of Experiments. Oliver & Boyd, Edinburgh, United Kingdom,
8th edition, 1966.
[70] I. Ford, D. M. Titterington, and C. P. Kitsos. Recent Advances in Nonlinear Experimental Design. Technometrics, 31(1):49–60, 1989.
[71] M. Frangos, Y. M. Marzouk, K. Willcox, and B. van Bloemen Waanders. Surrogate
and Reduced-Order Modeling: A Comparison of Approaches for Large-Scale Statistical
Inverse Problems. In Large-Scale Inverse Problems and Quantification of Uncertainty,
pages 123–149. John Wiley & Sons, Chichester, United Kingdom, 2010.
[72] M. Frenklach. Transforming data into knowledge-Process Informatics for combustion
chemistry. Proceedings of the Combustion Institute, 31(1):125–140, 2007.
179
[73] T. Gerstner and M. Griebel. Dimension-Adaptive Tensor-Product Quadrature. Computing, 71(1):65–87, 2003.
[74] R. G. Ghanem and P. D. Spanos. Stochastic Finite Elements: A Spectral Approach.
Springer New York, New York, NY, 1st edition, 1991.
[75] J. Ginebra. On the Measure of the Information in a Statistical Experiment. Bayesian
Analysis, 2(1):167–212, 2007.
[76] P. Glasserman. Gradient Estimation via Perturbation Analysis. Kluwer Academic
Publishers, Boston, MA, 1991.
[77] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM (Association of Computing Machinery), 33(10):75–84, 1990.
[78] G. J. Gordon. Stable Function Approximation in Dynamic Programming. In Proceedings of the 12th International Conference on Machine Learning, pages 261–268, Tahoe
City, CA, 1995.
[79] A. Greenwald, B. Guillemette, V. Naroditskiy, and M. C. Tschantz. Scaling Up the
Sample Average Approximation Method for Stochastic Optimization with Applications
to Trading Agents. In Agent-Mediated Electronic Commerce. Designing Trading Agents
and Mechanisms, pages 187–199. Springer Berlin Heidelberg, Berlin, Germany, 2006.
[80] T. Guest and A. Curtis. Iteratively constructive sequential design of experiments and
surveys with nonlinear parameter-data relationships. Journal of Geophysical Research,
114(B04307):1–14, Apr. 2009.
[81] C. Guestrin, A. Krause, and A. P. Singh. Near-Optimal Sensor Placements in Gaussian
Processes. In Proceedings of the 22nd International Conference on Machine Learning,
pages 265–272, Bonn, Germany, 2005.
[82] G. Gürkan, A. Y. Özge, and S. M. Robinson. Sample-Path Optimization in Simulation.
In Proceedings of the 1994 Winter Simulation Conference, pages 247–254, Lake Buena
Vista, FL, 1994.
[83] I. Guyon, M. Nikravesh, S. Gunn, and L. A. Zadeh. Feature Extraction: Foundations
and Applications. Springer Berlin Heidelberg, Berlin, Germany, 2006.
[84] M. Hamada, H. F. Martz, C. S. Reese, and A. G. Wilson. Finding Near-Optimal
Bayesian Experimental Designs via Genetic Algorithms. The American Statistician,
55(3):175–181, 2001.
[85] K. Healy and L. W. Schruben. Retrospective Simulation Response Optimization. In
Proceedings of the 1991 Winter Simulation Conference, pages 901–906, Phoenix, AZ,
1991.
[86] D. A. Hickman and L. D. Schmidt. Production of Syngas by Direct Catalytic Oxidation
of Methane. Science, 259(5093):343–346, 1993.
[87] Y. C. Ho and X. Cao. Perturbation Analysis and Optimization of Queueing Networks.
Journal of Optimization Theory and Applications, 40(4):559–582, 1983.
180
[88] S. Hosder, R. Walters, and R. Perez. A non-intrusive polynomial chaos method for uncertainty propagation in CFD simulations. In Proceedings of the 44th AIAA Aerospace
Sciences Meeting and Exhibit (American Institute of Aeronautics and Astronautics),
Reno, NV, 2006.
[89] X. Huan. Accelerated Bayesian Experimental Design for Chemical Kinetic Models.
Master’s thesis, Massachusetts Institute of Technology, 2010.
[90] X. Huan and Y. M. Marzouk. Simulation-based optimal Bayesian experimental design
for nonlinear systems. Journal of Computational Physics, 232(1):288–317, 2013.
[91] X. Huan and Y. M. Marzouk. Gradient-Based Stochastic Optimization Methods in
Bayesian Experimental Design. International Journal for Uncertainty Quantification,
4(6):479–510, 2014.
[92] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially
observable stochastic domains. Artificial Intelligence, 101(1-2):99–134, 1998.
[93] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement Learning: A Survey.
Journal of Artificial Intelligence Research, 4:237–285, 1996.
[94] M. C. Kennedy and A. O’Hagan. Bayesian calibration of computer models. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 63(3):425–464, 2001.
[95] J. Kiefer and J. Wolfowitz. Stochastic Estimation of the Maximum of a Regression
Function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
[96] W. Kim, M. A. Pitt, Z.-L. Lu, M. Steyvers, and J. I. Myung. A Hierarchical Adaptive
Approach to Optimal Experimental Design. Neural Computation, 26:2565–2492, 2014.
[97] A. J. Kleywegt, A. Shapiro, and T. Homem-de Mello. The Sample Average Approximation Method for Stochastic Discrete Optimization. SIAM Journal on Optimization
(Society for Industrial and Applied Mathematics), 12(2):479–502, 2002.
[98] H. Knothe. Contributions to the Theory of Convex Bodies. The Michigan Mathematical Journal, 4(1):39–52, 1957.
[99] A. Krause and C. Guestrin. Near-optimal Observation Selection Using Submodular
Functions. In Proceedings of the 22nd AAAI Conference on Artificial Intelligence (Association for the Advancement of Artificial Intelligence), pages 1650–1654, Vancouver,
Canada, 2007.
[100] A. Krause, J. Leskovec, C. Guestrin, J. VanBriesen, and C. Faloutsos. Efficient Sensor
Placement Optimization for Securing Large Water Distribution Networks. Journal of
Water Resources Planning and Management, 134(6):516–526, 2008.
[101] H. Kurniawati, D. Hsu, and W. S. Lee. SARSOP: Efficient Point-Based POMDP
Planning by Approximating Optimally Reachable Belief Spaces. In Proceedings of
Robotics: Science and Systems, 2008, pages 65–72, Zurich, Switzerland, 2008.
[102] H. J. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algorithms
and Applications. Springer New York, New York, NY, 2nd edition, 2003.
181
[103] M. Lagoudakis. Least-Squares Policy Iteration. The Journal of Machine Learning
Research, 4:1107–1149, 2003.
[104] O. P. Le Maître and O. M. Knio. Spectral Methods for Uncertainty Quantification:
with Applications to Computational Fluid Dynamics. Springer Netherlands, Houten,
Netherlands, 2010.
[105] D. V. Lindley. On a Measure of the Information Provided by an Experiment. The
Annals of Mathematical Statistics, 27(4):986–1005, 1956.
[106] D. V. Lindley. Bayesian Statistics: A Review. SIAM (Society for Industrial and
Applied Mathematics), Philadelphia, PA, 1972.
[107] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining.
Springer US, New York, NY, 1998.
[108] T. J. Loredo. Rotating Stars and Revolving Planets: Bayesian Exploration of the Pulsating Sky. In Bayesian Statistics 9: Proceedings of the Nineth Valencia International
Meeting, pages 361–392, Benidorm, Spain, 2010.
[109] T. J. Loredo and D. F. Chernoff. Bayesian Adaptive Exploration. In Statistical Challenges in Astronomy, pages 57–70. Springer New York, New York, NY, 2003.
[110] D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge
University Press, Cambridge, United Kingdom, 4th edition, 2005.
[111] W.-K. Mak, D. P. Morton, and R. K. Wood. Monte Carlo bounding techniques for
determining solution quality in stochastic programs. Operations Research Letters,
24(1-2):47–56, 1999.
[112] Y. M. Marzouk and H. N. Najm. Dimensionality reduction and polynomial chaos acceleration of Bayesian inference in inverse problems. Journal of Computational Physics,
228(6):1862–1902, 2009.
[113] Y. M. Marzouk, H. N. Najm, and L. A. Rahn. Stochastic spectral methods for efficient
Bayesian solution of inverse problems. Journal of Computational Physics, 224(2):560–
586, 2007.
[114] Y. M. Marzouk and D. Xiu. A Stochastic Collocation Approach to Bayesian Inference
in Inverse Problems. Communications in Computational Physics, 6(4):826–847, 2009.
[115] R. J. McCann. Existence and Uniqueness of Monotone Measure-Preserving Maps.
Duke Mathematical Journal, 80(2):309–323, 1995.
[116] G. Monge. Mémoire sur la théorie des déblais et de remblais. In Histoire de l’Académie
Royale des Sciences de Paris, avec les Mémoires de Mathématique et de Physique pour
la même année, pages 666–704. De l’Imprimerie Royale, Paris, France, 1781.
[117] S. Mosbach, A. Braumann, P. L. W. Man, C. A. Kastner, G. P. E. Brownbridge, and
M. Kraft. Iterative improvement of Bayesian parameter estimates for an engine model
by means of experimental design. Combustion and Flame, 159(3):1303–1313, 2012.
[118] P. Müller. Simulation Based Optimal Design. Handbook of Statistics, 25:509–518,
2005.
182
[119] P. Müller, D. A. Berry, A. P. Grieve, M. Smith, and M. Krams. Simulation-based sequential Bayesian design. Journal of Statistical Planning and Inference, 137(10):3140–
3150, 2007.
[120] P. Müller and G. Parmigiani. Optimal Design via Curve Fitting of Monte Carlo
Experiments. Journal of the American Statistical Association, 90(432):1322–1330,
1995.
[121] P. Müller, B. Sansó, and M. De Iorio. Optimal Bayesian Design by Inhomogeneous Markov Chain Simulation. Journal of the American Statistical Association,
99(467):788–798, 2004.
[122] S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 65(2):331–366, 2003.
[123] H. N. Najm. Uncertainty Quantification and Polynomial Chaos Techniques in Computational Fluid Dynamics. Annual Review of Fluid Mechanics, 41:35–52, 2009.
[124] J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer
Journal, 7(4):308–313, 1965.
[125] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation
approach to stochastic programming. SIAM Journal on Optimization (Society for
Industrial and Applied Mathematics), 19(4):1574–1609, 2009.
[126] J. Nocedal and S. J. Wright. Numerical Optimization. Springer New York, New York,
NY, 2006.
[127] V. Norkin, G. Pflug, and A. Ruszczynski. A branch and bound method for stochastic
global optimization. Mathematical Programming, 83(1-3):425–450, 1998.
[128] I. Olkin and F. Pukelsheim. The Distance between Two Random Vectors with Given
Dispersion Matrices. Linear Algebra and its Applications, 48:257–263, 1982.
[129] D. Ormoneit and S. Sen. Kernel-Based Reinforcement Learning. Machine Learning,
49(2-3):161–178, 2002.
[130] G. Parmigiani and L. Y. T. Inoue. Decision Theory: Principles and Approaches. John
Wiley & Sons, West Sussex, United Kingdom, 2009.
[131] M. D. Parno. Transport maps for accelerated Bayesian computation. PhD thesis,
Massachusetts Institute of Technology, 2015.
[132] M. D. Parno and Y. M. Marzouk. Transport map accelerated Markov chain Monte
Carlo. arXiv preprint arXiv:1412.5492, 2015.
[133] B. D. Phenix, J. L. Dinaro, M. A. Tatang, J. W. Tester, J. B. Howard, and G. J.
Mcrae. Incorporation of Parametric Uncertainty into Complex Kinetic Mechanisms:
Application to Hydrogen Oxidation in Supercritical Water. Combustion and Flame,
112(1-2):132–146, 1998.
[134] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. International Joint Conference on Artificial Intelligence, 3:1025–
1032, 2003.
183
[135] B. T. Polyak and A. B. Juditsky. Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization (Society for Industrial and Applied
Mathematics), 30(4):838–855, 1992.
[136] J. Porta and N. Vlassis. Point-Based Value Iteration for Continuous POMDPs. The
Journal of Machine Learning Research, 7:2329–2367, 2006.
[137] W. B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. John Wiley & Sons, Hoboken, NJ, 2nd edition, 2011.
[138] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Hoboken, NJ, 1994.
[139] A. J. Ragauskas, C. K. Williams, B. H. Davison, G. Britovsek, J. Cairney, C. A. Eckert,
W. J. Frederick, J. P. Hallett, D. J. Leak, C. L. Liotta, J. R. Mielenz, R. Murphy,
R. Templer, and T. Tschaplinski. The Path Forward for Biofuels and Biomaterials.
Science, 311(5760):484–489, 2006.
[140] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.
The MIT Press, Cambridge, MA, 2006.
[141] M. T. Reagan, H. N. Najm, R. G. Ghanem, and O. M. Knio. Uncertainty quantification
in reacting-flow simulations through non-intrusive spectral projection. Combustion and
Flame, 132(3):545–555, 2003.
[142] H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of
Mathematical Statistics, 22(3):400–407, 1951.
[143] M. Rosenblatt. Remarks on a Multivariate Transformation. The Annals of Mathematical Statistics, 23(3):470–472, 1952.
[144] T. Russi, A. Packard, R. Feeley, and M. Frenklach. Sensitivity Analysis of Uncertainty
in Model Prediction. The Journal of Physical Chemistry A, 112(12):2579–2588, 2008.
[145] K. J. Ryan. Estimating Expected Information Gains for Experimental Designs With
Application to the Random Fatigue-Limit Model. Journal of Computational and
Graphical Statistics, 12(3):585–603, 2003.
[146] T. J. Santner, B. J. Williams, and W. I. Notz. The Design and Analysis of Computer
Experiments. Springer New York, New York, NY, 2003.
[147] P. Schütz, A. Tomasgard, and S. Ahmed. Supply chain design under uncertainty
using sample average approximation and dual decomposition. European Journal of
Operational Research, 199(2):409–419, 2009.
[148] P. Sebastiani and H. P. Wynn. Maximum entropy sampling and optimal Bayesian
experimental design. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 62(1):145–157, 2000.
[149] A. Shapiro. Asymptotic Analysis of Stochastic Programs. Annals of Operations Research, 30(1):169–186, 1991.
[150] A. Shapiro. Stochastic Programming by Monte Carlo Simulation Methods. Technical
report, Georgia Institute of Technology, 2003.
184
[151] A. Shapiro and A. Philpott. A Tutorial on Stochastic Programming. Technical report,
Georgia Institute of Technology, 2007.
[152] D. A. Shea and S. A. Lister. The BioWatch Program: Detection of Bioterrorism.
Technical report, Congressional Research Service Report, 2003.
[153] S. Sherman. On a theorem of Hardy, Littlewood, Polya, and Blackwell. Proceedings
of the National Academy of Sciences, 37(12):826–831, 1951.
[154] O. Sigaud and O. Buffet. Markov Decision Processes in Artificial Intelligence: MDPs,
beyond MDPs and applications. John Wiley & Sons, Hoboken, NJ, 2010.
[155] L. Sirovich. Turbulence and the Dynamics of Coherent Structures, Part I: Coherent
Structures. Quarterly of applied mathematics, 45(3):561–571, 1987.
[156] D. S. Sivia and J. Skilling. Data Analysis: A Bayesian Tutorial. Oxford University
Press, New York, NY, 2nd edition, 2006.
[157] R. D. Smallwood and E. J. Sondik. The Optimal Control of Partially Observable
Markov Processes Over a Finite Horizon. Operations Research, 21(5):1071–1088, 1973.
[158] A. Solonen, H. Haario, and M. Laine. Simulation-Based Optimal Design Using a
Response Variance Criterion. Journal of Computational and Graphical Statistics,
21(1):234–252, 2012.
[159] E. J. Sondik. The optimal control of partially observable Markov processes. PhD thesis,
Stanford University, 1971.
[160] J. C. Spall. Accelerated Second-Order Stochastic Optimization Using Only Function
Measurements. In Proceedings of the 36th IEEE Conference on Decision and Control
(Institute of Electrical and Electronics Engineers), pages 1417–1424, San Diego, CA,
1997.
[161] J. C. Spall. Implementation of the Simultaneous Perturbation Algorithm for Stochastic
Optimization. IEEE Transactions on Aerospace and Electronic Systems (Institute of
Electrical and Electronics Engineers), 34(3):817–823, 1998.
[162] C. Stein. Notes on a seminar on theoretical statistics; Comparison of experiments.
Technical report, University of Chicago, 1951.
[163] R. S. Sutton. Learning to Predict by the Methods of Temporal Differences. Machine
Learning, 3(1):9–44, 1988.
[164] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT
Press, Cambridge, MA, 1998.
[165] C. Szepesvári. Algorithms for Reinforcement Learning. Morgan & Claypool, San
Rafael, CA, 2010.
[166] G. Terejanu, R. R. Upadhyay, and K. Miki. Bayesian experimental design for the
active nitridation of graphite by atomic nitrogen. Experimental Thermal and Fluid
Science, 36:178–193, 2012.
185
[167] G. Tesauro and G. R. Galperin. On-line Policy Improvement using Monte Carlo Search.
In Advances in Neural Information Processing Systems 9, pages 1068–1074, Denver,
CO, 1996.
[168] J. N. Tsitsiklis and B. Van Roy. Regression Methods for Pricing Complex AmericanStyle Options. IEEE Transactions on Neural Networks (Institute of Electrical and
Electronics Engineers), 12(4):694–703, 2001.
[169] J. van den Berg, A. Curtis, and J. Trampert. Optimal nonlinear Bayesian experimental
design: an application to amplitude versus offset experiments. Geophysical Journal
International, 155(2):411–421, Nov. 2003.
[170] B. Verweij, S. Ahmed, A. J. Kleywegt, G. Nemhauser, and A. Shapiro. The Sample
Average Approximation Method Applied to Stochastic Routing Problems: A Computational Study. Computational Optimization and Applications, 24(2):289–333, 2003.
[171] C. Villani. Optimal Transport: Old and New. Springer-Verlag Berlin Heidelberg,
Berlin, Germany, 2008.
[172] U. Von Toussaint. Bayesian inference in physics. Reviews of Modern Physics, 83:943–
999, 2011.
[173] R. W. Walters. Towards Stochastic Fluid Mechanics via Polynomial Chaos. In Proceedings of the 41st AIAA Aerospace Sciences Meeting and Exhibit (American Institute
of Aeronautics and Astronautics), Reno, NV, 2003.
[174] J. K. Wathen and J. A. Christen. Implementation of Backward Induction for Sequentially Adaptive Clinical Trials. Journal of Computational and Graphical Statistics,
15(2):398–413, 2006.
[175] C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, King’s College, 1989.
[176] C. J. C. H. Watkins and P. Dayan. Technical Note: Q-Learning. Machine Learning,
8(3-4):279–292, 1992.
[177] B. P. Weaver, B. J. Williams, C. M. Anderson-Cook, and D. M. Higdon. Computational Enhancements to Bayesian Design of Experiments Using Gaussian Processes.
Bayesian Analysis, 2015.
[178] N. Wiener. The Homogeneous Chaos. American Journal of Mathematics, 60(4):897–
936, 1938.
[179] D. Xiu. Fast Numerical Methods for Stochastic Computations: A Review. Communications in Computational Physics, 5(2-4):242–272, 2009.
[180] D. Xiu and G. E. Karniadakis. The Wiener-Askey Polynomial Chaos for Stochastic
Differential Equations. SIAM Journal on Scientific Computing (Society for Industrial
and Applied Mathematics), 24(2):619–644, 2002.
[181] D. Xiu and G. E. Karniadakis. A new stochastic approach to transient heat conduction modeling with uncertainty. International Journal of Heat and Mass Transfer,
46(24):4681–4693, 2003.
186
Download