Numerical Approaches for Sequential Bayesian Optimal Experimental Design by Xun Huan B.A.Sc., University of Toronto (2008) S.M., Massachusetts Institute of Technology (2010) Submitted to the Department of Aeronautics and Astronautics in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computational Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2015 c Massachusetts Institute of Technology 2015. All rights reserved. Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Aeronautics and Astronautics August 20, 2015 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Youssef M. Marzouk Class of 1942 Associate Professor of Aeronautics and Astronautics Thesis Supervisor Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John N. Tsitsiklis Clarence J. Lebel Professor of Electrical Engineering Thesis Committee Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mort D. Webster Associate Professor of Energy Engineering, Pennsylvania State University Thesis Committee Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karen E. Willcox Professor of Aeronautics and Astronautics Thesis Committee Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paulo C. Lozano Associate Professor of Aeronautics and Astronautics Chair, Graduate Program Committee 2 Numerical Approaches for Sequential Bayesian Optimal Experimental Design by Xun Huan Submitted to the Department of Aeronautics and Astronautics on August 20, 2015, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computational Science and Engineering Abstract Experimental data play a crucial role in developing and refining models of physical systems. Some experiments can be more valuable than others, however. Well-chosen experiments can save substantial resources, and hence optimal experimental design (OED) seeks to quantify and maximize the value of experimental data. Common current practice for designing a sequence of experiments uses suboptimal approaches: batch (open-loop) design that chooses all experiments simultaneously with no feedback of information, or greedy (myopic) design that optimally selects the next experiment without accounting for future observations and dynamics. In contrast, sequential optimal experimental design (sOED) is free of these limitations. With the goal of acquiring experimental data that are optimal for model parameter inference, we develop a rigorous Bayesian formulation for OED using an objective that incorporates a measure of information gain. This framework is first demonstrated in a batch design setting, and then extended to sOED using a dynamic programming (DP) formulation. We also develop new numerical tools for sOED to accommodate nonlinear models with continuous (and often unbounded) parameter, design, and observation spaces. Two major techniques are employed to make solution of the DP problem computationally feasible. First, the optimal policy is sought using a one-step lookahead representation combined with approximate value iteration. This approximate dynamic programming method couples backward induction and regression to construct value function approximations. It also iteratively generates trajectories via exploration and exploitation to further improve approximation accuracy in frequently visited regions of the state space. Second, transport maps are used to represent belief states, which reflect the intermediate posteriors within the sequential design process. Transport maps offer a finite-dimensional representation of these generally non-Gaussian random variables, and also enable fast approximate Bayesian inference, which must be performed millions of times under nested combinations of optimization and Monte Carlo sampling. The overall sOED algorithm is demonstrated and verified against analytic solutions on a simple linear-Gaussian model. Its advantages over batch and greedy designs are then shown via a nonlinear application of optimal sequential sensing: inferring contaminant source location from a sensor in a time-dependent convection-diffusion system. Finally, the capability of the algorithm is tested for multidimensional parameter and design spaces in a more complex setting of the source inversion problem. 3 Thesis Supervisor: Youssef M. Marzouk Title: Class of 1942 Associate Professor of Aeronautics and Astronautics Committee Member: John N. Tsitsiklis Title: Clarence J. Lebel Professor of Electrical Engineering Committee Member: Mort D. Webster Title: Associate Professor of Energy Engineering, Pennsylvania State University Committee Member: Karen E. Willcox Title: Professor of Aeronautics and Astronautics 4 Acknowledgments First and foremost, I would like to thank my advisor Youssef Marzouk, for giving me the opportunity to work with him, and for his constant guidance and support. Youssef has been a great mentor, friend, and inspiration to me throughout my graduate school career. I find myself incredibly lucky to have crossed paths with him right as he started as a faculty member at MIT. I would also like to thank all my committee members, John Tsitsiklis, Mort Webster, and Karen Willcox, and my readers Peter Frazier and Omar Knio. I have benefited greatly from their support and insightful discussions, and I am honored to have each of them in the making of this thesis. There are many friends and colleagues that helped me through graduate school and enriched my life: Huafei Sun, with a friendship that started all the way back in Toronto, who has been like a big brother to me; Masayuki Yano, a great roommate of 3 years, with whom I had many interesting discussions about research and life; Hemant Chaurasia, together we endured through quals and classes, enjoyed MIT $100K events, and played many intramural hockey games side-by-side; Matthew Parno for graciously sharing his MUQ code; Tarek El Moselhy for the fun times in exploring Vancouver and Japan; Tiangang Cui for many enjoyable outings outside of research; Chi Feng, Alessio Spantini, and Sergio Amaral for performing “emergency surgery” on my desktop computer when its power supply died the week before my defense; and many others that I cannot name all here. I want to thank the entire UQ group and ACDL, all the students, post-docs, faculty and staff, past and present. Special thanks go to Sophia Hasenfus, Beth Marois, Meghan Pepin, and Jean Sofronas, for all the help behind the scenes. I am also grateful for all my friends from Toronto for making my visits back home extra fun and memorable. Last but not least, I want to thank my parents, who have always been there for me, through tough and happy times. I would not have made this far without their constant love, support, and encouragement, and I am very proud to be their son. My research was generously supported by funding from the BP-MIT Energy Fellowship, the KAUST Global Research Partnership, the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR), the Air Force Office of Scientific Research (AFOSR) Computational Mathematics Program, the National Science Foundation (NSF), and the Natural Sciences and Engineering Research Council of Canada (NSERC). 5 6 Contents 1 Introduction 1.1 1.2 19 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.1.1 Batch (open-loop) optimal experimental design . . . . . . . . . . . . . 21 1.1.2 Sequential (closed-loop) optimal experimental design . . . . . . . . . . 26 Thesis objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2 Batch Optimal Experimental Design 31 2.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2 Stochastic optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.1 Robbins-Monro stochastic approximation . . . . . . . . . . . . . . . . 35 2.2.2 Sample average approximation . . . . . . . . . . . . . . . . . . . . . . 36 2.2.3 Challenges in optimal experimental design . . . . . . . . . . . . . . . . 39 2.3 Polynomial chaos expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.4 Infinitesimal perturbation analysis . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5 Numerical results: 2D diffusion source inversion problem . . . . . . . . . . . . 46 2.5.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3 Formulation for Sequential Design 65 3.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2 Dynamic programming form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3 Information-based Bayesian experimental design . . . . . . . . . . . . . . . . . 69 3.4 Notable suboptimal sequential design methods . . . . . . . . . . . . . . . . . . 71 7 4 Approximate Dynamic Programming for Sequential Design 73 4.1 Approximation approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Policy representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Policy construction via approximate value iteration . . . . . . . . . . . . . . . 76 4.3.1 Backward induction and regression . . . . . . . . . . . . . . . . . . . . 76 4.3.2 Exploration and exploitation . . . . . . . . . . . . . . . . . . . . . . . 78 4.3.3 Iterative update of state measure and policy approximation . . . . . . 78 4.4 Connection to the rollout algorithm (policy iteration) . . . . . . . . . . . . . . 80 4.5 Connection to POMDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5 Transport Maps for Sequential Design 87 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Bayesian inference using transport maps . . . . . . . . . . . . . . . . . . . . . 90 5.3 Constructing maps from samples . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3.1 Optimization objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.3 Convexity and separability of the optimization problem . . . . . . . . 96 5.3.4 Map parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4 Relationship between quality of joint and conditional maps . . . . . . . . . . 98 5.5 Sequential design using transport maps . . . . . . . . . . . . . . . . . . . . . . 104 5.5.1 Joint map structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5.2 Distributions on design variables . . . . . . . . . . . . . . . . . . . . . 108 5.5.3 Generating samples in sequential design . . . . . . . . . . . . . . . . . 111 5.5.4 Evaluating the Kullback-Leibler divergence . . . . . . . . . . . . . . . 112 6 Full Algorithm Pseudo-code for Sequential Design 117 7 Numerical Results 119 7.1 7.2 Linear-Gaussian problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.1.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 1D contaminant source inversion problem . . . . . . . . . . . . . . . . . . . . 126 7.2.1 Case 1: comparison with greedy (myopic) design . . . . . . . . . . . . 131 8 7.3 7.2.2 Case 2: comparison with batch (open-loop) design . . . . . . . . . . . 133 7.2.3 Case 3: sOED grid and map methods . . . . . . . . . . . . . . . . . . 136 2D Contaminant source inversion problem . . . . . . . . . . . . . . . . . . . . 142 7.3.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8 Conclusions 155 8.1 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.2.1 Computational advances . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.2.2 Formulational advances . . . . . . . . . . . . . . . . . . . . . . . . . . 159 A Analytic Derivation of the Unbiased Gradient Estimator 163 B Analytic Solution to the Linear-Gaussian Problem 169 B.1 Derivation from batch optimal experimental design . . . . . . . . . . . . . . . 169 B.2 Derivation from sequential optimal experimental design . . . . . . . . . . . . . 172 9 10 List of Figures 1-1 The learning process can be characterized as an iteration between theory and practice via deductive and inductive reasoning. . . . . . . . . . . . . . . . . . 19 2-1 Example forward model solution and realizations from the likelihood. The solid line represents the time-dependent contaminant concentration w(x, t; xsrc ) at x = xsensor = (0, 0), given a source centered at xsrc = (0.1, 0.1), source strength s = 2.0, width h = 0.05, and shutoff time τ = 0.3. Parameters are defined in Equation 2.18. The five crosses represent noisy measurements at five designated measurement times. . . . . . . . . . . . . . . . . . . . . . . . . 47 2-2 Surface plots of independent ÛN,M realizations, evaluated over the entire design space [0, 1]2 ∋ d = (x, y). Note that the vertical axis ranges and color scales vary among the subfigures. . . . . . . . . . . . . . . . . . . . . . . . . . 49 2-3 Contours of posterior densities for the source location, given different sensor placements. The true source location, marked with a blue circle, is xsrc = (0.09, 0.22). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2-4 Sample paths of the RM algorithm with N = 1, overlaid on ÛN,M surfaces from Figure 2-2 with the corresponding M values. The large is the starting position and the large × is the final position. . . . . . . . . . . . . . . . . . . 52 2-5 Sample paths of the RM algorithm with N = 11, overlaid on ÛN,M surfaces from Figure 2-2 with the corresponding M values. The large is the starting position and the large × is the final position. . . . . . . . . . . . . . . . . . . 53 2-6 Sample paths of the RM algorithm with N = 101, overlaid on ÛN,M surfaces from Figure 2-2 with the corresponding M values. The large is the starting position and the large × is the final position. . . . . . . . . . . . . . . . . . . 54 11 2-7 Realizations of the objective function surface using SAA, and corresponding steps of BFGS, with N = 1. The large is the starting position and the large × is the final position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2-8 Realizations of the objective function surface using SAA, and corresponding steps of BFGS, with N = 11. The large is the starting position and the large × is the final position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2-9 Realizations of the objective function surface using SAA, and corresponding steps of BFGS, with N = 101. The large is the starting position and the large × is the final position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2-10 Mean squared error, defined in Equation 2.22, versus average run time for each optimization algorithm and various choices of inner-loop and outer-loop sample sizes. The highlighted curves are “optimal fronts” for RM (light red) and SAA-BFGS (light blue). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3-1 Batch design exhibits an open-loop behavior, where no feedback of information is involved, and the observations yk from any experiment do not affect the design of any other experiments. Sequential design exhibits a closed-loop behavior, where feedback of information takes place, and the data yk from an experiment can be used to guide the design of future experiments. . . . . . . . 68 5-1 A log-normal random variable z can be mapped to a standard Gaussian rani.d. dom variable ξ via ξ = T (z) = ln(z). . . . . . . . . . . . . . . . . . . . . . . . 88 5-2 Example 5.3.1: samples and density contours. . . . . . . . . . . . . . . . . . . 99 5-3 Example 5.3.1: posterior density functions using different map polynomial basis orders and sample sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5-4 Illustration of exact map and perspectives of approximate maps. Contour plots on the left reflect the reference density, and on the right the target density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5-5 Example 5.5.1: posteriors from joint maps constructed under different d distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5-6 Example 5.5.1: additional examples of posteriors from joint maps constructed under different d distributions. The same legend in Figure 5-5 applies. . . . . 115 12 7-1 Linear-Gaussian problem: J˜1 surfaces and regression points used to build them. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7-2 Linear-Gaussian problem: d0 histograms from 1000 simulated trajectories. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively.125 7-3 Linear-Gaussian problem: d1 histograms from 1000 simulated trajectories. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively.126 7-4 Linear-Gaussian problem: (d0 , d1 ) pair scatter plots from 1000 simulated trajectories superimposed on top of the analytic expected utility surface. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. . . 127 7-5 Linear-Gaussian problem: total reward histograms from 1000 simulated trajectories. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. The plus-minus quantity is 1 standard error. . . . . . . . . . . . 128 7-6 Linear-Gaussian problem: samples used to construct the exploration map and samples generated from the resulting map. . . . . . . . . . . . . . . . . . . . . 129 7-7 1D contaminant source inversion problem, case 1: physical state and belief state density progression of a sample trajectory. . . . . . . . . . . . . . . . . . 132 7-8 1D contaminant source inversion problem, case 1: (d0 , d1 ) pair scatter plots from 1000 simulated trajectories for greedy design and sOED. . . . . . . . . . 133 7-9 1D contaminant source inversion problem, case 1: total reward histograms from 1000 simulated trajectories for greedy design and sOED. The plus-minus quantity is 1 standard error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7-10 1D contaminant source inversion problem, case 2: d0 and d1 pair scatter plots from 1000 simulated trajectories for batch design and sOED. Roughly 55% of the sOED trajectories qualify for the precise device in the second experiment. However, there is no particular pattern or clustering of these designs, thus we do not separately color-code them in the scatter plot. . . . . . . . . . . . . . . 135 7-11 1D contaminant source inversion problem case 2: total reward histograms from 1000 simulated trajectories for batch design and sOED. The plus-minus quantity is 1 standard error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 13 7-12 1D contaminant source inversion problem, case 3: d0 histograms from 1000 simulated trajectories for the sOED grid and map methods. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. . . . . . . . . . 138 7-13 1D contaminant source inversion problem, case 3: d1 histograms from 1000 simulated trajectories for the sOED grid and map methods. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. . . . . . . . . . 139 7-14 1D contaminant source inversion problem, case 3: total reward histograms from 1000 simulated trajectories for the sOED grid and map methods. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. The plus-minus quantity is 1 standard error. . . . . . . . . . . . . . . . . . . . 140 7-15 1D contaminant source inversion problem: samples used to construct exploration map and samples generated from the resulting map. . . . . . . . . . . . 141 7-16 1D contaminant source inversion problem, case 3: (d0 , d1 ) pair scatter plots from 1000 simulated trajectories. The sOED result here is for ℓ = 1. . . . . . 142 7-17 1D contaminant source inversion problem, case 3: total reward histograms from 1000 simulated trajectories using batch and greedy designs. The plusminus quantity is 1 standard error. . . . . . . . . . . . . . . . . . . . . . . . . 142 7-18 2D contaminant source inversion problem: plume signal and physical state progression of sample trajectory 1. . . . . . . . . . . . . . . . . . . . . . . . . 146 7-19 2D contaminant source inversion problem: belief state posterior density contour progression of sample trajectory 1. . . . . . . . . . . . . . . . . . . . . . 147 7-20 2D contaminant source inversion problem: plume signal and physical state progression of sample trajectory 2. . . . . . . . . . . . . . . . . . . . . . . . . 148 7-21 2D contaminant source inversion problem: belief state posterior density contour progression of sample trajectory 2. . . . . . . . . . . . . . . . . . . . . . 149 7-22 2D contaminant source inversion problem: dk histograms from 1000 simulated trajectories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7-23 2D contaminant source inversion problem: total reward histograms from 1000 simulated trajectories. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. The plus-minus quantity is 1 standard error. . . 150 7-24 2D contaminant source inversion problem: samples used to construct exploration map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 14 7-25 2D contaminant source inversion problem: samples generated from the resulting map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7-26 2D contaminant source inversion problem: samples used to construct exploration map between dk and yk , with θ. The columns from left to right correspond to d0,0 , d0,1 , y0 , d1,0 , d1,1 , y1 , d2,0 , d2,1 , y2 , and the marginals for the row variables, and the rows from top to bottom correspond to the marginal for the column variables, θ0 , θ1 , θ0 , θ1 , θ0 , θ1 , where each pair of rows corresponding to θ for inference after 1, 2, and 3 experiments, respectively.153 7-27 2D contaminant source inversion problem: samples generated from the resulting map between dk and yk , with θ. The columns from left to right correspond to d0,0 , d0,1 , y0 , d1,0 , d1,1 , y1 , d2,0 , d2,1 , y2 , and the marginals for the row variables, and the rows from top to bottom correspond to the marginal for the column variables, θ0 , θ1 , θ0 , θ1 , θ0 , θ1 , where each pair of rows corresponding to θ for inference after 1, 2, and 3 experiments, respectively. . . . . . . . . . . 154 B-1 Linear-Gaussian problem: analytic expected utility surface, with the “front” of optimal designs in dotted black line. . . . . . . . . . . . . . . . . . . . . . . 172 15 16 List of Tables 2.1 Histograms of final search positions resulting from 1000 independent runs of RM (top subrows) and SAA (bottom subrows) over a matrix of N and M sample sizes. For each histogram, the bottom-right and bottom-left axes represent the sensor coordinates x and y, respectively, while the vertical axis represents frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.2 High-quality expected information gain estimates at the final sensor positions resulting from 1000 independent runs of RM (top subrows, blue) and SAA-BFGS (bottom subrows, red). For each histogram, the horizontal axis represents values of ÛM =1001,N =1001 and the vertical axis represents frequency. 62 2.3 Histograms of optimality gap estimates for SAA-BFGS, over a matrix of samples sizes M and N . For each histogram, the horizontal axis represents value of the gap estimate and the vertical axis represents frequency. . . . . . . . . . 63 2.4 Number of iterations in each independent run of RM (top subrows, blue) and SAA-BFGS (bottom subrows, red), over a matrix of sample sizes M and N . For each histogram, the horizontal axis represents iteration number and the vertical axis represents frequency. . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.1 Different levels of scope are available when transport maps are used as the belief state in the sOED context. niter represents the number of stochastic optimization iterations from numerically evaluating Equation 5.38, and nMC represents the Monte Carlo size of approximating its expectation. In our implementation, these values are typically around 50 and 100, respectively. . . 105 17 5.2 Structure of joint maps needed to perform inference under different number of experiments. For simplicity of notation, we omit the conditioning in the subscript of map components; please see Equation 5.40 for the full subscripts. The same pattern is repeated for higher number of experiments. The components grouped by the red rectangular boxes are identical. . . . . . . . . . . . . 108 5.3 Marginal distributions of d used to construct joint map. . . . . . . . . . . . . 110 7.1 Linear-Gaussian problem: total reward mean values (of histograms in Figure 7-5) from 1000 simulated trajectories. Monte Carlo standard errors are all ±0.02. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.2 Contaminant source inversion problem: problem settings. . . . . . . . . . . . 130 7.3 Contaminant source inversion problem: algorithm settings. . . . . . . . . . . . 130 7.4 1D contaminant source inversion problem, case 3: total reward mean values from 1000 simulated trajectories; the Monte Carlo standard errors are all ±0.02. The grid and map cases are all from sOED. . . . . . . . . . . . . . . . 138 18 Chapter 1 Introduction Experiments play an essential role in the learning process. As George E. P. Box points out, “. . . science is a means whereby learning is achieved, not by mere theoretical speculation on the one hand, nor by the undirected accumulation of practical facts on the other, but rather by a motivated iteration between theory and practice . . . ” [29]. As illustrated in Figure 1-1, theory is used to deduce what is expected to be observed in practice, and observations from experiments are used in turn to induce how theory may be further improved. In science and engineering, experiments are a fundamental building block of the scientific method, and crucial in the continuing development and refinement of models of physical systems. Theory Deduction Induction Practice (experimental observations) Figure 1-1: The learning process can be characterized as an iteration between theory and practice via deductive and inductive reasoning. Whether obtained through field observations or laboratory experiments, experimental data may be difficult and expensive to acquire. Even controlled experiments can be timeconsuming or delicate to perform. Experiments are also not equally useful, with some providing valuable information while others perhaps irrelevant to the investigation goals. It is therefore important to quantify the trade-off between costs and benefits, and maximize 19 the overall value of experimental data—to design experiments to be “optimal” by some appropriate measure. Not only is this an important economic consideration, it can also greatly accelerate the advancement of scientific understanding. Experimental design thus encompasses questions of where and when to measure, which variables to interrogate, and what experimental conditions to employ (some examples of real-life experimental design situations are shown below). In this thesis, we develop a systematic framework for experimental design that can help answer these questions. Example 1.0.1. Combustion kinetics: Alternative fuels, such as biofuels [139] and synthetic fuels [86], are becoming increasingly popular [36]. They are attractive sources for safeguarding volatile petroleum price, ensuring energy security, and promoting new and desirable properties that traditional fossil fuels might not offer. The development of these new fuels relies on a deep understanding of the underlying chemical combustion process, which is often modeled by complicated, nonlinear chemical mechanisms composed of many elementary reactions. Parameters governing the rates of these reactions, known as kinetic rate parameters, are usually inferred from experimental measurements such as ignition delay times [72, 58]. Many of these kinetic parameters have large uncertainties even today [12, 11, 141, 133], and more data are needed to reduce the uncertainties. Combustion experiments, often conducted using shock tubes, are usually expensive and difficult to set up, and need to be carefully planned. Furthermore, one may choose to carry out these experiments under different temperatures and pressures, with different initial concentrations of reactants, and selecting different output quantities to observe and at different times. Experimental design provides guidance in making these choices such that the most information may be gained on the kinetic rate parameters [89, 90]. Example 1.0.2. Optimal sensor placement: The United States government has initiated a number of terrorism prevention measures since the events of 9/11. For example, the BioWatch program [152] focuses on the prevention and response to scenarios where a biological pathogen is released in a city. One of its main goals is to find and intercept the contaminant source and eliminate it as soon as possible. It is often too dangerous to dispatch personnel into the contamination zone, but a limited number of measurements may be available from remote-controlled robotic vehicles. It is thus crucial for these measurements to yield the most information on the location of contaminant source [91]. This problem 20 will be revisited in this thesis, with particular focus on situations that allow a sequential selection of measurement locations. 1.1 Literature review Systematic design of experiments has received much attention in the statistics community and in many science and engineering applications. Early design approaches primarily relied on heuristics and experience, with the traditional factorial, composite, and Latin hypercube designs all based on the concepts of space-filling and blocking [69, 31, 56, 32]. While these methods can produce good designs in relatively simple situations involving a few design variables, they generally do not take into account, and take advantage of, the knowledge of the underlying physical process. Simulation-based experimental design uses a model to guide the choice of experiments, and optimal experimental design (OED) furthermore incorporates specific and relevant metrics to design experiments for a particular purpose, such as parameter inference, prediction, or model discrimination. The design of multiple experiments can be pursued via two broad classes of approaches: • Batch (open-loop) design involves the design of all experiments concurrently as a batch. The outcome of any experiment would not affect the design of others. In some situations, this approach may be necessary, such as under certain scheduling constraints. • Sequential (closed-loop) design allows experiments to be conducted in sequence, thus permitting newly acquired data to help guide the design of future experiments. 1.1.1 Batch (open-loop) optimal experimental design Extensive theory has been developed for OED of linear models, where the quantities probed in the experiments depend linearly on the model parameters of interest. Common solution criteria for the OED problem are written as functionals of the Fisher information matrix [66, 9]. These criteria include the well-known “alphabetic optimality” conditions, e.g., A-optimality to minimize the average variance of parameter estimates, or G-optimality to minimize the maximum variance of model predictions. The derivations may also adopt a Bayesian perspective [94, 156], which provides a rigorous foundation for inference from noisy, indirect, and incomplete data and a natural mechanism for incorporating physical 21 constraints and heterogeneous sources of information. Bayesian analogues of alphabetic optimality, reflecting prior and posterior uncertainty in the model parameters, can be attained from a decision-theoretic point of view [18, 146, 130], with the formulation of an expected utility quantity. For instance, Bayesian D-optimality can be obtained from a utility function containing Shannon information while Bayesian A-optimality may be derived from a squared error loss. In the case of linear-Gaussian models, the criteria of Bayesian alphabetic optimality reduce to mathematical forms that parallel their non-Bayesian counterparts [43]. For nonlinear models, however, exact evaluation of optimal design criteria is much more challenging. More tractable design criteria can be obtained by imposing additional assumptions, effectively changing the form of the objective; these assumptions include linearizations of the forward model, Gaussian approximations of the posterior distribution, and additional assumptions on the marginal distribution of the data [33, 43]. In the Bayesian setting, such assumptions lead to design criteria that may be understood as approximations of an expected utility. Most of these involve prior expectations of the Fisher information matrix [49]. Cruder “locally optimal” approximations require selecting a “best guess” value of the unknown model parameters and maximizing some functional of the Fisher information evaluated at this point [70]. None of these approximations, though, is suitable when the parameter distribution is broad or when it departs significantly from normality [51]. A more general design framework, free of these limiting assumptions, is preferred [118, 80]. With recent advances in algorithm development and computational power, OED for nonlinear systems can now be tackled directly using numerical simulation [109, 145, 169, 108, 158, 89, 90, 91]. Information-based objectives Our work accommodates nonlinear experimental design from a Bayesian perspective (e.g., [118]). We focus on experiments described by a continuous design space, with the goal of choosing experiments that are optimal for Bayesian parameter inference. Rigorous informationtheoretic criteria have been proposed throughout the literature (e.g., [75]). The seminal paper of Lindley [105] suggests using the expected information gain in model parameters from prior to posterior—or equivalently, the mutual information between parameters and observations, conditioned on the design variables—as a measure of the information provided by an experiment. This objective can also be derived using the Kullback-Leibler divergence from posterior to prior as a utility function [60, 43]. Sebastiani and Wynn [148] 22 propose selecting experiments for which the marginal distribution of the data has maximum Shannon entropy; this may be understood as a special case of Lindley’s criterion. Maximum entropy sampling (MES) has seen use in applications ranging from astronomy [109] to geophysics [169], and is well suited to nonlinear models. Reverting to Lindley’s criterion, Ryan [145] introduces a Monte Carlo estimator of expected information gain to design experiments for a model of material fatigue. Terejanu et al. [166] use a kernel estimator of mutual information to identify parameters in chemical kinetic model. The latter two studies evaluate their criteria on every element of a finite set of possible designs (on the order of ten designs in these examples), and thus sidestep the challenge of optimizing the design criterion over general design spaces. And both report significant limitations due to computation expense; [145] concludes that “full blown search” over the design space is infeasible, and that two order-of-magnitude gains in computational efficiency would be required even to discriminate among the enumerated designs. The application of optimization methods to experimental design has thus favored simpler design objectives. The chemical engineering community, for example, has tended to use linearized and locally optimal [117] design criteria or other objectives [144] for which deterministic optimization strategies are suitable. But in the broader context of decision-theoretic design formulations, sampling is required. [120] proposes a curve fitting scheme wherein the expected utility was fit with a regression model, using Monte Carlo samples over the design space. This scheme relies on problem-specific intuition about the character of the expected utility surface. Clyde et al. [52] explore the joint design, parameter, and data space with a Markov chain Monte Carlo (MCMC) sampler, while Amzal et al. [6] expanded this concept to multiple MCMC chains in a sequential Monte Carlo framework; this strategy combines integration with optimization, such that the marginal distribution of sampled designs is proportional to the expected utility. This idea is extended with simulated annealing in [121] to achieve more efficient maximization of the expected utility. [52, 121] use expected utilities as design criteria but do not pursue information-theoretic design metrics. Indeed, direct optimization of information-theoretic metrics has seen much less development. Building on the enumeration approaches of [169, 145, 166] and the one-dimensional design space considered in [109], [80] iteratively finds MES designs in multi-dimensional spaces by greedily choosing one component of the design vector at a time. Hamada et al. [84] also find “near-optimal” designs for linear and nonlinear regression problems by maximizing expected information 23 gain via genetic algorithms. Guestrin, Krause and others [81, 99, 100] find near-optimal placement of sensors in a discretized domain by iteratively solving greedy subproblems, taking advantage of the submodularity of mutual information. More recently, the author has made several contributions addressing the coupling of rigorous information-theoretic design criteria, complex nonlinear physics-based models, and efficient optimization strategies on continuous design spaces [89, 90, 91]. Stochastic optimization There are many approaches for solving continuous optimization problems with stochastic objectives. While some do not require the direct evaluation of gradients (e.g., NelderMead [124], Kiefer-Wolfowitz [95], and simultaneous perturbation stochastic approximation [161]), other algorithms can use gradient evaluations to great advantage. Broadly, these algorithms involve either stochastic approximation (SA) [102] or sample average approximation (SAA) [149], where the latter approach must also invoke a gradient-based deterministic optimization algorithm. Hybrids of the two approaches are possible as well. The Robbins-Monro algorithm [142] is one of the earliest and most widely used SA methods, and has become a prototype for many subsequent algorithms. It involves an iterative update that resembles steepest descent, except that it uses stochastic gradient information. SAA (also referred to as retrospective method [85] and sample-path method [82]) is a more recent approach, with theoretical analysis initially appearing in the 1990s [149, 82, 97]. Convergence rates and stochastic bounds, although useful, do not necessarily reflect empirical performance under finite computational resources and imperfect numerical optimization schemes. To the best of our knowledge, extensive numerical testing of SAA has focused on stochastic programming problems with special structure (e.g., linear programs with discrete design variables) [3, 170, 16, 79, 147]. While numerical improvements to SAA have seen continual development (e.g., estimators of optimality gap [127, 111] and sample size adaptation [46, 47]), the practical behavior of SAA in more general optimization settings is largely unexplored. SAA is frequently compared to stochastic approximation methods such as RM. For example, [150] suggests that SAA is more robust than SA because of sensitivity to step size choice in the latter. On the other hand, variants of SA have been developed that, for certain classes of problems (e.g., [125]), reach solution quality comparable to that of SAA in substantially less time. In this thesis, we also make comparisons between SA and SAA, 24 but from a practical and numerical perspective and in the context of OED. Surrogates for computationally intensive models In either case of Robbins-Monro or SAA, for information-based OED, one needs to employ gradients of an information gain objective. Typically, this objective function involves nested integrations over possible model outputs and over the input parameter space, where the model output may be a functional of the solution of a partial differential equation. In many practical cases, the model may be essentially a black box; while in other cases, even if gradients can be evaluated with adjoint methods, using the full model to evaluate the expected information gain or its gradient is computationally prohibitive. To make these calculations tractable, one would like to replace the forward model with a cheaper “surrogate” model that is accurate over the entire regions of the model input parameters. Surrogates can be generally categorized into three classes [65, 71]: data-fit models, reduced-order models, and hierarchical models. Data-fit models capture the input-output relationship of a model from available data points, and assume regularity by imposing interpolation or regression. Given the data points, it matters not how the original model functions, and it may be treated as a black box. One common approach for constructing data-fit models is Gaussian process regression [94, 140]; other approaches rely on so-called polynomial chaos expansions (PCE) and related stochastic spectral methods [178, 74, 180, 59, 123, 179, 104, 53]. In the context of OED, the former can be used to replace the likelihood altogether, allowing quick inferences and objective evaluations from this statistical model of much simpler structure [177]. The latter builds a subspace from a set of orthogonal polynomial basis functions, and exploits the regularity in the dependence of model outputs on uncertain input parameters. PCE capturing dependencies jointly on parameters and design conditions further accelerates the overall OED process [90], and can be constructed using dimension-adaptive sparse quadrature [73] that identifies and exploits anisotropic dependencies for efficiency in high dimensions. Reduced-order models are based on a projection of the output space onto a smaller, lowerdimensional subspace. One example is the proper orthogonal decomposition (POD), where a set of “snapshots” of the model outputs are used to construct a basis for the subspace [19, 155, 38]. Finally, hierarchical models are those where simplifications are performed based on the underlying physics. Techniques based on grid-coarsening, simplification of mechanics, 25 addition of assumptions, are of this type, and are often the basis of multifidelity analysis and optimization [28, 4]. 1.1.2 Sequential (closed-loop) optimal experimental design Compared to batch OED, sequential optimal experimental design (sOED) has seen much less development and use. The value of feedback through sequential design has been recognized early on, with original approaches typically involving a heuristic partitioning of experiments into batches. For instance, in the context of experimental design for improving chemical plant filtration rate [30], an initial “empirical feedback” stage involving space-filling designs is administered to “pick the winner” and find designs that best fix the problem, and a subsequent “scientific feedback” stage with adapted designs is followed to better understand the reasons for what went wrong or why a solution worked. Initial attempts in finding optimal sequential designs relied heavily on results from batch OED, by simply repeating its design methodology in a greedy manner. Some work made use of linear design theory by iteratively alternating between parameter estimation and applications of linear optimality (e.g., [2]). Since many physically realistic models involve output quantities that have nonlinear dependencies on model parameters, it is desirable to employ nonlinear OED tools. The key challenge, then, is to represent and propagate general non-Gaussian posteriors beyond the first experiment. Various representation techniques have been tested within the greedy design framework, with a large body of research based on sample representations of the posterior. For instance, posterior importance sampling has been employed for variance-based utility [158] and in greedy augmentations of generalized linear models [61]. Sequential Monte Carlo methods have also been utilized both in experimental design for parameter inference [62] and model discrimination [42, 63]. Even grid-based discretizations/representations of posterior density functions have shown success in adaptive design optimization that makes use of hierarchical models in visual psychophysics [96]. While these developments provide a convenient and intuitive avenue of extending existing batch OED tools, greedy design is ultimately suboptimal. A truly optimal sequential design framework needs to account for all relevant future effects in making every decision, but such considerations are dampened by challenges in computational feasibility. With recent advances in numerical algorithms and computing power, sOED can now be made practical. 26 sOED is often posed in a dynamic programming (DP) form, a framework widely used to describe sequential decision-making under uncertainty. While the DP description of sOED is gaining traction in recent years [119, 172], implementations and applications of this framework remain few, due to notoriously large computational requirements. The few existing attempts have mostly focused on optimal stopping problems [18], stemming predominantly from applications of clinical trial designs. Under simple situations, direct backward induction with tabular storage may be used, but is only feasible for discrete variables that can take on a few possible outcomes [37, 174]. Applications of more involved numerical solution techniques all rely on special structures of the problem with careful choices of loss functions. For example, Carlin et al. [41] propose a forward sampling method that directly optimizes a Monte Carlo estimate of the expected utility, but targets monotonic loss functions and certain conjugate priors that result in threshold policies based on the posterior mean. Continued development on backward induction also find feasible numerical implementations owing to policies that depend only on lower-dimensional sufficient statistics such as the posterior mean and standard deviation [21, 48]. Other approaches replace the simulation model altogether, and instead use statistical models with assumed distribution forms [122]. None of these works, however, uses an information-based objective. Incorporation of utilities that reflect information gain induces quantities that are much more challenging to evaluate, and has been attempted only for simple situations. For instance, Ben-Gal and Caramanis [15] find near-optimal stopping policies in multidimensional design spaces by deriving and making use of diminishing return (submodularity) on the expected incremental information gain; however, this is possible only for linear-Gaussian problems, where mutual information does not depend on the observations. With the current state-of-the-art in sOED heavily relying on special problem structures and often feasible only for discrete variables that can take on a few values, we seek to contribute to its development with a more general framework and numerical tools that can accommodate broader classes of problems. Dynamic programming The solution to the sOED problem directly relates to the solution of a DP problem. As DP is a broad subject accompanied by a vast sea of literature from many different fields of research, including control theory [24, 22, 23], operations research [138, 137], and machine 27 learning [93, 164], we do not attempt to make a comprehensive review. Instead, we make a brief introduction and only describe parts that are most relevant and promising to the sOED problem, while referring readers to the references above. Central to DP is the famous Bellman’s equation [13, 14], describing the relationship between cost or reward incurred immediately, with the expected cost or reward in the uncertain future, as a consequence of a decision. Its recursive definition leads to an exponential explosion of scenarios, and this “curse of dimensionality” cements to become the fundamental challenge of DP. Typically, only special classes of problems have analytic solutions, such as those described by linear dynamics and quadratic cost [8]. As a result, substantial research has been devoted to developing efficient numerical strategies for accurately capturing DP solutions—this field is known as approximate dynamic programming (ADP) (also referred to as neuro-dynamic programming and reinforcement learning) [137, 24, 93, 164]. With the goal of finding the (near) optimal policy, one must first be able to represent a policy. While direct approximations can be made, a policy is more often portrayed implicitly, such as by the limited lookahead forms. These forms eventually relegate the approximation to their associated value functions by probing their values at different states, leading to broad branches of ADP strategy in approximate value iteration (AVI) and approximate policy iteration (API). The key difference between AVI and API is that the former updates the policy immediately and maintains as good of an approximation to the optimal policy as possible, while the latter makes an accurate assessment of the value from a fixed policy (i.e., policy evaluation or learning) in an inner loop before improvements are made. Both of these strategies have stimulated the development of a host of learning (policy evaluation) techniques based on the well-known temporal-differencing method (e.g., [163, 164, 34]), and API further sparked the expansion of policy improvement methods such as least squares policy iteration [103], actor-critic methods (e.g., [24]), and policy-gradient algorithms (e.g., [165]). Finally, representation of value functions can be replaced by “model-free” Q-factors that capture the values in state-action pairs—this leads to the widely used reinforcement learning technique of Q-learning [175, 176]. 28 1.2 Thesis objectives Current research in OED has seen rapid advances in the design of batch experiments. Progress towards the optimal design of sequential experiments, however, remains in relatively early stages. Direct applications of batch OED methods to sequential settings are suboptimal, and initial explorations of the optimal framework have been limited to problems with discrete spaces of very few states and with special problem and solution structures. We aim to extend the optimal sequential design framework to much more general settings. The objectives of this thesis are: • To advance the numerical methods for batch OED from author’s previous work [89, 90] in order to accommodate nonlinear and computationally intensive models with an information gain objective. This involves deriving and accessing gradient information via the use of polynomial chaos and infinitesimal perturbation analysis in order to enable the application of gradient-based optimization methods. • To formulate the sOED problem in a rigorous manner, for a finite number of experiments, accommodating nonlinear and physically realistic models, under continuous parameter, design, and observation spaces of multiple dimensions, using a Bayesian treatment of uncertainty with general non-Gaussian distributions and an information measure design objective. This goal includes formulating the DP form of the sOED problem that is central to the subsequent development of numerical methods. • To develop numerical methods for solving the sOED problem in a computationally practical manner. This is achieved via the following sub-objectives. – To implement ADP techniques based on a one-step lookahead policy representation, combined with approximate value iteration (in particular backward induction and regression) for constructing value function approximations. – To represent continuous belief states numerically for general multivariate nonGaussian random variables using transport maps. – To construct and utilize transport maps in the joint design, observation, and parameter space, in a form that enables fast and approximate Bayesian inference by conditioning; this capability is necessary to achieve computational feasibility in the ADP methods. 29 • To demonstrate the computational effectiveness of our sOED numerical tools on realistic design applications with multiple experiments and multidimensional parameters. These applications include contaminant source inversion problems in both one- and two-dimensional physical domains. More broadly speaking, this thesis seeks to develop a rigorous mathematical framework and a set of numerical tools for performing sequential optimal experimental design in a computationally feasible manner. The thesis is organized as follows. Chapter 2 begins with the formulation, numerical methods, and results for the batch OED method, particularly focusing on the development of gradient information. It also provides a foundation of understanding in the relatively simpler batch design setting before extending to sequential designs for the rest of the thesis. Chapter 3 then presents the formulation of the sOED problem, including the DP form that is the basis for developing our numerical methods. We also demonstrate the frequently used batch and greedy design methods to be simplifications from the sOED problem, and thus suboptimal for sequential settings. Chapter 4 details the ADP techniques we employ to numerically solve the DP form of the sOED problem, including the development of an adaptive strategy to refine the policy induced state space. Chapter 5 introduces and describes the use of transport map as belief state, along with the framework for using joint maps to enable fast and approximate Bayesian inference. The full algorithm for the sOED problem is summarized in Chapter 6. It is then applied to several numerical examples in Chapter 7. We first illustrate the solution on a simple linear-Gaussian problem to provide intuitive insights and establish comparisons with analytic references. We then demonstrate these tools on contaminant source inversion problems of 1D and 2D convection-diffusion scenarios. Finally, Chapter 8 provides concluding remarks and future work. 30 Chapter 2 Batch Optimal Experimental Design Batch (open-loop) optimal experimental design (OED) involves the design of all experiments concurrently as a batch, where the outcome of any experiment would not affect the design of others.1 This self-contained chapter introduces the framework of batch OED, assuming the goal of the experiments is to infer uncertain model parameters from noisy and indirect observations. The framework developed here, however, can be used to accommodate other experimental goals as well. Furthermore, it uses a Bayesian treatment of uncertainty, employs an information measure objective, and accommodates nonlinear models under continuous parameter, design, and observation spaces. We pay particular attention to the use of gradient information and the overall computational behavior of the method, and demonstrate its feasibility with a partial differential equation (PDE)-based 2D diffusion source inversion problem. We then extend this foundation to sequential (closed-loop) OED in subsequent chapters. The content of this chapter is a continuation from the author’s previous work [89, 90], and draws heavily from the author’s recent publication [91]. 2.1 Formulation Let (Ω, F, P) be a probability space, where Ω is a sample space, F is a σ-field, and P is a probability measure on (Ω, F ). Let the vector of real-valued random variables2 θ : Ω → Rnθ denote the uncertain model parameters of interest (referred to as “parameters” 1 For simplicity of terminology, we refer to the entire batch of experiments as a single entity “experiment” in this chapter. 2 For simplicity, we will use lower case to represent both the random variables and their realizations. 31 in this thesis), i.e., they are the parameters to be conditioned on experimental data. Here nθ is the dimension of parameters. θ is associated with a measure µ on Rnθ , such that µ(A) = P θ−1 (A) for A ∈ Rnθ . We then define f (θ) = dµ/dθ to be the density of θ with respect to the Lebesgue measure. For the present purposes, we will assume that such a density always exists. Similarly, we treat the observations from the experiment, y ∈ Y (referred to as “observations”, “noisy measurements”, or “data” in this thesis), as a real-valued random vector endowed with an appropriate density, and d ∈ D as the vector of continuous design variables (referred to as “design” in this thesis). If one performs an experiment under design d and observes a realization of the data y, then the change in one’s state of knowledge about the parameters is given by Bayes’ rule: f (θ|y, d) = f (y|θ, d)f (θ|d) f (y|θ, d)f (θ) = . f (y|d) f (y|d) (2.1) For simplicity of notation, we shall use f (·) to represent all density functions, and which specific distribution it corresponds to is reflected by its arguments (when needed for clarity, we will explicitly include a subscript of the associated random variable). Here, f (θ|d) is the prior density, f (y|θ, d) is the likelihood function, f (θ|y, d) is the posterior density, and f (y|d) is the evidence. The second equality is due to the assumption that knowing the design of an experiment without knowing its observations does not affect our belief about the parameters (i.e., the prior would not change based on what experiment we plan to do)—thus f (θ|d) = f (θ). The likelihood function is assumed to be given, and describes the discrepancy between the observations and a forward model prediction in a probabilistic way. The forward model, denoted by G(θ, d), is a function that maps both the parameters and design into the observation space, and usually describes the outcome of some (possibly computationally expensive, such as PDE-based) simulation process. For example, y can be from, but not limited to, an additive Gaussian likelihood model: y = G(θ, d) + ǫ, where ǫ ∼ N (0, σǫ2 ), leading to a likelihood function of f (y|θ, d) = fǫ (y − G(θ, d)). We take a decision-theoretic approach and follow the concept of expected utility (or expected reward) to quantify the value of experiments [18, 146, 130]. While utility functions are quite flexible and can be based on loss functions defined for specific goals or tasks, we focus on utility functions that lead to valid measures of information gain of experiments [75]. Taking an information-theoretic approach, we choose utility functions that reflect the ex32 pected information gain on the parameters θ [105, 106]. In particular, we use the relative entropy, or the Kullback-Leibler (KL) divergence, from the posterior to the prior, and take its expectation under the prior predictive distribution of the data to obtain an expected utility U (d): f (θ|y, d) f (θ|y, d) ln U (d) = dθ f (y|d) dy f (θ) Y H = Ey|d DKL fθ|y,d (·|y, d)||fθ (·) , Z Z (2.2) where H ⊆ Rnθ is the support of the prior. Because the observations y cannot be known before the experiment is performed, taking the expectation over the prior predictive f (y|d) lets the resulting utility function reflect the information gain on average, over all anticipated outcomes of the experiment. The expected utility U (d) is thus the expected information gain due to an experiment performed at design d. A more detailed derivation of the expected utility can be found in [89, 90]. We choose to use the KL divergence for several reasons. First, KL is a special case of a wide range of divergence measures that satisfy the minimal set of requirements to be a valid measure of information on a set of experiments [75]. These requirements are based on the sufficient ordering (or “always at least as informative” ordering) of experiments, and are developed rigorously from likelihood ratio statistics, in a general setting without specifically targeting decision-theoretic or Bayesian perspectives. Second, KL gives an intuitive indication of information gain in the sense of Shannon information [55]. Since KL reflects the difference between two distributions, a large KL divergence from posterior to prior implies that the observations y decrease entropy in θ by a large amount, and hence those observations are more informative for parameter inference. Indeed, the KL divergence reflects the difference in information carried by two distributions in units of nats [55, 110], and the expected information gain is also equivalent to the mutual information between the parameters θ and the observations y, given the design d. Third, such a formulation for general nonlinear forward models (where G(θ, d) are nonlinear functions in the parameters θ) is consistent with linear optimal design theory based on the Fisher information matrix [66, 9]. When a linear model is used in this formulation, it simplifies to the linear D-optimality design, which is an attractive design approach due to, for example, its invariant under smooth model reparameterization [45]. Finally, the use of information measure contrasts with a loss 33 function in that, while the former does not target a particular task (such as estimation) in the context of a decision problem, it provides a general guidance of learning about the uncertain environment, and gaining information that performs well for a wide range of tasks albeit not best for any particular task. Typically, the expected utility in Equation 2.2 has no closed form (even if the forward model is, for example, a polynomial function of θ). Instead, it must be approximated numerically. By applying Bayes’ rule to the quantities inside and outside the logarithm in Equation 2.2, and then introducing Monte Carlo approximations for the resulting integrals, we obtain the nested Monte Carlo estimator proposed by Ryan [145]: N h M i X X 1 1 ln f (y (i) |θ(i) , d) − ln f (y (i) |θ̃(i,j) , d) , (2.3) U (d) ≈ ÛN,M (d, θs , ys ) ≡ N M i=1 j=1 o n where θs ≡ θ(i) ∪ θ̃(i,j) , i = 1 . . . N , j = 1 . . . M , are i.i.d. samples from the prior f (θ); and ys ≡ y (i) , i = 1 . . . N , are independent samples from the likelihoods f (y|θ(i) , d). The variance of this estimator is approximately A(d)/N + B(d)/N M and its bias is (to leading order) C(d)/M [145], where A, B, and C are terms that depend only on the distributions at hand. While the estimator ÛN,M is biased for finite M , it is asymptotically unbiased. Finally, the expected utility must be maximized over the design space D to find the optimal design: d∗ = argmax U (d). (2.4) d∈D Since U can only be approximated by Monte Carlo estimators such as ÛN,M , optimization methods for stochastic objective functions are needed. 2.2 Stochastic optimization Optimization methods can be broadly categorized as gradient-based and non-gradient-based. While gradient-based methods require additional gradient information, they are also generally more efficient than their non-gradient counterparts. With the intention to solve Equation 2.4, we make the gradient information available for the batch OED problem in this chapter. In particular, we consider two gradient-based stochastic optimization approaches: 34 Robbins-Monro (RM) stochastic approximation, and sample average approximation (SAA) combined with the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method. Both approaches require some flavor of gradient information, but they do not use the exact gradient of U (d). Calculating the latter is generally not possible, given that we only have a Monte Carlo estimator of U (d). The use of non-gradient optimization methods in the context of batch OED, with simultaneous perturbation stochastic approximation [160, 161] and Nelder-Mead simplex method [124], have been previously investigated by the author [89, 90]. Recasting the optimization problem in the convention of a minimization statement: d∗ = argmin [−U (d)] = argmin d∈D d∈D n io h h(d) ≡ Ey|d ĥ(d, y) , (2.5) where ĥ(d, y) is the underlying unbiased estimator of the unavailable objective function h(d) ≡ −U (d). We note that y is generally dependent on d. 2.2.1 Robbins-Monro stochastic approximation The iterative update of the RM method is dj+1 = dj − aj ĝ(dj , y ′ ), (2.6) where j is the optimization iteration index and ĝ(dj , y ′ ) is an unbiased estimator of the gradient (with respect to d) of h(d) evaluated at dj . In other words, Ey′ |d [ĝ(d, y ′ )] = ∇d h(d), but ĝ is not necessarily equal to ∇d ĥ. Also, y ′ and y may, but need not, be related. The gain sequence aj should satisfy the following properties: ∞ X j=0 aj = ∞ and ∞ X j=0 a2j < ∞. (2.7) One natural choice, used in this work, is the harmonic step size sequence aj = β/j, where β is some appropriate scaling constant. For example, in the diffusion application problem of Section 2.5, β is chosen to be 1.0 since the design space is [0, 1]2 . With various technical assumptions on ĝ and g, it can be shown that RM converges to the exact solution of Equation 2.5 almost surely [102]. Choosing the sequence aj is often viewed as the Achilles’ heel of RM, as the algorithm’s 35 performance can be very sensitive to the step size. We acknowledge this fact and do not downplay the difficulty of choosing an appropriate gain sequence, but there exist logical approaches to selecting aj that yield reasonable performance. More sophisticated strategies, such as search-then-converge learning rate schedules [57], adaptive stochastic step size rules [17], and iterate averaging methods [135, 102], have been developed and successfully demonstrated in applications. We will also use relatively simple stopping criteria for the RM iterations: the algorithm will be terminated when changes in dj stall (e.g., k dj − dj−1 k falls below some designated tolerance for 5 successive iterations) or when a maximum number of iterations has been reached (e.g., 50 iterations for the results of Section 2.5.) 2.2.2 Sample average approximation Transformation to design-independent noise The central idea of SAA is to reduce the stochastic optimization problem to a deterministic problem, by fixing the noise throughout the entire optimization process. In practice, if the noise y is design-dependent, it is first transformed to a design-independent random variable by effectively moving all the design dependence into the function ĥ. (An example of this transformation is given in Section 2.4.) The noise variables at different d then share a common distribution, and a common set of realizations is employed at all values of d. Such a transformation is always possible in practice, since the random numbers in any computation are fundamentally generated from uniform random (more precisely, pseudorandom) numbers. Thus one can always transform y back into these uniform random variables, which are of course independent of d.3 For the remainder of this section (Section 2.2.2) we shall, without loss of generality, assume that y has been transformed to a random variable w that is independent of d, while abusing the same notation of ĥ(d, w). Reduction to a deterministic problem SAA approximates the optimization problem of Equation 2.5 with ) N X 1 ĥ(d, wi ) , dˆs = argmin ĥN (d, ws ) ≡ N d∈D ( 3 (2.8) i=1 One does not need to go all the way to the uniform random variables; any higher-level “transformed” random variables, as long as they remain independent of d, suffice. 36 where dˆs and ĥN (dˆs , ws ) are the optimal design and objective values under a particular set of N realizations of the random variable w, ws ≡ {wi }N i=1 . The same set of realizations is used for different values of d during the optimization process, thus making the minimization problem in Equation 2.8 deterministic. (One can view this approach as an application of common random numbers.) A deterministic optimization algorithm can then be chosen to find dˆs as an approximation to d∗ . Estimates of h(dˆs ) can be improved by using ĥN ′ (dˆs , ws′ ) instead of ĥN (dˆs , ws ), where ′ ′ ĥN ′ (dˆs , ws′ ) is computed from a larger set of realizations ws′ ≡ {wm }N m=1 with N > N , in order to attain a lower variance. Finally, multiple (say R) optimization runs are often performed to obtain a sampling distribution for the optimal design values and the optimal objective values, i.e., dˆrs and ĥN (dˆrs , wsr ), for r = 1 . . . R. The sets wsr are independently chosen for each optimization run, but remain fixed within each run. Under certain assumptions on the objective function and the design space, the optimal design and objective estimates in SAA generally converge to their respective true values in distribution at a rate √ of 1/ N [149, 97].4 For the solution of a particular deterministic problem dˆrs , stochastic bounds on the true optimal value can be constructed by estimating the optimality gap h(dˆrs ) − h(d∗ ) [127, 111]. The first term can simply be approximated using the unbiased estimator ĥN ′ (dˆrs , wsr′ ) since h i Ews′ ĥN ′ (dˆrs , ws′ ) = h(dˆrs ). The second term may be estimated using the average of the approximate optimal objective values across the R replicate optimization runs (based on wsr , rather than wsr′ ): R h̄N = 1 X ĥN (dˆrs , wsr ). R (2.9) r=1 This is a negatively biased estimator and hence a stochastic lower bound on h(d∗ ) [127, 4 More precise properties of these asymptotic distributions depend on properties of the objective and the set of optimal solutions to the true problem. For instance, in the case of a singleton optimum d∗ , the SAA estimates ĥN (dˆs , ·) converge to a Gaussian with variance Varw [ĥ(d∗ , w)]/N . Faster convergence to the optimal objective value may be obtained when the objective satisfies stronger regularity conditions. The SAA solutions dˆs are not in general asympotically normal, however. Furthermore, discrete probability distributions lead to entirely different asymptotics of the optimal solutions. 37 111, 151].5,6 The difference ĥN ′ (dˆrs , wsr′ ) − h̄N is thus a stochastic upper bound on the true optimality gap h(dˆrs ) − h(d∗ ). The variance of this optimality gap estimator can be derived from the Monte Carlo standard error formula [3]. One could then use the optimality gap estimator and its variance to decide whether more runs are required, or which approximate optimal designs are most trustworthy. Pseudo-code for the SAA method is presented in Algorithm 1. At this point, we have reduced the stochastic optimization problem to a series of deterministic optimization problems; a suitable deterministic optimization algorithm is still needed to solve them. Algorithm 1: Pseudo-code for SAA. Set optimality gap tolerance η and number of replicate optimization runs R; r = 1; while optimality gap estimate > η and r ≤ R do Sample the set wsr = {wir }N i=1 ; Perform a deterministic optimization run and find dˆrs (see Algorithm 2); r }N ′ Sample the larger set wsr′ = {wm where N ′ > N ; m=1 P N′ 1 r r r r ˆ ˆ ′ ĥ d , w ; Compute ĥN (d , w ′ ) = ′ 1 2 3 4 5 6 7 s s N m=1 s m Estimate the optimality gap and its variance; r = r + 1; end ˆr r R Output the sets {dˆrs }R r=1 and {ĥN ′ (ds , ws′ )}r=1 for post-processing; 8 9 10 11 Broyden-Fletcher-Goldfarb-Shanno method The BFGS method [126] is a gradient-based method for solving deterministic nonlinear optimization problems, widely used for its robustness, ease of implementation, and efficiency. It is a quasi-Newton method, iteratively updating an approximation to the (inverse) Hessian matrix from objective and gradient evaluations at each stage. Pseudo-code for the BFGS method is given in Algorithm 2. In the present implementation, a simple backtracking line search is used to find a step size that satisfies the first (Armijo) Wolfe condition only. The algorithm can be terminated according to many commonly used criteria: for example, when the gradient stalls, the line search step size falls below a prescribed tolerance, the h i Short proof from [151]: For any d ∈ D, we have that Ews ĥN (d, ws ) = h(d), and that ĥN (d, wsr ) ≥ h h h i i i mind′ ∈D ĥN (d′ , wsr ). Then h(d) = Ews ĥN (d, ws ) ≥ Ews mind′ ∈D ĥN (d′ , ws ) = Ews ĥN (dˆrs , ws ) = Ews h̄N . 6 The bias decreases monotonically with N [127]. 5 38 design or function value stalls, or a maximum allowable number of iterations or objective evaluations is reached. BFGS is shown to converge super-linearly to a local minimum if a quadratic Taylor expansion exists near that minimum [126]. The limited memory BFGS (L-BFGS) [126] method can also be used when the design dimension becomes very large (e.g., more than 104 ), such that the dense inverse Hessian cannot be stored explicitly. Algorithm 2: Pseudo-code for BFGS. In this context, ĥN (d, wsr ) is the deterministic objective function we want to minimize (as a function of d). 1 2 3 4 5 6 Initialize starting point d0 , inverse Hessian approximation H0 , gradient termination tolerance ε; Initialize any other termination conditions and parameters; j = 0; while ∇d ĥN (dj , wsr ) > ε and other termination conditions are not met do Compute search direction pj = −Hj ∇d ĥN (dj , wsr ); Find acceptable step size αj via line search; Update position dj+1 = dj + αj pj ; Define vectors sj = dj+1 − dj and uj = ∇d ĥN (dj+1 , wsr ) − ∇d ĥN (dj , wsr ) ; sj s⊤ sj u ⊤ u j s⊤ j j Update inverse Hessian approximation Hj+1 = I − s⊤ u Hj I − u⊤ s + s⊤ uj ; 7 8 9 j 10 11 12 j j j j j j = j + 1; end Output dˆrs = dj ; 2.2.3 Challenges in optimal experimental design The main challenge in applying the aforementioned stochastic optimization algorithms to batch OED is the lack of readily-available gradient information. For RM, we need an unbiased estimator of the gradient of the expected utility, i.e., ĝ in Equation 2.6. For SAA-BFGS, we need the gradient of the finite-sample Monte Carlo approximation of the expected utility, i.e., ∇d ĥN (·, wsr ). We address these needs by introducing two concepts in the next two sections: 1. A simple surrogate model, based on polynomial chaos expansions (see Section 2.3), replaces the often computationally intensive forward model. The purpose of the surrogate is twofold. First, it allows the nested Monte Carlo estimator in Equation 2.3 to be evaluated in a computationally tractable manner. Second, its polynomial form allows the gradient of Equation 2.3, ∇d ĥN (·, wsr ), to be derived analytically. These 39 gains come at the expense of introducing additional error via the polynomial approximation of the original forward model, however. In other words, given a surrogate for the forward model and the resulting expected information gain, we can derive exact gradients of a Monte Carlo approximation of this expected information gain, and use these gradients in SAA. 2. Infinitesimal perturbation analysis (see Section 2.4) applied to Equation 2.2, along with the estimator in Equation 2.3 and the polynomial surrogate model, allows the analytic derivation of an unbiased gradient estimator ĝ, as required for the RM approach. 2.3 Polynomial chaos expansions This section introduces polynomial chaos expansions (PCE) for mitigating the cost of repeated forward model evaluations. In the next section, they will also shown be used to help evaluate appropriate gradient information for stochastic optimization methods. Mathematical models of the experiment enter the inference and design formulation through the likelihood function f (y|θ, d). For example, a simple likelihood function might allow for an additive discrepancy ǫ between experimental observations and model predictions y = G(θ, d) + ǫ, where G is the forward model. Computationally intensive forward models can render Monte Carlo estimation of the expected information gain impractical. In particular, drawing a sample from f (y|θ, d) requires evaluating G at a particular (θ, d). Evaluating the density f (y|θ, d) = fǫ (y − G(θ, d)) again requires evaluating G. To make these calculations tractable, one would like to replace G with a cheaper “surrogate” model that is accurate over the entire prior support and the entire design space D. As discussed near the end of Section 1.1.1, various options exist with different properties. We focus on PCE, which has seen extensive use in a range of engineering applications (e.g., [88, 141, 173, 181]) including parameter estimation and inverse problems (e.g., [113, 112, 114]). More recently, it has also been used in the batch OED setting [89, 90], with excellent accuracy and multiple order-of-magnitude speedups over direct evaluations of forward model. The formulation of PCE is as follows. Any random variable z with finite variance can 40 be represented by an infinite series z= ∞ X ai Ψi (ξ1 , ξ2 , . . .), (2.10) |i|=0 where i = (i1 , i2 , . . .) , ij ∈ N0 , is an infinite-dimensional multi-index (we bold this index to emphasize its multidimensional nature); |i| = i1 + i2 + . . . is the l1 norm; ai ∈ R are the expansion coefficients; ξi are independent random variables; and Ψi (ξ1 , ξ2 , . . .) = ∞ Y ψij (ξj ) (2.11) j=1 are multivariate polynomial basis functions [180]. Here ψij is an orthogonal polynomial of order ij in the variable ξj , where orthogonality is with respect to the density of ξj , Eξ [ψm (ξ)ψn (ξ)] = Z Ξ 2 ψm (ξ)ψn (ξ)f (ξ) dξ = δm,n Eξ ψm (ξ) , (2.12) and Ξ is the support of f (ξ). The expansion in Equation 2.10 is convergent in the meansquare sense [39]. For computational purposes, the infinite sum in Equation 2.10 must be truncated to some finite stochastic dimension ns and a finite number of polynomial terms. A common choice is the “total-order” truncation |i| ≤ p, but other truncations that retain fewer cross terms, a larger number of cross terms, or anisotropy among the dimensions are certainly possible [53]. In the OED context, the model outputs depend on both the parameters and the design. Constructing a new polynomial expansion at each value of d encountered during optimization is generally impractical. Instead, we can construct a single PCE for each component of G, depending jointly on θ and d [90]. To proceed, we assign one stochastic dimension to each component of θ and one to each component of d. Further, we assume an affine transformation between each component of d and the corresponding ξi ; any realization of d can thus be uniquely associated with a vector of realizations ξi . Since the design variables will usually be supported on a bounded domain (e.g., inside some hyper-rectangle), the corresponding ξi are endowed with uniform distributions. The associated univariate ψi are thus Legendre polynomials. These distributions effectively define a uniform weight function over the design space D that governs where the L2 -convergent PCE should be most accurate.7 7 Ideally, we would like to use a weight function that is proportional to how often the different d values 41 Constructing the PCE involves computing the coefficients ai . This computation generally can proceed via two possible approaches, intrusive and nonintrusive. The intrusive approach results in a new system of equations that is larger than the original deterministic system, but it needs be solved only once. The difficulty of this latter step depends strongly on the character of the original equations, however, and may be prohibitive for arbitrary nonlinear systems. The nonintrusive approach computes the expansion coefficients by directly using the quantity of interest (e.g., the model outputs), for example, by projecting them onto the basis functions Ψi . One advantage of this method is that the deterministic solver can be reused and treated as a black box. The deterministic problem then needs to be solved many times, but typically at carefully chosen parameter and design values. The nonintrusive approach also offers flexibility in choosing arbitrary functionals of the state trajectory as observation variables; these functionals may depend smoothly on ξ even when the state itself has a less regular dependence. Here, we will employ a nonintrusive approach. Applying orthogonality, the PCE coefficients for a forward model surrogate are simply Gc,i Eξ [Gc (θ(ξ), d(ξ))Ψi (ξ)] = = Eξ Ψ2i (ξ) R d(ξ))Ψi (ξ)f (ξ) dξ Ξ Gc (θ(ξ), R , 2 Ξ Ψi (ξ)f (ξ) dξ (2.13) where Gc,i is the coefficient of Ψi for the cth component of the model outputs. Analytic expressions are available for the denominators Eξ Ψ2i (ξ) , but the numerators must be evaluated numerically. When the evaluations of the integrand (and hence the forward model) are expensive and ns is large, an efficient method for numerical integration in high dimensions is essential. To evaluate the numerators in Equation 2.13, we employ Smolyak sparse quadrature based on one-dimensional Clenshaw-Curtis quadrature rules [50]. Care must be taken to avoid significant aliasing errors when using sparse quadrature to construct polynomial approximations, however. Indeed, it is advantageous to recast the approximation as a Smolyak sum of constituent full-tensor polynomial approximations, each associated with a tensorproduct quadrature rule that is appropriate to its polynomials [54, 53]. This type of approximation may be constructed adaptively, thus taking advantage of weak coupling and anisotropy in the dependence of G on θ and d. More details can be found in [53]. are visited over the entire algorithm (e.g., from stochastic optimization). This distribution, if known, could replace the uniform distribution and define a more efficient weighted L2 norm; however, it is almost always too complex to extract in practice. 42 At this point, we may substitute the polynomial approximation of G into the likelihood function f (y|θ, d), which in turn enters the expected information gain estimator in Equation 2.3. This enables fast evaluation of the expected information gain. The computation of appropriate gradient information is discussed next. 2.4 Infinitesimal perturbation analysis This section applies the method of infinitesimal perturbation analysis (IPA) [87, 76, 7] to construct an unbiased estimator ĝ of the gradient of the expected information gain, for use in RM. The same procedure yields the gradient ∇d ĥN,M (·, wsr ) of a finite-sample Monte Carlo approximation of the expected information gain, for use in SAA. The central idea of IPA is that under certain conditions, an unbiased estimator of the gradient of a function can be obtained by simply taking the gradient of an unbiased estimator of the function. We apply this idea in the context of batch OED. The first requirement of IPA is the availability of an unbiased estimator of the objective function. Unfortunately, as described in Section 2.1, ÛN,M from Equation 2.3 is a biased estimator of U for finite M [145]. To circumvent this technicality, let us optimize the following objective function instead of U : i h ŪM (d) ≡ Eθs ,ys |d ÛN,M (d, θs , ys ) Z Z ÛN,M (d, θs , ys )f (θs , ys |d) dθs dys = Ys = Z Ys Hs Z (N,M ) ÛN,M (d, θs , ys ) Hs Y (i,j)=(1,1) f (y (i) |θ(i) , d)f (θ(i) )f (θ̃(i,j) ) dθs dys . (2.14) Our original estimator ÛN,M is now unbiased for the new objective ŪM by construction! The trade-off, of course, is that the function being optimized is no longer the true U . But it is consistent in that ŪM (d) → U (d) as M → ∞, for any N > 0. (To illustrate this convergence in the numerical results of Section 2.5, realizations of ÛN,M , i.e., Monte Carlo approximations of ŪM , are plotted in Figure 2-2 for varying M .) The second requirement of IPA comprises conditions allowing an unbiased gradient estimator to be constructed by taking the gradient of the unbiased function estimator. Standard conditions (see, for example, [7]) require that the random quantity (e.g., ÛN,M ) be almost 43 surely continuous and differentiable. Here, because ÛN,M is parameterized by continuous random variables that have densities with respect to Lebesgue measure, we can take a perspective that relies on Leibniz’s rule with the following conditions: 1. ÛN,M and ∇d ÛN,M are continuous over the product space of design variables and random variables, D × Hs × Ys ; 2. the density of the “noise” random variable is independent of d. The first condition supports the interchange of differentiation and integration according to Leibniz’s rule. This condition might be difficult to verify in general cases, but the use of finite-order polynomial forward models and continuous distributions for the prior and observational noise ensures that we meet the requirement. The second condition is needed to preserve the form of the expectation. If it is violated, differentiation with respect to d must be performed on the f (θs , ys |d) term as well via the R R product rule, in which case the additional term Ys Hs ÛN,M (d, θs , ys ) ∇d [f (θs , ys |d)] dθs dys would no longer be an expectation with respect to the original density. The likelihood-ratio method may be used to restore the expectation [77, 7], but it is not pursued here. Instead, it is simpler to transform the noise to a design-independent random variable as described in Section 2.2.2. In the context of OED, the outcome of the experiment y is a stochastic quantity that depends on the design d. From the stochastic optimization perspective, y is thus the noise variable. To demonstrate the transformation to design-independent noise, we assume a likelihood where the data result from an additive Gaussian perturbation to the forward model: y = G(θ, d) + ǫ = G(θ, d) + C(θ, d)z. (2.15) Here C is a diagonal matrix with non-zero entries reflecting the dependence of the noise standard deviation on other quantities, and z is a vector of i.i.d. standard normal random variables. For example, “10% Gaussian noise on the cth component” would translate to Cc,i = δci 0.1|Gc (θ, d)|, where δci is the Kronecker delta function. For other forms of the likelihood, the right-hand side of Equation 2.15 is simply replaced by a generic function of 44 θ, d, and some design-independent random variable z. Here, however, we will focus on the additive Gaussian form in order to derive illustrative expressions. By extracting a design-independent random variable z from the noise term ǫ ≡ C(θ, d)z, we will satisfy the second condition above. The design dependence of y is incorporated into ÛN,M by substituting Equation 2.15 into Equation 2.3: N i 1 Xn h ÛN,M (d, θs , zs ) = ln fy|θ,d G(θ(i) , d) + C(θ(i) , d)z (i) θ(i) , d N i=1 M X 1 − ln fy|θ,d G(θ(i) , d) + C(θ(i) , d)z (i) θ(i,j) , d ,(2.16) M j=1 where zs = z (i) . The new noise variables are now independent of d. The samples of y (i) drawn from the likelihood are instead realized by drawing z (i) from multivariate standard Gaussian, then multiplying these samples by C and adding them to the model output. With all conditions for IPA satisfied, an unbiased estimator of the gradient of ŪM , corresponding to ĝ in Equation 2.6, is simply ∇d ÛN,M (d, θs , zs ) since Z h i Eθs ,zs ∇d ÛN,M (d, θs , zs ) = Z ∇d ÛN,M (d, θs , zs )f (θs , zs ) dθs dzs Z Z = ∇d ÛN,M (d, θs , zs )f (θs , zs ) dθs dzs Zs H s i h = ∇d Eθs ,zs ÛN,M (d, θs , zs ) Zs Hs = ∇d ŪM (d), (2.17) where Zs is the support of f (zs ). This gradient estimator is therefore suitable for use in RM. The gradient of the finite-sample Monte Carlo approximation of U (d), i.e., ∇d ĥN,M (·, wsr ) used in SAA, takes exactly the same form. The only difference between the two is that ĝ lets θs and zs be random at every iteration of the optimization process. When used as ∇d ĥN,M (·, wsr ), θs and zs are frozen at some realization throughout the optimization process. In either case, these gradient expressions contain derivatives of the likelihood function and thus derivatives ∇d G(θ, d). When G is replaced with a polynomial expansion, these derivatives can be computed inexpensively. Detailed derivations of the gradient estimator using orthogonal polynomial expansions can be found in Appendix A. 45 2.5 2.5.1 Numerical results: 2D diffusion source inversion problem Problem setup We demonstrate the batch OED formulation and stochastic optimization tools on a source inversion problem in a 2D diffusion field. The goal is to place a single sensor that yields maximum information about the location of the contaminant source. Contaminant transport is governed by a scalar diffusion equation on a square domain: ∂w = ∇2 w + S (xsrc , x, t) , ∂t x ∈ X = [0, 1]2 , (2.18) where w(x, t; xsrc ) is the space-time concentration field parameterized by the coordinate of the source center xsrc . We impose homogeneous Neumann boundary conditions on ∂X ∇w · n = 0 (2.19) where n is the normal vector, along with a zero initial condition (2.20) w(x, 0; xsrc ) = 0. The source function has a Gaussian spatial profile S (xsrc , x, t) = s 2πh2 exp −xk2 − kxsrc 2h2 0, , 0≤t<τ (2.21) t≥τ where s, h, and τ are known (prescribed) source intensity, width, and shutoff time parameters, respectively, and xsrc ≡ (θx , θy ) = θ is the unknown source location that we would ultimately like to infer. The design vector is the location of a single sensor, xsensor ≡ (dx , dy ) = d, and the observations {yi }5i=1 comprise five noisy point measurements of w at the sensor location and at five equally-spaced sample times. For this study, we choose s = 2.0, h = 0.05, τ = 0.3; a uniform prior θx , θy ∼ U (0, 1); and an additive noise likelihood model yi = w (xsensor , ti , ; xsrc ) + ǫi , i = 1 . . . 5, such that the ǫi are zero-mean Gaussian random variables, mutually independent given xsensor , t, and xsrc , each with standard deviation σi = 0.1 + 0.1 |w (xsensor , ti ; xsrc )|. In other words, the measurement noise associated with the data has a “floor” value of 0.1 plus an additional contribution that is 10% of the signal. 46 The sensor may be placed anywhere in the square domain, such that the design space is (dx , dy ) ∈ [0, 1]2 . Figure 2-1 shows an example concentration profile and measurements. 2.5 Concentration 2 1.5 1 0.5 model prediction noisy measurements 0 0 0.1 0.2 t 0.3 0.4 Figure 2-1: Example forward model solution and realizations from the likelihood. The solid line represents the time-dependent contaminant concentration w(x, t; xsrc ) at x = xsensor = (0, 0), given a source centered at xsrc = (0.1, 0.1), source strength s = 2.0, width h = 0.05, and shutoff time τ = 0.3. Parameters are defined in Equation 2.18. The five crosses represent noisy measurements at five designated measurement times. Evaluating the forward model thus requires solving the PDE in Equation 2.18 at fixed realizations of θ = xsrc and extracting the solution field at the design location d = xsensor . We discretize Equation 2.18 using 2nd-order centered differences on a 25 × 25 spatial grid and a 4th-order backward differentiation formula for time integration. As described in Section 2.3, we replace the full forward model with a PCE surrogate, for computational efficiency. To this end, we construct a Legendre polynomial approximation of the forward model output over the 4-dimensional joint parameter and design space, using a total-order polynomial truncation of degree 12 and 106 forward model evaluations. This high polynomial degree and rather large number of forward model evaluations are deliberately selected in order to render truncation and aliasing errors insignificant in our study. OED results of similar quality may be obtained for this problem with surrogates of lower order and with far fewer quadrature points (e.g., degree 4 with 104 forward model evaluations) but for brevity they are not included here. The relative L2 errors of the current surrogate range from 6 × 10−3 to 10−6 . The OED formulation now seeks the sensor location d∗ = x∗sensor such that when the 47 experiment is performed, on average—i.e., averaged over all possible source locations according to the prior, and over all possible resulting concentration measurements according to the likelihood—the five concentration readings {yi }5i=1 yield the greatest information gain from prior to posterior. 2.5.2 Results Objective function Before we present the results of numerical optimization, we first explore the properties of the expected information gain objective. Numerical realizations of ÛN,M for N = 1001 and M = 2, 11, 101, and 1001 are shown in Figure 2-2. These plots can be interpreted as 1-sample Monte Carlo approximations of ŪM = E[ÛN,M ], or equivalently, as l-sample Monte Carlo approximations of ŪM = E[Û(N/l),M ]. As N grows, ÛN,M becomes a better approximation to ŪM and as M grows, ŪM becomes a better approximation to U . The figures show that values of ÛN,M increase when M increases (for fixed N ), suggesting a negative bias at finite M . At the same time, the objective becomes less flat in d; since U is certainly closer to the M = 1001 surface than the M = 2 surface, these results suggest that U is not particularly flat in d. This feature of the current design problem is encouraging, since stochastic optimization problems with higher curvature can be more easily solved; in the context of stochastic optimization, for example, they effectively have a higher signal-to-noise ratio. The expected information gain objective inherits symmetries from the square, as expected from the physical nature of the problem. The plots also suggest a smooth albeit non-convex underlying objective U , with inflection points lying on an interior circle and four local maxima symmetrically located at the corners of the design space. The best placement for a single sensor is therefore at the corners of the design space, while the worst placement is at the center. The reason for this perhaps counterintuitive result is that the diffusion process is isotropic: a series of concentration measurements can only determine the distance of the source from the sensor, not its orientation. The posterior distribution thus resembles an annulus of constant radius surrounding the sensor. A sensor placement that minimizes the area of these annuli, averaged over all possible source locations according to the prior, tends to be optimal. In this problem, because of the domain geometry and 48 0.45 Expected Utility Expected Utility 0.5 0.4 0.4 0.3 1.4 1.1 1.2 1 1 0.9 0.8 0.8 0.35 0.2 1 1 1 1 0.5 y 0.5 0 0 0.5 0.3 y x (a) N = 1001, M = 2 0 0 x (b) N = 1001, M = 11 1.5 1.4 1.2 1 1.1 1 0.5 1 0.9 1 y 0.8 0.5 0 0 1.4 2 1.3 Expected Utility Expected Utility 1.5 0.5 0.7 0.5 0.7 x (c) N = 1001, M = 101 1.3 1.5 1.2 1.1 1 1 0.5 1 0.9 1 0.5 y 0.8 0.5 0 0 x (d) N = 1001, M = 1001 Figure 2-2: Surface plots of independent ÛN,M realizations, evaluated over the entire design space [0, 1]2 ∋ d = (x, y). Note that the vertical axis ranges and color scales vary among the subfigures. the magnitude of the observational noise, these optimal locations happen to be the furthest points from the domain center, i.e., the corners. Figure 2-3 shows posterior densities for the source location, under different sensor placements, given data generated from a “true” source centered at xsrc = (0.09, 0.22). The posterior densities are evaluated using the PCE surrogate via Bayes’ rule, while the data are generated by directly solving the diffusion equation on a denser (101 × 101) spatial grid than before and then adding the Gaussian noise described in Section 2.5.1. Note that the posteriors are extremely non-Gaussian. Moreover, they generally include the true source location, but do not center on it. Reasons for not expecting the posterior mode to match 49 the true source location are twofold: first, we have only 5 measurements, each perturbed with a relatively significant random noise; second, there is model error, due to mismatch between the PCE approximation constructed from the coarser spatial discretization of the PDE and the more finely discretized PDE model used to simulate the data.8,9 For this source configuration, it appears that a sensor placed at any of the corners yields a “tighter” posterior than a sensor placed at the center. But we must keep in mind that this result is not guaranteed for all source locations and data realizations; it depends on where the source actually is. [Imagine, for example, if the source happened to be very close to the center of the domain; then the sensor at (0.5, 0.5) would yield the tightest posterior.] What the batch OED method yields is the optimal sensor placement averaged over the prior distribution of the source location and the predictive distribution of the data. 1 1 60 Sensor Source Center 50 0.8 1 4.5 4 0.8 3 Sensor Source Center 2.5 0.8 3.5 30 0.4 0.4 2 20 0.2 0.4 0.6 0.8 1 Sensor Source Center 0 0 1 x (a) xsensor = (0.0, 0.0) 0.2 1 10 0.2 1.5 0.4 1.5 0.2 0 0 2 0.6 2.5 y y 3 0.6 y 40 0.6 0.2 0.4 0.6 0.8 0.5 0.5 0 0 1 0.2 0.4 0.6 0.8 x x (b) xsensor = (0.0, 1.0) (c) xsensor = (1.0, 0.0) 1 5 1 0.8 4 0.8 0.6 3 0.6 0.4 2 0.4 0.2 1 0.2 Sensor Source Center 1 1.8 1.6 1.2 y y 1.4 1 0.8 0.6 0.4 Sensor Source Center 0 0 0.2 0.4 0.6 0.8 0.2 0 0 1 x 0.2 0.4 0.6 0.8 1 x (d) xsensor = (1.0, 1.0) (e) xsensor = (0.5, 0.5) Figure 2-3: Contours of posterior densities for the source location, given different sensor placements. The true source location, marked with a blue circle, is xsrc = (0.09, 0.22). 8 Indeed, there are two levels of model error: (1) between the PCE and the PDE model used to construct the PCE, which has a ∆x = ∆y = 1/24 spatial discretization; (2) between this PDE model and the more finely discretized (∆x = ∆y = 1/100) PDE model used to simulate the noisy data. 9 Model error is an extremely important aspect of uncertainty quantification [94], but its treatment is beyond the scope of this thesis. Understanding the impact of model error on OED is an important direction for future work. 50 Stochastic optimization results We now analyze the optimization results, first assessing the behavior of the two stochastic optimization methods individually, and then comparing their performance. Simple termination criteria are used for both methods, stopping when k dj − dj−1 k falls below a tolerance of 10−6 for 5 successive iterations, or when a maximum number of 50 iterations has been reached. Recall that the RM algorithm is essentially a steepest-ascent method (since we are maximizing the expected utility) with a stochastic gradient estimate. Figures 2-4–2-6 each show four sample RM optimization paths overlaid on the ÛN,M surfaces from Figure 2-2. The optimization does not always proceed in an ascent direction, due to the noise in the gradient estimate, but even a noisy gradient can be useful in eventually guiding the algorithm to regions of high objective value. Naturally, fewer iterations are needed and good designs are more likely to be found when the variance of the gradient estimator is reduced by increasing N and M . Note that one must be cautious not to over-generalize from these figures, since the paths shown in each plot are not necessarily representative. Instead, their purpose is to provide intuition about the optimization mechanics. Data derived from many runs are more appropriate performance metrics, and will be used later in this section. For SAA-BFGS, each choice of the sample set wxr yields a different deterministic objective; example realizations of this objective surface are shown in Figures 2-7–2-9. For each realization, a local maximum is found efficiently by the BFGS algorithm, requiring only a few (usually less than 10) iterations. For each set of results corresponding to a particular N (i.e., each of Figures 2-7–2-9), the random numbers used for smaller values of M are proper subsets of those used for larger M . We thus expect some similarity and a sense of convergence among the subplots in each figure. Note also that when N is low, realizations of the objective can be extremely different from Figure 2-2 (for example, the plots in Figure 2-7 have local maxima near the center of the domain), although improvement is observed as N is increased. In general, each deterministic problem in SAA can have very different features than the underlying objective function. None of the realizations encountered here has maxima at the corners, or is even symmetric. Nonetheless, when sampling over many SAA subproblems, even a low N can provide reasonably good results. This will be shown in Tables 2.1 and 2.2, and discussed in detail below. 51 1 1 1.1 0.45 0.8 0.8 1 0.6 0.6 y y 0.4 0.9 0.4 0.4 0.35 0.8 0.2 0.2 0.7 0 0 0.3 0.2 0.4 0.6 0.8 0 0 1 0.2 0.4 0.6 0.8 1 x x (a) N = 1, M = 2 (b) N = 1, M = 11 1 1 1.4 1.4 1.3 0.8 0.8 1.3 1.2 1.2 0.6 y 1.1 0.4 y 0.6 1 0.9 0.2 1.1 0.4 1 0.9 0.2 0.8 0 0 0.2 0.4 0.6 0.8 0.8 0 0 1 0.2 0.4 0.6 0.8 x x (c) N = 1, M = 101 (d) N = 1, M = 1001 1 Figure 2-4: Sample paths of the RM algorithm with N = 1, overlaid on ÛN,M surfaces from Figure 2-2 with the corresponding M values. The large is the starting position and the large × is the final position. To compare the performance of RM and SAA-BFGS, 1000 independent runs are conducted for each algorithm, over a matrix of N and M values. The starting locations of these runs are sampled from a uniform distribution over the design space. We make reasonable choices for the numerical parameters in each algorithm (e.g., gain schedule scaling, termination criteria) leading to similar run times. Histograms of the final design parameters (sensor positions) resulting from each set of 1000 optimization runs are shown in Table 2.1. The top figures in each major row represent RM results, while the bottom figures in each major row correspond to SAA-BFGS results. Columns correspond to different values of M . It is immediately apparent that more designs cluster at the corners of the domain as N and M are increased. For the case with the largest number of samples (N = 101 and M = 1001), 52 1 1 1.1 0.45 0.8 0.8 1 0.6 0.6 y y 0.4 0.9 0.4 0.4 0.35 0.8 0.2 0.2 0.7 0 0 0.3 0.2 0.4 0.6 0.8 0 0 1 0.2 0.4 0.6 0.8 1 x x (a) N = 11, M = 2 (b) N = 11, M = 11 1 1 1.4 1.4 1.3 0.8 0.8 1.3 1.2 1.2 0.6 y 1.1 0.4 y 0.6 1 0.9 0.2 1.1 0.4 1 0.9 0.2 0.8 0 0 0.2 0.4 0.6 0.8 0.8 0 0 1 0.2 0.4 0.6 0.8 x x (c) N = 11, M = 101 (d) N = 11, M = 1001 1 Figure 2-5: Sample paths of the RM algorithm with N = 11, overlaid on ÛN,M surfaces from Figure 2-2 with the corresponding M values. The large is the starting position and the large × is the final position. each corner has around 250 designs, suggesting that higher sample sizes cannot further improve the optimization results. An “overlap” in quality across the different N cases is also observed: for example, results of the N = 101, M = 2 case are worse than those of the N = 11, M = 1001 case. A balance is thus needed in choosing samples sizes N and M , and it is not ideal to heavily favor sampling either the inner or outer Monte Carlo loop in ÛN,M . Overall, comparing the RM and SAA-BFGS plots at intermediate values of M and N , we see that RM has a slight advantage over SAA-BFGS by placing more designs at the corners. The distribution of final designs alone does not reflect the robustness of the optimization results. For example, if U is very flat near the optimum, then suboptimal designs need not 53 1 1 1.1 0.45 0.8 0.8 1 0.6 0.6 y y 0.4 0.9 0.4 0.4 0.35 0.8 0.2 0.2 0.7 0 0 0.3 0.2 0.4 0.6 0.8 0 0 1 0.2 0.4 0.6 0.8 1 x x (a) N = 101, M = 2 (b) N = 101, M = 11 1 1 1.4 1.4 1.3 0.8 0.8 1.3 1.2 1.2 0.6 y 1.1 0.4 y 0.6 1 1 0.9 0.2 1.1 0.4 0.9 0.2 0.8 0 0 0.2 0.4 0.6 0.8 0.8 0 0 1 0.2 0.4 0.6 0.8 x x (c) N = 101, M = 101 (d) N = 101, M = 1001 1 Figure 2-6: Sample paths of the RM algorithm with N = 101, overlaid on ÛN,M surfaces from Figure 2-2 with the corresponding M values. The large is the starting position and the large × is the final position. be very close to the true optimum in the design space to be considered good designs in practice. To evaluate robustness, a “high-quality” objective estimate Û1001,1001 is computed for each of the 1000 final designs considered above. The resulting histograms are shown in Table 2.2, where again the top subrows are for RM and the bottom subrows are for SAA-BFGS, with the results covering a full range of N and M values. In keeping with our previous observations, performance is improved as N and M are increased—in that the mean (over the optimization runs) expected information gain increases, while the variance in the expected information gain decreases. Note, however, that even if all 1000 optimization runs produced identical final designs, this variance will not reach zero, as there exists a “floor” corresponding to the variance of the estimator Û1001,1001 . This minimum variance can be 54 1 1 2 0.6 1.8 0.8 0.5 0.8 1.6 0.4 0.6 1.4 0.6 1.2 y y 0.3 0.2 0.4 0.4 1 0.1 0.8 0.2 0 0.2 0.6 −0.1 0 0 0.2 0.4 0.6 0.8 0 0 1 0.4 0.2 0.4 0.6 0.8 1 x x (a) N = 1, M = 2 (b) N = 1, M = 11 1 1 2.5 0.8 2.5 0.8 2 0.6 2 0.4 1.5 0.2 1 y y 0.6 1.5 0.4 1 0.2 0 0 0.2 0.4 0.6 0.8 0 0 1 0.2 0.4 0.6 0.8 x x (c) N = 1, M = 101 (d) N = 1, M = 1001 1 Figure 2-7: Realizations of the objective function surface using SAA, and corresponding steps of BFGS, with N = 1. The large is the starting position and the large × is the final position. observed in the histograms of the RM results with N = 101 and M = 101 or 1001. One interesting feature of the histograms in Table 2.2 is their bimodality. The higher mode reflects designs near the four corners, while the lower mode encompasses all other suboptimal designs. As N or M increase, we observe a transfer of probability mass from the lower mode to the upper mode. However, the sample sizes are not large enough for the lower mode to completely disappear for most cases; it is only absent in the two RM cases with the largest sample sizes. Overall, the histograms are similar in shape for both algorithms, but RM appears to produce less variability in the expected information gain, particularly at high N values. Table 2.3 shows histograms of optimality gap estimates from the 1000 SAA-BFGS runs. 55 1 1 1.4 0.6 0.8 0.8 1.3 0.55 1.2 0.6 0.5 1.1 y y 0.6 0.45 1 0.4 0.4 0.9 0.4 0.2 0.8 0.2 0.35 0.7 0 0 0.3 0.2 0.4 0.6 0.8 0 0 1 0.2 0.4 0.6 0.8 1 x x (a) N = 11, M = 2 (b) N = 11, M = 11 1 1 1.8 0.8 0.8 1.6 0.6 1.4 0.4 1.2 0.2 1 1.6 0.4 1.2 y 1.4 y 0.6 1 0.2 0.8 0 0 0.2 0.4 0.6 0.8 0 0 1 0.8 0.2 0.4 0.6 0.8 x x (c) N = 11, M = 101 (d) N = 11, M = 1001 1 Figure 2-8: Realizations of the objective function surface using SAA, and corresponding steps of BFGS, with N = 11. The large is the starting position and the large × is the final position. Since we are dealing with a maximization problem (for the expected information gain), the estimator from Section 2.2.2 is reversed in sign, such that the upper bound is now h̄N and the lower bound is ĥN ′ (dˆrs , wsr′ ). The lower bound must be evaluated with the same inner-loop Monte Carlo sample size M used in the optimization run in order to represent an identically-biased underlying objective; hence, the lower bound values will not be the same as the “high-quality” objective estimates Û1001,1001 discussed above. From the table, we observe that as N increases, values of the optimality gap estimate decrease. This is a result of the lower bound rising with N (since the optimization is better able to find designs in regions of large ŪM , e.g., corners of the domains in Table 2.1), and the upper 56 1 1 1.2 0.45 0.8 0.8 1.1 0.4 0.6 1 y y 0.6 0.9 0.2 0.2 0 0 0.4 0.35 0.4 0.8 0.3 0.2 0.4 0.6 0.8 0 0 1 0.2 0.4 0.6 0.8 1 x x (a) N = 101, M = 2 (b) N = 101, M = 11 1 1 1.4 1.2 0.8 1.3 0.8 1.2 1.1 0.6 0.6 y y 1.1 1 0.4 1 0.4 0.9 0.2 0.9 0.2 0.8 0.8 0 0 0.2 0.4 0.6 0.8 0 0 1 0.7 0.2 0.4 0.6 0.8 x x (c) N = 101, M = 101 (d) N = 101, M = 1001 1 Figure 2-9: Realizations of the objective function surface using SAA, and corresponding steps of BFGS, with N = 101. The large is the starting position and the large × is the final position. bound simultaneously falling (since its positive bias monotonically decreases with N [127]). Consequently, both bounds become tighter and the gap estimates tend toward zero. As M increases, the variance of the gap estimates increases. Since the upper bound (h̄N ) is fixed for a given set of SAA runs, the spread is only affected by the variability of the lower bound. Indeed, from Figure 2-2, it is apparent that the objective becomes less flat as M increases, with the highest gradients (considering the good design regions only) occurring at the corners. This translates to a higher sensitivity, as a small “imperfection” in the design would lead to larger changes in objective estimate; one then would expect the variation of ĥN ′ (dˆrs , wsr′ ) to become higher as well, leading to greater variance in the gap estimates. 57 Finally, as M increases, the histogram values tend to increase, but they increase more slowly for larger values of N . Some intuition for this result may be obtained by considering the relative rates of change of the upper and lower bounds with respect to M , given different values of N . Again referring to Figure 2-2, the objective values generally increase with M , indicating an increase of the lower bound. This increase should be more pronounced for larger N , since the optimization converges to designs closer to the corners, where, as mentioned earlier, the objective has larger gradient. The upper bound increases with M as well, as indicated by the contour levels in Figures 2-7–2-9. But this rate of increase is observed to be slowest at the highest N (i.e., in Figure 2-9). Combining these two effects, it is reasonable that as N increases, the gap estimate will increase with M at a slower rate. Can the optimality gap be used to choose values of M and N ? For a fixed M , we certainly have convergence as N increases, and the gap estimate can be a good indicator of solution quality. However, because different values of M correspond to different objective surfaces (due to the bias of ÛN,M ), the optimality gap is unsuitable for comparisons across different values of M ; indeed, in our example, even though solution quality is improved with M , the gap estimates appear looser and noisier. Another performance metric we extract from the stochastic optimization runs is the number of iterations required to reach a solution; histograms of iteration number for RM and SAA, for the same matrix of M and N values, are shown in Table 2.4. At low sample sizes, many of the SAA-BFGS runs take only a few iterations, while almost all of the RM runs terminate at the maximum allowable number of iterations (50 in this case). This difference again reflects the efficiency of BFGS for deterministic optimization problems. As N and M are increased, the histograms show a “transfer of mass” from higher iteration numbers to lower iteration numbers, coinciding somewhat with the bimodal behavior described previously. The reduction in iteration number with increased sample size implies that an n-fold increase in sample size leads to an increase in computational time that is often much less than a factor of n. Accounting for this sublinear relationship when allocating computational resources, especially if samples can be drawn in parallel, can lead to substantial savings. Although SAA-BFGS generally requires fewer iterations, each iteration takes longer than a step of RM. RM thus offers a higher “resolution” in run times, potentially giving more freedom to the user in stopping the algorithm. RM thus becomes more attractive as the evaluation of the objective function becomes more expensive. 58 As a single integrated measure of the quality of the stochastic optimization solutions, we evaluate the following mean squared error (MSE): 2 1 X Û1001,1001 (dr , θsr′ , zsr′ ) − U ref , R R MSE = (2.22) r=1 where dr , r = 1 . . . R, are the final designs from a given optimization algorithm, and U ref is the true optimal value of the expected information gain. Since the true optimum is unavailable in this study, U ref is taken to be the maximum value of the objective over all runs. Recall that the MSE combines the effects of bias and variance; here it reflects the variance in objective values plus the difference (squared) between the mean objective value and the true optimum, calculated via R = 1000 replicated optimization runs. Figure 2-10 relates solution quality to computational effort by plotting the MSE against average computational time (per run). Each symbol represents a particular value of N (×, , and represent N = 1, 11, and 101, respectively), while the four different M values are reflected through the average run times. These plots confirm the behavior we have previously encountered. Solution quality generally improves (lower MSE) with increasing sample sizes, although a balanced allocation of samples must be chosen. For instance, a large N with small M can yield inferior solutions to a smaller N with larger M ; while, for any given N , continued increases in M beyond some threshold yield minimal improvements in MSE. The best sample allocation is described by the minimum of all the curves. We highlight these “optimal fronts” in light red for RM and in light blue for SAA-BFGS. Monte Carlo error in the “high-quality” estimator Û1001,1001 may also be reflected in the non-zero MSE asymptote for the high-N RM cases. According to Figure 2-10, RM outperforms SAA-BFGS by consistently achieving smaller MSE for a given computational effort. One should be cautious, however, in generalizing from these numerical experiments. The advantage of RM is relatively small, and other factors such as code optimization, choices of algorithm parameters, and of course the OED problem itself can affect or even reverse this advantage. 59 0.08 0.07 0.07 Optimization Result MSE Optimization Result MSE 0.08 0.06 0.05 0.04 0.03 0.02 0.01 N=1 N=11 N=101 0 10 −2 0.06 0.05 0.04 0.03 0.02 0.01 0 0 2 10 Average Time [s] N=1 N=11 N=101 10 10 −2 0 10 Average Time [s] (a) RM 2 10 (b) SAA-BFGS 0.08 Optimization Result MSE 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 RM SAA−BFGS −2 10 0 10 Average Time [s] 10 2 (c) RM and SAA-BFGS “optimal fronts” Figure 2-10: Mean squared error, defined in Equation 2.22, versus average run time for each optimization algorithm and various choices of inner-loop and outer-loop sample sizes. The highlighted curves are “optimal fronts” for RM (light red) and SAA-BFGS (light blue). 60 ❍ ❍❍ M ❍❍ ❍ 2 N 11 101 1001 200 200 200 200 100 100 100 100 0 1 0 1 0 1 1 0.5 0 0 1 0 1 1 0.5 0.5 1 0.5 0.5 0 0 1 0.5 0.5 0 0 200 200 200 200 100 100 100 100 0 1 0 1 0 1 1 0.5 0 1 1 0.5 0.5 0 0 1 0.5 0.5 0 0 1 0.5 0.5 0 0 200 200 200 100 100 100 100 0 1 0 1 0 1 1 0.5 0 0 11 0 1 1 0.5 1 0.5 0.5 0 0 1 0.5 0.5 0 0 200 200 200 100 100 100 100 0 1 0 1 0 1 1 0 1 1 0.5 0.5 0 0 1 0.5 0.5 0 0 1 0.5 0.5 0 0 200 200 200 100 100 100 100 0 1 0 1 0 1 1 0 0 101 0 1 1 0.5 0.5 1 0.5 0.5 0 0 1 0.5 0.5 0 0 200 200 200 100 100 100 100 0 1 0 1 0 1 1 0.5 0 0 0 1 1 0.5 0.5 0 0 0.5 0 0 200 0.5 0.5 0 0 200 0.5 0.5 0 0 200 0.5 0.5 0 0 200 0.5 0.5 0 0 1 0.5 0.5 0 0 1 0.5 0.5 0 0 Table 2.1: Histograms of final search positions resulting from 1000 independent runs of RM (top subrows) and SAA (bottom subrows) over a matrix of N and M sample sizes. For each histogram, the bottom-right and bottom-left axes represent the sensor coordinates x and y, respectively, while the vertical axis represents frequency. 61 ❍❍ N M ❍❍ 1 2 1001 250 250 250 200 200 200 200 150 150 150 150 100 100 100 100 50 50 50 0 0.6 0.8 1 1.2 1.4 1.6 0 0.6 0.8 1 1.2 1.4 1.6 0 0.6 50 0.8 1 1.2 1.4 1.6 0 0.6 250 250 250 250 200 200 200 200 150 150 150 150 100 100 100 100 50 50 50 50 0.8 1 1.2 1.4 1.6 0 0.6 0.8 1 1.2 1.4 1.6 0 0.6 0.8 1 1.2 1.4 1.6 0 0.6 250 250 250 250 200 200 200 200 150 150 150 150 100 100 100 100 50 50 50 50 0 0.6 0.8 1 1.2 1.4 1.6 0 0.6 0.8 1 1.2 1.4 1.6 0 0.6 0.8 1 1.2 1.4 1.6 0 0.6 250 250 250 250 200 200 200 200 150 150 150 150 100 100 100 100 50 50 50 0 0.6 101 101 250 0 0.6 11 11 ❍ ❍ 0.8 1 1.2 1.4 1.6 0 0.6 0.8 1 1.2 1.4 1.6 0 0.6 1 1.2 1.4 1.6 0 0.6 250 250 250 200 200 200 200 150 150 150 150 100 100 100 100 50 50 50 50 0.8 1 1.2 1.4 1.6 0 0.6 0.8 1 1.2 1.4 1.6 0 0.6 0.8 1 1.2 1.4 1.6 0 0.6 250 250 250 250 200 200 200 200 150 150 150 150 100 100 100 100 50 50 50 50 0 0.6 0.8 1 1.2 1.4 1.6 0 0.6 0.8 1 1.2 1.4 1.6 0 0.6 1 1.2 1.4 1.6 0.8 1 1.2 1.4 1.6 0.8 1 1.2 1.4 1.6 0.8 1 1.2 1.4 1.6 0.8 1 1.2 1.4 1.6 0.8 1 1.2 1.4 1.6 50 0.8 250 0 0.6 0.8 0.8 1 1.2 1.4 1.6 0 0.6 Table 2.2: High-quality expected information gain estimates at the final sensor positions resulting from 1000 independent runs of RM (top subrows, blue) and SAA-BFGS (bottom subrows, red). For each histogram, the horizontal axis represents values of ÛM =1001,N =1001 and the vertical axis represents frequency. 62 ❍ ❍❍ M ❍❍ ❍ 2 N 1 11 101 11 101 1001 250 250 250 250 200 200 200 200 150 150 150 150 100 100 100 100 50 50 50 50 0 0 0 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 0 250 250 250 250 200 200 200 200 150 150 150 150 100 100 100 100 50 50 50 50 0 0 0 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 0 250 250 250 250 200 200 200 200 150 150 150 150 100 100 100 100 50 50 50 0 0 0.5 1 1.5 0 0 0.5 1 1.5 0 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 50 0 0.5 1 1.5 0 Table 2.3: Histograms of optimality gap estimates for SAA-BFGS, over a matrix of samples sizes M and N . For each histogram, the horizontal axis represents value of the gap estimate and the vertical axis represents frequency. 63 ❍ ❍❍ M ❍❍ ❍ 2 N 1 1001 1000 1000 1000 800 800 800 800 600 600 600 600 400 400 400 400 200 200 200 0 0 10 20 30 40 50 0 0 10 20 30 40 50 0 0 200 10 20 30 40 50 0 0 600 600 600 600 500 500 500 500 400 400 400 400 300 300 300 300 200 200 200 200 100 100 100 5 10 15 20 0 0 5 10 15 20 5 10 15 20 0 0 1000 1000 1000 800 800 800 800 600 600 600 600 400 400 400 400 200 200 200 200 0 0 10 20 30 40 50 0 0 10 20 30 40 50 0 0 10 20 30 40 50 0 0 600 600 600 600 500 500 500 500 400 400 400 400 300 300 300 300 200 200 200 200 100 100 100 5 10 15 20 0 0 5 10 15 20 5 10 15 20 1000 1000 800 800 800 800 600 600 600 600 400 400 400 400 200 200 200 200 10 20 30 40 50 10 20 30 40 50 0 0 10 20 30 40 50 0 0 600 600 600 600 500 500 500 500 400 400 400 400 300 300 300 300 200 200 200 200 100 100 100 100 0 0 5 10 15 20 0 0 5 10 15 20 0 0 5 10 0 0 1000 0 0 20 30 40 10 20 15 30 20 40 5 10 15 20 0 0 5 10 10 20 5 15 50 30 10 20 40 15 Table 2.4: Number of iterations in each independent run of RM (top subrows, blue) and SAA-BFGS (bottom subrows, red), over a matrix of sample sizes M and N . For each histogram, the horizontal axis represents iteration number and the vertical axis represents frequency. 64 50 100 0 0 1000 0 0 10 100 0 0 1000 0 0 101 101 1000 0 0 11 11 50 20 Chapter 3 Formulation for Sequential Design Having described batch optimal experimental design (OED) in the previous chapter, we now extend our framework to the more general setting of sequential OED (sOED) for the rest of this thesis. sOED allows experiments to be conducted in sequence, thus permitting newly acquired experimental observations to help guide the design of future experiments. While batch OED techniques may be repeatedly applied for sequential problems, such a procedure would not be optimal. We provide the optimal design formulation for sequential experimental design. In particular, it targets a finite number of experiments, adopts a Bayesian treatment of uncertainty, employs an information measure objective, and accommodates nonlinear models under continuous parameter, design, and observation spaces. While the sOED notation remains similar as the batch OED formulation in Chapter 2, some conflicts do arise. To avoid confusion, we provide a full and detailed formulation to the sOED problem in this chapter. Numerical solution techniques for solving the problem will be presented in subsequent chapters. 3.1 Problem definition A complete formulation for optimal sequential design needs to account for all sources of uncertainty over the entire relevant time period, and under a full description of the system state as well as its evolution dynamics. In essence, we need to establish a mathematical description of all factors that determine which designs are optimal under different situations. With this goal in mind, we first define the core formulation components, and then state the 65 sOED problem. At this point, the formulation remains general, and does not assume an experimental goal of parameter inference (we will specialize later in Section 3.3). • Experiment index: k = 0, . . . , N − 1. The experiments are assumed to be discrete and ordered by the integer index k, for a total of N experiments. We consider finite horizon N . • State: xk = [xk,b , xk,p ]. The state should encompass all information of the system necessary in making the optimal future experimental design decisions. Generally, this includes a belief state xk,b component that reflects the current state of uncertainty, and a physical state xk,p component that describes any deterministic decision-relevant variables. We consider continuous, and possibly unbounded, state variables. Specific choices will be discussed later. • Design: dk ∈ Dk . The design (also known as “control”, “action”, or “decision” in other contexts) represents the conditions under which the experiment is to be performed. Moreover, we seek a policy (also known as “controller” or “decision rule”) π ≡ {µ0 , µ1 , . . . , µN −1 } consisting of a set of policy functions, one for each experiment, that indicates which design to perform depending on what the current state is: µk (xk ) = dk . We consider continuous design variables. Design methods that produce a policy are known as sequential (closed-loop) designs because a feedback of observations from experiments is necessary to determine the current state, which in turn is needed to apply the policy. This is in contrast to batch (open-loop) designs, where the designs are determined before any experiments are performed. These designs only depend on the initial state and not on subsequent designs or their observations, and hence involve no feedback. These perspectives of the batch and sequential designs are illustrated in Figure 3-1. • Observations: yk ∈ Yk . The observations (also referred to as “noisy measurements”, or “data” in this thesis) from the experiment are assumed to be the only source of uncertainty to the system, and often incorporate measurement noise and model inadequacy. Some models also have internal stochasticity as a part of the system dynamics; we currently do not study these cases. We consider continuous observation variables. • Stage reward: gk (xk , yk , dk ). The stage reward reflects the immediate reward asso66 ciated with performing a particular experiment. This quantity could depend on the state, observations, or design. Typically, it would reflect the monetary and time costs of performing the experiment, as well as any additional benefits or penalties. • Terminal reward: gN (xN ). The terminal reward serves as a mechanism to end the system dynamics by providing a reward value solely based on the final system state xN . • System dynamics: xk+1 = Fk (xk , yk , dk ). The system dynamics (also known as “transition function”, “transfer function”, or simply “the model” in other contexts) describes the evolution of the system state after performing an experiment, incorporating the design and observations of that experiment. This includes both the propagation of the belief state and the physical state. Specific dynamics depends on the choice of the state variable, and will be discussed later. Following the same decision-theoretic approach used to develop the expected utility for batch OED in Section 2.1, we seek to maximize the expected total reward functional (while this quantity is the expected utility, we use the term “expected total reward” in sOED to parallel the definitions of stage and terminal rewards): U (π) = Ey0 ,...,yN −1 |π "N −1 X # gk (xk , yk , µk (xk )) + gN (xN ) , k=0 (3.1) subject to the system dynamics xk+1 = Fk (xk , yk , dk ) for all experiments k = 0, . . . , N − 1. The optimal policy is then π ∗ = µ∗0 , . . . , µ∗N −1 = argmax U (π), (3.2) π={µ0 ,...,µN −1 } subject to the design space constraints µk (xk ) ∈ Dk , ∀xk for k = 0, . . . , N − 1. For simplicity, we also refer to Equation 3.2 as “the sOED problem” in this thesis. As shown later in Section 3.4, the commonly used batch (open-loop) and greedy (myopic) design approaches can be viewed as derivatives (and thus suboptimal design methods) from this general formulation. 67 Design d0 d1 Optimizer (controller) dN −1 Observations y0 Experiment 0 y1 Experiment 1 .. . Experiment N − 1 yN −1 (a) Batch (open-loop) design Observations yk Design dk System dynamics xk+1 = Fk (xk , yk , dk ) State xk Policy (controller) µk (b) Sequential (closed-loop) design Figure 3-1: Batch design exhibits an open-loop behavior, where no feedback of information is involved, and the observations yk from any experiment do not affect the design of any other experiments. Sequential design exhibits a closed-loop behavior, where feedback of information takes place, and the data yk from an experiment can be used to guide the design of future experiments. 3.2 Dynamic programming form The sOED problem involves the optimization of a functional of a set of policy functions. While this type of problem is studied in the field of calculus of variations, it is challenging to solve directly. Instead, we express the problem in an alternative form using Bellman’s principle of optimality [13, 14]: with the argument “the tail portion of an optimal policy is optimal for the tail subproblem”, we can break it into a set of smaller subproblems. The resulting form is the well known dynamic programming (DP) formulation (e.g., [22, 23]): Jk (xk ) = max Eyk |xk ,dk [gk (xk , yk , dk ) + Jk+1 (Fk (xk , yk , dk ))] dk ∈Dk (3.3) (3.4) JN (xN ) = gN (xN ), 68 for k = 0, . . . , N − 1. The Jk (xk ) functions are known as the “reward-to-go” or “value” functions (also referred to as “cost-to-go” or “cost” functions if gk and gN are defined as costs and the overall problem is defined to minimize the expected total cost), and collectively known as the Bellman’s equation. The optimal policy functions are now implicitly represented by the arguments of the maximization expressions, and if d∗k = µ∗k (xk ) maximizes the right side of Equation 3.3 then the policy π ∗ = µ∗0 , µ∗1 , . . . , µ∗N −1 is optimal. Each evaluation of the value function now involves a function optimization, which can be tackled more readily. Solving the DP problem has its own challenges, as its recursive structure involving nested maximization and expectation leads to an exponential growth in computation with respect to the horizon N . The growth further amplifies upon the state, design, and observation spaces, leading to the “curse of dimensionality”. Analytic solution is rarely available except for some specific classes of problems. Most of the time, DP problems can only be solved numerically and approximately. A combination of various approximation techniques and numerical methods is required to solve the sOED problem in DP form, and we will describe them in detail in Chapters 4 and 5. 3.3 Information-based Bayesian experimental design We now refine the sOED problem under the experimental goal of inferring uncertain model parameters θ from noisy and indirect observations yk . With this specialization, we now choose the appropriate state variable and reward functions. We follow the Bayesian perspective described in Section 2.1 and generalize Bayes’ rule for the sequential setting. If one performs the kth experiment under design dk and observes a realization of the observations yk , then the change in one’s state of knowledge about the parameters is given by: f (θ|yk , dk , Ik ) = f (yk |θ, dk , Ik )f (θ|Ik ) f (yk |θ, dk , Ik )f (θ|dk , Ik ) = . f (yk |dk , Ik ) f (yk |dk , Ik ) (3.5) Here, Ik = {d0 , y0 , . . . , dk−1 , yk−1 } is the information vector representing the history from the previous experiments, encompassing their designs and observations. Similar to Equation 2.1, we assume that knowing the design of the current (kth) experiment without knowing its observations does not affect our current belief about the parameters (i.e., the prior for the kth experiment would not change based on what experiment we plan to do)—thus 69 f (θ|dk , Ik ) = f (θ|Ik ). In this Bayesian setting, a belief state that fully describes the state of uncertainty after k experiments then is the posterior. This can be any set of properties that fully describes the posterior, including the posterior random variable itself θ|yk , dk , Ik , its density function f (θ|yk , dk , Ik ) or distribution function F (θ|yk , dk , Ik ), other sufficient statistics, or even simply the prior along with the entire history of designs and observations from all previous experiments. For example, in the event where θ is a discrete random variable that can take on a finite number of possible realizations, methods from partially observable Markov decision process (POMDP) [154, 138] typically designate the belief state to be a finite-dimensional vector of possible θ realizations combined with their corresponding probability mass function values. Since we deal with continuous (and often unbounded) θ, an analogous perspective manifests in an infinite-dimensional belief state entity; we thus seek alternative approaches. In this chapter, for the purpose of illustration, we denote the belief state to be the posterior random variable, i.e., xk,b = θ|Ik . In Chapters 5 and 7, the belief state will take on different meanings depending on the choice of its numerical representation; these choices will be made clear in context. Following the same information-theoretic approach and discussions from Section 2.1, it is natural to set the terminal reward as the Kullback-Leibler (KL) divergence from the final posterior after all N experiments have been performed, to the prior before any experiment is performed: gN (xN ) = DKL f (xN,b ) dθ, fxN,b (xN,b )||fx0,b (x0,b ) = f (xN,b ) ln f (x0,b ) H Z (3.6) where H is the support of the prior. The stage rewards then reflect all other immediate rewards or costs related in performing particular experiments, such as its monetary, time, and personnel costs, or level of difficulty and risk. When the stage rewards are zero, we arrive at an expected total reward that is analogous to the expected utility developed for batch OED (Equation 2.2): U (π) = Ey0 ,...,yN −1 |π DKL fθ|d0 ,y0 ,...,dN −1 ,yN −1 (xN,b )||fθ (x0,b ) (3.7) subject to dk = µk (xk ) and xk+1 = Fk (xk , yk , dk ) for k = 0, . . . , N − 1. Another intuitive alternative is to use incremental information gain after each experiment 70 is performed by setting gk (xk , yk , dk ) = DKL fxk+1,b (xk+1,b )||fxk,b (xk,b ) = Z f (xk+1,b ) ln H f (xk+1,b ) dθ f (xk,b ) for k = 0, . . . , N − 1, where xk+1,b is the belief state component of xk+1 = Fk (xk , yk , dk ). The expected total reward from this specification is not equivalent to Equation 3.7 since the reference distributions in the KL divergence terms are different. While intuitively this approache reflects, in some sense, the amount of information gained from all the experiments, one should use caution in the quantitative interpretation of its results as it involves the additions of (and comparisons between) KL divergence terms with respect to different reference distributions (they are thus quantities expressed in different units). Additionally, such a formulation needs to evaluate KL divergence after every experiment, often an approximate and expensive process, that may deteriorate the overall computational performance. We therefore take the approach in Equation 3.6 in this thesis. 3.4 Notable suboptimal sequential design methods Two notable design approaches frequently encountered in the OED literature are batch (open-loop) (described in detail in Chapter 2) and greedy (myopic) designs. Compared to the full sOED formulation derived in this chapter, batch and greedy design approaches are simpler to form and to solve. However, when applied to a sequential design problem, they are both special cases from simplifying the structure of the sOED problem, and thus are suboptimal. We discuss these two designs below for the purpose of emphasis, but do not employ them in developing our numerical method for solving the sOED problem. (In this study, we take an approach to preserve the original problem as much as possible, and instead rely more heavily on techniques to approximately solve the exact problem.) These suboptimal designs, though, will be used as numerical comparisons in Chapter 7. Batch OED involves the design of all experiments concurrently as a batch, where the outcome of any experiment would not affect the design of others. Mathematically, the policy functions µk for batch design do not depend on the states xk , since no feedback is involved. 71 Equation 3.2 thus reduces to a multidimensional vector space optimization problem1 d∗0 , . . . , d∗N −1 = argmax Ey0 ,...,yN −1 |d0 ,...,dN −1 d0 ,...,dN −1 "N −1 X # gk (xk , yk , dk ) + gN (xN ) , k=0 (3.8) subject to the design space constraints dk ∈ Dk , ∀k. More specifically, setting gN to Equa−1 N −1 tion 3.6, gk = 0 for k = 0, . . . , N − 1, d = {dk }N k=0 and y = {yk }k=0 , we recover exactly the batch OED problem (Equations 2.2 and 2.4). Since batch OED involves applying stricter conditions to the sOED problem, it therefore yields suboptimal designs. Greedy design is a type of sequential (closed-loop) formulation where only the next experiment is considered without taking into account other future consequences. Mathematically, the greedy policy is described by2 Jk (xk ) = max Eyk |xk ,dk [gk (xk , yk , dk )] dk ∈Dk JN (xN ) = gN (xN ). (3.9) (3.10) gr If dgr k = µk (xk ) maximizes the right side of Equation 3.9 for all k = 0, . . . , N − 1, then the gr gr policy π gr = µgr 0 , µ1 , . . . , µN −1 is the greedy policy. The primary advantage of greedy design is that by ignoring the future effects, Bellman’s equation becomes decoupled, and the exponential growth of computation with respect to the horizon N is avoided. It may also be a reasonable choice under circumstances where the total number of experiments is unknown. Nonetheless, since the formulation is a truncation to the DP form of the sOED problem (Equations 3.3 and 3.4), the greedy policy is also suboptimal. 1 Batch OED generally cannot be expressed in the DP form since they do not abide Bellman’s principle of optimality: the truncated optimal batch design {d∗i , . . . , d∗N −1 } is generally not the optimal batch design for the tail subproblem of designing experiments i to N − 1. 2 A greedy design formulation would require an incremental information gain formulation (Equation 3.8) in order to properly reflect the value of information after each experiment is performed. 72 Chapter 4 Approximate Dynamic Programming for Sequential Design The sequential optimal experimental design (sOED) problem, even expressed in the dynamic programming (DP) form (Equations 3.3 and 3.4 from Chapter 3), almost always needs to be solved numerically and approximately. This chapter describes the techniques we use in finding an approximate solution to a DP problem under continuous spaces, and focuses on the optimality aspect of the approximate solution. For the most part, these techniques are applicable outside the sOED context as well. We then specifically discuss the representation aspect of the belief state and performing Bayesian inference for general non-Gaussian random variables in the next chapter. 4.1 Approximation approaches Approximate dynamic programming (ADP) broadly refers to numerical methods in finding an approximate solution to a DP problem. Substantial research has been devoted towards developing these techniques across a number of different communities, targeting different variations of the DP expression. For example, the area of stochastic control in control theory usually deals with multidimensional continuous control variables [24, 22, 23], the study of Markov decision processes in operations research typically accommodates highdimensional discrete decision vectors [138, 137], and the branch of reinforcement learning from machine learning often handles small, finite sets of discrete actions [93, 164]. While a plethora of different terminology is used across these fields, there is often a large overlap in 73 the fundamental spirit of their solution approaches. We thus take a perspective to group the various ADP techniques into the following two broad categories. 1. Problem approximation: where there is no natural way to refine the approximation, or that refinement does not lead to the solution of the original problem—these methods typically lead to suboptimal designs. Examples: batch and greedy designs, open-loop feedback control, certainty equivalent control, Gaussian approximation of distributions. 2. Solution approximation: where there is some natural way to refine the approximation, and the effects of approximation diminish with refinement—these methods have some sense of convergence, and may be refined towards the solution of the original problem. Examples: policy iteration, value function and Q-factor approximations, numerical optimization, Monte Carlo sampling, regression, quadrature and numerical integration, discretization and aggregation, rolling horizon. In practice, techniques from both categories are often combined together to find an approximate solution to a DP problem. In this thesis, however, we take an approach to try to preserve the original problem as much as possible, and rely more heavily on solution approximation techniques to approximately solve the exact problem. Keeping in line with this philosophy, we proceed to build our ADP method around a backbone of one-step lookahead policy representation, and approximate value iteration via backward induction and regression construction of approximate value functions. 4.2 Policy representation In seeking the optimal policy, we first need to be able to represent a (generally supoptimal) policy π = {µ0 , µ1 , . . . , µN −1 }. On the one hand, one may represent a policy function µk (xk ) directly (and approximately), for example, by tabulating its values on a discretized grid of xk or using functional approximation techniques. On the other hand, one can preserve the recursive relationship in Bellman’s equation and “parameterize” the policy via value functions. We proceed with a policy representation using one step of lookahead, to retain some level 74 of structural property from the original DP problem while keeping the method computationally feasible. By looking ahead only one step, the recursion between the value functions is broken, and the exponential growth of computational cost with respect to the horizon N is reduced to linear growth.1 This leads to the one-step lookahead policy representation (e.g., [22]): h i µk (xk ) = argmax Eyk |xk ,dk gk (xk , yk , dk ) + J˜k+1 (Fk (xk , yk , dk )) (4.1) dk ∈Dk for k = 0, . . . , N − 1, and J˜N (xN ) ≡ gN (xN ). The policy function µk is therefore indirectly represented via some value function J˜k+1 , and one can view the policy π to be implicitly parameterized by the set of value functions J˜1 , . . . , J˜N .2 If J˜k+1 (xk+1 ) = Jk+1 (xk+1 ), we recover the Bellman’s equation (Equation 3.3), and µk = µ∗k ; therefore we would like to find J˜k+1 ’s that are in some sense close to Jk+1 ’s. Before we describe how to construct good value function approximations in the next section, we first describe how to numerically represent these approximations. We employ a simple parametric linear architecture function approximator: J˜k (xk ) = rk⊤ φk (xk ) = m X rk,i φk,i (xk ), (4.2) i=1 where rk,i is the coefficient (weight) corresponding to the ith feature (basis) φk,i (xk ). While more sophisticated nonlinear function, or even nonparametric, approximators are possible (e.g., k-nearest-neighbor [78], kernel regression [129], neural network [24]), linear approximator is easy to use and intuitive to understand [103], and is often required for many algorithm analysis and convergence results [23]. It follows that the construction of J˜k (xk ) involves the selection of features and training of the coefficients. The choice of features is an important, but difficult, task. A concise set of features that is relevant in reflecting the function values from data points can substantially improve the accuracy and efficiency of the function approximators, and in turn, of the overall algo1 Multi-step lookahead is possible in theory, but impractical, as the amount of online computation would be tremendous under continuous spaces. 2 A similar method is the use of Q-factors [175, 176]: µk (xk ) = argmaxdk ∈Dk Q̃k (xk , dk ), where the Qfactor corresponding to the optimal policy is Qk (xk , dk ) ≡ Eyk |xk ,dk [gk (xk , yk , dk ) + Jk+1 (Fk (xk , yk , dk ))]. The functions Q̃k (xk , dk ) have a higher input dimension than J˜k (xk ), but once they are available, the corresponding policy can be evaluated without the system dynamics Fk , and is thus known as a “model-free” method. Q-learning via value iteration is a prominent method in reinforcement learning. 75 rithm. Identifying helpful features, however, is non-trivial. Substantial research has been dedicated in developing systematic procedures for both extracting and selecting features in the machine learning and statistics communities [83, 107], but in practice, finding good features often relies on experience, trial-and-error, and expertise knowledge of the particular problem at hand. We acknowledge the difficulty of this process, but do not further pursue detailed discussions of general and systematic feature formation. Instead, we take a reasonable heuristic step in the sOED context, and choose features that are based on the mean and log-variance of the belief state, along with the physical state component. The main motivation for this move stems from the KL divergence term in the terminal reward, which is chiefly responsible for reflecting information gain. While generally the belief states are not Gaussian, the analytic formula of KL divergence between two Gaussian random variables, which includes their mean and log-variance terms, provides a starting point for promising features. We will specify the feature choices with more detail in Chapter 7. For the present purpose of developing our ADP method in this chapter, we assume the features are set. We now focus on developing efficient procedures for training the coefficients. 4.3 Policy construction via approximate value iteration 4.3.1 Backward induction and regression Our goal is to find policy parameterization (value function approximations) J˜k ’s that are close to the optimal policy value functions Jk ’s that satisfy Equation 3.3. We take a direct approach, and would like to solve the following ideal regression problem that minimizes the least squares error of the approximation under the optimal policy induced state measure (also known as the D-norm in other works; its density function is denoted by fπ∗ (x1 , . . . , xN −1 )): min rk ,∀k Z X1 ,...,XN −1 "N −1 X k=1 Jk (xk ) − rk⊤ φk (xk ) 2 # fπ∗ (x1 , . . . , xN −1 ) dx1 , . . . , dxN −1 , (4.3) where Xk is the support of xk , and we impose the linear architecture of J˜k (xk ) = rk⊤ φk (xk ). The distribution of regression points is a reflection of where emphasis is placed for the approximation function to be more accurate. Intuitively, we would ultimately like more accurate approximations in regions of states that are more likely or frequently visited under the optimal policy. More precisely, we would like to use the state measure induced together 76 by the optimal policy and by the associated numerical methods. For example, the choice of stochastic optimization algorithm as well as its setting (discussed in Section 2.2) affect which intermediate states are more frequently visited during the optimization process. The accuracy at the intermediate states can be crucial, since they can potentially mislead the optimization algorithm to arrive at completely different designs, and in turn change the regression and policy evaluation outcomes. It would then be prudent to include the states visited from the optimization procedure as regression points as well. In Chapter 7, we will demonstrate the importance of including these states through illustrative numerical examples. For the rest of this thesis, we shall refer “policy induced state measure” to include effects of the associated numerical methods as well. As we do not have the optimal policy, Jk (xk ), or fπ∗ (x1 , . . . , xN −1 ), we must solve Equation 4.3 approximately. To sidestep the need of optimal value functions Jk (xk ) in the ideal regression problem (Equation 4.3), we proceed to construct the approximate functions by approximate value iteration, specifically using backward induction and regression. The resulting J˜k ’s will then be used as parameterization of the one-step lookahead policy in Equation 4.1. Starting with J˜N (xN ) ≡ gN (xN ), we proceed backwards from k = N − 1 to k = 1 and form J˜k (xk ) = rk⊤ φk (xk ) h i ˜ = Π max Eyk |xk ,dk gk (xk , yk , dk ) + Jk+1 (Fk (xk , yk , dk )) dk ∈Dk = Π Jˆk (xk ), (4.4) where Π is the approximation operator that can be, for example, regression. This leads to a set of ideal stage-k regression problems min rk Z Xk 2 Jˆk (xk ) − rk⊤ φk (xk ) fπ∗ (xk ) dxk , (4.5) h i with Jˆk (xk ) ≡ maxdk ∈Dk Eyk |xk ,dk gk (xk , yk , dk ) + J˜k+1 (Fk (xk , yk , dk )) , fπ∗ (xk ) being the marginal of fπ∗ (x1 , . . . , xN −1 ), and J˜k (xk ) = rk⊤ φk (xk ). While we no longer need the optimal value functions Jk (xk ) in constructing J˜k (xk ), we remain unable to select regression points according to fπ∗ (xk ); we discuss this issue in the next subsection. Furthermore, since J˜k (xk ) is built based on J˜k+1 (xk+1 ) through the backward induction process, the effects of numerical approximation error aggregate, potentially at an exponential rate [168]. The accuracy of all 77 J˜k (xk ) approximations (i.e., for all k) is thus extremely important. 4.3.2 Exploration and exploitation Although we cannot generate regression points distributed exactly according to the optimal policy induced state measure, it is possible to generate them according to a given (suboptimal) policy. This includes heuristic policies, and the current approximation to the optimal policy in the algorithm (we shall refer to this as the “current policy” throughout this section). In particular, we proceed to generate regression points using a combination of exploration and exploitation. Exploration is conducted by randomly selecting designs (i.e., a random heuristics policy). For example, if the feasible design space is bounded, this can be performed by uniform sampling; when it is unbounded, however, a designated exploration design measure needs to be prescribed, often selected from experience and understanding of the problem. The purpose of exploration is to allow a positive probability of probing regions that can potentially lead to better rewards than through the current policy. Exploitation is conducted by applying the current policy, in this case, from exercising the one-step lookahead policy using the parameterization value functions J˜k . The purpose of exploitation is to take advantage of the current understanding of a good policy. When states visited by exploitation are used as regression points, they help increase the weight of accuracy in regions of states that would be reached and visited frequently via this policy. In practice, a balance of both exploration and exploitation is used to achieve good results, but an infusion of exploration (or other heuristics) generally invalidates theoretical algorithm analysis and convergence results [23, 137]. In our algorithm, the states visited from both exploration and exploitation trajectories are used as regression points for the least squares problem in Equation 4.5. 4.3.3 Iterative update of state measure and policy approximation A dilemma emerges from generating regression samples via exploitation is a “chicken or the egg” problem: exploitation requires the availability of a current policy from the algorithm, and the construction of such a policy (that is not a heuristics policy) requires regression samples. We address this issue by introducing an iterative approach to update the state measure used for generating the regression samples. On a high level, we achieve this by alternating between generating regression points from exploitation of the current policy 78 using Equation 4.1, and constructing approximate optimal policy by solving the regression problem of Equation 4.5. Here is a concrete description of the procedure. The algorithm starts with only an exploration heuristics, denoted by π explore . States from exploration trajectories generated from π explore are then used as regression points to approximately solve Equation 4.5, producing J˜k1 ’s that parameterize π 1 . π 1 is then used to generate exploitation trajectories via Equation 4.1, and together with a mixture of exploration states from π explore , the overall set of states is used as regression points to again solve Equation 4.5, giving us J˜k2 ’s that parameterize π 2 . The process is repeated, and one would expect the regression points to distribute closer to the optimal policy induced state measure. Additionally, the biggest change is expected to occur when the first exploitation policy π 1 becomes available, with smaller changes in subsequent iterations. A rigorous proof of convergence of this iterative procedure is difficult with the infusion of exploration and the generally unpredictable state measure induced by the numerical methods and settings; we therefore start with numerical investigations of this procedure in this thesis, and will develop formal proofs in the future. Combining the stage-k regression problem from all stages (Equation 4.5), the overall regression problem being solved approximates the ideal regression problem of Equation 4.3: min rk ,∀k Z X1 ,...,XN −1 "N −1 X k=1 Jˆkℓ+1 (xk ) − rk⊤ φk (xk ) 2 # fπexplore +πℓ (x1 , . . . , xN −1 ) dx1 , . . . , dxN −1(4.6) where fπexplore +πℓ (x1 , . . . , xN −1 ) is the joint density corresponding to the mixture of exploration and exploitation from the ℓth iteration, and the approximation rk⊤ φk (xk ) at iteration ℓ is denoted J˜kℓ (xk ). Note that fπexplore +πℓ (x1 , . . . , xN −1 ) lags one iteration behind Jˆkℓ+1 (xk ) since we need to have constructed the policy first before we can sample trajectories from it. Simulating exploitation trajectories, applying policies, and evaluating the regression system all involve finding the maximum of an expected value in a continuous design space (Equations 4.1 and 4.4). While the expected value generally cannot be found analytically, a robust and natural approximation may be obtained via Monte Carlo estimate. As a result, the optimization objective is effectively noisy. Following the developments of Section 2.2, we employ Robbins-Monro (or Kiefer-Wolfowitz when gradient is not analytically available) stochastic approximation algorithm for stochastic optimization. 79 4.4 Connection to the rollout algorithm (policy iteration) While the one-step lookahead policy representation described in Equation 4.1 has a similar form to the one-step lookahead rollout algorithm [167, 22],3 our implementation of approximation value iteration is different from rollout. A typical rollout algorithm involves three main steps: 1. policy initialization: choose a base (heuristics) policy; 2. policy evaluation: compute the corresponding value functions of this policy; and 3. policy improvement: apply these value functions in the one-step lookahead formula (Equation 4.1) to obtain a new policy that is guaranteed to be no worse than the previous policy [22]. Policy iteration simply repeats steps 2 and 3, and is more frequently used in infinite-horizon settings. Our approach differs from rollout in that J˜k (xk ) are not necessarily value functions corresponding to any policy. Instead, we perform backward induction to construct J˜k (xk ) that directly approximate the value functions of the optimal policy (which is the key property of an approximate value iteration method). One-step lookahead is simply the means to apply the policy parameterized by J˜k ’s. A rollout implementation would include the construction of base policy value functions Jπbase ,k (xk ). They can be either directly approximated pointwise as needed in an online manner using Monte Carlo of trajectory simulations from xk , or approximated offline using function approximation techniques such as Equation 4.2 combined with regression. The former involves fewer sources of approximation, but is computationally expensive. The latter is similar in spirit to the procedure introduced in the previous section, and is furthermore computationally cheaper. This is because base policies are typically described in forms much simpler than the one-step lookahead representation, and producing values of dk from them normally do not require the maximization operation. Nonetheless, we perform the full backward induction in Equation 4.4 as the additional maximization is affordable under our current computational setup, and its inclusion can offer an advantage in leading to an overall better policy. This can be seen from the fact that the value function approximations 3 Multi-step lookahead rollout algorithms are also possible. Similar to discussions in Section 4.2, the tremendous amount of online computation under continuous spaces make them impractical. 80 produced from backward induction would generally be a better starting point heading into one-step lookahead, compared to the value functions of a (perhaps arbitrarily chosen) base policy. An interesting future investigation would be to compare the computational performance between the direct backward induction approach with multiple iterations of rollout (i.e., approximate policy iteration). 4.5 Connection to POMDP Partially-observable Markov decision process (POMDP) is a generalized framework of Markov decision process (MDP) where the underlying state cannot be directly observed [154, 138], and as such, a probability distribution of the state is maintained. (In POMDP vernacular, the “partially-observed state” is simply the parameters θ from our optimal experimental design (OED) terminology; we may use them interchangeably in this section.) While a general continuous version of the POMDP framework can be used to describe the sOED problem introduced in Chapter 3, traditional MDP and POMDP research largely focus on problems with discrete (often finitely-discrete) spaces and random variables. Nonetheless, we might expect existing POMDP algorithms suitable and insightful in handling discretized versions of the sOED problem. There are two major limitations to the majority of the state-of-the-art POMDP algorithms when applied to the sOED problem. First, these algorithms are often designed to handle only a handful of possible values of design and state variables, while even a simple discretization of the type of sOED problems considered in this thesis would lead to an extremely large number of discretized values. Second, most POMDP algorithms are based on, and exploit, the property that the problem’s cost functions are piecewise linear and convex (when minimizing) with respect to the belief state [159, 157] (some examples of such algorithms include the witness algorithm [92], point-based value iteration [134], SARSOP [101], and Perseus [136]). (In problems with discrete partially-observed states, the belief state is simply the vector of probability mass function values, which in itself is a full and finitedimensional representation of the uncertainty. Property of piecewise linear and convex then naturally arises for many problems in the field of operations research, where a specific value of cost or reward is usually assigned to each possible realization of the partially-observed state. The expected cost or reward then becomes a linear combination of these values weighed by 81 the probability masses.) However, we show that these algorithms would not be suitable to solve even the discretized version of a one-experiment OED problem that employs an information measure objective (i.e., in information-based Bayesian experimental design). This is because such an objective (at least a practical one) necessarily leads to value functions that are not linear. By an induction argument, the value functions in a multi-experiment sOED problem would also generally not have the piecewise linear and convex (concave) property. We demonstrate the second limitation under a one-experiment, n-state (finitely) discrete random variable θ setting (recall that θ is the partially-observed state in the POMDP vernacular). We start with a rigorous definition for measure of information in experimental design. Definition 4.5.1. (Ginebra [75], with notation adaptations) A measure of the information about θ in an experiment d assigns a value U (d) such that 1. U (d) is a real number 2. U (dtni ) = 0, and 3. whenever dA and dB are such that dA is “sufficient for” dB , then U (dA ) ≥ U (dB ). The notation U (d) corresponds to the expected utility from the batch OED problem in Chapter 2, or the expected total reward (but with a fixed prior and N = 1) from the sOED problem in Chapter 3; in this one-experiment setting, it is also the sole value function. dtni is the “totally non-informative” experiment, where one cannot learn about θ by observing the outcomes of dtni . In the Bayesian setting, this is when the posterior remains the same as the prior. In contrast, a “totally informative” experiment dti is one where for every pair of (θi , θj ), θi 6= θj , (where θi is the ith realization of all possible values θ can take on) the intersection of the support sets for their likelihoods, Yi = supp (f (y|θi , dti )) and Yj = supp (f (y|θj , dti )), is an empty set, and thus it is a family of mutually singular distributions. After performing dti , the value of θ can be determined with certainty, hence the totally informativeness. As a consequence of the requirements in Definition 4.5.1, 0 = U (dtni ) ≤ U (d) ≤ U (dti ). The third requirement in Definition 4.5.1 needs the definition of “sufficient for”. Definition 4.5.2. (Originally Blackwell [25, 26], then Ginebra [75], with notation adaptations) Experiment dA is said to be “sufficient for” dB if there exists a stochastic transfor82 mation of y|dA to a random variable w(y|dA ) such that w(y|dA ) and y|dB have identical distribution under each θ. The following proposition based on Definition 4.5.2 was first proven by Blackwell [25, 26], Sherman [153], and Stein [162], and then generalized and stated in the Bayesian setting by Ginebra [75]. Proposition 4.5.1. (Ginebra [75], with notation adaptations) Experiment dA is “sufficient for” dB if and only if for a given strictly positive prior distribution p(θ), (4.7) Ey|dA [φ (p(θ|y, dA ))] ≥ Ey|dB [φ (p(θ|y, dB ))] for every convex function φ(·) on the simplex of Rn , where p(θ|y, dA ) and p(θ|y, dB ) are the posterior distributions on the same prior p(θ). We use the notation p(·) to denote probability mass functions of discrete random variables. We now propose the following. Theorem 4.5.1. When a measure of the information U about an n-state random variable θ in a single experiment d is in linear form with respect to its probability mass function U (d) = Ey|d " n X i=1 # (4.8) αi p(θi |y, d) , the measure of information is constant (zero) for all experiments and therefore is not useful. Here αi ∈ R and p(θi |d, y) is the posterior probability mass at θ = θi . Proof. We start by first establishing conditions under which Equation 4.8 is a valid measure of information, by satisfying the requirements in Definition 4.5.1. Requirement 1 is satisfied by Equation 4.8 by definition. To meet requirement 2, the coefficients must satisfy U (dtni ) = Ey|d " n X i=1 # αi p(θi |dtni , y) = Ey|d " n X i=1 # αi p(θi ) = n X αi p(θi ) = 0, (4.9) i=1 where the posterior remains unchanged from the prior by definition of dtni . To meet requirement 3, we require that whenever Proposition 4.5.1 is satisfied, then U (dA ) ≥ U (dB ). This is satisfied by Equation 4.8 since it is a specialization where φ is linear and hence convex—thus 83 if Proposition 4.5.1 is satisfied then indeed our choice of U (d) satisfies U (dA ) ≥ U (dB ) by construction. Under these conditions, Equation 4.8 is a valid measure of information. We now show that U (d) = 0 for all d, and thus it is not a practically useful measure of information. Consider the totally informative experiment, which is an experiment (regardless whether or not it can be physically achieved in practice) that can deterministically pinpoint the value of θ. In other words, the posterior would be a Kronecker delta function. The information value of the totally informative experiment thus provides the theoretically highest achievable U . Its information value is U (dti ) = Ey|d = = " = = = = i=1 Z X n Y i=1 n XZ αi p(θi |dti , y) n Z X n X n X αj j=1 n X j=1 n X αj αi p(θi |dti , y)f (y|dti ) dy αj f (y|dti ) dy j=1 Yj Z n X j=1 n X # αi p(θi |dti , y)f (y|dti ) dy Yj i=1 j=1 = n X Z f (y|θm , dti )p(θm ) dy Yj m=1 f (y|θj , dti )p(θj ) dy Yj αj p(θj ) Z f (y|θj , dti ) dy Yj αj p(θj ) = 0. (4.10) j=1 The second equality is from the definition of expectation (here we assume observation space Y is continuous, but the same result applies for discrete cases). The third equality involves breaking the integral over Y into the disjoint sets Yj = supp(f (y|θj , dti )), j = 1, . . . , n. The fourth equality is due to p(θi |dti , y) = δi,j for all y ∈ Yj (any y ∈ Yj will lead to a delta posterior at θj ). The fifth and sixth equalities apply the definition of conditional probability and use the fact that for all y ∈ Yj , f (y|θm , dti ) = 0 for all θm 6= θj (due to the disjoint likelihood functions under dti ). The next line simply rearranges. The eigth equality again uses the property of disjoint likelihood functions. The last equality is due to Equation 4.9. 84 Both the totally non-informative and totally informative experiments yield information values of zero under the linear form of information measure in Equation 4.8. 0 = U (dtni ) ≤ U (d) ≤ U (dti ) = 0, and therefore U (d) = 0 for all d. Hence, the linear form of information measure is not a practically useful measure of information. As a result, a practically useful measure of information necessarily has a nonlinear form with respect to the belief state in problems of discrete parameters. 85 86 Chapter 5 Transport Maps for Sequential Design Another important challenge of the sequential optimal experimental design (sOED) problem is in representing the belief state xk,b and performing Bayesian inference (the part of system dynamics Fk for propagating the belief state). In particular, we seek to accommodate nonlinear forward models involving non-Gaussian continuous random variables. Following the discussions in Chapter 3, a Bayesian perspective suggests a belief state that comprehensively describes the uncertain environment is simply the posterior. However, representing a general continuous random variable posterior in a finite-dimensional manner is difficult. We propose to represent the belief state using transport maps. As we demonstrate in this chapter, transport maps are especially attractive compared to other traditional alternatives, in that they can be constructed directly from samples without requiring model knowledge, and the optimization problem from the construction process is dimensionally-separable and convex. Furthermore, by constructing joint maps, they enable Bayesian inference to be performed very quickly, albeit approximately, by conditioning on different realizations of design and observations. We start the chapter with some general background of transport maps in Section 5.1, demonstrate how they can be used for Bayesian inference in Section 5.2, and provide the details of how to construct maps from samples in Section 5.3. The connection of quality between joint and conditional maps is discussed in Section 5.4, which is important to justify constructing accurate posterior maps. Finally, the particular implementation and use of maps in the sOED problem is presented in Section 5.5. We note that Sections 5.1 and 5.3 contain material drawn heavily from the work of Parno and Marzouk [131, 132]. 87 5.1 Background Consider two Borel probability measures on Rn , µz and µξ . We will refer to these as the target and reference measures, respectively, and associate them with random variables z ∼ µz and ξ ∼ µξ . A transport map T : Rn → Rn is a deterministic transformation that pushes forward µz to µξ , yielding (5.1) µ ξ = T♯ µ z . In other words, µξ (A) = µz T −1 (A) for any Borel set A ⊆ Rn . In terms of the random i.d. i.d variables, we write ξ = T (z), where = denotes equality in distribution. The transport map is equivalently a deterministic coupling of probability measures [171]. For example, Figure 5-1 illustrates a log-normal random variable z mapped to a standard Gaussian random i.d. variable ξ via ξ = T (z) = ln(z). ξ f (ξ) T (z) z f (z) Figure 5-1: A log-normal random variable z can be mapped to a standard Gaussian random i.d. variable ξ via ξ = T (z) = ln(z). Of course, there can be infinitely many transport maps between two probability measures. On the other hand, it is possible that no transport map exists: consider the case where µz has an atom but µξ does not. If a transport map exists, one way of regularizing the problem and finding a unique map is to introduce a cost function c(z, ξ) on Rn × Rn that represents the work needed to move one unit of mass from z to ξ. Using this cost function, the total 88 cost of pushing µz to µξ is C(T ) = Z Rn (5.2) c (z, T (z)) dµz (z). Minimization of this cost subject to the constraint µξ = T♯ µz is called the Monge problem [116]. A transport map satisfying the measure constraint in Equation 5.1 and minimizing the cost in Equation 5.2 is an optimal transport map. The celebrated result of [35], later generalized by [115], shows that this map exists, is unique, and is monotone µz -a.e. when µz is atomless and the cost function c(z, ξ) is quadratic. Generalizations of this result to other cost functions and spaces have been established in [44, 5, 68, 20]. The choice of cost function in Equation 5.2 naturally influences the structure of the map. For illustration, consider the Gaussian case of z ∼ N (0, I) and ξ ∼ N (0, Σ) for some positive i.d. definite covariance matrix Σ. The associated transport map is linear: ξ = T (z) = Sz, where the matrix S is any square root of Σ. When the transport cost is quadratic, c(z, ξ) = |z −ξ|2 , S is the symmetric square root obtained from the eigendecomposition of Σ, Σ = V ΛV ⊤ and 1 S = V Λ 2 V ⊤ [128]. If the cost is instead taken to be the following weighted quadratic c(z, ξ) = n X i=1 ti−1 |zi − ξi |2 , t > 0, (5.3) then as t → 0, the optimal map becomes lower triangular and equal to the Cholesky factor of Σ. Generalizing to non-Gaussian µz and µξ , as t → 0 optimal maps Tt obtained with the cost function in Equation 5.3 are shown by [40] and [27] to converge to the Knothe-Rosenblatt (KR) rearrangement [143, 98] between probability measures. The KR map exists and is uniquely defined if µz is absolutely continuous with respect to Lebesgue measure. It is defined by and typically constructed via an iterative procedure that involves evaluating and inverting a series of marginalized conditional cumulative distribution functions. As a result, it inherits several useful properties: the Jacobian matrix of T is lower triangular and has positive diagonal entries µz -a.e. (i.e., monotone). Because of this triangular structure, the Jacobian determinant and the inverse of the map are easy to evaluate. This is an important computational advantage that we exploit. We will employ KR maps (lower triangular and monotone), but without directly appealing to the transport cost in Equation 5.3. While this cost is meaningful for theoretical 89 analysis and even numerical continuation schemes [40], we find that for small t, the sequence of weights {ti } quickly produces numerical underflow as the parameter dimension n increases. Instead, we will directly impose the lower triangular structure and search for a map Te that approximately satisfies the measure constraint, i.e., for which µξ ≈ Te♯ µz . This approach is a key difference between our construction and classical optimal transportation. Numerical challenges with Equation 5.3 are not the only reason to seek approximate maps. Suppose that the target measure µz is a Bayesian posterior or some other intractable distribution, but let the reference µξ be something simpler, e.g., a Gaussian distribution with identity covariance. In this case, the complex structure of µz is captured by the map T . Sampling and other tasks can then be performed with the simple reference distribution instead of the more complicated distribution. In particular, if a map exactly satisfying Equation 5.1 were available, sampling the target distribution µz would simply require drawing a sample ξ ∗ ∼ µξ and pushing it to the target space with θ∗ = T −1 (ξ ∗ ). This concept was employed by [64] for posterior sampling. Depending on the structure of the reference and the target, however, finding an exact map may be computationally challenging. In particular, if the target contains many nonlinear dependencies that are not present in the reference distribution, the representation of the map T (e.g., in some canonical basis) can become quite complex. Hence, it is desirable to work with approximations to T . 5.2 Bayesian inference using transport maps Transport maps can be viewed as a representation of random variables. We may then choose to perform Bayesian inference via transport maps, instead of probability density functions from the traditional form of Bayes’ theorem in Equation 2.1. To illustrate this, we employ KR maps, which are lower-triangular, monotone, and where the reference measure is standard Gaussian. Adopting the same notation as Equation 2.1 for this section, where θ is the parameters, y the observations, and d the experimental design, the KR map from the target joint random vector in the order of (d, θ, y) is η1 Td (d) η2 = Tθ|d (d, θ) η3 Ty|θ,d (d, θ, y) Φ−1 (F (d)) −1 = Φ (F (θ|d)) Φ−1 (F (y|θ, d)) 90 , (5.4) where η1 , η2 , η3 are i.i.d. standard Gaussians, and the subscript on the map components denotes the corresponding conditional distribution. For simplicity, we omit the “i.d.” above the equality signs except when this property needs to be emphasized. We also use F (·) to represent all distribution functions, and which specific distribution it corresponds to is reflected by its arguments (when needed for clarity, a subscript of the random variable will be explicitly included). Equation 5.4 may be interpreted as the prior form of the joint map, where the associated conditional distribution functions are of the prior and likelihood, both of which are available to us at the beginning of an inference procedure. Another form of the target joint random vector with a different order of (d, y, θ) yields the KR map of ξ1 Td (d) ξ2 = Ty|d (d, y) Tθ|y,d (d, y, θ) ξ3 Φ−1 (F (d)) −1 = Φ (F (y|d)) Φ−1 (F (θ|y, d)) , (5.5) where ξ1 , ξ2 , ξ3 are i.i.d. standard Gaussians. Equation 5.5 may be interpreted as the posterior form of the joint map, where the associated conditional distribution functions are of the evidence and posterior, components we seek through the inference process.1 While the prior form of the joint map is easy to construct and can be done even analytically, the inference process then involves reordering the random variables to obtain the posterior form of the joint map, a non-trivial task. In the next section, we will show how to construct the approximate map to the posterior form of Equation 5.5 directly, and circumvent the prior form of Equation 5.4 and the reordering process altogether. To demonstrate that Equation 5.5 indeed carries the posterior information, consider the posterior random variable of θ conditioned on a particular experimental design d = d∗ and observations y = y ∗ . Its KR map is precisely Tθ|y∗ ,d∗ (θ) = Φ−1 (F (θ|y ∗ , d∗ )) = Tθ|y,d (d∗ , y ∗ , θ), (5.6) where the first equality is due to the definition of KR maps, and the second equality uses the relationship of the last component in Equation 5.5. Therefore, once the posterior form 1 Other orderings of the random variables are also possible, such as (y, d, θ). Such a sequence would still associate with the posterior conditional distribution, but not the evidence. If the only interest is the posterior, then any ordering would be suitable as long as θ is positioned after all the variables we plan to condition it on. 91 of the joint map in Equation 5.5 is available, we can obtain the KR map of the posterior random variable by simply condition the last component. Effectively, we have attained a posterior map that is parameterized on y and d. This is extremely useful in the context of sOED, where many repeated inference computations need to be conducted on the same prior belief state but with different realizations of d and y in numerically evaluating the Bellman’s equation (Equation 3.3) using a stochastic optimization algorithm. The probability density function of the joint can also be easily obtained via f (d, y, θ) = fξ1 ,ξ2 ,ξ3 (Td,y,θ (d, y, θ)) |det ∂Td,y,θ (d, y, θ)| , (5.7) where Td,y,θ (d, y, θ) denotes the entire joint map from Equation 5.5 evaluated at (d, y, θ), and ∂Td,y,θ (d, y, θ) is its Jacobian of transformation. The determinant Jacobian is easily computable as it is simply the product of diagonal terms due to the triangular structure. Similarly, the density function for the posteriors can be obtained via f (θ|y, d) = fξ3 (Tθ|y,d (d, y, θ)) det ∂θ Tθ|y,d (d, y, θ) , (5.8) where ∂θ Tθ|y,d (d, y, θ) is the Jacobian of transformation (with respect to θ) for the last component of Equation 5.5. Before we describe the map construction method in the next section, we first illustrate the concepts from this section with an example. Example 5.2.1. For simplicity, consider an inference problem where the design d is fixed, and thus omitted from notation. The prior on θ is N (0, 1), and the observation (likelihood model) has the form y = G(θ) + 1.7ǫ = 0.01θ5 + 0.1(θ − 1.5)3 + 0.2θ + 5 + 1.7ǫ, (5.9) where ǫ ∼ N (0, 1) is an independent noise random variable. The prior form of the joint map can be easily constructed using the prior and likelihood information: (5.10) η1 = θ η2 = 1 1 [y − G(θ)] = y − 0.01θ5 − 0.1(θ − 1.5)3 − 0.2θ − 5 . 1.7 1.7 92 (5.11) It is interesting to note that for likelihood models in the form of additive Gaussian noise, the joint maps constructed in this manner are always monotone regardless of the form of forward model G; this is due to the triangular form of the maps. To obtain the posterior form of the joint map, we require ξ1 = Ty (y) (5.12) ξ2 = Tθ|y (y, θ), (5.13) which is difficult to attain analytically. Instead, we will construct an approximation to this joint map using numerical techniques introduced in the next section. Once it is made available, the map for the posterior conditioned on y = y ∗ is then simply Tθ|y∗ (θ) = Tθ|y (y ∗ , θ), for any realizations y ∗ . We will revisit this example later after introducing the map construction method. 5.3 Constructing maps from samples We now describe a method to numerically construct an approximate map from samples of the target measure. The work presented in this section (Section 5.3) was originally developed by Parno and Marzouk, with additional details in [131, 132]. We repeat much of the derivation here for completeness. We seek transport maps that have a lower triangular structure, i.e., T (z1 , z2 , . . . , zn ) = T1 (z1 ) T2 (z1 , z2 ) .. . Tn (z1 , z2 , . . . , zn ) , (5.14) where zi denotes the ith component of z and Ti : Ri → R is ith component of the map T for simplicity. We assume that both the target and reference measures are absolutely continuous on Rn . This assumption precludes the existence of atoms in µz and thus makes the KR coupling well-defined. To find a useful approximation of the KR coupling, we will define a map-induced density f˜z (z) and minimize the distance between this map-induced density and the target density fz (z). 93 5.3.1 Optimization objective Let fξ be the probability density associated with the reference measure µξ , and consider a transformation Te(z) that is monotone and differentiable µz -a.e. (In Section 5.3.2 we will discuss constraints to ensure monotonicity; moreover, we will employ maps that are everywhere differentiable by construction.) Now consider the pullback of µξ through Te. The density of this pullback measure is f˜z (z) = fξ (Te(z))| det ∂ Te(z)|, (5.15) where ∂ Te(z) is the Jacobian of the map evaluated at z, and | det ∂ Te(z)| is the absolute value of the Jacobian determinant. If the measure constraint µξ = Te♯ µz were exactly satisfied, the map-induced density f˜z would equal the target density fz . This suggests finding Te by minimizing a distance or divergence between f˜z and fz ; to this end, we use the Kullback-Leibler (KL) divergence from f˜z to fz : fz (z) ˜ DKL (fz kfz ) = Efz ln f˜z (z) e e = Efz ln fz (z) − ln fξ T (z) − ln det ∂ T (z) . (5.16) We can then find transport maps by solving the following optimization problem: min Efz − ln fξ (T (z)) − ln |det ∂T (z)| , T ∈T (5.17) where T is some space of lower-triangular functions from Rn to Rn . If T is large enough to include the KR map, then the solution of this optimization problem will exactly satisfy Equation 5.1. Note that we have removed the ln fz (z) term in Equation 5.16 from the optimization objective Equation 5.17, as it is independent of T . If the exact coupling condition is satisfied, however, then the quantity inside the expectation in Equation 5.16 becomes constant in z. Note that the KL divergence is not symmetric. We choose the direction above so that we can use Monte Carlo samples to approximate the expectation with respect to fz (z). Furthermore, as we will show below, this direction allows us to dramatically simplify the 94 solution of Equation 5.17 when fξ is Gaussian. Suppose that we have K samples from fz , denoted by {z (1) , z (2) , . . . , z (K) }. Taking a sample-average approximation (SAA) approach described in Section 2.2.2, we replace the objective in Equation 5.17 with its Monte Carlo estimate and, for this fixed set of samples, solve the corresponding deterministic optimization problem: " # K X 1 (k) (k) − ln fξ T (z ) − ln det ∂T (z ) . Te = argmin K T ∈T (5.18) k=1 The solution Te is an approximation to the exact transport map for two reasons: first, we have used an approximation of the expectation operator; and second, we have restricted the feasible domain of the optimization problem to T . The specification of T is the result of constraints, discussed in Section 5.3.2, and of the finite-dimensional parameterization of the map, such as a multivariate polynomial expansion. 5.3.2 Constraints To write the map-induced density f˜z as in Equation 5.15, it is sufficient that Te be differen- tiable and monotone, i.e., (z ′ − z)⊤ (Te(z ′ ) − Te(z)) ≥ 0 for distinct points z, z ′ ∈ Rn . Since we assume that µz has no atoms, to ensure that the pushforward Te♯ µz also has no atoms we only need to require that Te be strictly monotone. Our map is by construction everywhere differentiable and lower triangular, and we impose the monotone constraint via ∂ Tei ≥ λmin > 0, i = 1 . . . n. ∂zi (5.19) Since Te is lower triangular, the Jacobian ∂ Te is also lower triangular, and Equation 5.19 ensures that the Jacobian is positive definite. Because the Jacobian determinant is then positive, we can remove the absolute value from the determinant terms in Equation 5.17, Equation 5.18, and related expressions. This is an important step towards arriving at a convex optimization problem (see Section 5.3.3). Unfortunately, we cannot generally enforce the lower bound in Equation 5.19 over the entire support of the target measure. A weaker, but practically enforceable, alternative is to require the map to be increasing at each sample used to approximate the KL divergence. 95 In other words, we use the constraints ∂ Tei ∂zi z (k) ≥ λmin > 0 ∀i ∈ {1, 2, . . . , n}, ∀k ∈ {1, 2, . . . , K}. (5.20) In practice, we have found that Equation 5.20 is sufficient to ensure the monotonicity of a map represented by a finite basis expansion. 5.3.3 Convexity and separability of the optimization problem Now we consider the task of minimizing the objective in Equation 5.18. The 1/K term can immediately be discarded, and the derivative constraints above let us remove the absolute value from the determinant term. While one could tackle the resulting minimization problem directly, we can simplify it further by exploiting the structure of the reference density and the triangular map. First, we let ξ ∼ N (0, I). This choice of reference distribution yields n 1X 2 n ξi . ln fξ (ξ) = − ln(2π) − 2 2 (5.21) i=1 Next, the lower triangular Jacobian ∂ Te simplifies the determinant term in Equation 5.18 to give n Y ∂ Tei ln det ∂ Te(z) = ln (det ∂ Te(z)) = ln ∂zi i=1 ! = n X i=1 ln ∂ Tei . ∂zi (5.22) The objective function in Equation 5.18 now becomes " K n X X ei 1 ∂ T 2 (k) C(Te) = Te (z ) − ln 2 i ∂zi i=1 k=1 z (k) # . (5.23) This objective is separable: it is a sum of n terms, each involving a single component Tei of the map. The constraints in Equation 5.20 are also separable; there are K constraints for each Tei , and no constraint involves multiple components of the map. Hence the entire opti- mization problem separates into n individual optimization problems, one for each dimension of the parameter space. Moreover, each optimization problem is convex : the objective is convex and the feasible domain is closed (note the ≥ operator in the linear constraints of Equation 5.20) and convex. 96 In practice, we must solve the optimization problem over some finite-dimensional space of candidate maps. Let each component of the map be written as Tei (z; γi ), i = 1 . . . n, where γi ∈ RMi is a vector of parameters, e.g., coordinates in some basis. Throughout this thesis, we employ multivariate polynomial basis functions, but other choices are certainly possible. For instance, [131] found radial basis function representations of the map also to be useful. For any choice of basis, we will require that Tei be linear in γi . The complete map is then defined by the parameters γ̄ = [γ1 , γ2 , . . . , γn ]. Note that there are distinct parameter vectors for each component of the map. The optimization problem over the parameters remains separable, with each of the n different subproblems given by: min γi s.t. # " K X ∂ Tei (z; γi ) 1 e2 (k) T (z ; γi ) − ln 2 i ∂zi (k) k=1 z ∂ Tei (z; γi ) ≥ λmin > 0, k ∈ {1, 2, . . . , K}, ∂zi (k) (5.24) z for i = 1 . . . n. All of these optimization subproblems can be solved in parallel without evaluating the target density fz (z). Since the map components Tei are linear in the coefficients γi , each finite-dimensional problem is still convex. Moreover, efficient matrix-matrix and matrix-vector operations can be used to evaluate the objective. This allows us to easily solve Equation 5.24 with a standard Newton method. 5.3.4 Map parameterization One way to parameterize each component of the map Tei is with a multivariate polynomial expansion. We define each multivariate polynomial ψj as ψj (z) = n Y ϕji (zi ). (5.25) i=1 where j = (j1 , j2 , . . . , jn ) ∈ Nn0 is a multi-index and ϕji is a univariate polynomial of degree ji . The univariate polynomials can be chosen from any family of orthogonal polynomials (e.g., Hermite, Legendre, Jacobi). For simplicity, monomials are used for the present purposes. Using these multivariate polynomials, we express the map as a finite expansion of the form Tei (z; γi ) = X j∈Ji 97 γi,j ψj (z), (5.26) where Ji is a set of multi-indices defining the polynomial terms in the expansion. Notice that the cardinality of the multi-index set defines the dimension of each parameter vector γi , i.e., Mi = |Ji |. An appropriate choice of each multi-index set Ji will force the entire map Te to be lower triangular. A simple choice of the multi-index set corresponds to a total-order polynomial basis, where the maximum degree of each multivariate polynomial is bounded by some integer p ≥ 0: JiT O = {j : kjk1 ≤ p, jk = 0 ∀k > i}. (5.27) The first constraint in this set limits the polynomial order, while the second constraint, jk = 0 ∀k > i, applied over all i = 1 . . . n components of the map, forces Te to be lower triangular. In this work, we adopt the construction of total-order (monomial) polynomial basis. Example 5.3.1. We now continue from Example 5.2.1 and present its numerical results. Samples from the joint distribution, and the joint density function contours, are shown in Figure 5-2. A particular posterior for y = y ∗ = 1 is studied; this is represented by the dotted horizontal line in the joint density, and its exact posterior density is shown in Figure 5-3. An approximate joint map for the form in Equations 5.12 and 5.13 is constructed numerically using monomial polynomial basis of different total orders, and with different number of samples; the density corresponding to the posterior map Tθ|y∗ are shown in Figure 5-3. As expected, the density induced by the approximate posterior map better approximates the exact density as the polynomial basis order and sample size are increased. However, while the model in Equation 5.9 is a 5th-order polynomial, the posterior form of its joint map (Equations 5.12 and 5.13) can generally be of higher than 5th order. As a result, even with a 5th-order polynomial basis, we do not expect to obtain the exact posterior density. 5.4 Relationship between quality of joint and conditional maps The map construction method described in the previous section can be used to construct an approximate map to the posterior form of the joint map (Equation 5.5) in the context of Bayesian inference. This construction method minimizes an objective that is the KL diver98 ·10−2 20 20 8 7 10 10 6 0 y y 5 0 4 3 −10 −10 −20 −20 −4 −2 0 θ 2 4 (a) Samples from joint distribution 2 1 −4 −2 0 θ 2 4 0 (b) Joint density contours, dotted line is the y = 1 value where inference is perform on Figure 5-2: Example 5.3.1: samples and density contours. gence between the map-induced joint density and the target joint density (Equation 5.16). As the goal of inference is to ultimately obtain the posterior map and its density by conditioning on the joint map (Equations 5.5 and 5.6), we would like to explore the implications of joint map quality on the subsequent posterior conditional map quality. In other words, does a “good” joint map also lead to “good” posterior maps? We prove this is indeed the case, and that the optimal (in the KL sense) approximate joint map also produces the optimal expected posterior maps. Consider an exact n-dimensional, lower triangular and monotone transport map from target random vector z1:n to an i.i.d. reference random vector ξ1:n : ξ1:n ξ1 ξ2 i.d. = . = .. ξn T1 (z1 ) T2 (z1 , z2 ) ... Tn (z1 , z2 , . . . , zn ) = T1:n (z1:n ), (5.28) where the subscript notation zj:k = zj , . . . , zk . This exact map T1:n always exists and is unique, and has target density fz1:n (z1:n ) = fξ1:n (T1:n (z1:n )) |det ∂T1:n (z1:n )|. Let T̃1:n ∈ T̃1:n denote an approximate map, where T̃1:n ⊆ T1:n is an approximation subspace of the same 99 0.6 0.5 PDF 0.4 0.3 0.2 0.1 −5 −4 −3 −2 −1 0 θ 1 2 3 4 5 (a) Exact posterior density for y = 1 0.6 0.6 Exact p1 map p3 map p5 map 0.5 0.4 PDF PDF 0.4 0.3 0.3 0.2 0.2 0.1 0.1 −5 −4 −3 −2 −1 0 θ 1 2 3 4 Exact 103 samples 104 samples 105 samples 0.5 5 (b) Various map basis orders, 106 samples −5 −4 −3 −2 −1 0 θ 1 2 3 4 5 (c) Various sample sizes, 5th order Figure 5-3: Example 5.3.1: posterior density functions using different map polynomial basis orders and sample sizes. dimension to T1:n , and T1:n is the space of all lower triangular diffeomorphisms on Rn : ξ˜1:n = ξ˜1 T̃1 (z1 ) i.d. T̃2 (z1 , z2 ) = .. . ξ˜n T̃n (z1 , z2 , . . . , zn ) ξ˜2 .. . = T̃1:n (z1:n ). (5.29) In Equation 5.29, the target random vector z1:n is unchanged, but the reference random vector ξ˜1:n will be approximate and as a result generally no longer i.i.d. Similarly, we could i.d. keep the reference random vector ξ1:n unchanged (and thus i.i.d.), and view ξ1:n = T̃1:n (z̃1:n ) for some approximate target random vector z̃1:n . Such a z̃1:n always exists since it is precisely −1 T̃1:n (ξ1:n ). Figure 5-4 illustrates these two perspectives. The approximate target density is fz̃1:D (z1:D ) = fξ1:D (T̃1:D (z1:D )) det ∂ T̃1:D (z1:D ), which 100 ξ1:D = T1:D (z1:D ) ξ˜1:D = T̃1:D (z1:D ) ξ1:D = T̃1:D (z̃1:D ) Figure 5-4: Illustration of exact map and perspectives of approximate maps. Contour plots on the left reflect the reference density, and on the right the target density. in general only approximates the true target density fz1:D (z1:D ). The map construction approach described in Section 5.3 finds a good approximate map by minimizing the KL divergence between fz̃1:D (z1:D ) and fz1:D (z1:D ), where the KL provides a reflection of the map quality jointly in all of its dimensions. Through the following theorem and corollary, we show that the optimal approximate joint map also produces optimal expected posterior conditional maps. Theorem 5.4.1. Let the optimal approximate joint map that satisfies ∗ T̃1:n = argmin DKL (fz1:n ||fz̃1:n ) T̃1:n ∈T̃1:n 101 (5.30) denoted by the component structure ∗ T̃1:n (z1:n ) = T̃1∗ (z1 ) T̃2∗ (z1 , z2 ) .. . T̃n∗ (z1 , z2 , . . . , zn ) . (5.31) Then for each k = 1, . . . , n, the dimension-truncated “head” map ∗ T̃1:k (z1:k ) = T̃1∗ (z1 ) T̃2∗ (z1 , z2 ) .. . T̃k∗ (z1 , z2 , . . . , zk ) (5.32) is also the optimal approximate map for z1:k , in the sense ∗ T̃1:k = argmin DKL (fz1:k ||fz̃1:k ) , (5.33) T̃1:k ∈T̃1:k where T̃1:k ⊆ T̃1:n is its first k-dimensional truncation. Proof. We want to show that Equation 5.33 holds for k = 1, . . . , n. We prove by induction. ∗ . Now assume Equation 5.33 The base case for k = n is clearly true by definition of T̃1:n holds for k = m + 1, and we want to show then this holds for k = m as well. For any approximate map T̃1:(m+1) ∈ T̃1:(m+1) , = = = = = DKL fz1:(m+1) ||fz̃1:(m+1) " !# fz1:(m+1) (z1:(m+1) ) Ez1:(m+1) ln fz̃1:(m+1) (z1:(m+1) ) fzm+1 |z1:m (zm+1 |z1:m )fz1:m (z1:m ) Ez1:(m+1) ln fz̃m+1 |z̃1:m (zm+1 |z1:m )fz̃1:m (z1:m ) fzm+1 |z1:m (zm+1 |z1:m ) fz1:m (z1:m ) + Ez1:(m+1) ln Ez1:(m+1) ln fz̃m+1 |z̃1:m (zm+1 |z1:m ) fz̃1:m (z1:m ) fzm+1 |z1:m (zm+1 |z1:m ) f (z ) z 1:m 1:m z1:m + Ez Ez1:m Ezm+1 |z1:m ln 1:m ln fz̃m+1 |z̃1:m (zm+1 |z1:m ) fz̃1:m (z1:m ) Ez1:m DKL fzm+1 |z1:m (·|z1:m )||fz̃m+1 |z̃1:m (·|z1:m ) + DKL (fz1:m ||fz̃1:m ) , (5.34) 102 where the 2nd equality is due to fz̃1:(m+1) (z1:(m+1) ) = fξ1:(m+1) (T̃1:(m+1) (z1:(m+1) )) det ∂ T̃1:(m+1) (z1:(m+1) ) = fξm+1 (T̃m+1 (z1:(m+1) ))fξ1:m (T̃1:m (z1:m )) det ∂m+1 T̃m+1 (z1:(m+1) ) det ∂ T̃1:m (z1:m ) (5.35) = fz̃m+1 |z̃1:m (zm+1 |z1:m )fz̃1:m (z1:m ), where fz̃m+1 |z̃1:m (zm+1 |z1:m ) = fξm+1 (T̃m+1 (z1:(m+1) )) det ∂m+1 T̃m+1 (z1:(m+1) ) depends only on the map component of dimension m + 1 (i.e., T̃m+1 ), and not on any of the previous map components (i.e., T̃1:m ). The decomposition of fξ1:(m+1) (T̃1:(m+1) (z1:(m+1) )) in the 2nd equal- ity of Equation 5.35 uses the independence property of ξ1:(m+1) , and the decomposition of det ∂ T̃ (z ) 1:(m+1) 1:(m+1) is due to the triangular structure of the map. Now take argmin on both sides of Equation 5.34, we obtain argmin T̃1:(m+1) ∈T̃1:(m+1) = DKL fz1:(m+1) ||fz̃1:(m+1) argmin T̃1:m ∈T̃1:m ,T̃m+1 ∈T̃m+1 = Ez1:m DKL fzm+1 |z1:m (·|z1:m )||fz̃m+1 |z̃1:m (·|z1:m ) + DKL (fz1:m ||fz̃1:m )} argmin Ez1:m DKL fzm+1 |z1:m (·|z1:m )||fz̃m+1 |z̃1:m (·|z1:m ) T̃m+1 ∈T̃m+1 + argmin DKL (fz1:m ||fz̃1:m ) , (5.36) T̃1:m ∈T̃1:m where we have made use of the fact that the two terms in the summation depend on separate dimension components of the overall map. ∗ for dimensions 1 to m + 1 As a result, we see that the optimal approximate map T̃1:(m+1) ∗ is the concatenation of the optimal map T̃1:m for dimensions 1 to m, plus the (m + 1)th ∗ component T̃m+1 . This completes the proof. Corollary 5.4.1. For each k = 1, . . . , n, the component map T̃k∗ is the optimal expected conditional map, in the sense h i T̃k∗ = argmin Ez1:(k−1) DKL fzk |z1:(k−1) (·|z1:(k−1) )||fz̃k |z̃1:(k−1) (·|z1:(k−1) ) . T̃k ∈T̃k Proof. This is a result of Equation 5.36. 103 (5.37) In the context of Bayesian inference through joint maps (Section 5.2), the component map used on the right hand side of Equation 5.6 is therefore optimal under the joint expectation of d and y. 5.5 Sequential design using transport maps We now shift focus back to the sOED problem described in Chapter 3. Recall we would like to solve the dynamic programming form of the sOED problem, stated as Equations 3.3 and 3.4, and restated here for convenience: Jk (xk ) = max Eyk |xk ,dk [gk (xk , yk , dk ) + Jk+1 (Fk (xk , yk , dk ))] dk ∈Dk (5.38) (5.39) JN (xN ) = gN (xN ). While approximate dynamic programming techniques have been introduced in Chapter 4 for finding an approximate solution to this form, the issue of choosing the belief state xk remains unaddressed. Two major requirements emerge in considering this decision: the belief state needs to be able to (1) represent general, non-Gaussian posteriors of multiple dimensions in a finite-dimensional manner, and (2) perform Bayesian inference quickly. The second requirement is driven by Equation 5.38, where its numerical evaluation even on a single xk involves performing Bayesian inference (i.e., operations of Fk (xk , yk , dk )) many times under different values of dk (stochastic optimization iterations) and yk (Monte Carlo approximation of the expectation). Traditional approaches for representing random variables, such as direct approximations to their probability density and distribution functions, and using Gaussian mixtures or particles, generally do not scale well with dimension, are constrained to limited forms and structures, or computationally expensive to construct and propagate under inference. In contrast, the transport map method introduced in this chapter satisfies both requirements well. Following the discussions in Section 5.3.4, not only do the approximate KR maps provide a finite-dimensional representation of general random variables, they also offer a mechanism to adapt. This can be done by adjusting the selection of basis in different dimensions or regions, a topic to be explored as future work. Targeting an efficient representation can greatly alleviate the burden of extending to multiple and higher-dimensional settings. 104 Additionally, as illustrated in Section 5.2, only a single joint map needs to be constructed which can then be used to perform inference almost trivially for many different realizations of design and observations. The construction process itself is non-intrusive, requiring only samples from the target measure, which can effectively be treated as a black box. The construction procedure ends with a computationally attractive optimization problem that is dimensionally-separable and convex, and its solution can be obtained inexpensively. Motivated by these advantages, we proceed to use transport maps as the belief state for the sOED problem. 5.5.1 Joint map structure Transport maps of different levels of scope may be constructed for the sOED problem; we explore these possibilities here, and discuss their pros and cons. To illustrate this idea, consider the first experiment only at stage k = 0 with the prior x0 fixed, and the dimensions nθ , ny and nd inherited from Section 2.1 and assumed constant for all experiments. We then have three different levels of map construction under this situation, summarized in Table 5.1. Map targets θ θ, y0 θ, d0 , y0 Dimension nθ nθ + ny nθ + nd + ny Construct for each... {d0 , y0 } {d0 } 1 Total no. constructions niter × nMC niter 1 Table 5.1: Different levels of scope are available when transport maps are used as the belief state in the sOED context. niter represents the number of stochastic optimization iterations from numerically evaluating Equation 5.38, and nMC represents the Monte Carlo size of approximating its expectation. In our implementation, these values are typically around 50 and 100, respectively. The first choice involves constructing an nθ -dimensional map for each realization of {d0 , y0 }, which is equivalent to capturing each posterior directly. Such an approach does not align with the map construction and inference tools described in this chapter: constructing these maps requires samples from the posteriors, the very distributions we are trying to recover in the first place; it also does not take advantage of the inference mechanism of conditioning on a joint map. The second choice involves constructing an (nθ + ny )-dimensional map for each realization of {d0 }. This approach requires samples from the joint distribution of θ and y0 conditioned on d0 , which can be attained by first sampling θ from the prior, 105 and then y0 from the likelihood conditioned on the θ sample and d0 . A map constructed at d0 can then be used for inference on that d0 and any realization of y0 . However, niter map constructions are required, one for each d0 encountered within the stochastic optimization when evaluating Equation 5.38 numerically (niter , number of stochastic optimization iterations, is typically capped at 50 in our implementation), as these maps cannot be reused when the optimizer moves to a different value of d0 . The last choice involves a single (nθ + nd + ny )-dimensional map that can be used for inference of any realizations of d0 and y0 . Its construction involves samples from the joint distribution of θ, y0 , and d0 , which can be attained but requires some predefined rule (or distribution) that governs the generation of d0 ; we will discuss this requirement in detail within Section 5.5.3. The primary trade-off between these choices is the map dimension and the number of maps needed. For example, let us focus on the second and third choices, with nd considered in our numerical examples around 2, and niter at most 50. From experience, the extra time to construct a map nd dimensions higher is usually substantially shorter than constructing the smaller joint map niter times. While both approaches remain affordable, the larger map choice appears to be more computationally economical. Accuracy consideration becomes more important when the map dimension is high. On one front, with more dimensions and basis terms, especially using total-order polynomial basis, more samples from the joint distribution are required to construct the map while maintaining the same level of accuracy. On another front, a map of higher dimension also brings additional relationships in the new variables compared to its lower-dimensional counterpart, and is consequently more difficult to capture accurately. In context, a map attempting to accommodate the dependence on d tends to be less accurate at any particular values of d compared to a lower-dimensional map that is constructed for that specific d. Depending on the regularity of the problem, the map accuracy can have a huge impact on the overall sOED results. The same pattern of trade-off is observed when extending to multiple experiments. Along a similar argument, we propose to construct a single joint map at each stage k that can be used for inference on any realizations of designs and observations from the previous experiments. Table 5.2 illustrates these maps for the first three experiments. In particular, Tθ|d0 ,y0 (last component of the first column) is used for inference after performing one experiment, Tθ|d0 ,y0 ,d1 ,y1 (last component of the second column) is used for inference after performing two 106 experiments, etc. A closer examination their structure reveals two interesting observations. First, only the Tθ|d0 ,y0 ,...,dk ,yk (bottom) component of each map is be used for performing inference, while all other components are not needed but created as a by-product from constructing these maps. Second, there is substantial overlap of components between these maps at different stages. Specifically, the components grouped by the red rectangular boxes are identical. It is then natural to retain only the unique components from all these maps in an N -experiment design setting, arriving at the following single joint map that can be used for performing Bayesian inference upon any number of experiments: ξd 0 = Td0 (d0 ) ξ y0 = Ty0 |d0 (d0 , y0 ) ξd 1 = Td1 |d0 ,y0 (d0 , y0 , d1 ) ξ y1 = Ty1 |d0 ,y0 ,d1 (d0 , y0 , d1 , y1 ) .. . ξdN −1 = TdN −1 |d0 ,y0 ,...,dN −2 ,yN −2 (d0 , y0 , . . . , dN −2 , yN −2 , dN −1 ) ξyN −1 = TyN −1 |d0 ,y0 ,...,dN −2 ,yN −2 ,dN −1 (d0 , y0 , . . . , dN −2 , yN −2 , dN −1 , yN −1 ) ξθ 0 = Tθ|d0 ,y0 (d0 , y0 , θ) ξθ 1 = Tθ|d0 ,y0 ,d1 ,y1 (d0 , y0 , d1 , y1 , θ) .. . ξθN −1 = Tθ|d0 ,y0 ,...,dN −1 ,yN −1 (d0 , y0 , . . . , dN −1 , yN −1 , θ). (5.40) This final map has a dimension of N (nθ + nd + ny ), and the entire map can be constructed all-at-once using the method described in Section 5.3. The components ξθk correspond to the same θ variable, but with dependence structures consisting different number of dk ’s and yk ’s for inference on different numbers of experiments. This setup then does not require any intermediate posteriors Tθ|d∗0 ,y0∗ ,...,d∗k ,yk∗ directly, and inference is done by conditioning the entire history of past d∗k and yk∗ values. Consequently, intermediate posterior approximation errors are avoided altogether. The triangular structure is maintained while the block of θ variables have a sparse dependence (e.g., Tθ|d0 ,y0 is for inference after the first experiment and thus has no dependence on dk and yk for k > 0; θ components corresponding to different number of experiments also do not depend on each other), and this sparsity property is 107 leveraged in our implementation. k=0 ξd0 = Td0 (d0 ) ξy0 = Ty0 (d0 , y0 ) ξθ0 = Tθ0 (d0 , y0 , θ) 1 ξd 0 ξ y0 ξd 1 ξ y1 ξθ 1 = Td0 (d0 ) = Ty0 (d0 , y0 ) = Td1 (d0 , y0 , d1 ) = Ty1 (d0 , y0 , d1 , y1 ) = Tθ1 (d0 , y0 , d1 , y1 , θ) 2 ξd 0 ξ y0 ξd 1 ξ y1 ξd 2 ξ y2 ξθ 2 = Td0 (d0 ) = Ty0 (d0 , y0 ) = Td1 (d0 , y0 , d1 ) = Ty1 (d0 , y0 , d1 , y1 ) = Td2 (d0 , y0 , d1 , y1 , d2 ) = Ty2 (d0 , y0 , d1 , y1 , d2 , y2 ) = Tθ2 (d0 , y0 , d1 , y1 , d2 , y2 , θ) ··· Table 5.2: Structure of joint maps needed to perform inference under different number of experiments. For simplicity of notation, we omit the conditioning in the subscript of map components; please see Equation 5.40 for the full subscripts. The same pattern is repeated for higher number of experiments. The components grouped by the red rectangular boxes are identical. 5.5.2 Distributions on design variables The joint maps presented in Table 5.2 and Equation 5.40 all involve dependence on dk , so that the same maps can be used for inference under different designs. To construct these joint maps, then, dk samples are required. While θ and yk samples can be naturally generated from the prior and likelihood model, it is not immediate clear how to generate dk . On the one hand, intuition tells us that the joint map can be made more accurate if we prescribe an appropriate distribution for dk that reflects how often the designs are visited. On the other hand, we must do so without compromising what we ultimately seek for from the joint maps: the posteriors. We address these thoughts generally in this subsection, and focus on dk generation specifically within the sOED context in the next subsection. Consider a simple one-experiment joint map (i.e., the k = 0 column from Table 5.2); for simplicity we drop the subscripts on d0 and y0 . The ultimate purpose of this joint map is to produce posterior joint maps Tθ|d∗ ,y∗ (θ) = Tθ|d,y (d∗ , y ∗ , θ). Assuming that (1) d and θ are (marginally) independent, and (2) prior f (θ) and likelihood f (y|θ, d) are fixed, then the posterior conditional remains unchanged regardless of the marginal distribution of d: f (θ|y, d) = f (θ, y, d) f (y, d) = = f (y|θ, d)f (d)f (θ) f (y|d)f (d) f˜(θ, y, d) f (y|θ, d)f˜(d)f (θ) f (y|θ, d)f˜(d)f (θ) = , (5.41) = f˜(y|d)f˜(d) f˜(y, d) f (y|d)f˜(d) 108 where f˜ denotes density functions as a result of an alternative choice of marginal distribution on d. The second equality uses the independence assumption between d and θ. The third equality employs the alternative f˜(d), which does not affect the prior and likelihood. The four equality can be more clearly seen by f˜(y|d) = Z f (y|θ, d)f (θ) dθ = f (y|d), (5.42) Y again using the independence assumption between θ and d. Equation 5.41 implies that, regardless of the d marginal, the same posteriors are maintained, and we indeed have the freedom to select a distribution to sample d. The joint distribution and joint map, however, would be different. The linear-Gaussian example below demonstrates that when the d marginal is chosen to reflect designs more frequently visited or otherwise we are interested in, the quality of the posteriors is improved. Example 5.5.1. Consider a linear model y = θd + ǫ with prior θ ∼ N (s0 , σ02 ) and noise variable ǫ ∼ N (0, σǫ2 ). This linear-Gaussian problem has conjugate Gaussian posteriors in the form θ|y, d ∼ N y/d 2 2 σǫ /d 1 σǫ2 /d2 + + s0 σ02 1 σ02 , 1 + 1 σǫ2 /d2 1 σ02 . (5.43) We first point out that, even when the marginal of d is Gaussian, the joint distribution of θ, y, d is not multivariate Gaussian. This can be seen from the following argument. If the joint is Gaussian, then all of its marginals are also Gaussian; conversely, if any of its marginals is not Gaussian, then the joint cannot be Gaussian. Since θ and d are independent Gaussian random variables, their product θd cannot be Gaussian, and thus (y −ǫ) cannot be Gaussian. Knowing ǫ is independent of d and θ, and that (y − ǫ) is neither Gaussian nor a constant, then the marginal on y cannot be Gaussian as well (if y is Gaussian, then its moment generation function is My (t) = exp µy t + 21 σy2 t2 = My−ǫ (t)Mǫ (t) = My−ǫ (t) exp 12 σǫ2 t2 , and (y − ǫ) must either be a Gaussian or a constant, which is a contradiction). As a result, we can conclude that the joint distribution cannot be multivariate Gaussian, and using a linear polynomial basis to represent the joint map in this example incurs truncation error. Now assume we are interested in performing inference accurately for designs distributed 109 as d◦ ∼ N (1, 0.12 ). To test the accuracy of posteriors of a given joint map, we randomly sample designs from d◦ , θ from the prior with s0 = 0, σ0 = 1, y from the likelihood with σǫ = 1, and produce different posterior maps by condition on these samples. A number of joint maps are tested. They all employ first-order polynomial basis and are constructed from 105 samples generated from the prior and likelihood, and different choices of marginal distributions for d listed in Table 5.3. For cases 4 to 8, the samples are “fixed” on a grid in order to mimic uniform distributions, but at the same time allows us to maintain control over uniformity and frequency of samples; their purpose is to also test the numerical effects of utilizing multiple exactly repeated (i.e., stacked) design samples. Case 1 2 3 4 5 6 7 8 d marginal N (1, 0.12 ) N (0, 52 ) N (−2, 0.52 ) uniform on a grid of [0, 2] uniform on a grid of [0, 2] with each point repeated 10 times uniform on a grid of [−3, 0] uniform on a grid of [−3, 0] but with 50% points at exactly d = 1 same as case 7 but with 3rd-order map Table 5.3: Marginal distributions of d used to construct joint map. Posterior density functions for a particular sample realization from different joint maps are shown in Figure 5-5, with those from additional sample realization shown in Figure 5-6. In all the figures, posteriors from cases 1 and 4 match closest with analytic results. This is expected, since the joint map for the former is constructed using the exact d distribution as d◦ , while the latter has a good coverage of it via a uniform grid. Case 2 is less accurate because it “over-covers”, placing weights of accuracy also in regions we are not interested in. Case 3 is even less accurate because it is concentrated in a narrow region of d that is much further away from the bulk of d◦ . Case 5 is essentially exactly the same as case 4 since we are increasing the samples proportionally, thus not changing the d distribution. Case 6 is not accurate again because the samples do not have good coverage for the design region of interest. Case 7 improves upon case 6 as more samples at the mean of d◦ are added, but remains inaccurate. Cases 4 to 8 also demonstrate that the map construction algorithm is numerically sound even when there are samples with identical values of d. Overall, poor posteriors estimates can be attributed to two main factors. First, accuracy deteriorates as d is more different (in a loose sense of “rough coverage”, rather than precise 110 forms such as whether it is Gaussian, uniform, or grid) from d◦ . Second, since the joint distribution is not multivariate Gaussian for this example, there is truncation error from using a linear polynomial basis for the joint map. This is further supported by case 8, which is the same as case 7 but using a 3rd-order polynomial basis, showing improved results over case 7. d=0.87299, y=3.0213 1 0.8 0.6 0.4 Analytic Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7 Case 8 0.2 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Figure 5-5: Example 5.5.1: posteriors from joint maps constructed under different d distributions. 5.5.3 Generating samples in sequential design To construct the joint map described in Equation 5.40 for the sOED problem, we need to generate samples of θ, dk , yk for k = 0, . . . , N − 1. In particular, the joint density function has the form f (θ, d0 , y0 , . . . , dN −1 , yN −1 ) = "N −1 Y k=0 # f (yk |dk , θ)f (dk ) f (θ), (5.44) where yk are independent conditioned on θ and dk (this is simply due to the noise in the likelihood model being independent), and dk and θ are (marginally) independent. θ can be naturally generated from the prior f (θ), and yk from the likelihood f (yk |dk , θ) given dk and θ; the only missing part is f (dk ). As illustrated in the previous subsection, we may choose any marginal f (dk ) without changing the posteriors generated from the joint map. Furthermore, it is advantageous to 111 select f (dk ) that is in proportion to how often we will visit the designs. This is precisely the dk distribution induced by the optimal policy and the associated numerical methods; the same concept has been introduced and discussed in Section 4.3 in the context of regression for value function approximation. However, not only do we not have the optimal policy, we cannot even generate dk from an approximate policy in the one-step lookahead form of Equation 4.1, since it requires performing inference, capability provided by the very joint map we are trying to construct. The only choice then is to generate dk from a distribution that does not require inference, such as random exploration. This is the method we shall employ: we generate samples of θ from the prior, d0 , . . . , dN −1 from an exploration policy, and finally y0 , . . . , yN −1 from the likelihood. As this “exploration joint map” is constructed using only exploration samples, its performance is not optimal when used under other policies (i.e., exploitation). In practice, the design distributions induced by exploitation are usually more complicated and concentrated compared to exploration, and thus using an exploration joint map from a generally “wider” marginal can be regarded as a conservative stance. A natural extension is to use exploitation samples and construct new maps that would be more accurate for exploitation. In fact, these samples are readily available from the state measure update procedure described in Section 4.3. However, preliminary testings of this idea frequently caused numerical instability of the sOED algorithm when inaccurate inference evaluations lead to unrealistically high KL estimates; incorporating this idea in a stable and accurate manner is a promising future research direction. 5.5.4 Evaluating the Kullback-Leibler divergence Evaluation of the KL divergence is a core component of information-based OED; it is a non-trivial task for non-Gaussian random variables represented by transport maps. A straightforward method is by Monte Carlo sampling. Samples from the base distribution (that which the expectation is taken with respect to) are first obtained, by sampling from the reference distribution and pulling them through the map. Their density values can then be evaluated via formulas such as Equation 5.8, and a Monte Carlo estimate of the KL integral can then be established. While the inversion of an exact map is always possible, monotonicity is only enforced at the sample points used when constructing an approximate map (Equation 5.24). When monotonicity is lost, not only does the map inversion yield mul112 tiple roots, the density function formula also becomes invalid. Subsequently, unrealistically high values of KL divergence may surface, leading to numerical instability in the ensuing regression systems. In practice, loss of monotonicity may occur under high-dimensional and non-Gaussian distributions, especially when observations are dominated by a highly nonlinear signal, and when exploitation joint maps are attempted. An alternative approach for estimating the KL is to first use linear truncation to the polynomial map basis, and then apply the analytic formula for computing KL divergence between Gaussian random variables. Effectively, the random variables are “Gaussianized”, but this procedure differs from Laplace approximation since the linearization here is not necessarily performed at the mode. We emphasize that this approach suggests using Gaussian approximations only for evaluating the KL divergence; it is different from simply using Gaussian approximations throughout the sOED process, since in the case here higher-order information is still propagated throughout inference. While the Monte Carlo sampling method reflects higher-order information in its KL estimate as well, computation of the truncation approach is stable and can be performed much more quickly. We thus adopt this approach for the numerical examples presented in Chapter 7. As examples, we describe the precise truncation process for 1D and 2D maps with 3rdorder monomial basis. A 1D map has the form ξ = a0 + a1 z + a2 z 2 + a3 z 3 , (5.45) where ξ ∼ N (0, 1) is the Gaussian reference random variable, and z is the target. The linear truncation is then ξ = a0 + a1 z̃, (5.46) where a simple inversion yields z̃ = ξ − a0 . a1 (5.47) This form implies that z̃ ∼ N − aa10 , a12 . The 2D case is slightly more complicated, where 1 113 the map now has the form ξ0 = a0 + a1 z0 + a2 z02 + a3 z03 (5.48) ξ1 = b0 + b1 z0 + b2 z1 + b3 z02 + b4 z0 z1 + b5 z12 + b6 z03 + b7 z02 z1 + b8 z0 z12 + b9 z13 .(5.49) The linear truncation is then ξ0 = a0 + a1 z̃0 (5.50) ξ1 = b0 + b1 z̃0 + b2 z̃1 , (5.51) and an inversion yields z̃0 = ξ0 − a 0 a1 (5.52) z̃1 = 0 ξ1 − b0 − b1 ξ0a−a ξ1 − b0 − b1 z̃0 1 = , b2 b2 (5.53) which can be summarized in matrix form 1 (5.54) z̃ = Σ 2 ξ + µ with 1 Σ2 = 1 a1 0 − ab11b2 1 b2 − aa10 µ= − bb20 1 1 + a0 b 1 a1 b 2 . (5.55) This form implies that z̃ ∼ N (µ, Σ), where Σ = (Σ 2 )(Σ 2 )⊤ . For completeness, the analytic KL divergence formula for two Gaussian random variables z̃A ∼ (µA , ΣA ) and z̃B ∼ (µB , ΣB ) is det ΣA 1 −1 −1 tr ΣB ΣA + (µB − µA )ΣB (µB − µA ) − n − ln , (5.56) DKL (fz̃A ||fz̃B ) = 2 det ΣB where n is the dimension of the random variables. Since the map coefficients carry information that fully describes the random variable (up to truncation), it is promising and also valuable to develop an analytic formula for KL directly from the map coefficients in the future. 114 d=0.98031, y=-0.21571 d=1.0586, y=1.8556 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 −5 −4 −3 −2 −1 0 1 2 3 4 0 −5 −4 −3 −2 −1 5 (a) Sample 1 d=0.91481, y=1.0191 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 1 2 3 4 0 −5 −4 −3 −2 −1 5 (c) Sample 3 d=0.84906, y=1.9637 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 1 4 5 0 1 2 3 4 5 3 4 5 d=1.0876, y=1.6068 1 0 3 (d) Sample 4 1 0 −5 −4 −3 −2 −1 2 d=1.08, y=0.71798 1 0 1 (b) Sample 2 1 0 −5 −4 −3 −2 −1 0 2 3 4 5 (e) Sample 5 0 −5 −4 −3 −2 −1 0 1 2 (f) Sample 6 Figure 5-6: Example 5.5.1: additional examples of posteriors from joint maps constructed under different d distributions. The same legend in Figure 5-5 applies. 115 116 Chapter 6 Full Algorithm Pseudo-code for Sequential Design Combining the approximate dynamic programming techniques from Chapter 4 and the transport map technology from Chapter 5, we present the pseudo-code of our map-based algorithm for sequential optimal experimental design, outlined in Algorithm 3. We will also use a grid-based version of this algorithm for comparing numerical results of examples in Chapter 7 that involve 1D parameter space. In this variation, instead of using a transport map to represent the posterior, a grid is used to capture its probability density function. Whenever Bayesian inference is performed within the algorithm, the grid needs to be adapted in order to ensure reasonable coverage and grid resolution for the posterior density. A simple scheme first computes the unnormalized posterior density values on the current grid, and decides whether grid expansion is needed on either side based on a threshold that is the ratio of grid end-point density value to the grid mode density value. Second, a uniform grid is laid over the expanded regions, with new unnormalized posterior density values computed. Finally, a new grid over the original and expanded regions is constructed such that the probability masses between neighboring grid points are equal—this provides a mechanism for sparsifying the grid in regions of low density values. Results from this grid method are used as reference of comparison for the map-based algorithm, since the inference computations of the former generally involve fewer approximations. With respect to Algorithm 3, the grid method no longer requires line 3, and the inference computations in lines 5, 7, and 12 use the grid adaptation procedure described above. 117 Algorithm 3: Algorithm for map-based sequential optimal experimental design. 1 2 3 4 5 6 7 8 9 10 11 12 Set parameters: Select features {zk }, ∀k, exploration measure, L, R0 , R, T ; Initial exploration: Simulate R0 exploration trajectories by sampling θ from prior, dk from exploration measure, yk from likelihood, ∀k, without inference; Make exploration joint map: Make Texplore from these samples; for ℓ = 1, . . . , L do Exploration: Simulate R exploration trajectories by sampling θ from prior, dk from exploration measure, yk from likelihood, ∀k, with inference using Texplore ; ℓ Store all states visited Xk,explore = {xrk }R r=1 , ∀k; Exploitation: (if ℓ > 1) Simulate T exploitation trajectories by sampling θ from prior, dk from one-step lookahead h policy i ℓ−1 ℓ−1 µk (xk ) = argmaxd′k Eyk |xk ,d′k gk (xk , yk , d′k ) + J˜k+1 (Fk (xk , yk , d′k )) , yk from likelihood, ∀k, with inference using Texplore ; ℓ Store all states visited Xk,exploit = {xtk }Tt=1 , ∀k; Approximate value iteration: Construct J˜kℓ functions via backward induction ℓ ℓ using new regression points {Xk,explore ∪ Xk,exploit }, ∀k, described by loop below; for k = N − 1, . . . , 1 do ℓ ℓ for rt = 1, . . . , R + T where xrt k are all members of {Xk,explore ∪ Xk,exploit } do Compute training values h i rt ′ ℓ rt ′ ℓ rt ˜ ˆ ′ ′ rt J (x ) = maxd Ey |x ,d gk (x , yk , d ) + J (Fk (x , yk , d )) , k 13 14 15 16 17 k k k k k k k k+1 inference performed using Texplore ; Construct J˜kℓ = Π Jˆkℓ by regression on training values; end end end Extract final policy parameterization: J˜L , ∀k; k 118 k k Chapter 7 Numerical Results We present several numerical examples of the sequential optimal experimental design (sOED) problem in this chapter. Each example serves different purposes in highlighting various properties and observations. Through them, we demonstrate • Linear-Gaussian problem (Section 7.1): – ability of the numerical methods developed in this thesis in solving an sOED problem, where the analytic solution is available and can be compared to – agreement between results generated from analytic, grid, and map representations of the belief state, along with their associated inference methods • 1D contaminant source inversion problem (Section 7.2): – Case 1: advantages of sOED over batch (open-loop) design – Case 2: advantages of sOED over greedy (myopic) design – Case 3: performance of sOED using the map method, and comparison to the grid method (as reference solution) • 2D contaminant source inversion problem (Section 7.3): – ability of the numerical methods in handling complicated situations of multiple experiments and dimensions Details of these numerical examples are described in the subsequent sections. 119 7.1 Linear-Gaussian problem 7.1.1 Problem setup Consider a forward model that is linear with respect to the parameters, with no physical state component, and where observations are corrupted by an additive Gaussian noise: (7.1) yk = G(θ, dk ) + ǫ = θdk + ǫ. The prior on θ is N (s0 , σ02 ), ǫ ∼ N (0, σǫ2 ), and d ∈ [dL , dR ]. The resulting inference problem on θ has a conjugate Gaussian structure, and all subsequent posteriors are Gaussian with the formula: 2 = sk+1 , σk+1 y k /dk σǫ2 /d2k 1 σǫ2 /d2k + + sk σk2 1 σk2 1 , 1 + σ 2 /d2 ǫ k 1 σk2 . (7.2) We consider the design of N = 2 experiments, with s0 = 0, σ0 = 3, σǫ = 1, dL = 0.1, dR = 3. Three methods of belief state representation and inference are studied in this example: • analytic representation:1 xk,b = (sk , σk2 ) with exact inference using Equation 7.2; • grid representation of the posterior density function: xk,b is a grid on f (θ|Ik ), and the simple grid adaptation scheme described in Chapter 6 is used for inference; and • map representation: xk,b is the set of posterior map coefficients, with inference performed by conditioning on a joint map composed of total-order polynomial basis (with sparsification for dk and θ dimensions as illustrated in Equation 5.40) that is constructed using trajectories from a prescribed exploration policy. For this linear-Gaussian problem, the grid method uses grids of 50 nodes; the map method uses monomial basis functions of total order 3. The joint map has a total of N (nd ×ny ×nθ ) = 6 dimensions and 129 basis terms, and the coefficients are determined using 105 exploration trajectories with the exploration policy designated by dk ∼ N (1.25, 0.52 ). All posterior maps are 1D 3rd-order polynomials and thus have 4 coefficients. 1 For simplicity, these different methods of belief state representation will also be referred to as the “analytic method”, “grid method”, and “map method” in this chapter. 120 The reward functions used are (7.3) gk (xk , yk , dk ) = 0 2 gN (xN ) = DKL (fθ|IN (·|IN )||fθ (·)) − 2(ln σN − ln 2)2 , (7.4) for k = 0, 1. The terminal reward is a combination of information gain and a penalty away from a log-variance target. The latter increases the difficulty of this problem by moving the optimal policy away from the design space boundary and avoiding constructions of fortuitous policies.2 The analytic formula for the Kullback-Leibler (KL) divergence between two univariate Gaussians involves operations of the mean and log-variance of the Gaussians— this motivates the selection of value function features φk,i (in Equation 4.2) to be 1, sk , ln(σk2 ), s2k , ln(σk2 )2 , and sk ln(σk2 ). The features are evaluated by trapezoidal rule integration for the grid method, and inversion of a linear truncation for the map method. The KL divergence is approximated by first estimating the mean and variance using these techniques, and then applying the analytic KL formula for Gaussians. Since we know the posteriors should all be Gaussian in this example, these approximations are expected to be quite accurate. L = 3 iterations of state measure update are conducted with regression points generated by, only exploration for ℓ = 1, and 30% exploration and 70% exploitation for subsequent iterations. Analytic method uses 1000 regression points, while grid and map methods use 500. The policies generated from different methods are compared by applying them in 1000 simulated trajectories; this procedure is summarized in Algorithm 4. Each policy is first applied in trajectories under the same belief state representation method originally used to construct that policy. Then, inference is performed on the resulting sequence of designs and observations using a common evaluation framework, regardless of how the trajectory is produced: we use the analytic method as this common framework in this example. This ensures a fair comparison between policies, where the designs are produced using the “native” belief state representation that the policy was originally created for, while all final trajectories are evaluated using a “common” method. We note that there is also a distribution for the final policy due to the randomness involved in the numerical methods (e.g., repeating the 2 Without the second term in the terminal reward, the optimal policies will always be those that lead to the highest achievable signal, which occurs at the dk = 3 boundary. It is then more vulnerable to fortuitously producing policies that lead to boundary designs even when the overall value function approximation may be poor. 121 algorithm to construct the policy would not result in exactly the same policy, simply due to the different random numbers being used in simulations). We currently do not take this policy distribution into account in the assessment; instead, only a single policy realization is used to generate all 1000 trajectories. A more comprehensive study by repeating the policy constructing algorithm many times may be conducted in the future, although such an undertaking would be extremely expensive. Algorithm 4: Procedure for evaluating policies by simulating trajectories. 1 2 3 4 5 6 Select “native” belief state representation to generate policy: for example: analytic, map, or grid; see Section 7.1.1; Construct policy: use the native belief state representation, and the numerical methods developed in this thesis for solving the sOED problem; for q = 1, . . . , ntrajectories do Apply policy: generate a trajectory using the native belief state representation: sample θ from prior, evaluate dk by applying the constructed policy, sample yk from the likelihood, for k = 0, . . . , N − 1; Evaluate rewards via a “common” evaluation framework: perform inference on the dk and yk values from this trajectory and evaluate all rewards, using the analytic state representation; end 7.1.2 Results Since this example has a horizon of N = 2, only J˜1 is constructed via function approximation while J2 is directly (and numerically for grid and map methods) evaluated. These surfaces plotted in the posterior mean and variance are shown in Figure 7-1, along with the regression points used to build them. Excellent agreement across all three methods are observed, and there is a noticeable change in the distribution of regression points from ℓ = 1 (regression points from only exploration) to ℓ = 2 (regression points from a mixture of exploration and exploitation), leading to a better approximation of the policy induced state measure. The regression points appear to be grouped more closely together for ℓ = 1 even though they are from exploration, because exploration in fact covers a large region of dk space that leads to small values of σk2 . In this simple example, the particular choice of exploration design measure did not lead to a noticeable negative impact of the total reward for ℓ = 1 (Figure 7-5). However, this can easily be reversed for problems with more complicated value functions and less suitable choices of exploration design measures. 122 5 9 5 6 −5 6 −5 6 −5 3 −15 3 −15 3 −15 −25 0 −10 −25 0 −10 0 −10 −5 0 s1 5 10 σ 2k 9 σ 2k 5 σ 2k 9 −5 0 s1 5 10 −5 0 s1 5 10 −25 (a) Analytic method 9 5 9 5 6 −5 6 −5 6 −5 3 −15 3 −15 3 −15 −25 0 −10 −25 0 −10 σ 2k σ 2k 5 σ 2k 9 0 −10 −5 0 s1 5 10 −5 0 s1 5 10 −5 0 s1 5 10 −25 (b) Grid method 9 5 9 5 6 −5 6 −5 6 −5 3 −15 3 −15 3 −15 −25 0 −10 −25 0 −10 σ 2k 0 −10 −5 0 s1 5 10 σ 2k 5 σ 2k 9 −5 0 s1 5 10 −5 0 s1 5 10 −25 (c) Map method Figure 7-1: Linear-Gaussian problem: J˜1 surfaces and regression points used to build them. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. Histograms for d0 and d1 are shown in Figures 7-2 and 7-3. While overall the agreement is quite good between the three methods, d0 for the grid and map methods ℓ = 2 and ℓ = 3 are concentrated slightly to the left compared to those for the analytic method. The opposite is true for d1 , where the grid and map methods ℓ = 2 and ℓ = 3 are slightly to the right compared to those for the analytic method. This is because the optimal policy is not unique for this problem, and there is a natural notion of exchangeability between the two experimental designs d0 and d1 . With no stage cost, the overall objective of this problem is the expected KL divergence and the expected distance of the final log-variance to the target log-variance. This quantity can be shown to be only a function of its final variance, which is determined exactly given values of dk through Equation 7.2 (it is not affected by the observations yk ). In fact, this linear-Gaussian problem (with constant noise 123 variance) is a deterministic problem, and the optimal policy is reducible to optimal designs d∗0 and d∗1 . Batch design would produce the same optimal designs as sOED for deterministic problems. An analytic derivation of the optimal designs and the expected utility surface for this problem is presented in Appendix B, with d∗2 0 + d∗2 1 1 18014398509481984 ln 3 − 5117414861322735 = exp −1 , 9 9007199254740992 (7.5) and U (d∗0 , d∗1 ) ≈ 0.783289. (7.6) Indeed, there is a “front” of optimal designs, as there are different combinations of d0 and d1 that together lead to the underlying optimal final variance. The pairwise (d0 , d1 ) scatter plots for 1000 simulated trajectories are shown in Figure 7-4, and superimposed on the analytic expected utility surface. From the expected utility surface and the optimal design front, we can immediately see the symmetry between the two designs. Furthermore, we can now clearly understand the earlier observations, that the optimizer is hovering around different parts of the optimal front in different cases. The expected utility surface also appears quite flat around the optimal design front, thus we expect all these methods to have performed fairly well. Indeed, the histograms of total rewards and their mean from the simulated trajectories presented in Figure 7-5 and Table 7.1 show good agreement with each other and the optimal expected utility, with grid and map methods exhibiting slightly lower mean values but all within Monte Carlo standard error. For contrast, the exploration policy produces a much lower mean reward of −8.5. Analytic Grid Map ℓ=1 0.77 0.74 0.77 ℓ=2 0.78 0.76 0.75 ℓ=3 0.78 0.75 0.75 Table 7.1: Linear-Gaussian problem: total reward mean values (of histograms in Figure 7-5) from 1000 simulated trajectories. Monte Carlo standard errors are all ±0.02. The pairwise and marginal kernel density estimates (KDEs) from samples used to construct the joint exploration map, and samples generated from the resulting map, are shown in Figure 7-6. Excellent agreement is observed between the two sets of KDEs. As evident 124 600 500 500 500 400 400 400 300 Count 600 Count Count 600 300 300 200 200 200 100 100 100 0 0.45 0.5 0.55 0.6 0.65 0 0.45 0.7 0.5 0.55 d0 0.6 0.65 0 0.45 0.7 0.5 0.55 d0 0.6 0.65 0.7 0.6 0.65 0.7 0.6 0.65 0.7 d0 600 600 500 500 500 400 400 400 300 Count 600 Count Count (a) Analytic method 300 300 200 200 200 100 100 100 0 0.45 0.5 0.55 0.6 0.65 0 0.45 0.7 0.5 0.55 d0 0.6 0.65 0 0.45 0.7 0.5 0.55 d0 d0 600 600 500 500 500 400 400 400 300 Count 600 Count Count (b) Grid method 300 300 200 200 200 100 100 100 0 0.45 0.5 0.55 0.6 d0 0.65 0.7 0 0.45 0.5 0.55 0.6 d0 0.65 0.7 0 0.45 0.5 0.55 d0 (c) Map method Figure 7-2: Linear-Gaussian problem: d0 histograms from 1000 simulated trajectories. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. by, for example, the pairwise KDEs between dk and yk , the joint distribution is in general not Gaussian even for a linear-Gaussian problem (and even with Gaussian marginals on dk from the prescribed exploration design measure); this is discussed in Example 5.5.1 form a theoretic perspective. In summary for the linear-Gaussian example, we have shown numerical results from sOED to agree with the analytic optimal. Furthermore, we demonstrate agreement between results from analytic, grid, and map methods, using their associated inference methods. This is also a starting point in displaying strength of the transport map technology as well as the overall method for solving the sOED problem that we have developed in this thesis. Furthermore, the grid method can be trusted to be used as a comparison reference for the upcoming 1D nonlinear non-Gaussian example, where analytic representation of the belief 125 500 400 400 400 300 300 300 200 100 Count 500 Count Count 500 200 100 0 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 d1 200 100 0 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 d1 0 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 d1 500 500 400 400 400 300 300 300 200 100 Count 500 Count Count (a) Analytic method 200 100 0 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 d1 200 100 0 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 d1 0 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 d1 500 500 400 400 400 300 300 300 200 100 Count 500 Count Count (b) Grid method 200 100 0 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 d1 200 100 0 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 d1 0 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 d1 (c) Map method Figure 7-3: Linear-Gaussian problem: d1 histograms from 1000 simulated trajectories. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. state would not be possible. 7.2 1D contaminant source inversion problem Consider a situation where a chemical contaminant is accidentally released in the air. The contaminant plume is diffusing and carried by wind, posing a great danger to the general public. It is crucial to infer the source location of the contaminant, so that appropriate response actions may be taken to eliminate this threat. A state-of-the-art robotic vehicle is dispatched to take contaminant concentration measurements at a sequence of different locations and under a fixed time schedule. We seek the optimal policy of where the vehicle should move to take measurements, in order to obtain the highest expected information gain 126 3 2.5 3 0 2.5 −5 2 0 2.5 −5 2 −5 2 −10 1.5 −10 1.5 −10 1 −15 1 −15 1 −15 0.5 −20 0.5 −20 0.5 −20 0.5 1 1.5 d0 2 2.5 3 −25 0.5 1 1.5 d0 2 2.5 3 d1 1.5 d1 d1 3 0 −25 0.5 1 1.5 d0 2 2.5 3 −25 (a) Analytic method −10 1 0.5 0.5 1 1.5 d0 2 2.5 3 1.5 −10 −15 1 −20 0.5 −25 0.5 1 1.5 d0 2 2.5 3 0 2.5 −5 2 d1 1.5 3 0 2.5 −5 2 d1 3 0 −5 2 1.5 −10 −15 1 −15 −20 0.5 −20 d1 3 2.5 −25 0.5 1 1.5 d0 2 2.5 3 −25 (b) Grid method −10 1 0.5 0.5 1 1.5 d0 2 2.5 3 1.5 −10 −15 1 −20 0.5 −25 0.5 1 1.5 d0 2 2.5 3 0 2.5 −5 2 d1 1.5 3 0 2.5 −5 2 d1 3 0 −5 2 1.5 −10 −15 1 −15 −20 0.5 −20 −25 d1 3 2.5 0.5 1 1.5 d0 2 2.5 3 −25 (c) Map method Figure 7-4: Linear-Gaussian problem: (d0 , d1 ) pair scatter plots from 1000 simulated trajectories superimposed on top of the analytic expected utility surface. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. about the source location. For simplicity, assume the mean contaminant concentration G (scalar) with source location θ measured at location z and time t has the value k θ + dw (t) − z k2 s exp − √ G(θ, z, t) = √ 2(4)(0.3 + Dt) 2π 2 0.3 + Dt ! , (7.7) where s, D, and dw (t) are known source intensity, diffusion coefficient, and cumulative net displacement due to wind up to time t, respectively (their values will be specified later). A total of N measurements are taken uniformly spaced in time, with the relationship t = k + 1 (while t is a continuous variable, it corresponds to the experiment index via this relationship; 127 mean = 0.77 ± 0.02 500 mean = 0.78 ± 0.02 500 400 400 300 300 300 200 100 0 0 Count 400 Count Count 500 200 100 1 2 Reward 3 0 0 4 mean = 0.78 ± 0.02 200 100 1 2 Reward 3 0 0 4 1 2 Reward 3 4 (a) Analytic method mean = 0.74 ± 0.02 500 mean = 0.76 ± 0.02 500 400 400 300 300 300 200 100 0 0 Count 400 Count Count 500 200 100 1 2 Reward 3 0 0 4 mean = 0.75 ± 0.02 200 100 1 2 Reward 3 0 0 4 1 2 Reward 3 4 (b) Grid method mean = 0.77 ± 0.02 500 mean = 0.75 ± 0.02 500 400 400 300 300 300 200 100 0 0 Count 400 Count Count 500 200 100 1 2 Reward 3 4 0 0 mean = 0.74 ± 0.02 200 100 1 2 Reward 3 4 0 0 1 2 Reward 3 4 (c) Map method Figure 7-5: Linear-Gaussian problem: total reward histograms from 1000 simulated trajectories. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. The plus-minus quantity is 1 standard error. hence, y0 is taken at t = 1, y1 at t = 2, etc.). The state is a combination of belief and physical state components. Introduced in Section 7.1.1, the grid and map methods of belief state representation are studied here (the analytic method is no longer available since the nonlinear forward model leads to general non-Gaussian posteriors). The relevant physical state is the current location of the vehicle: xk,p = z; the inclusion of physical state is necessary since the optimal design is expected to be dependent on the vehicle position as well. The movement constraints of the vehicle for the next time unit is described by a box constraint dk ∈ [−dL , dR ], where dL and dR reflect its movement range. The physical state dynamics then simply describe position and displacement: xk+1,p = xk,p + dk . The concentration 128 d0 y0 d1 y1 θ θ (a) Samples used for map construction d0 y0 d1 y1 θ θ (b) Samples generated from map Figure 7-6: Linear-Gaussian problem: samples used to construct the exploration map and samples generated from the resulting map. measurements are corrupted by additive Gaussian noise: yk = G(θ, xk+1,p , k + 1) + ǫk (xk , dk ), (7.8) where the noise ǫk ∼ N (0, σǫ2k (xk , dk )) may depend on the state and the design. When simulating a trajectory, the physical state needs to be propagated first before an observation yk can be generated, since the latter requires the evaluation of G at xk+1,p . Once yk is obtained, the belief state can then be propagated via Bayesian inference. The reward functions used 129 in this problem are gk (xk , yk , dk ) = −cb − cq k dk k22 (7.9) (7.10) gN (xN ) = DKL (fθ|IN (·|IN )||fθ (·)), for k = 0, . . . , N − 1. The terminal reward is simply the KL divergence, and the stage reward consists of a base cost of operation plus a penalty that is quadratic with the vehicle movement distance. We start with the 1D version of the problem, where θ, dk , and xk,p are scalars (i.e., the plume and vehicle are confined to movements in a line). Problem and algorithm settings common to all 1D cases can be found in Tables 7.2 and 7.3, and additional variations will be described in each of the following case subsections. 1D case 2 Number of experiments N Prior on θ N (0, 22 ) Design constraints on dk Initial physical state x0,p State measure updates L Concentration strength s Diffusion coefficient D Base operation cost cb Quadratic movement cost coefficient cq [−3, 3] 5.5 3 30 0.1 0.1 0.1 2D case 3 0 22 0 N , 0 0 22 [−5, 5]2 (5, 5) 3 50 1 0 0.03 Table 7.2: Contaminant source inversion problem: problem settings. Grid method: number of grid points Map method: map total order Map method: number of map construction samples 1D case 100 3 106 Exploration policy measure on dk N (0, 22 ) Total number of regression points % of regression points from exploration Max number of optimization iterations Monte Carlo sample size in optimization Robbins-Monro harmonic gain sequence multiplier 500 30% 50 100 5 2D case 3 5 5×10 2 0 3 0 N , 0 0 32 1000 20% 30 10 15 Table 7.3: Contaminant source inversion problem: algorithm settings. 130 7.2.1 Case 1: comparison with greedy (myopic) design This case highlights the advantage of sOED over greedy design, which is accentuated when there are factors in the future important for designing the current experiments. We illustrate this via the wind factor: the air is calm initially, and then a constant wind of velocity 10 commences at t = 1, leading to the following cumulative net displacement due to wind up to time t: dw (t) = 0, t<1 10(t − 1), t ≥ 1 . (7.11) Intuitively, greedy design would not be able to take into account of the wind when designing the first experiment. Batch design (not presented in this case), however, would be able to. The observation noise standard deviation is set to σǫk = 2. For sOED, only the grid method (described in Section 7.1.1 but now using 100 nodes) is used for this case for demonstration purposes (we focus on comparing sOED with greedy design here; the map method will be studied later in Case 3). Motivated by the analytic KL divergence formula between Gaussians, the value function features are selected to be 1, posterior mean, log-variance, physical state, their squares and cross terms, for a total of 10 terms. The moments are evaluated by trapezoidal rule integration. The KL divergence is approximated by first estimating the mean and variance using this technique, and then applying the analytic KL formula for Gaussians. No state measure update is performed in this case (i.e., L = 1); the effects of state measure updates will be studied later in Case 3. For greedy design, the same grid method is used to represent the belief state. Similar to the linear-Gaussian problem, the policies generated from different methods are compared by applying them in 1000 simulated trajectories using Algorithm 4, except that the common evaluation framework at the end now uses a high-resolution grid method with 1000 nodes, since the analytic method is no longer available for this non-Gaussian setting. Before presenting the results, we first provide some intuition on the physical phenomenon through Figure 7-7, which shows the progression of a sample trajectory. The left figure displays the physical space, with the robotic vehicle starting at the black square location. For the first experiment, it moves to a new location and acquires the noisy observation indicated by the blue cross, while the solid blue curve indicates the plume signal profile 131 G at that time. For the second experiment, the vehicle moves to another new location and acquires the noisy observation indicated by the red cross, while the dotted red curve indicates the plume signal profile G at that time after having diffused slightly and carried to the right by the wind. The right figure shows the corresponding belief state density functions at different stages, constructed using the grid method. Starting from the solid blue prior density, the dashed red posterior density after the first experiment is only slightly narrower, since the first observation (blue cross) is in the region that dominated by the measurement noise. The dotted yellow final posterior after both experiments, however, becomes much narrower, as the second observation (red cross) is in the high gradient region (and thus carries high information for identifying θ) of the plume profile. The posteriors can become quite non-Gaussian and even multimodal. The black circle indicates the true θ value; the posterior modes do not necessarily match this value, due to noisy measurements and finite number of observations. 0.4 10 G at t=1 G at t=2 z0 y0 y1 8 0.3 PDF y 6 PDF of x0,b PDF of x1,b PDF of x2,b θ∗ 4 0.2 2 0.1 0 −2 −10 −5 0 5 z 10 15 0 −8 −6 −4 −2 20 (a) Physical state and plume progression 0 θ 2 4 6 8 (b) Belief state density progression Figure 7-7: 1D contaminant source inversion problem, case 1: physical state and belief state density progression of a sample trajectory. The pairwise (d0 , d1 ) scatter plots for 1000 simulated trajectories are shown in Figure 7-8. Greedy designs generally move towards the left in the first design (negative values of d0 ) since for almost all realizations of θ (generated from the prior), the main part of the plume start on the left of the initial vehicle location. When designing the first experiment, greedy design does not know there will be a second experiment and that the wind will blow the plume back to the right, thus exerts a great effort to move to the left. Similarly, when designing the second experiment, it then chases the plume which is now on its right (positive values 132 of d1 ). sOED, however, generally starts heading to the right in the first experiment right away, so that it can arrive in the regions of higher information gain in time for the second experiment after the plume has been carried by the wind. In both approaches, there are a few cases where d1 is very close to zero. These cases correspond to where θ are sampled from the right tail of the prior, making the plume to be much closer to the initial vehicle location. As a result, a high amount of information is obtained from the first observation. The plume is subsequently carried 10 units to the right by the wind, and the vehicle cannot reach regions that yield a high enough amount of information in the second experiment that justify its d1 movement cost. The best action is then to simply stay put. The “chasing” tendency of greedy design turns out to be overall costly due to the quadratic movement penalty. This is reflected in Figure 7-9, which shows histograms for total rewards from the trajectories. sOED yields a mean reward of 0.12 ± 0.02, whereas greedy produces a much 1.5 1.5 1 1 d1 d1 lower mean reward of 0.07 ± 0.02; the plus-minus quantity is 1 standard error. 0.5 0 0.5 0 −0.5 −1 −0.75 −0.5 −0.25 d0 0 0.25 −0.5 −1 −0.75 −0.5 −0.25 d0 0.5 (a) Greedy design 0 0.25 0.5 (b) sOED Figure 7-8: 1D contaminant source inversion problem, case 1: (d0 , d1 ) pair scatter plots from 1000 simulated trajectories for greedy design and sOED. 7.2.2 Case 2: comparison with batch (open-loop) design This case highlights the advantage of sOED over batch design, which is accentuated when there is information useful for designing experiments that can be obtained from performing some of the experiments first (i.e., feedback). We illustrate this via different measurement devices: the robotic vehicle is carrying two measuring instruments, a “rough” device that achieves an observation noise standard deviation of σǫk = 2, and a “precise” device of σǫk = 0.5. The precise device is much more expensive to operate. Fortunately, the device 133 mean = 0.07 ± 0.02 500 500 400 400 300 300 200 200 100 100 0 −1 0 1 2 Reward mean = 0.12 ± 0.02 600 Count Count 600 3 0 4 (a) Greedy design −1 0 1 2 Reward 3 4 (b) sOED Figure 7-9: 1D contaminant source inversion problem, case 1: total reward histograms from 1000 simulated trajectories for greedy design and sOED. The plus-minus quantity is 1 standard error. cost is charged to the funding agency, and not reflected in our reward functions. However, the agency only permits (and requires) its use under promising situations where the current posterior variance is below a threshold of 3.0 (recall the prior variance is 4.0).3 The observation noise standard deviation is then 0.5, if variance corresponding to x < 3 k,b σǫk (xk,b ) = . 2, otherwise (7.12) Intuitively, batch design would not be able to use the first observation to update the belief state, thus unable to take advantage of the feedback of information and with it, the opportunity to use the precise device. Greedy design (not presented in this case), however, would be able to. The same wind conditions from Equation 7.11 are also applied. The same grid method setup as Case 1 is used for both sOED and batch design. Policies are also compared using the same technique as Case 1 with 1000 simulated trajectories, but with one caveat. For this case, since the measurement noise is dependent on the belief state and therefore the method of belief state representation, the noise standard deviation is also recorded as the observations are generated. The correct corresponding standard deviation is used when inference is performed on the common evaluation framework. In other words, while the belief state governs which measuring device is used, we would know which device is in fact used to obtain any particular observation. 3 Which instrument is used is then not a design decision. 134 The pairwise (d0 , d1 ) scatter plots for 1000 simulated trajectories are shown in Figure 7-10. As expected, batch design is able to account for the future wind effect, and starts moving to the right for the first experiment so that it can arrive in the regions of higher information in time for the second experiment after the plume has been carried by the wind. sOED, however, realizes that there is the possibility of using the precise device if it can reduce the posterior variance to less than 3 from the first observation. Thus it moves to the left towards the plume location in the first experiment to get a more informative observation even though the movement cost is higher. Roughly 55% of these trajectories achieve the requirement for using the precise device in the second experiment, and they produce a mean reward of 0.51 in contrast to −0.01 for trajectories that fail to qualify for this technology. Effectively, sOED has taken a risk in order to achieve an overall higher expected reward. The risk factor is not in the current problem formulation, but it certainly should be considered in practice, especially for such crucial missions where there is perhaps only one chance to ensure public safety. The histograms for total rewards from all trajectories are shown in Figure 7-9. The risk taken by sOED indeed pays off as it sees a mean reward 0.28 ± 0.02, whereas greedy produces a much lower mean reward of 0.11 ± 0.02; the plus-minus quantity is 1 standard error. 3 2 2 1 1 d1 d1 3 0 0 −1 −1 −2 −1.5 −1 −0.5 d0 0 −2 −1.5 0.5 (a) Batch design −1 −0.5 d0 0 0.5 (b) sOED Figure 7-10: 1D contaminant source inversion problem, case 2: d0 and d1 pair scatter plots from 1000 simulated trajectories for batch design and sOED. Roughly 55% of the sOED trajectories qualify for the precise device in the second experiment. However, there is no particular pattern or clustering of these designs, thus we do not separately color-code them in the scatter plot. 135 mean = 0.11 ± 0.02 500 500 400 400 300 300 200 200 100 100 0 −1 0 1 2 Reward mean = 0.28 ± 0.02 600 Count Count 600 3 0 4 −1 (a) Batch design 0 1 2 Reward 3 4 (b) sOED Figure 7-11: 1D contaminant source inversion problem case 2: total reward histograms from 1000 simulated trajectories for batch design and sOED. The plus-minus quantity is 1 standard error. 7.2.3 Case 3: sOED grid and map methods This case investigates the performance of the map method under the sOED algorithm developed in this thesis. The same wind conditions from Equation 7.11 are applied, and a similar two-tier measuring device system from Equation 7.12 is implemented with slightly different parameters: 0.2, if variance corresponding to x < 2 k,b σǫk (xk,b ) = . 2, otherwise (7.13) This case setting is chosen so that sOED can show advantage over both greedy and batch designs at the same time. For sOED, the grid and map methods described in Section 7.1.1 are studied. The settings for both methods can be found in Table 7.3. The map method uses monomial basis functions of total order 3. The joint map has a total of N (nd × ny × nθ ) = 6 dimensions and 129 basis terms, and the coefficients are determined using 106 exploration trajectories with the exploration policy designated by dk ∼ N (0, 22 ). All posterior maps are 1D 3rd-order polynomials and thus have 4 coefficients. Furthermore, the moments used in the value function features are estimated by inverting the linear truncation for the map method. The KL divergence is approximated by first estimating the mean and variance using this technique, and then applying the analytic KL formula for Gaussians. In additional to the 136 moment-based value function features, features composed of 1st and 3rd degree total-order polynomials on posterior map coefficients and the physical state are also investigated. The main advantages of such a construction are the accessibility of map coefficients (especially in multidimensional parameter spaces), and that the coefficients carry all the information about the posterior (more than just mean and variance). There are some caveats in the formulation of these features; we defer a detailed discussion to Section 7.3. For greedy and batch designs, the same grid method setup is used. Policies are also evaluated using the same technique as Case 2, with 1000 simulated trajectories. sOED results using grid and map methods We start with a comparison between the grid and map methods using the sOED algorithm. The analytic method is no longer appropriate for this 1D source inversion problem since the forward model is nonlinear and posteriors are non-Gaussian. As a result, having established the agreement between the analytic and grid methods in the linear-Gaussian problem, we now use the grid method as a reference to compare the map method to. Histograms for d0 and d1 from the grid and map methods are shown in Figures 7-12 and 7-13. Excellent agreement is observed for d0 between the two methods, while d1 from the map method are generally of lower values than the grid method. Since there are only N = 2 experiments in this problem, only J˜1 is constructed via function approximation while J2 is directly and numerically evaluated. The fact that d0 values between grid and map methods are similar implies that J˜1 (and therefore the policy) generated from the two methods are in good agreement. The discrepancy in d1 then must be due to the less accurate inference and KL computations from the map method. This difference is also reflected in the total rewards, shown in Figure 7-14 and Table 7.4, where the mean rewards from the map method are generally slightly lower than those from the grid method. There are two main approaches to improve the map quality: (1) improve the map construction process, and (2) use more relevant samples for the map construction. The joint map is currently constructed using exploration trajectory samples, with Figure 7-15 showing the pairwise and marginal KDEs from samples used to construct the exploration joint map, and samples generated from that map. Overall, the joint distribution appears quite non-Gaussian, with heavy tail especially for those involving y1 , rendering the mapping to standard Gaussian random variables more nonlinear and thus more challenging to repre137 400 300 300 300 200 100 0 −0.5 Count 400 Count Count 400 200 100 −0.25 0 d0 0.25 0 −0.5 0.5 200 100 −0.25 0 d0 0.25 0 −0.5 0.5 −0.25 0 d0 0.25 0.5 −0.25 0 d0 0.25 0.5 400 400 300 300 300 200 100 0 −0.5 Count 400 Count Count (a) Grid method 200 100 −0.25 0 d0 0.25 0.5 0 −0.5 200 100 −0.25 0 d0 0.25 0.5 0 −0.5 (b) Map method Figure 7-12: 1D contaminant source inversion problem, case 3: d0 histograms from 1000 simulated trajectories for the sOED grid and map methods. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. sent. While the current map captures these features reasonably well, increasing the order of polynomial basis beyond degree 3 is expected to further improve its performance. However, higher order polynomial basis brings new challenges as well. First, the map construction, evaluation, conditioning, and sampling procedures all require more computations (for example, moving to a 5th-order polynomial basis would increase the number of map coefficients from the current 129 to 467). Second, a higher-order polynomial is also more prone to losing monotonicity, making sampling and density evaluation difficult. Currently, we use 106 samples to build the map. From experience, this sample size is much higher than needed in Grid Map (moment features) Map (coefficient features 1st-order) Map (coefficient features 3rd-order) Batch design Greedy design ℓ=1 0.14 0.13 0.12 0.15 ℓ=2 0.12 0.16 0.14 0.18 ℓ=3 0.19 0.16 0.12 0.15 0.11 0.09 Table 7.4: 1D contaminant source inversion problem, case 3: total reward mean values from 1000 simulated trajectories; the Monte Carlo standard errors are all ±0.02. The grid and map cases are all from sOED. 138 250 200 200 200 150 150 150 100 50 0 0 Count 250 Count Count 250 100 50 0.2 0.4 0.6 0.8 0 0 1 100 50 0.2 0.4 d1 0.6 0.8 0 0 1 0.2 0.4 d1 0.6 0.8 1 0.6 0.8 1 d1 250 250 200 200 200 150 150 150 100 50 0 0 Count 250 Count Count (a) Grid method 100 50 0.2 0.4 0.6 d1 0.8 1 0 0 100 50 0.2 0.4 0.6 0.8 1 0 0 d1 0.2 0.4 d1 (b) Map method Figure 7-13: 1D contaminant source inversion problem, case 3: d1 histograms from 1000 simulated trajectories for the sOED grid and map methods. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. practice for producing reasonably accurate results for this 6D 3rd-order map, but we choose it intentional in order to minimize this particular source of error. Another perspective to improve the map representation is to use more relevant samples for its construction. The exploration map is placing much computational effort in ensuring accuracy over a wide region of state space that may not by visited by the exploitation policies. As a direction of future research, the adaptation of the joint map to exploitation trajectory samples as they become available, is also expected to further improve the performance of the map method. Lastly, value function features based on map coefficients produced similar histograms of d0 , d1 , and rewards, and these plots are omitted. Their mean rewards from 1000 trajectories are shown in Table 7.4. While the 1st-order coefficient features perform only slightly worse than the moment features, the 3rd-order coefficient features are able to achieve a similar level of mean reward. These observations provide good motivation and support for using map coefficients as features in higher-dimensional problems, especially where posteriors depart further from normality and higher-order moment information becomes important. 139 mean = 0.14 ± 0.02 800 200 600 Count 400 400 200 0 1 2 Reward 3 0 −1 4 mean = 0.19 ± 0.02 800 600 Count Count 600 0 −1 mean = 0.12 ± 0.02 800 400 200 0 1 2 Reward 3 0 −1 4 0 1 2 Reward 3 4 3 4 (a) Grid method mean = 0.13 ± 0.02 800 200 600 Count 400 400 200 0 1 2 Reward 3 4 0 −1 mean = 0.16 ± 0.02 800 600 Count Count 600 0 −1 mean = 0.16 ± 0.02 800 400 200 0 1 2 Reward 3 4 0 −1 0 1 2 Reward (b) Map method Figure 7-14: 1D contaminant source inversion problem, case 3: total reward histograms from 1000 simulated trajectories for the sOED grid and map methods. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. The plus-minus quantity is 1 standard error. Comparison between sOED, batch, and greedy designs We now focus on comparisons between different design approaches. For simplicity, only the ℓ = 3 grid sOED results are used in this part. All batch and greedy design results are produced using the grid method. Intuitively, one would expect this problem to show an advantage of sOED over both batch and greedy designs: batch design does not have a feedback mechanism and thus is unable to make use of the precise device; greedy design does not look into the future and thus is unable to account for the wind that will blow the plume to the right. Indeed, this is supported by the pairwise (d0 , d1 ) scatter plots shown in Figure 7-16. Batch design foresees the wind and moves towards the right immediately but abandons the chance to use the precise device, while greedy design chases after the high information regions of the plume and incurs a high movement cost. sOED is able to balance the knowledge of the wind and the precise device, moving slightly to the left in the first design in order to have a chance to qualify for the precise device in the second experiment (these turn out 140 d0 y0 d1 y1 θ θ (a) Samples used for map construction d0 y0 d1 y1 θ θ (b) Samples generated from map Figure 7-15: 1D contaminant source inversion problem: samples used to construct exploration map and samples generated from the resulting map. to be cases where the initial plume starts to the right of the origin) before shifting to the right in the second experiment. The mean rewards are shown in Table 7.4, where batch and greedy designs achieve lower values compared to any of the sOED variants. The sOED map methods outperform batch and greedy designs despite their less accurate inference and KL computations compared to their grid counterpart. For contrast, the exploration policy produces a much lower mean reward of around −0.5. In summary for the 3 cases of 1D contaminant source inversion problem, we have demonstrated the advantages of sOED over batch and greedy designs in realistic situations. Furthermore, with sOED grid method used as a comparison reference, the map method has 141 1 1 1 0.75 0.75 0.75 0.5 0.5 d1 d1 d1 0.5 0.25 0.25 0.25 0 0 0 −0.25 −1 −0.75 −0.5 −0.25 d0 0 0.25 0.5 −0.25 −1 −0.75 −0.5 −0.25 d0 (a) Batch design 0 0.25 −0.25 −1 −0.75 −0.5 −0.25 d0 0.5 (b) Greedy design 0 0.25 0.5 (c) sOED using grid method Figure 7-16: 1D contaminant source inversion problem, case 3: (d0 , d1 ) pair scatter plots from 1000 simulated trajectories. The sOED result here is for ℓ = 1. mean = 0.11 ± 0.02 800 600 Count Count 600 400 200 0 −1 mean = 0.09 ± 0.02 800 400 200 0 1 2 Reward 3 0 −1 4 (a) Batch design 0 1 2 Reward 3 4 (b) Greedy design Figure 7-17: 1D contaminant source inversion problem, case 3: total reward histograms from 1000 simulated trajectories using batch and greedy designs. The plus-minus quantity is 1 standard error. shown good performance while employing moment-based as well as map coefficient-based value function features. 7.3 7.3.1 2D Contaminant source inversion problem Problem setup Consider a 2D version of the contaminant source inversion problem described in Section 7.2, where now θ = [θ0 , θ1 ]⊤ , dk = [dk,0 , dk,1 ]⊤ , z = [z0 , z1 ]⊤ , and xk,p = [xk,p,0 , xk,p,1 ]⊤ are 2D vectors (i.e., the plume and vehicle are confined to movements in a 2D physical space). The air is calm initially, and then a variable wind commences at t = 1. This leads to the following values of cumulative net displacement due to wind at the time points coinciding 142 with the experiments: dw (t = 1) = 0 0 dw (t = 2) = 0 5 dw (t = 3) = 5 10 . (7.14) The precise evolution of the wind in between these time points is not relevant, since only these integrated quantities directly affect the contaminant profile in Equation 7.7. The concentration measurements are corrupted by an additive Gaussian noise described by Equation 7.8, with the noise variable having a constant standard deviation ǫk ∼ N (0, 0.52 ). Additional problem and algorithm settings are summarized in Tables 7.2 and 7.3. In this multidimensional problem, the grid method is no longer practical. Such an implementation would require a sophisticated 2D grid adaptation strategy for inference in order to capture the posteriors with sufficient resolution; the overall setup would be very computationally expensive. The map method, however, is capable of accommodating multidimensional parameters relatively easily. For this problem, we employ a map method using polynomial basis functions of total order 3. With N = 3 experiments, the total dimension of the joint map is 3(nd × ny × nθ ) = 15, and its coefficients are determined using 5 × 105 exploration trajectories with an exploration design measure dk,j ∼ N (0, 32 ), j = 0, 1. Posterior maps, constructed by conditioning on the appropriate dimensions of the joint map, are then 2D 3rd-order polynomials with the form ξθk,0 = a0 + a1 θ0 + a2 θ02 + a3 θ03 ξθk,1 = b0 + b1 θ0 + b2 θ1 + b3 θ02 + b4 θ0 θ1 + b5 θ12 + b6 θ03 + b7 θ02 θ1 + b8 θ0 θ12 + b9 θ13(7.16) . (7.15) KL evaluations on the posteriors are performed using the linear truncation technique described in Section 5.5.4. Features in value function approximation are chosen to be based on posterior map coefficients, instead of posterior first and second moments as we have done in the earlier examples. We make this choice for two main reasons. First, moment information is not directly available via a map representation, and must be approximated by, for example, linear truncation or sampling. Such estimates can become computationally cumbersome in multidimensional settings, and inaccurate for non-Gaussian posteriors. Map coefficients, however, are easily accessible. Second, information fully describing the posterior (up to basis limitations) 143 is encoded within the entire set of coefficients. This includes all moment information as well. Map coefficients thus provide an accessible source of full posterior description without requiring additional approximations. To be more specific, we construct the features only from map coefficients corresponding to terms that are strictly less than the highest total polynomial order. With reference to Equations 7.15 and 7.16, the features are then functions of {ai }2i=0 , {bi }5i=0 . (7.17) The excluded coefficients correspond to terms that are not affected by dk and yk when conditioned from the joint map. (For example, in the k = 0 case with the total order of polynomial capped at 3, the b5 θ12 term results from conditioning the joint map terms c0 d0,0 θ12 + c1 d0,1 θ12 + c2 y0 θ12 + c3 θ12 on a particular realization of d0 and y0 . Here ci are the coefficients from the joint map; their particular ordering is unimportant in this illustration. b6 θ03 , however, is contributed only from a single joint map term, c4 θ03 , which is not affected by either dk or yk . Note that the joint map does not have terms such as d0,0 θ03 since that would exceed the total-order limit.) Consequently, those coefficients are identical for all regression points, regardless of the experimental designs and observations. Their inclusion only introduces linear dependence in the regression system, and have no positive contributions. We thus use features that are 2nd-order polynomials jointly in the map coefficients from Equation 7.17 and the 2D physical state, leading to a total of 9 + 2 + 2 = 13 choose 2 equaling 78 features. Before presenting the results, we first provide some intuition on the physical phenomenon through Figure 7-18, which shows the progression of a sample trajectory. The solid, dotted, and dotted-dash contour lines depict the plume signal when y0 , y1 , and y2 are observed, respectively. The plume diffuses with time, and is also carried by the wind first northward, and then towards the northeast. Figure 7-19 shows the probability density of the corresponding belief states, beginning with the prior in Figure 7-19(a). The black circle indicates the true θ value; the posterior modes do not necessarily match this value, due to noisy measurements and finite number of observations. The vehicle starts at the initial location represented by the black square in Figure 7-18. For the first experiment, it moves to a new location indicated by the circle, southwest towards 144 where the initial plume is generally situated (in accordance to the prior). Here, this location remains far from the main part of the plume signal, and acquires only little information. This is reflected by Figure 7-19(b), where the density contours remain fairly wide. At the same time, this location is close to the region of high information content (intuitively, high gradient area) for the second experiment in anticipation of the wind. The vehicle then only needs to move a small amount to the diamond mark, and able to make a fairly informative observation despite a slight loss of plume signal from diffusion. Indeed, Figure 7-19(c) shows a more concentrated density. Finally, with the wind carrying the plume much further away and the vehicle unable to catch it without accumulating substantial movement cost, it only nudges slightly towards the final plume position in the last design. As expected, very little additional information is obtained in the last measurement, as the final posterior remains largely unchanged. Figures 7-20 and 7-21 show the physical and belief state progression of another sample trajectory, where the plume starting location is to the southwest of the origin. In this case, a meaningful amount of information is attainable in the final experiment, and justifies its corresponding movement cost. The vehicle thus makes a large displacement towards the final plume position, and indeed a significant narrowing in the posterior after the final measurement is observed. These trajectory samples illustrate that, when a policy is produced from sOED, it is able to find good experimental designs based on different situations—an inherent adaptive property. 7.3.2 Results Histograms for designs d0 , d1 , and d2 are shown in Figure 7-22. Each dk has two components, corresponding to the two physical space dimensions. The middle column of the figure provides three-dimensional histograms reflecting both components at the same time, while the left and right columns display the marginal histograms of each dimension. The starting location of most plume realizations (generated from the prior) are to the southwest compared to the initial physical state of the robotic vehicle. As a result, the vehicle generally has a southwest tendency in d0 , with negative values in both components. However, the magnitude of the first movement is not at the largest possible (recall the design space is [−5, 5]2 ) due to three factors: first, there is a competing quadratic movement cost; 145 15 8 7.5 10 7 z1 6.5 5 6 5.5 0 5 4.5 -5 -5 4 0 5 z 10 15 0 Figure 7-18: 2D contaminant source inversion problem: plume signal and physical state progression of sample trajectory 1. second, the plume can still be fairly far away where only little information is acquirable (such as the situation depicted in Figure 7-18); and third, it may be better to get into a good position in anticipation of the second experiment instead. The trade-off between these factors is complicated, and the algorithm developed in this thesis helps address these difficulties in a quantitative manner. Moving onto the second design, d1 sees more variation in the second component while its first component remains mostly around the same position. This is because by this point, the vehicle is often in between the current and next plume positions in the first component, while south of both the current and next plume positions in the second component. This observation thus demonstrates characteristics of anticipating the subsequent plume movement towards the northeast. Finally, the last movement sees the largest spread in the histograms. Since this is the final decision in the sequence, the only consideration is then the trade-off between movement cost and information gain, dependent on the current plume location, vehicle position, and belief state. The final design thus fully and clearly adjusts to this trade-off, with no need, or opportunity, for additional reservations due to future effects. Experiments later in the sequence are often where feedback effects are most influential. Histograms for trajectory rewards are plotted in Figure 7-23 for ℓ = 1, 2, and 3, with mean rewards of 1.04 ± 0.03, 1.10 ± 0.03, and 0.96 ± 0.03, respectively; the plus-minus quantity is 1 standard error. The dk histograms are similar for the subsequent ℓ iterations 146 ·10−2 ·10−2 4 6 3.5 3.5 4 4 3 3 2 2 2.5 z1 z1 2.5 0 0 2 −2 −4 −6 −6 −4 −2 0 z0 2 4 6 2 1.5 −2 1.5 1 −4 1 0.5 −6 (a) x0,b density −6 −4 −2 0 z0 2 4 6 ·10−2 4 6 4 6 3.5 3.5 4 4 3 3 2 2 2.5 z1 2.5 z1 0.5 (b) x1,b density ·10−2 0 0 2 −2 −4 −6 4 6 −6 −4 −2 0 z0 2 4 6 2 1.5 −2 1.5 1 −4 1 0.5 −6 (c) x2,b density −6 −4 −2 0 z0 2 4 6 0.5 (d) x3,b density Figure 7-19: 2D contaminant source inversion problem: belief state posterior density contour progression of sample trajectory 1. and thus are not included. In this example, we observe only small changes of results as the state measure is updated with ℓ. First, it implies that the exploration design measure we selected is reasonable, since ℓ = 1 iteration (where all regression points are from exploration) does not show a noticeable disadvantage compared to subsequent ℓ iterations (where exploitation samples are then incorporated). More specifically, this is due to either designs from exploitation policies being similarly distributed as those from the exploration policy (which is not the case here as evident from Figure 7-22), or that the value function approximations are robust against the locations of regression points. This leads to the second implication, that the features we selected span a subspace that is sufficiently rich to approximate the value functions well. (To illustrate this more simply, imagine a quadratic function being approximated 147 15 8 7.5 10 7 z1 6.5 5 6 5.5 0 5 4.5 -5 -5 4 0 5 z 10 15 0 Figure 7-20: 2D contaminant source inversion problem: plume signal and physical state progression of sample trajectory 2. by only linear basis functions, then different regression sample distributions can produce drastically different outcomes; whereas if the basis functions are quadratic, then very similar approximations would be produced.) This observation is particularly encouraging, as it provides support for our choice of features, made largely from heuristics. We also point out the importance of including regression samples produced from the numerical methods (discussed in Section 4.3.1). When those samples are not included, the ℓ = 2 iteration suffers tremendously and produces a much lower mean reward of 0.55. The culprit of deterioration is inaccurate value function approximations that lead the optimizer to designs that are in fact far from the true optimum. For contrast, the exploration policy yields a much lower mean reward of −0.78. The pairwise and marginal KDEs from samples used to construct the exploration joint map, and samples generated from that map, for the dk and yk dimensions, are shown in Figures 7-24 and 7-25 . Figures 7-26 and 7-27 display those of dimensions crossed between dk and yk with θ: the columns from left to right correspond to d0,0 , d0,1 , y0 , d1,0 , d1,1 , y1 , d2,0 , d2,1 , y2 , and the marginals for the row variables, and the rows from top to bottom correspond to the marginal for the column variables, θ0 , θ1 , θ0 , θ1 , θ0 , θ1 , where each pair of rows corresponding to θ for inference after 1, 2, and 3 experiments, respectively. The only part omitted from the joint map is the pairwise KDEs between θ’s, which are independent Gaussian and uninteresting. Overall, the pairwise KDEs exhibit extremely non-Gaussian, 148 ·10−2 6 ·10−2 4 6 3.5 3.5 4 4 3 3 2 2 2.5 z1 z1 2.5 0 0 2 −2 2 −2 1.5 −4 −6 1 −6 −4 −2 0 z0 2 4 6 1.5 −4 −6 0.5 (a) x0,b density 1 −6 −4 −2 0 z0 2 4 6 ·10−2 ·10−2 4 6 3.5 4 3.5 4 4 3 3 2 2 2.5 z1 2.5 z1 0.5 (b) x1,b density 6 0 0 2 −2 2 −2 1.5 −4 −6 4 1 −6 −4 −2 0 z0 2 4 6 1.5 −4 −6 0.5 (c) x2,b density 1 −6 −4 −2 0 z0 2 4 6 0.5 (d) x3,b density Figure 7-21: 2D contaminant source inversion problem: belief state posterior density contour progression of sample trajectory 2. heavy-tail, and even borderline multi-modal behavior. Nonetheless, the map is still able to capture these characteristics reasonably well, with the map-generated KDEs matching fairly well with their counterparts from the samples used to construct the map. As the problem becomes more nonlinear and higher dimensional, the joint behavior will also become more difficult to mirror. While one aspect of development is through the enrichment of map basis and samples, another promising future research direction is to leverage the exploitation samples, and to construct lower-dimensional targeted local maps that are more accurate for specific realizations (as discussed at the beginning of Section 5.5.1). We will expand these ideas in Chapter 8. 149 250 200 200 150 150 Count Count 250 100 50 0 −5 100 50 −2.5 0 d0,0 2.5 0 −5 5 −2.5 0 d0,1 2.5 5 −2.5 0 d1,1 2.5 5 −2.5 0 d2,1 2.5 5 250 250 200 200 150 150 Count Count (a) d0 100 50 0 −5 100 50 −2.5 0 d1,0 2.5 0 −5 5 250 250 200 200 150 150 Count Count (b) d1 100 50 50 0 −5 100 −2.5 0 d2,0 2.5 0 −5 5 (c) d2 Figure 7-22: 2D contaminant source inversion problem: dk histograms from 1000 simulated trajectories. mean = 1.04 ± 0.03 200 100 50 0 −4 200 150 Count Count 150 mean = 1.10 ± 0.03 100 50 −2 0 2 Reward 4 6 0 −4 mean = 0.96 ± 0.03 150 Count 200 100 50 −2 0 2 Reward 4 6 0 −4 −2 0 2 Reward 4 6 Figure 7-23: 2D contaminant source inversion problem: total reward histograms from 1000 simulated trajectories. The left, middle, and right columns correspond to ℓ = 1, 2, and 3, respectively. The plus-minus quantity is 1 standard error. 150 d0,0 d0,1 y0 d1,0 d1,1 y1 d2,0 d2,1 y2 Figure 7-24: 2D contaminant source inversion problem: samples used to construct exploration map. 151 d0,0 d0,1 y0 d1,0 d1,1 y1 d2,0 d2,1 y2 Figure 7-25: 2D contaminant source inversion problem: samples generated from the resulting map. 152 Figure 7-26: 2D contaminant source inversion problem: samples used to construct exploration map between dk and yk , with θ. The columns from left to right correspond to d0,0 , d0,1 , y0 , d1,0 , d1,1 , y1 , d2,0 , d2,1 , y2 , and the marginals for the row variables, and the rows from top to bottom correspond to the marginal for the column variables, θ0 , θ1 , θ0 , θ1 , θ0 , θ1 , where each pair of rows corresponding to θ for inference after 1, 2, and 3 experiments, respectively. 153 Figure 7-27: 2D contaminant source inversion problem: samples generated from the resulting map between dk and yk , with θ. The columns from left to right correspond to d0,0 , d0,1 , y0 , d1,0 , d1,1 , y1 , d2,0 , d2,1 , y2 , and the marginals for the row variables, and the rows from top to bottom correspond to the marginal for the column variables, θ0 , θ1 , θ0 , θ1 , θ0 , θ1 , where each pair of rows corresponding to θ for inference after 1, 2, and 3 experiments, respectively. 154 Chapter 8 Conclusions 8.1 Summary and conclusions This thesis has developed a rigorous mathematical framework and a set of numerical tools for performing optimal sequential experimental design (sOED) in a computationally feasible manner. Experiments play an essential role in the learning process, and a systematic design procedure for finding the optimal experiments can lead to tremendous resource savings. Propelled by recent algorithm developments, simulation-based optimal experimental design (OED) has seen substantial advances in accommodating nonlinear and physically realistic processes. However, the state-of-the-art OED tools today are largely limited to batch (openloop) and greedy (myopic) designs. While sufficient under some circumstances, these design approaches generally do not yield the optimal design of multiple experiments conducted in a sequence. The use of fully optimal description for sequential design is still in the early stages. We begin the thesis with an extension to our previous batch OED work. In addition to describing the framework and numerical tools for batch OED, particularly focus is paid to enhancing the capability for accommodating nonlinear and computationally intensive models with an information gain objective. This involves deriving and accessing gradient information via the use of polynomial chaos and infinitessimal perturbation analysis in order to enable the use of gradient-based optimization methods, which would be otherwise not possible or impractical. An extensive comparison between two gradient-based methods, Robbins-Monro stochastic approximation and sample average approximation, is made from a practical and numerical perspective and in the context of batch OED, via a diffusion source 155 inversion application governed by a 2D partial differential equation. We then develop a rigorous mathematical framework for sOED. This framework is formulated from a decision-theoretic perspective, with a Bayesian treatment of uncertainty and an information measure objective. It is capable of accommodating the sequential design of a finite number of experiments, with nonlinear models and non-Gaussian distributions, and under continuous parameter, design, and observation spaces of multiple dimensions. What sets sOED apart from batch OED is that it seeks an optimal policy, a set of functions that determines what the optimal design is, depending on the current system state. Directly solving for the optimal policy for the sOED problem is a challenging task. Instead, we re-express it using a dynamic programming formulation, and then make use of various approximate dynamic programming (ADP) techniques in finding an approximation to the optimal policy. The ADP techniques employed are based on a one-step lookahead policy representation, combined with approximate value iteration (in particular backward induction and regression). Value functions are approximated using a linear architecture, with features selected from heuristics and motivated by moment terms in the analytic formula of Kullback-Leibler divergence between Gaussian distributions. The approximations are then constructed from regression problems resulting from the backward induction process. Regression samples are generated from trajectory simulations, via both exploration and exploitation. In obtaining good regression sample locations, we emphasize the notion of policy and numerical method induced state measure. An iterative update procedure is introduced to help adapt and refine this measure as better policy approximations are constructed. Lastly, we further point out the difficulty of the problem as we mathematically show that many advanced partiallyobservable Markov decision process algorithms are not suitable for information-based OED. The next major challenge involves the expression of the belief state, which are posteriors of multivariate non-Gaussian continuous random variables. Transport maps with finitedimensional parameterization are introduced to represent the belief states. This technology is numerically attractive in that they can be constructed directly from samples without requiring model knowledge, and the optimization problem in the construction process is dimensionally-separable and convex. More importantly, by building a map jointly in the parameter and observation spaces, one can recover the posterior map by simply condition on the joint map. This allows Bayesian inference, which needs to be repeated millions of times throughout the entire sOED process under different realizations of design and observations, 156 to be performed very quickly, albeit approximately. This ability plays a key role in making the overall method computationally feasible. We take a step further, and build a single joint map in the parameter, observation, and design spaces of all stages, such that only one map is needed for all subsequent inferences in solving the sOED problem. Currently, samples for map construction are generated from exploration only, future research will involve the incorporation of exploitation samples as well. Finally, we demonstrate the computational effectiveness of these methods via three examples. The first is a linear-Gaussian problem, where analytic solution is available. A comparison of sOED using analytic, grid, and map representations of the belief state provides understanding of the various sources of numerical error. Next is a realistic nonlinear contaminant source inversion problem in a 1D physical space with diffusion and convection effects. Through different settings, we demonstrate the advantage of sOED over the more often used batch and greedy designs, and also establish confidence in our map-based algorithm for handling nonlinear problems. The map-based method has constructed excellent policies, using both moment-based value function features as well as map coefficient-based features. The last problem is the contaminant source inversion problem in a 2D physical space setting. With multiple dimensions in many variables, this problem tests the limitations of the numerical methods developed in this work, and offers insights of future research directions. 8.2 Future work Throughout this thesis, we have identified several promising avenues of future work, which are briefly outlined below. We broadly divide them into areas of computational and formulational advances. 8.2.1 Computational advances 1. Transport maps and inference accuracy: One fruitful direction of research involves improving the accuracy of the transport map representation of belief state, and the accompanying inference method of conditioning a joint map. As the number of experiments and variable dimensions increase, and as the problem becomes more nonlinear and non-Gaussian, accuracy of maps also becomes more difficult to main157 tain. Echoing the discussion from Section 7.2.3, the accuracy of a particular map can be increased by improving the map construction process (such as enriching its basis functions and boosting the number of construction samples), or to use more relevant samples (such as those from exploitation trajectories that better reflect which states the algorithm visits). On a higher level, we may also consider different maps altogether. Instead of a single joint map adopted in this thesis, targeted local maps may be created that are more accurate for specific states. With reference to Equation 3.3, one possible route is to construct a separate map for each xk visited and as needed, which can then be used for inference on different realizations of dk and yk . Such an approach would require joint maps of only ndk × nyk × nθ dimensions, independent of N (interestingly, information-based OED is expected to be most effective when only a few experiments are available, as otherwise even suboptimal experiments can still eventually lead to informative posteriors after many measurements; this suggests that the number of experiments would be less of a problem than increases of other variable dimensions). Naturally, lower-dimensional joint maps would produce more accurate inference results, but many more of such maps need to be constructed, and must be done in an online fashion. Furthermore, performance would be affected by additional sources of error in the propagation of truncated representations of xk , as we would then need to explicitly store these representations whereas we only needed to store their associated history of designs and observations in the current implementation. More generally, hybrid methods of inference appear promising. For instance, we can use a “rough” joint map to arrive at an initial starting point of posterior, and further refine it as needed using other techniques such as importance sampling. An easily tune-able setup is also attractive for numerical adaptation, another topic to be discussed shortly. 2. Alternative approximate dynamic programming techniques: There is a vast literature on ADP techniques in additional to those used in this thesis. For many possible alternative approaches, it is not immediately clear whether they can produce accurate results more efficiently. For example, we have employed a direct backward induction approach to construct the value function approximations. A rollout formulation, perhaps involving multiple iterations of rollout (approximate policy iteration) can potentially be computationally cheaper, but produces “less optimal” policies. At the same times, a whole field of algorithms for policy evaluation can be tested, such 158 as the temporal differencing (TD) variations. 3. Adaptation of numerical methods: In tackling a difficult problem such as sOED, numerous numerical and approximation techniques need to be employed, and accompanying them are also different sources of error. The work of this thesis has largely relied on techniques that have some natural way of refinement. For example, one can make the value function approximations more accurate by enriching its features, increasing the regression sample size and making more efficient sample choices, and improving the quality of objective estimates in the stochastic optimization algorithm. We have the choice of which components to improve, and by how much, through the allocation of computational resources. Yet, not all errors are equally important. The key is to understand to which sources of error is the quality of approximate optimal policy more sensitive to, and which sources are more prominent but can also be economically reduced. We would like to further investigate the behavior of these numerical errors, with the aim both to create a goal-oriented adaptation scheme that can efficiently improve the accuracy of the overall method, as well as to achieve quantifiable and meaningful error bounds on the results. 8.2.2 Formulational advances 1. Changes to the number of experiments: While the sOED formulation in this thesis has assumed the number of experiments to be known and fixed, it is often not the case in practice. For example, when a particular goal has been achieved (e.g., enough information has been acquired, effect of the drug has reached its target), additional experiments are often no longer necessary. For projects heavily influenced by political climate or funding availability, the probability of project termination or renewal is often ambiguous. Regardless whether these changes are intentional or not, their inclusion requires some mechanism that can change the number of experiments. One well studied variant of sOED that accommodates possible early termination is the optimal stopping problem. In experimental design, the concept has been used in the design of clinical trials and biological experiments (e.g., [41, 48]). On a high level, its formulation involves using a “maximum possible” horizon, and with each experiment having the option of a termination design when certain conditions are met. Whether 159 such a maximum horizon even exists, or what is a reasonable “long” substitute of this number, may be problematic in itself. Despite these additional challenges, however, an extension from our current formulation would not be difficult. More subtle is how to accommodate unforeseen additional experiments, especially when this news is revealed partly through the project. This is usually a good problem to have. However, the policy constructed for the previously shorter horizon may no longer be “good” for the remaining original experiments, and does not even apply for the new ones. While one may simply solve the new sOED problem starting from the current state, this can be an expensive and inefficient process. Of particular concern are situations where there is some sense of diminishing return in the experiments, or if only a small change is made compared to the original total (e.g., 1 new experiment is added to the original total of 100 may have insignificant effects). These factors, combined with often limited time in computing the new policy, present a need for updating or modifying the original policy on the fly, perhaps suboptimally. Further extending on this line of thought, advanced structural approximations (such as mixtures of batch, greedy, and sequential designs of blocks of experiments) may be more robust to these changes, but at a trade-off of optimality. 2. Additional advances in formulation: There are several additional general formulation aspects of sOED that are worth incorporating in the future. First, the element of risk can be extremely important in some missions (for instance, as mentioned in the example from Section 7.2.2). Risk can be incorporated, for example, through the objective function (such as adding terms that reflects the variance, or worst case scenarios) and probabilistic constraints (such as the probability of some notion of failure that reflects the reliability of the policy). These new structures naturally lead to more sophisticated robust optimization algorithms (e.g., [10]). Second, the treatment of nuisance parameters can help increase the efficiency of sOED. For example, in the examples of Sections 7.2 and 7.3, if the wind condition is unknown, then acquiring information to reduce uncertainty in this nuisance parameter may still be useful towards the primary goal of gaining information for the contaminant source location. While nuisance parameter treatment has been investigated in batch OED [67], its incorporation in sOED would have an even bigger impact. Finally, the treatment of model 160 discrepancy is an interesting and challenging area by itself. Its inclusion in sOED would help make the method even more reflective of real life situations. 161 162 Appendix A Analytic Derivation of the Unbiased Gradient Estimator We derive the analytical form of the unbiased gradient estimator ∇d ÛN,M (d, θs , zs ),1 following the method presented in Section 2.4. The estimator ÛN,M (d, θs , zs ) is defined in Equation 2.16. Its gradient in component form is ∇d ÛN,M (d, θs , zs ) = ∂ ∂d1 ÛN,M (d, θs , zs ) .. . , ∂ Û (d, θ , z ) s s N,M ∂da .. . ∂ Û (d, θ , z ) s s N,M ∂dn ∂ ∂d2 ÛN,M (d, θs , zs ) (A.1) d where nd is the dimension of the design vector d, and da denotes the ath component of d. The ath component of the gradient is then ∂ ÛN,M (d, θs , zs ) = ∂da N 1 X N i=1 PM ( j=1 − PM G(θ(i) , d) + C(θ(i) , d)z (i) θ(i) , d fy|θ,d G(θ(i) , d) + C(θ(i) , d)z (i) θ(i) , d ) ∂ (i) (i) (i) θ (i,j) , d ∂da fy|θ,d G(θ , d) + C(θ , d)z . (A.2) fy|θ,d G(θ(i) , d) + C(θ(i) , d)z (i) θ(i,j ′ ) , d ∂ ∂da fy|θ,d j ′ =1 1 Recall that this estimator is unbiased with respect to the gradient of ŪM . 163 Partial derivatives of the likelihood function with respect to d are required above. We assume that each component of C(θ(i) , d) is of the form αc + βc |Gc (θ(i) , d)|, c = 1 . . . ny , where ny is the dimension of the obsevation vector y, and αc , βc are constants. Also, let the random vectors z (i) be mutually independent and composed of i.i.d. components, such that the observations are conditionally independent given θ and d. The derivative of the likelihood function then becomes ∂ fy|θ,d G(θ(i) , d) + C(θ(i) , d)z (i) θ(i,j) , d ∂da " ny # ∂ Y = fyc |θ,d Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc(i) θ(i,j) , d ∂da c=1 ny X ∂ (i) fyk |θ,d Gk (θ(i) , d) + (αk + βk |Gk (θ(i) , d)|)zk θ(i,j) , d = ∂da k=1 ny Y fyc |θ,d Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc(i) θ(i,j) , d . (A.3) c=1 c6=k (i) Introducing a standard normal density for each zc , the likelihood associated with a single component of the data vector is = fyc |θ,d Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc(i) θ(i,j) , d 1 2π αc + βc |Gc (θ(i,j) , d)| (i) 2 (i,j) (i) (i) , d) − (Gc (θ , d) + (αc + βc |Gc (θ , d)|)zc ) Gc (θ exp − , 2 2 αc + βc |Gc (θ(i,j) , d)| √ 164 (A.4) and its derivatives are ∂ fyc |θ,d Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc(i) θ(i,j) , d ∂da −βc sgn(Gc (θ(i,j) , d)) ∂d∂ a Gc (θ(i,j) , d) = √ 2 2π αc + βc |Gc (θ(i,j) , d)| (i) 2 (i,j) (i) (i) , d) − (Gc (θ , d) + (αc + βc |Gc (θ , d)|)zc ) Gc (θ × exp − 2 2 αc + βc |Gc (θ(i,j) , d)| 1 2π αc + βc |Gc (θ(i,j) , d)| (i) 2 (i,j) (i) (i) , d) − (Gc (θ , d) + (αc + βc |Gc (θ , d)|)zc ) Gc (θ × exp − 2 2 αc + βc |Gc (θ(i,j) , d)| Gc (θ(i,j) , d) − (Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc(i) ) × − 2 αc + βc |Gc (θ(i,j) , d)| ∂ ∂ (i,j) (i) (i) (i) × Gc (θ , d) − Gc (θ , d)(1 + βc sgn(Gc (θ , d))zc ) ∂da ∂da (i) 2 Gc (θ(i,j) , d) − (Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc ) + 3 αc + βc |Gc (θ(i,j) , d)| ∂ (i,j) (i,j) ×βc sgn(Gc (θ , d)) Gc (θ , d) . ∂da +√ (A.5) In cases where conditioning on θ(i,j) is replaced by conditioning on θ(i) (i.e., for the first summation term in Equation A.2), the expressions simplify to fyc |θ,d (Gc (θ(i) , d) + (αc + βc |Gc (θ(i) , d)|)zc(i) |θ(i) , d) 2 (i) 1 zc exp − = √ 2 2π αc + βc |Gc (θ(i) , d)| (A.6) and = ∂ f (Gc (θ(i) , d) + (αc + βc Gc (θ(i) , d))zc(i) |θ(i) , d) ∂da yc |θ,d 2 (i) ∂ (i) (i) −βc sgn(Gc (θ , d)) ∂da Gc (θ , d) zc exp − . √ 2 (i) 2 2π αc + βc |Gc (θ , d)| 165 (A.7) We now require the derivative of each model output Gc with respect to d. In most cases, this quantity will not be available analytically. One could use an adjoint method to evaluate the derivatives, or instead employ a finite difference approximation, but embedding these approaches in a Monte Carlo sum may be prohibitive, particularly if each forward model evaluation is computationally expensive. The polynomial chaos surrogate introduced in Section 2.3 addresses this problem by replacing the forward model with polynomial expansions for either Gc Gc (θ(i) , d) ≈ X b∈J gb Ψb ξ(θ(i) , d) (A.8) or ln Gc (i) Gc (θ , d) ≈ exp " X (i) gb Ψb ξ(θ , d) b∈J # . (A.9) Here gb are the expansion coefficients and J is an admissible multi-index set indicating which polynomial terms are in the expansion. For instance, if nθ is the dimension of θ and nd is the dimension of d, such that nθ + nd is the dimension of ξ, then J := {b ∈ Nn0 θ +nd : |b|1 ≤ p} is a total-order expansion of degree p. This expansion converges in the L2 sense as p → ∞. Consider the latter (ln-Gc ) case; here, the derivative of the polynomial chaos expansion is # " X X ∂ ∂ (i) (i) Gc (θ , d) = exp Ψb (ξ(θ(i) , d)). gb gb Ψb ξ(θ , d) ∂da ∂da (A.10) b b In the former (Gc without the logarithm) case, we obtain the same expression except without the exp [·] term. To complete the derivation, we assume that each component of the input parameters θ and design vector d is represented by an affine transformation of corresponding basis random variable ξ: θ l = γ l + δ l ξl , dl′ −nθ = γ l ′ + δ l ′ ξl ′ , (A.11) (A.12) where γ(·) and δ(·) 6= 0 are constants, and l = 1, . . . , nθ and l′ = nθ + 1, . . . , nθ + nd . This 166 is a reasonable assumption since ξ can be typically chosen such that their distributions are of the same family as the prior on θ (or the uniform “prior” on d); this choice avoids any need for approximate representations of the prior. The derivative of Ψb (ξ(θ(i) , d)) from Equation A.10 is thus ∂ Ψb (ξ(θ(i) , d)) = ∂da = nθ +nd nθY ∂ Y (i) ψbl′ (ξl′ (dl′ −nθ )) ψbl ξl (θl ) ∂da ′ l=1 l =nθ +1 nθ Y ψbl l=1 (i) ξl (θl ) +nd nθY ′ (dl′ −n )) ψ (ξ b l ′ θ l l′ =nθ +1 l′ −nθ 6=a ∂ (ξa+nθ (da )) , ψb ∂da a+nθ (A.13) and the derivative of the univariate basis function ψ with respect to da is ∂ (ξa+nθ (da )) = ψb ∂da a+nθ = ∂ ∂ ξa+nθ (da ) ∂da 1 (ξa+nθ ) , δa+nθ ψba+nθ (ξa+nθ ) ∂ξa+nθ ∂ ψb ∂ξa+nθ a+nθ (A.14) where the second equality is a result of using Equation A.12. The derivative of the polynomial basis function with respect to its argument is available analytically for many standard orthogonal polynomials, and may be evaluated using recurrence relationships [1]. For example, in the case of Legendre polynomials, the usual derivative recurrence relationship is ∂ ∂ξ ψn (ξ) = [−bξψn (ξ) + bψn−1 (ξ)] /(1 − ξ 2 ), where n is the polynomial degree. However, di- vision by (1 − ξ 2 ) presents numerical difficulties when evaluated on ξ that fall on or near the boundaries of the domain. Instead, a more robust alternative that requires both previous polynomial function and derivative evaluations can be obtained by directly differentiating the three-term recurrence relationship for the polynomial, and is preferable in practice: 2n − 1 2n − 1 ∂ n−1 ∂ ∂ ψn (ξ) = ψn−1 (ξ) + ξ ψn−1 (ξ) − ψn−2 (ξ). ∂ξ n n ∂ξ n ∂ξ (A.15) This concludes the derivation of the analytical gradient estimator ∇d ÛN,M (d, θs , zs ). 167 168 Appendix B Analytic Solution to the Linear-Gaussian Problem We derive the analytic solution to the linear-Gaussian problem described in Section 7.1. As discussed in the main text, this problem is deterministic, and its optimal policy can be reduced to optimal designs (i.e., the expected utility or reward is a function of d0 and d1 , rather than of a policy). In this case, batch optimal experimental design (OED) and sequential optimal experimental design (sOED) yield the same optimal designs since feedback is not needed. We pursue the derivation first via the batch design formulation in Section B.1, which is simpler and also produces the entire analytic expected utility function. In Section B.2, we present the derivation under the sOED formulation. The production of these derivations is assisted with the MATLAB Symbolic Math Toolbox package. B.1 Derivation from batch optimal experimental design Following the expected utility definition of Equation 2.2 and with the additional term introduced in Equation 7.4, the expected utility for this problem is h 2 i U (d0 , d1 ) = Ey0 ,y1 |d0 ,d1 DKL fθ|y0 ,y1 ,d0 ,d1 (·|y0 , y1 , d0 , d1 )||fθ (·) − 2 ln σ22 − ln 2 2 f (θ|y0 , y1 , d0 , d1 ) − 2 ln σ22 − ln 2 = Ey0 ,y1 |d0 ,d1 Eθ|y0 ,y1 ,d0 ,d1 ln f (θ) 2 f (θ|y0 , y1 , d0 , d1 ) = Eθ|d0 ,d1 Ey0 ,y1 |θ,d0 ,d1 ln − 2 ln σ22 − ln 2 , (B.1) f (θ) 169 where the second equality is due to σ22 being independent of y0 and y1 given d0 and d1 (see Equation 7.2), and the last equality is from the re-arrangement of conditional expectations. Let us first focus on the first term in Equation B.1, and substitute the following formula for log-posterior and log-prior density functions (s0 − θ)2 1 ln f (θ) = − ln 2πσ02 − 2 2σ02 (s2 − θ)2 1 ln f (θ|y0 , y1 , d0 , d1 ) = − ln 2πσ22 − . 2 2σ22 (B.2) (B.3) We then further substitute s2 , σ22 , s1 , σ12 with the formula in Equation 7.2, and s0 = 0, σ02 = 9, and σǫ2 = 1 from the problem setting. The resulting expression is 1 ln 9d0 2 + 9d1 2 + 1 2 18d0 + 18d1 + 2 2 2 4 − d0 − d1 − 9d0 − 9d1 4 θ2 − 9d0 2 y0 2 − 9d1 2 y1 2 Eθ|d0 ,d1 Ey0 ,y1 |θ,d0 ,d1 2 + 18d0 3 θy0 + 18d1 3 θy1 + 2d0 θy0 + 2d1 θy1 − 18d0 2 d1 2 θ2 − 18d0 d1 y0 y1 !!## 2 2 1 ln d + d + 0 1 9 + ln 3 (. B.4) + 18d0 d1 2 θy0 + 18d0 2 d1 θy1 + 18(d0 2 + d1 2 ) 2 Next, we make use of the linearity of expectation operators, and apply the inner expectation Ey0 ,y1 |θ,d0 ,d1 term-by-term, with the formulas Eyk |θ,dk [yk ] = Eǫ|θ,dk [θdk + ǫ] = θdk , Eyk |θ,dk yk2 = Varǫ|θ,dk [yk ] + Eǫ|θ,dk [yk ]2 (B.5) = Varǫ|θ,dk [θdk + ǫ] + θ2 d2k = σǫ2 + θ2 d2k , (B.6) for k = 0, 1. The substitution of yk invokes the model in Equation 7.1. For the cross term with y0 and y1 in Equation B.4, the observations are independent conditioned on θ, d0 , and d1 , due to the independent ǫ assumption—hence the joint expectation is separable. Again using the linearity of expectation operators, we apply the outer expectation Eθ|d0 ,d1 170 term-by-term, using the formulas Eθ|d0 ,d1 [θ] = s0 , Eθ|d0 ,d1 θ2 = Varθ|d0 ,d1 [θ] + Eθ|d0 ,d1 [θ]2 = σ02 + s20 . (B.7) (B.8) Upon these substitutions, Equation B.4 simplifies to ln(d0 2 + d1 2 + 19 ) 2473854946935173 + . 2 2251799813685248 (B.9) The second term in Equation B.1, upon substituting the formula in Equation 7.2 and simplifying, becomes −2 ln 2 − ln 9 2 9 d0 + 9 d1 2 + 1 2 . (B.10) Combining Equations B.9 and B.10, we obtain the analytic formula for the expected utility U (d0 , d1 ) = ln(d0 2 + d1 2 + 91 ) 2473854946935173 + 2 2251799813685248 2 9 . −2 ln 2 − ln 9 d0 2 + 9 d1 2 + 1 (B.11) Finding the stationary points in the design space (d0 , d1 ) ∈ [0.1, 3]2 by setting the gradient to zero and checking the boundaries, the optimal designs satisfy the condition d∗2 0 + d∗2 1 18014398509481984 ln 3 − 5117414861322735 1 exp −1 , = 9 9007199254740992 (B.12) with U (d∗0 , d∗1 ) ≈ 0.783289. (B.13) The optimal solution is indeed not unique, as there is a “front” of optimal designs. The expected utility contours and the optimal design front are plotted in Figure B-1. 171 3 0 2.5 −5 d1 2 1.5 −10 1 −15 0.5 −20 0.5 1 1.5 d0 2 2.5 3 −25 Figure B-1: Linear-Gaussian problem: analytic expected utility surface, with the “front” of optimal designs in dotted black line. B.2 Derivation from sequential optimal experimental design We now present the first steps of a derivation using the sOED formulation. This approaches reaches the same optimal designs as the derivation using the batch OED formulation. We start the derivation from the terminal reward defined in Equation 7.4 J2 (x2 ) = DKL fθ|d0 ,y0 ,d1 ,y1 (·|d0 , y0 , d1 , y1 )||fθ (·) − 2(ln σ22 − ln 2)2 " # 2 1 σ22 (s2 − s0 )2 σ2 = + − ln − 1 − 2(ln σ22 − ln 2)2 2 2 2 σ02 σ0 σ0 2 2 2 s2 σ2 1 σ2 − 1 − 2(ln σ22 − ln 2)2 , + − ln = 2 9 9 9 (B.14) where the second equality is due to the analytic formula for Kullback-Leibler divergence between two univariate Gaussians, and the third equality is upon simplification from s0 = 0 and σ02 = 9 for this problem. Substituting this into the Bellman’s equation (Equation 3.3) 172 produces J1 (x1 ) = max Ey1 |x1 ,d1 [g1 (x1 , y1 , d1 ) + J2 (F1 (x1 , y1 , d1 ))] d1 2 2 σ2 s22 1 σ2 2 2 + − ln − 1 − 2(ln σ2 − ln 2) = max Ey1 |x1 ,d1 d1 2 9 9 9 " ( ! ) 2 1 1 y1 σ12 d1 + s1 σ12 σ12 + −1 − = max Ey1 |x1 ,d1 − ln d1 2 9 σ12 d21 + 1 9 σ12 d21 + 1 9 σ12 d21 + 1 2 # σ12 2 ln 2 2 − ln 2 σ1 d1 + 1 " ( 2 4 2 σ12 1 1 2 2 y σ d + s + 2y σ d s E + = max 1 1 1 y |x ,d 1 1 1 1 1 1 1 1 2 d1 2 9 σ12 d21 + 1 9 σ 2 d21 + 1 ! ) 1 2 # σ12 σ12 − 1 − 2 ln 2 2 − ln 2 , (B.15) − ln σ1 d1 + 1 9 σ12 d21 + 1 where we have substituted for s2 and σ22 with the analytic formula from Equation 7.2, and also made use of g1 = 0 and σǫ2 = 1 for this problem. The next step requires taking the expectation with respect to y1 |x1 , d1 ; this is equivalent to taking the expectation with respect to y1 |s1 , σ12 , d1 , since x1 is completely described by its mean and variance in this conjugate Gaussian setting. With the intention to use the linearity of expectation and apply the expectation term-by-term, we develop the identities Ey1 |s1 ,σ12 ,d1 [y1 ] = = = = Z Z Z +∞ y1 f (y1 |s1 , σ12 , d1 ) dy1 −∞ +∞ Z +∞ −∞ −∞ +∞ Z +∞ −∞ Z +∞ −∞ −∞ y1 f (y1 , θ|s1 , σ12 , d1 ) dy1 dθ y1 f (y1 |θ, s1 , σ12 , d1 )f (θ|s1 , σ12 , d1 ) dy1 dθ d1 θf (θ|s1 , σ12 , d1 ) dy1 dθ (B.16) = d1 s1 , 173 and Ey1 |s1 ,σ12 ,d1 y12 = = = = = Z Z Z Z +∞ y12 f (y1 |s1 , σ12 , d1 ) dy1 −∞ +∞ Z +∞ −∞ −∞ +∞ Z +∞ −∞ +∞ −∞ σǫ2 + −∞ y12 f (y1 , θ|s1 , σ12 , d1 ) dy1 dθ y12 f (y1 |θ, s1 , σ12 , d1 )f (θ|s1 , σ12 , d1 ) dy1 dθ (σǫ2 + d21 θ2 )f (θ|s1 , σ12 , d1 ) dy1 dθ d21 (σ12 + s21 ) = 1 + d21 (σ12 + s21 ), (B.17) where we have applied the property Var(y) = E y 2 − (E [y])2 twice, and the last equality uses σǫ2 = 1. Substituting Equations B.16 and B.17 into Equation B.15 then yields " ( 1 (1 + d21 (σ12 + s21 ))σ14 d21 + s21 + 2σ12 d21 s21 σ12 J1 (x1 ) = max + 2 d1 2 9 σ12 d21 + 1 9 σ12 d21 + 1 ! ) 2 # σ12 σ12 − 1 − 2 ln 2 2 − ln 2 − ln σ1 d1 + 1 9 σ12 d21 + 1 ≡ max J¯1 (x1 , d1 ). d1 (B.18) (B.19) To find the optimal d1 , we take the partial derivative of J¯ with respect to d1 and setting it r 1 to zero to attain three stationary points: 0, ± r e 8 σ12 −2 . 2σ12 The only feasible candidate in the 1 design space is + e 8 σ12 −2 , 2σ12 and it needs to be checked for global optimality along with the boundary values. This is a hideous process involving verification of their second derivative properties and under different regions of x1 , and we r omit the details here. Nonetheless, 1 the optimum can be ultimately shown to be d∗1 = + e 8 σ12 −2 . 2σ12 Substituting for σ12 using Equation 7.2 and s0 = 0, σ02 = 9, and σǫ2 = 1 from this problem, we arrive at the final relationship between d∗0 and d∗1 that is exactly as Equation B.12. 174 Bibliography [1] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions With Formulas, Graphs, and Mathematical Tables. U.S. Department of Commerce, NIST (National Institute of Standards and Technology), Washington, DC, 1972. [2] A. K. Agarwal and M. L. Brisk. Sequential Experimental Design for Precise Parameter Estimation. 2. Design Criteria. Industrial & Engineering Chemistry Process Design and Development, 24(1):207–210, 1985. [3] S. Ahmed and A. Shapiro. The Sample Average Approximation Method for Stochastic Programs with Integer Recourse. Technical report, Georgia Institute of Technology, 2002. [4] N. M. Alexandrov, R. M. Lewis, C. R. Gumbert, L. L. Green, and P. A. Newman. Approximation and Model Management in Aerodynamic Optimization with VariableFidelity Models. Journal of Aircraft, 38(6):1093–1101, 2001. [5] L. Ambrosio and N. Gigli. A User’s Guide to Optimal Transport. In Modelling and Optimisation of Flows on Networks, pages 1–155. Springer Berlin Heidelberg, Berlin, Germany, 2013. [6] B. Amzal, F. Y. Bois, E. Parent, and C. P. Robert. Bayesian-Optimal Design via Interacting Particle Systems. Journal of the American Statistical Association, 101(474):773–785, 2006. [7] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis. Springer New York, New York, NY, 2007. [8] M. Athans. The Role and Use of the Stochastic Linear-Quadratic-Gaussian Problem in Control System Design. IEEE Transactions on Automatic Control (Institute of Electrical and Electronics Engineers), 16(6):529–552, 1971. [9] A. C. Atkinson and A. N. Donev. Optimum Experimental Designs. Oxford University Press, New York, NY, 1992. [10] F. Augustin and Y. M. Marzouk. NOWPAC: A provably convergent derivative-free nonlinear optimizer with path-augmented constraints. arXiv preprint arXiv:1403.1931, 2014. [11] D. L. Baulch, C. T. Bowman, C. J. Cobos, R. A. Cox, T. Just, J. A. Kerr, T. Murrells, M. J. Pilling, D. Stocker, J. Troe, W. Tsang, R. W. Walker, and J. Warnatz. Evaluated Kinetic Data for Combustion Modeling: Supplement II. Journal of Physical and Chemical Reference Data, 34(3):757–1397, 2005. 175 [12] D. L. Baulch, C. J. Cobos, R. A. Cox, P. Frank, G. Hayman, T. Just, J. A. Kerr, T. Murrells, M. J. Pilling, J. Troe, R. W. Walker, and J. Warnatz. Evaluated Kinetic Data for Combustion Modeling: Supplement I. Journal of Physical and Chemical Reference Data, 23(6):847–1033, 1994. [13] R. Bellman. Bottleneck Problems and Dynamic Programming. Proceedings of the National Academy of Sciences of the United States of America, 39(9):947–951, 1953. [14] R. Bellman. Dynamic Programming and Lagrange Multipliers. Proceedings of the National Academy of Sciences of the United States of America, 42(10):767–769, 1956. [15] I. Ben-Gal and M. Caramanis. Sequential DOE via dynamic programming. IIE Transactions (Institute of Industrial Engineers), 34(12):1087–1100, 2002. [16] M. Benisch, A. Greenwald, V. Naroditskiy, and M. C. Tschantz. A Stochastic Programming Approach to Scheduling in TAC SCM. In Proceedings of the 5th ACM Conference on Electronic Commerce (Association of Computing Machinery), pages 152–159, New York, NY, 2004. [17] A. Benveniste, M. Métivier, and P. Priouret. Adaptive Algorithms and Stochastic Approximation. Springer Berlin Heidelberg, Berlin, Germany, 1990. [18] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer New York, New York, NY, 1985. [19] G. Berkooz, P. Holmes, and J. L. Lumley. The Proper Orthogonal Decomposition in the Analysis of Turbulent Flows. Annual Review of Fluid Mechanics, 25:539–575, 1993. [20] P. Bernard and B. Buffoni. Optimal mass transportation and Mather theory. Journal of the European Mathematical Society, 9(1):85–121, 2007. [21] D. A. Berry, P. Müller, A. P. Grieve, M. Smith, T. Parke, R. Blazek, N. Mitchard, and M. Krams. Adaptive Bayesian Designs for Dose-Ranging Drug Trials. In Case studies in Bayesian statistics, pages 99–181. Springer New York, New York, NY, 2002. [22] D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol. 1. Athena Scientific, Belmont, MA, 2005. [23] D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol. 2. Athena Scientific, Belmont, MA, 2007. [24] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. [25] D. Blackwell. Comparison of Experiments. In Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probability, pages 93–102, Berkeley, CA, 1951. [26] D. Blackwell. Equivalent Comparisons of Experiments. The Annals of Mathematical Statistics, 24(2):265–272, 1953. [27] N. Bonnotte. From Knothe’s Rearrangement to Brenier’s Optimal Transport Map. SIAM Journal on Mathematical Analysis (Society for Industrial and Applied Mathematics), 45(1):64–87, 2013. 176 [28] A. J. Booker, J. E. Dennis, P. D. Frank, D. B. Serafini, V. Torczon, and M. W. Trosset. A rigorous framework for optimization of expensive functions by surrogates. Structural Optimization, 17(1):1–13, 1999. [29] G. E. P. Box. Science and Statistics. Journal of the American Statistical Association, 71(356):791–799, 1976. [30] G. E. P. Box. Sequential Experimentation and Sequential Assembly of Designs. Quality Engineering, 5(2):321–330, 1992. [31] G. E. P. Box and N. R. Draper. Empirical Model-Building and Response Surfaces. John Wiley & Sons, Hoboken, NJ, 1987. [32] G. E. P. Box, J. S. Hunter, and W. G. Hunter. Statistics for Experimenters: Design, Innovation and Discovery. John Wiley & Sons, Hoboken, NJ, 2nd edition, 2005. [33] G. E. P. Box and H. L. Lucas. Design of Experiments in Non-Linear Situations. Biometrika, 46(1-2):77–90, 1959. [34] S. J. Bradtke and A. G. Barto. Linear Least-Squares Algorithms for Temporal Difference Learning. Machine Learning, 22(1-3):33–57, 1996. [35] Y. Brenier. Polar Factorization and Monotone Rearrangement of Vector-Valued Functions. Communications on Pure and Applied Mathematics, 44(4):375–417, 1991. [36] S. Bringezu, H. Schütz, M. O’Brien, L. Kauppi, R. W. Howarth, and J. McNeely. Towards Sustainable Production and Use of Resources: Assessing Biofuels. Technical report, United Nations Environment Programme, 2009. [37] A. E. Brockwell and J. B. Kadane. A Gridding Method for Bayesian Sequential Decision Problems. Journal of Computational and Graphical Statistics, 12(3):566–584, 2003. [38] T. Bui-Thanh, K. Willcox, and O. Ghattas. Model Reduction for Large-Scale Systems with High-Dimensional Parametric Input Space. SIAM Journal on Scientific Computing (Society for Industrial and Applied Mathematics), 30(6):3270–3288, 2008. [39] R. H. Cameron and W. T. Martin. The Orthogonal Development of Non-Linear Functionals in Series of Fourier-Hermite Functionals. The Annals of Mathematics, 48(2):385–392, 1947. [40] G. Carlier, A. Galichon, and F. Santambrogio. From Knothe’s Transport to Brenier’s Map and a Continuation Method for Optimal Transport. SIAM Journal on Mathematical Analysis (Society for Industrial and Applied Mathematics), 41(6):2554–2576, 2010. [41] P. Carlin, Bradley, J. B. Kadane, and A. E. Gelfand. Approaches for Optimal Sequential Decision Analysis in Clinical Trials. Biometrics, 54(3):964–975, 1998. [42] D. R. Cavagnaro, J. I. Myung, M. A. Pitt, and J. V. Kujala. Adaptive Design Optimization: A Mutual Information-Based Approach to Model Discrimination in Cognitive Science. Neural Computation, 22(4):887–905, 2010. 177 [43] K. Chaloner and I. Verdinelli. Bayesian Experimental Design: A Review. Statistical Science, 10(3):273–304, 1995. [44] T. Champion and L. De Pascale. The Monge Problem in Rˆd. Duke Mathematical Journal, 157(3):551–572, 2011. [45] P. Chaudhuri and P. A. Mykland. Nonlinear Experiments: Optimal Design and Inference Based on Likelihood. Journal of the American Statistical Association, 88(422):538–546, 1993. [46] H. Chen and B. W. Schmeiser. Retrospective Approximation Algorithms for Stochastic Root Finding. In Proceedings of the 1994 Winter Simulation Conference, pages 255– 261, Lake Buena Vista, FL, 1994. [47] H. Chen and B. W. Schmeiser. Stochastic root finding via retrospective approximation. IIE Transactions (Institute of Industrial Engineers), 33(3):259–275, 2001. [48] J. A. Christen and M. Nakamura. Sequential Stopping Rules for Species Accumulation. Journal of Agricultural, Biological & Environmental Statistics, 8(2):184–195, 2003. [49] Y. Chu and J. Hahn. Integrating Parameter Selection with Experimental Design Under Uncertainty for Nonlinear Dynamic Systems. AIChE Journal (American Institute of Chemical Engineers), 54(9):2310–2320, 2008. [50] C. W. Clenshaw and A. R. Curtis. A method for numerical integration on an automatic computer. Numerische Mathematik, 2(1):197–205, 1960. [51] M. A. Clyde. Bayesian Optimal Designs for Approximate Normality. PhD thesis, University of Minnesota, 1993. [52] M. A. Clyde, P. Müller, and G. Parmigiani. Exploring Expected Utility Surfaces by Markov Chains. Technical report, Duke University, 1996. [53] P. R. Conrad and Y. M. Marzouk. Adaptive Smolyak Pseudospectral Approximations. SIAM Journal on Scientific Computing (Society for Industrial and Applied Mathematics), 35(6):A2643–A2670, 2013. [54] P. G. Constantine, M. S. Eldred, and E. T. Phipps. Sparse pseudospectral approximation method. Computer Methods in Applied Mechanics and Engineering, 229-232:1–12, 2012. [55] T. A. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Hoboken, NJ, 2nd edition, 2006. [56] D. R. Cox and N. Reid. The Theory of the Design of Experiments. Chapman & Hall/CRC, Boca Raton, FL, 2000. [57] C. Darken and J. E. Moody. Note on Learning Rate Schedules for Stochastic Optimization. In Advances in Neural Information Processing Systems 3, pages 832–838, Denver, CO, 1990. [58] D. F. Davidson and R. K. Hanson. Interpreting Shock Tube Ignition Data. International Journal of Chemical Kinetics, 36(9):510–523, 2004. 178 [59] B. J. Debusschere, H. N. Najm, P. P. Pébay, O. M. Knio, R. G. Ghanem, and O. P. Le Maître. Numerical Challenges in the Use of Polynomial Chaos Representations for Stochastic Processes. SIAM Journal on Scientific Computing (Society for Industrial and Applied Mathematics), 26(2):698–719, 2004. [60] M. H. DeGroot. Optimal Statistical Decisions. John Wiley & Sons, Hoboken, NJ, 2004. [61] H. A. Dror and D. M. Steinberg. Sequential Experimental Designs for Generalized Linear Models. Journal of the American Statistical Association, 103(481):288–298, 2008. [62] C. C. Drovandi, J. M. McGree, and A. N. Pettitt. Sequential Monte Carlo for Bayesian sequentially designed experiments for discrete data. Computational Statistics and Data Analysis, 57:320–335, 2013. [63] C. C. Drovandi, J. M. McGree, and A. N. Pettitt. A Sequential Monte Carlo Algorithm to Incorporate Model Uncertainty in Bayesian Sequential Design. Journal of Computational and Graphical Statistics, 23(1):3–24, 2014. [64] T. A. El Moselhy and Y. M. Marzouk. Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815–7850, 2012. [65] M. S. Eldred, A. A. Giunta, and S. S. Collis. Second-Order Corrections for SurrogateBased Optimization with Model Hierarchies. In 10th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference (American Institute of Aeronautics and Astronautics, International Society of Structural and Multidisciplinary Optimization), Albany, NY, 2004. [66] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, NY, 1972. [67] C. Feng. Optimal Bayesian experimental design in the presence of model error. Master’s thesis, Massachusetts Institute of Technology, 2015. [68] D. Feyel and A. S. Üstünel. Monge-Kantorovitch Measure Transportation and MongeAmpère Equation on Wiener Space. Probability Theory and Related Fields, 128(3):347– 385, 2004. [69] R. A. Fisher. The Design of Experiments. Oliver & Boyd, Edinburgh, United Kingdom, 8th edition, 1966. [70] I. Ford, D. M. Titterington, and C. P. Kitsos. Recent Advances in Nonlinear Experimental Design. Technometrics, 31(1):49–60, 1989. [71] M. Frangos, Y. M. Marzouk, K. Willcox, and B. van Bloemen Waanders. Surrogate and Reduced-Order Modeling: A Comparison of Approaches for Large-Scale Statistical Inverse Problems. In Large-Scale Inverse Problems and Quantification of Uncertainty, pages 123–149. John Wiley & Sons, Chichester, United Kingdom, 2010. [72] M. Frenklach. Transforming data into knowledge-Process Informatics for combustion chemistry. Proceedings of the Combustion Institute, 31(1):125–140, 2007. 179 [73] T. Gerstner and M. Griebel. Dimension-Adaptive Tensor-Product Quadrature. Computing, 71(1):65–87, 2003. [74] R. G. Ghanem and P. D. Spanos. Stochastic Finite Elements: A Spectral Approach. Springer New York, New York, NY, 1st edition, 1991. [75] J. Ginebra. On the Measure of the Information in a Statistical Experiment. Bayesian Analysis, 2(1):167–212, 2007. [76] P. Glasserman. Gradient Estimation via Perturbation Analysis. Kluwer Academic Publishers, Boston, MA, 1991. [77] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM (Association of Computing Machinery), 33(10):75–84, 1990. [78] G. J. Gordon. Stable Function Approximation in Dynamic Programming. In Proceedings of the 12th International Conference on Machine Learning, pages 261–268, Tahoe City, CA, 1995. [79] A. Greenwald, B. Guillemette, V. Naroditskiy, and M. C. Tschantz. Scaling Up the Sample Average Approximation Method for Stochastic Optimization with Applications to Trading Agents. In Agent-Mediated Electronic Commerce. Designing Trading Agents and Mechanisms, pages 187–199. Springer Berlin Heidelberg, Berlin, Germany, 2006. [80] T. Guest and A. Curtis. Iteratively constructive sequential design of experiments and surveys with nonlinear parameter-data relationships. Journal of Geophysical Research, 114(B04307):1–14, Apr. 2009. [81] C. Guestrin, A. Krause, and A. P. Singh. Near-Optimal Sensor Placements in Gaussian Processes. In Proceedings of the 22nd International Conference on Machine Learning, pages 265–272, Bonn, Germany, 2005. [82] G. Gürkan, A. Y. Özge, and S. M. Robinson. Sample-Path Optimization in Simulation. In Proceedings of the 1994 Winter Simulation Conference, pages 247–254, Lake Buena Vista, FL, 1994. [83] I. Guyon, M. Nikravesh, S. Gunn, and L. A. Zadeh. Feature Extraction: Foundations and Applications. Springer Berlin Heidelberg, Berlin, Germany, 2006. [84] M. Hamada, H. F. Martz, C. S. Reese, and A. G. Wilson. Finding Near-Optimal Bayesian Experimental Designs via Genetic Algorithms. The American Statistician, 55(3):175–181, 2001. [85] K. Healy and L. W. Schruben. Retrospective Simulation Response Optimization. In Proceedings of the 1991 Winter Simulation Conference, pages 901–906, Phoenix, AZ, 1991. [86] D. A. Hickman and L. D. Schmidt. Production of Syngas by Direct Catalytic Oxidation of Methane. Science, 259(5093):343–346, 1993. [87] Y. C. Ho and X. Cao. Perturbation Analysis and Optimization of Queueing Networks. Journal of Optimization Theory and Applications, 40(4):559–582, 1983. 180 [88] S. Hosder, R. Walters, and R. Perez. A non-intrusive polynomial chaos method for uncertainty propagation in CFD simulations. In Proceedings of the 44th AIAA Aerospace Sciences Meeting and Exhibit (American Institute of Aeronautics and Astronautics), Reno, NV, 2006. [89] X. Huan. Accelerated Bayesian Experimental Design for Chemical Kinetic Models. Master’s thesis, Massachusetts Institute of Technology, 2010. [90] X. Huan and Y. M. Marzouk. Simulation-based optimal Bayesian experimental design for nonlinear systems. Journal of Computational Physics, 232(1):288–317, 2013. [91] X. Huan and Y. M. Marzouk. Gradient-Based Stochastic Optimization Methods in Bayesian Experimental Design. International Journal for Uncertainty Quantification, 4(6):479–510, 2014. [92] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1-2):99–134, 1998. [93] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4:237–285, 1996. [94] M. C. Kennedy and A. O’Hagan. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(3):425–464, 2001. [95] J. Kiefer and J. Wolfowitz. Stochastic Estimation of the Maximum of a Regression Function. The Annals of Mathematical Statistics, 23(3):462–466, 1952. [96] W. Kim, M. A. Pitt, Z.-L. Lu, M. Steyvers, and J. I. Myung. A Hierarchical Adaptive Approach to Optimal Experimental Design. Neural Computation, 26:2565–2492, 2014. [97] A. J. Kleywegt, A. Shapiro, and T. Homem-de Mello. The Sample Average Approximation Method for Stochastic Discrete Optimization. SIAM Journal on Optimization (Society for Industrial and Applied Mathematics), 12(2):479–502, 2002. [98] H. Knothe. Contributions to the Theory of Convex Bodies. The Michigan Mathematical Journal, 4(1):39–52, 1957. [99] A. Krause and C. Guestrin. Near-optimal Observation Selection Using Submodular Functions. In Proceedings of the 22nd AAAI Conference on Artificial Intelligence (Association for the Advancement of Artificial Intelligence), pages 1650–1654, Vancouver, Canada, 2007. [100] A. Krause, J. Leskovec, C. Guestrin, J. VanBriesen, and C. Faloutsos. Efficient Sensor Placement Optimization for Securing Large Water Distribution Networks. Journal of Water Resources Planning and Management, 134(6):516–526, 2008. [101] H. Kurniawati, D. Hsu, and W. S. Lee. SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces. In Proceedings of Robotics: Science and Systems, 2008, pages 65–72, Zurich, Switzerland, 2008. [102] H. J. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer New York, New York, NY, 2nd edition, 2003. 181 [103] M. Lagoudakis. Least-Squares Policy Iteration. The Journal of Machine Learning Research, 4:1107–1149, 2003. [104] O. P. Le Maître and O. M. Knio. Spectral Methods for Uncertainty Quantification: with Applications to Computational Fluid Dynamics. Springer Netherlands, Houten, Netherlands, 2010. [105] D. V. Lindley. On a Measure of the Information Provided by an Experiment. The Annals of Mathematical Statistics, 27(4):986–1005, 1956. [106] D. V. Lindley. Bayesian Statistics: A Review. SIAM (Society for Industrial and Applied Mathematics), Philadelphia, PA, 1972. [107] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Springer US, New York, NY, 1998. [108] T. J. Loredo. Rotating Stars and Revolving Planets: Bayesian Exploration of the Pulsating Sky. In Bayesian Statistics 9: Proceedings of the Nineth Valencia International Meeting, pages 361–392, Benidorm, Spain, 2010. [109] T. J. Loredo and D. F. Chernoff. Bayesian Adaptive Exploration. In Statistical Challenges in Astronomy, pages 57–70. Springer New York, New York, NY, 2003. [110] D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge, United Kingdom, 4th edition, 2005. [111] W.-K. Mak, D. P. Morton, and R. K. Wood. Monte Carlo bounding techniques for determining solution quality in stochastic programs. Operations Research Letters, 24(1-2):47–56, 1999. [112] Y. M. Marzouk and H. N. Najm. Dimensionality reduction and polynomial chaos acceleration of Bayesian inference in inverse problems. Journal of Computational Physics, 228(6):1862–1902, 2009. [113] Y. M. Marzouk, H. N. Najm, and L. A. Rahn. Stochastic spectral methods for efficient Bayesian solution of inverse problems. Journal of Computational Physics, 224(2):560– 586, 2007. [114] Y. M. Marzouk and D. Xiu. A Stochastic Collocation Approach to Bayesian Inference in Inverse Problems. Communications in Computational Physics, 6(4):826–847, 2009. [115] R. J. McCann. Existence and Uniqueness of Monotone Measure-Preserving Maps. Duke Mathematical Journal, 80(2):309–323, 1995. [116] G. Monge. Mémoire sur la théorie des déblais et de remblais. In Histoire de l’Académie Royale des Sciences de Paris, avec les Mémoires de Mathématique et de Physique pour la même année, pages 666–704. De l’Imprimerie Royale, Paris, France, 1781. [117] S. Mosbach, A. Braumann, P. L. W. Man, C. A. Kastner, G. P. E. Brownbridge, and M. Kraft. Iterative improvement of Bayesian parameter estimates for an engine model by means of experimental design. Combustion and Flame, 159(3):1303–1313, 2012. [118] P. Müller. Simulation Based Optimal Design. Handbook of Statistics, 25:509–518, 2005. 182 [119] P. Müller, D. A. Berry, A. P. Grieve, M. Smith, and M. Krams. Simulation-based sequential Bayesian design. Journal of Statistical Planning and Inference, 137(10):3140– 3150, 2007. [120] P. Müller and G. Parmigiani. Optimal Design via Curve Fitting of Monte Carlo Experiments. Journal of the American Statistical Association, 90(432):1322–1330, 1995. [121] P. Müller, B. Sansó, and M. De Iorio. Optimal Bayesian Design by Inhomogeneous Markov Chain Simulation. Journal of the American Statistical Association, 99(467):788–798, 2004. [122] S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–366, 2003. [123] H. N. Najm. Uncertainty Quantification and Polynomial Chaos Techniques in Computational Fluid Dynamics. Annual Review of Fluid Mechanics, 41:35–52, 2009. [124] J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7(4):308–313, 1965. [125] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization (Society for Industrial and Applied Mathematics), 19(4):1574–1609, 2009. [126] J. Nocedal and S. J. Wright. Numerical Optimization. Springer New York, New York, NY, 2006. [127] V. Norkin, G. Pflug, and A. Ruszczynski. A branch and bound method for stochastic global optimization. Mathematical Programming, 83(1-3):425–450, 1998. [128] I. Olkin and F. Pukelsheim. The Distance between Two Random Vectors with Given Dispersion Matrices. Linear Algebra and its Applications, 48:257–263, 1982. [129] D. Ormoneit and S. Sen. Kernel-Based Reinforcement Learning. Machine Learning, 49(2-3):161–178, 2002. [130] G. Parmigiani and L. Y. T. Inoue. Decision Theory: Principles and Approaches. John Wiley & Sons, West Sussex, United Kingdom, 2009. [131] M. D. Parno. Transport maps for accelerated Bayesian computation. PhD thesis, Massachusetts Institute of Technology, 2015. [132] M. D. Parno and Y. M. Marzouk. Transport map accelerated Markov chain Monte Carlo. arXiv preprint arXiv:1412.5492, 2015. [133] B. D. Phenix, J. L. Dinaro, M. A. Tatang, J. W. Tester, J. B. Howard, and G. J. Mcrae. Incorporation of Parametric Uncertainty into Complex Kinetic Mechanisms: Application to Hydrogen Oxidation in Supercritical Water. Combustion and Flame, 112(1-2):132–146, 1998. [134] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. International Joint Conference on Artificial Intelligence, 3:1025– 1032, 2003. 183 [135] B. T. Polyak and A. B. Juditsky. Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization (Society for Industrial and Applied Mathematics), 30(4):838–855, 1992. [136] J. Porta and N. Vlassis. Point-Based Value Iteration for Continuous POMDPs. The Journal of Machine Learning Research, 7:2329–2367, 2006. [137] W. B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. John Wiley & Sons, Hoboken, NJ, 2nd edition, 2011. [138] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Hoboken, NJ, 1994. [139] A. J. Ragauskas, C. K. Williams, B. H. Davison, G. Britovsek, J. Cairney, C. A. Eckert, W. J. Frederick, J. P. Hallett, D. J. Leak, C. L. Liotta, J. R. Mielenz, R. Murphy, R. Templer, and T. Tschaplinski. The Path Forward for Biofuels and Biomaterials. Science, 311(5760):484–489, 2006. [140] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006. [141] M. T. Reagan, H. N. Najm, R. G. Ghanem, and O. M. Knio. Uncertainty quantification in reacting-flow simulations through non-intrusive spectral projection. Combustion and Flame, 132(3):545–555, 2003. [142] H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, 1951. [143] M. Rosenblatt. Remarks on a Multivariate Transformation. The Annals of Mathematical Statistics, 23(3):470–472, 1952. [144] T. Russi, A. Packard, R. Feeley, and M. Frenklach. Sensitivity Analysis of Uncertainty in Model Prediction. The Journal of Physical Chemistry A, 112(12):2579–2588, 2008. [145] K. J. Ryan. Estimating Expected Information Gains for Experimental Designs With Application to the Random Fatigue-Limit Model. Journal of Computational and Graphical Statistics, 12(3):585–603, 2003. [146] T. J. Santner, B. J. Williams, and W. I. Notz. The Design and Analysis of Computer Experiments. Springer New York, New York, NY, 2003. [147] P. Schütz, A. Tomasgard, and S. Ahmed. Supply chain design under uncertainty using sample average approximation and dual decomposition. European Journal of Operational Research, 199(2):409–419, 2009. [148] P. Sebastiani and H. P. Wynn. Maximum entropy sampling and optimal Bayesian experimental design. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1):145–157, 2000. [149] A. Shapiro. Asymptotic Analysis of Stochastic Programs. Annals of Operations Research, 30(1):169–186, 1991. [150] A. Shapiro. Stochastic Programming by Monte Carlo Simulation Methods. Technical report, Georgia Institute of Technology, 2003. 184 [151] A. Shapiro and A. Philpott. A Tutorial on Stochastic Programming. Technical report, Georgia Institute of Technology, 2007. [152] D. A. Shea and S. A. Lister. The BioWatch Program: Detection of Bioterrorism. Technical report, Congressional Research Service Report, 2003. [153] S. Sherman. On a theorem of Hardy, Littlewood, Polya, and Blackwell. Proceedings of the National Academy of Sciences, 37(12):826–831, 1951. [154] O. Sigaud and O. Buffet. Markov Decision Processes in Artificial Intelligence: MDPs, beyond MDPs and applications. John Wiley & Sons, Hoboken, NJ, 2010. [155] L. Sirovich. Turbulence and the Dynamics of Coherent Structures, Part I: Coherent Structures. Quarterly of applied mathematics, 45(3):561–571, 1987. [156] D. S. Sivia and J. Skilling. Data Analysis: A Bayesian Tutorial. Oxford University Press, New York, NY, 2nd edition, 2006. [157] R. D. Smallwood and E. J. Sondik. The Optimal Control of Partially Observable Markov Processes Over a Finite Horizon. Operations Research, 21(5):1071–1088, 1973. [158] A. Solonen, H. Haario, and M. Laine. Simulation-Based Optimal Design Using a Response Variance Criterion. Journal of Computational and Graphical Statistics, 21(1):234–252, 2012. [159] E. J. Sondik. The optimal control of partially observable Markov processes. PhD thesis, Stanford University, 1971. [160] J. C. Spall. Accelerated Second-Order Stochastic Optimization Using Only Function Measurements. In Proceedings of the 36th IEEE Conference on Decision and Control (Institute of Electrical and Electronics Engineers), pages 1417–1424, San Diego, CA, 1997. [161] J. C. Spall. Implementation of the Simultaneous Perturbation Algorithm for Stochastic Optimization. IEEE Transactions on Aerospace and Electronic Systems (Institute of Electrical and Electronics Engineers), 34(3):817–823, 1998. [162] C. Stein. Notes on a seminar on theoretical statistics; Comparison of experiments. Technical report, University of Chicago, 1951. [163] R. S. Sutton. Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3(1):9–44, 1988. [164] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998. [165] C. Szepesvári. Algorithms for Reinforcement Learning. Morgan & Claypool, San Rafael, CA, 2010. [166] G. Terejanu, R. R. Upadhyay, and K. Miki. Bayesian experimental design for the active nitridation of graphite by atomic nitrogen. Experimental Thermal and Fluid Science, 36:178–193, 2012. 185 [167] G. Tesauro and G. R. Galperin. On-line Policy Improvement using Monte Carlo Search. In Advances in Neural Information Processing Systems 9, pages 1068–1074, Denver, CO, 1996. [168] J. N. Tsitsiklis and B. Van Roy. Regression Methods for Pricing Complex AmericanStyle Options. IEEE Transactions on Neural Networks (Institute of Electrical and Electronics Engineers), 12(4):694–703, 2001. [169] J. van den Berg, A. Curtis, and J. Trampert. Optimal nonlinear Bayesian experimental design: an application to amplitude versus offset experiments. Geophysical Journal International, 155(2):411–421, Nov. 2003. [170] B. Verweij, S. Ahmed, A. J. Kleywegt, G. Nemhauser, and A. Shapiro. The Sample Average Approximation Method Applied to Stochastic Routing Problems: A Computational Study. Computational Optimization and Applications, 24(2):289–333, 2003. [171] C. Villani. Optimal Transport: Old and New. Springer-Verlag Berlin Heidelberg, Berlin, Germany, 2008. [172] U. Von Toussaint. Bayesian inference in physics. Reviews of Modern Physics, 83:943– 999, 2011. [173] R. W. Walters. Towards Stochastic Fluid Mechanics via Polynomial Chaos. In Proceedings of the 41st AIAA Aerospace Sciences Meeting and Exhibit (American Institute of Aeronautics and Astronautics), Reno, NV, 2003. [174] J. K. Wathen and J. A. Christen. Implementation of Backward Induction for Sequentially Adaptive Clinical Trials. Journal of Computational and Graphical Statistics, 15(2):398–413, 2006. [175] C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, King’s College, 1989. [176] C. J. C. H. Watkins and P. Dayan. Technical Note: Q-Learning. Machine Learning, 8(3-4):279–292, 1992. [177] B. P. Weaver, B. J. Williams, C. M. Anderson-Cook, and D. M. Higdon. Computational Enhancements to Bayesian Design of Experiments Using Gaussian Processes. Bayesian Analysis, 2015. [178] N. Wiener. The Homogeneous Chaos. American Journal of Mathematics, 60(4):897– 936, 1938. [179] D. Xiu. Fast Numerical Methods for Stochastic Computations: A Review. Communications in Computational Physics, 5(2-4):242–272, 2009. [180] D. Xiu and G. E. Karniadakis. The Wiener-Askey Polynomial Chaos for Stochastic Differential Equations. SIAM Journal on Scientific Computing (Society for Industrial and Applied Mathematics), 24(2):619–644, 2002. [181] D. Xiu and G. E. Karniadakis. A new stochastic approach to transient heat conduction modeling with uncertainty. International Journal of Heat and Mass Transfer, 46(24):4681–4693, 2003. 186