Minimization for generating conditional realizations from unconditional realizations Dean Oliver Uni Centre for Integrated Petroleum Research 17 June 2013 1/39 Organization I Data assimilation — geoscience applications I I I Large numbers of parameters to be estimated (105 – 107 ) Large amounts of data (103 – 105 ) Evaluate of likelihood function is expensive (1 hr) I Sample updating — Bayes rule + another condition I Examples (very small) I Comments 2/39 Final uncertainty in Initial uncertainty in data reservoir model param- −−−−−−−−−→ reservoir model parameters eters 2 1.0 1 0.8 0 0.6 -1 0.4 -2 -1 0 1 relatively simple 2 -1.0 -0.5 0.0 0.5 relatively complex Bayes’ rule relates prior pdf to posterior pdf. For high model dimensions, generally need Monte Carlo methods for integration. 3/39 Ensemble of conditional Initial ensemble of data reservoir model param- −−−−−−−−−→ reservoir model parameters eters -3 -2 3 3 2 2 1 1 1 -1 2 3 -3 -2 1 -1 -1 -1 -2 -2 -3 -3 2 3 Ensembles of particles represent prior pdf and posterior pdf. Compute expectations by summing over samples. 4/39 Sampling from the conditional pdf 1.2 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 -1.0 -0.5 0.0 0.5 -1.0 -0.5 0.5 Samples need to be distributed correctly. MCMC? Rejection/acceptance? Scale poorly in high model/data dimensions. 5/39 Update the ensemble of samples ææ æ æææ æ æ æ æ æ æ æ æ æ ææ æ ææ æ æ æ æ æ 1æ æ ææ æ æ æ ææ æææ ææ æ ææ æ æ æ æ æ æ æ æ æ æ æ æ ææ æ æ æ æ ææææ æ æ æ æ æ ææ æ ææ æ ææ æ æ æ æ æ æ æææ æ ææ æææ ææ 1æ æ æ æ æ ææ æ 2 æ æ -1æ æ ææ æ ææ æ æ æ æ æ ææ ææ æ ææ æ ææ æ ææ æ ææ æ ææ æ æ æ ææ ææ æ æ æ æ æ æ ææ ææ æ æ æ-1 æ æ ææ æ æ æ ææ æ æ æ æ æ æ æ æ æ æ -2 æ æ 2æ ææ prior ææ æ æ 2 æ æ æææ æ æ æ æ æ æ æ æ æ æ æ ææ æ ææ æ æ æ æ 1 æ ææ æ æ æ æ ææ æ æ æææ ææ æ æ æ æ æ æ æ æ æ æ æ æ æ æ ææ æ æ æ æ ææææ æ æ æ æ æ æ ææ æ ææ æ æ æ æ æ æ æ æææ 2 -2 -1ææ æ ææ æ æ æ æææ ææ 1 æ æ æ æ æ æ æ ææ æ æ æ ææ æ æ æ ææ ææ æææ æ æ ææ æ ææ æ ææ æ ææ æ æ æ ææ ææ æ æ æ æ æ æ ææ ææ æ æ æ -1 æ æ ææ æ æ æ ææ æ æ æ æ æ æ æ æ æ æ posterior Each sample from prior is updated, not resampled from posterior. How to update? 6/39 How to update samples? “The Ensemble Kalman Filter (EnKF) [is] a continuous implementation of the Bayesian update rule” (Li and Caers, 2011). “These methods use . . . ‘samples’ that are drawn independently from the given initial distribution and assigned equal weights. . . . When observations become available, Bayes’ rule is applied either to individual samples . . . ” (Park and Xu, 2009) Note: Bayes rule explains how to update probabilities, but not how to update samples. (Particle filters do update the probability of each sample using Bayes rule.) 7/39 Two transformations from prior to posterior pdf -2 y = µy + 1.5 1.5 1.0 1.0 0.5 0.5 1 -1 2 -2 1 -1 -0.5 -0.5 -1.0 -1.0 -1.5 -1.5 Ly [Ly Σx Ly ]−1/2 Ly (x − µx ) 2 y = µy + Ly L−1 x (x − µx ) Two equally valid transformations from prior to posterior for linear-gaussian problem. Bayes rule is not sufficient to specify the transformation. 8/39 Why it matters 1. If prior is Gaussian and posterior is Gaussian, then any linear transformation of variables that obtains the correct posterior mean and covariance is OK. (Any version of EnKF or EnSRF.) 2. What transformation to use when the observation operator is nonlinear? 2.1 Randomized maximum likelihood (Kitanidis, 1995; Oliver et al., 1996) 2.2 Optimal map (Sei, 2011; El Moselhy and Marzouk, 2012) 2.3 Implicit filters (Chorin et al., 2010; Morzfeld et al., 2012) 2.4 Continuous data assimilation or multiple data assimilation (Reich, 2011; Emerick and Reynolds, 2013) 2.5 Ensemble-based iterative filters/smoothers 9/39 Monge’s transport problem I I Let x ∼ p(x) Let y = s(x) I I then y ∼ qs (y ) = p(s −1 (y ))| det(Ds −1 (y ))| q(y ) is the target pdf of transformed variables Solve for the optimal transformation s ∗ (·) that minimizes Z ∗ s = arg min kx − s (x)k2 p(x) dx such that qs (y ) = q(y ) s 10/39 Example:1 Gaussian to Gaussian I Prior pdf: p(x) = N(µx , Σx ) I Target pdf: q(y ) = N(µy , Σy ) y = s(x) := µy + Ly [Ly Σx Ly ]−1/2 Ly (x − µx ) 1/2 where Lx = Σx 1/2 and Ly = Σy . Minimizes the expected value of the squared distance kx − y k2 1 Olkin and Pukelsheim (1982); Knott and Smith (1984); Rüschendorf (1990) 11/39 Speculation — why would the minimum distance solution be desirable? Obtaining a transformation with optimal transport properties may only be important insofar as it simplifies the sampling problem except that the ‘natural’ optimal transport solution might be more robust to deviations from ideality. Examples include those in which samples from the prior are complex geological models, in which case making as small of a change as possible might be beneficial. 12/39 Transformations between spaces of unequal dimensions 2.6 2.4 x ∈ RN y = s(x) ∈ RM D 2.2 2.0 1.8 1.6 1.4 -1 0 1 2 3 4 1.4 1.6 1.8 2.0 2.2 2.4 Y X Z Find s(·) such that kx − As(x)k2Σx p(x) dx is minimized and x and y = s(x) are distributed as x ∼ N(µx , Σx ) y ∼ N(µy , Σy ). 13/39 Transformations between spaces of unequal dimensions 2.6 2.6 Ay 2.4 2.4 2.2 D 2.0 2.2 x 2.0 1.8 1.8 1.6 Σx 1.6 1.4 -1 0 1 2 3 4 -1 1 2 3 4 X Z Find s(·) such that kx − As(x)k2Σx p(x) dx is minimized and x and y = s(x) are distributed as x ∼ N(µx , Σx ) y ∼ N(µy , Σy ). 14/39 Transformations between spaces of unequal dimensions Note that if A= I H and Σx = Cx 0 0 Cd then x A y= A d −1 = x + Cx H T HCx H T + Cd (d − Hx) T −1 Σ−1 x A T Σ−1 x is an optimal mapping of [x d] to y . Note that this is the perturbed observation form of the EnKF (Burgers et al., 1998), or RML for linear inverse problem. 15/39 Nonlinear Transformations Consider now the problem of determining the transformation s ∗ that satisfies 2 Z x s(x, d) ∗ p(x, d) dx dd − s = arg min s d h(s(x, d)) Σ xd subject to y = s(x, d) is distributed as q(y ). The prior is Gaussian but the data relationship is nonlinear. 16/39 Nonlinear Transformations x 1.6 y = s(x) 1.4 D 1.2 1.0 0.8 0.6 0.4 -2 -1 0 1 X 2 3 -1.0 -0.5 0.0 0.5 1.0 Y 17/39 Nonlinear Transformations 1.6 1.6 Ay 1.4 1.4 x 1.2 D D 1.2 1.0 1.0 0.8 0.8 0.6 0.6 -2 -1 0 1 X 2 3 Σx -2 -1 0 1 2 3 X 18/39 Approximate solution Consider the transformation defined by the solution of T x∗ y Cx − ∗ d h(y ) 0 ∗ y = arg min y 0 Cd −1 x∗ y − . d∗ h(y ) If h is differentiable, then the variables x ∗ , d ∗ , and y ∗ are related as x ∗ = y ∗ + Cx H∗T Cd−1 [h(y ∗ ) − d ∗ ] 19/39 Approximate solution For small y − y ∗ , the pdf of transformed variables log(qs (y )) ≈ 1 (y − µ)T Cx−1 (y − µ) 2 1 + (h(y ) − do )T Cd−1 (h(y ) − do ) 2 + u(δ̂, d ∗ , y ∗ ) + log |J| is approximately equal to the target pdf. u(δ̂, d ∗ , y ∗ ) comprises terms that do not depend on y . The Jacobian determinant of the transformation is ∂(x, d) I −Cx H∗T C −1 d J= = I ∂(y , δ) H∗ 20/39 Examples 1.2 2.0 1.0 1.5 0.8 0.6 1.0 0.4 0.5 0.2 1.0 1.5 2.0 2.5 1. univariate: posterior pdf is unimodal but skewed -1.5 -1.0 0.5 -0.5 1.0 1.5 2. univariate: posterior pdf is bimodal 0.30 -0.8 0.25 -1.0 0.20 -1.2 0.15 -1.4 0.10 -1.6 0.05 -1.8 -0.05 0.00 0.05 0.10 0.15 3. bivariate: posterior pdf is bimodal -0.15 -0.10 -0.05 0.05 0.10 0.15 4. 1000 D: posterior pdf has 4 modes 21/39 Example 2: single variable, bimodal h(x) = d posterior 4 1.2 1.0 3 0.8 2 0.6 prior 0.4 1 0.2 -2 1 -1 (a) Likelihood 2 -2 -1 1 2 (b) prior and posterior Figure 1: The observation tells that x ≈ ±1. Prior says that x ≈ 0.8. Will sample from prior, transform to posterior. 22/39 Example 2: single variable, bimodal y x 1.5 1.0 0.5 -2 1 -1 -0.5 2 3 x y -1.0 -1.5 d (a) y vs x (b) y vs (x, d) Figure 2: Transformed samples from approximate posterior vs samples from prior. T x∗ y Cx − d∗ h(y ) 0 ∗ y = arg min y 0 Cd −1 x∗ y − . d∗ h(y ) 23/39 Example 2: single variable, bimodal x 1.5 1.0 0.5 -2 1 -1 2 3 y -0.5 -1.0 -1.5 d (a) y vs x (b) y vs (x, d) Figure 3: Transformed samples from approximate posterior vs samples from prior. Red curve shows the (correct) optimal map defined for minimizing expected distance between x and y . 24/39 Example 2: single variable, bimodal 7 1.5 6 5 1.0 4 3 0.5 2 1 -1.5 -1.0 -0.5 0.0 0.5 (a) σd = 0.1 1.0 1.5 2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 (b) σd = 0.5 Figure 4: Distributions of transformed samples for two values of noise in the observation. Samples from the prior distribution (gray curve) were mapped using the minimization transformation. The blue solid curve shows the true pdf. Clearly under-sampled in region between modes. 25/39 Example 2: single variable, bimodal 7 6 1.5 5 4 1.0 3 2 0.5 1 -1.5 -1.0 -0.5 0.0 0.5 (a) σd = 0.1 1.0 1.5 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 (b) σd = 0.5 Figure 5: Same as previous slide, but the dashed curve shows the product of the true pdf and the Jacobian determinant of the transformation. 26/39 Example 3: two variables, bimodal 1. The locations of the modes in 20 experiments were randomly sampled. 2. Bimodality is established through nonlinearity of one observation of x12 + x22 . 3. Second observation is made of a linear combination of the two variables 4. Approximately 4000 samples for each experiment 27/39 Example 3: two variables, bimodal 0.6 0.4 1.2 A 1.0 0.2 B 0.8 0.08 0.10 0.12 0.14 -0.2 0.6 -0.4 0.4 -0.6 -0.8 -1.0 (a) Example A: modes relatively close, good sampling. 0.5 -0.5 (b) Example B: modes relatively distant, good sampling. -0.6 -0.8 -1.0 æ -1.6 -1.8 -0.10 -0.05 0.05 0.10 0.15 0.20 (c) Example C: modes relatively close, poor sampling. æ æ -1.2 -1.4 ææ æ æ 2 æ æ æææ æ æ æ æ æ æ æ æ æ æ ææ æ ææ æ æ æ æ 1 æ ææ æ æ æ æ ææ æ æ æææ ææ æ æ æ æ æ æ æ æ ææ æ ææ æ æ æ æ æ æ ææ æ æ æ ææ æ ææ æ æææ æ æ æ æææ æ æ æ æææ 2 -2 -1ææ æ ææ æ æ æ æææ ææ 1 æ æ æ æ æ æ æ ææ æ æ æ ææ æ æ æ ææ ææ æææ æ æ ææ æ ææ æ ææ æ ææ æ æ æ ææ ææ æ æ æ æ æ æ ææ ææ æ æ æ -1 æ æ ææ æ æ æ ææ æ æ æ æ æ æ æ æ C (d) Example B: mapping of individual samples from prior to posterior. 28/39 Example 3: Quantitative comparison 0.5 0.5 0.0 0.40 -0.5 0.45 0.50 0.55 0.60 -0.5 -1.0 -1.0 0.35 0.40 0.45 0.50 true pdf 0.55 0.60 sampling For each example, we compute the local mean for each of the modes of the distribution and the total probability for each mode. 29/39 Example 3: Summary results æ æ C 0.5 B 0.0 æ æ ææ æ æ æ æ æ æ æ æ æ æ æ ææ æ æ A C æ -0.5 æ Empirical weight Empirical local mean 0.9 æ æ 0.8 æ æ 0.7 æ æ 0.6 æ æ B æ -1.0 æ ææ æ æ 0.5 æ æ -1.0 -0.5 0.0 True local mean 0.5 (a) Comparison of sampled estimate of local mean for modes with true local mean. A æ 0.5 0.6 0.7 True weight 0.8 0.9 (b) Comparison of sampled estimate of weight on one of the modes with true weight. Figure 7: Approximately 4000 optimization-based samples are used to compute the approximate weights and means for both modes of the sampled distributions. Labeled points refer to experiments shown on a previous slide. 30/39 Example 4: 4 modes, high dimension The prior probability density is Gaussian with independence of variables and unit range: x ∼ A exp[−0.5x T x] The likelihood is the sum of delta functions of random weights: L[x|d] = 4 X bi δ(x − αi ) i=1 so that the posteriori pdf that we wish to sample is p[x = αi |d] ∝ bi exp[−0.5 αiT αi ] The αi were normally distributed with mean 0 but small variance. 31/39 Example 4: random test pdf 0.4 0.05 0.3 0.00 0.2 -0.05 0.00 0.1 -0.05 -0.10 0.00 0.05 -0.15 -0.10 -0.05 0.05 0.10 -0.10 0.15 projection 1D 0.10 projection 3D Computations in dimensions as high as 2000. 32/39 Total absolute error in probability Example 3: Summary results 1.0 æ æ æ 0.8 æ 0.6 æ æ 0.4 æ 0.2 æ 5 10 50 100 æ æ æ 500 1000 Dimension of model space 1000 random samples of αi and βi . 10,000 samples used to compute error for each set of αi and βi . 33/39 Key points 1. Bayes’ rule does not specify how to update individual samples. Need some other criterion (such as ‘minimum expected distance’ or ‘maximum correlation’). 2. For some types of problems, can simplify the optimal transport problem (for probability density) by careful choice of a cost function. 3. RML sampling is approximate for nonlinear observation operators — should probably weight samples according to the Jacobian determinant of the transformation. 34/39 Remaining questions 1. Defining a cost function was key to getting a good approximation of optimal transport solution – how to generalize for nongaussian prior? 2. How to efficiently compute the Jacobian of transformation for weighting of updated samples? 3. Usefulness in ensemble-based approaches? 35/39 36/39 Burgers, G., van Leeuwen, P. J., and Evensen, G. (1998). Analysis scheme in the ensemble Kalman filter. Monthly Weather Review, 126(6):1719–1724. Chorin, A. J., Morzfeld, M., and Tu, X. (2010). Implicit particle filters for data assimilation. Communications in Applied Mathematics and Computational Science, 5(2):221–240. El Moselhy, T. A. and Marzouk, Y. M. (2012). Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815–7850. Emerick, A. A. and Reynolds, A. C. (2013). Ensemble smoother with multiple data assimilation. Computers & Geosciences, 55:3–15. Kitanidis, P. K. (1995). Quasi-linear geostatistical theory for inversing. Water Resour. Res., 31(10):2411–2419. Knott, M. and Smith, C. S. (1984). On the optimal mapping of distributions. Journal of Optimization Theory and Applications, 43(1):39–49. Li, H. and Caers, J. (2011). Geological modelling and history 36/39 matching of multi-scale flow barriers in channelized reservoirs: methodology and application. Petroleum Geoscience, 17:17–34. Morzfeld, M., Tu, X., Atkins, E., and Chorin, A. J. (2012). A random map implementation of implicit filters. Journal of Computational Physics, 231(4):2049–2066. Oliver, D. S., He, N., and Reynolds, A. C. (1996). Conditioning permeability fields to pressure data. In European Conference for the Mathematics of Oil Recovery, V, pages 1–11. Olkin, I. and Pukelsheim, F. (1982). The distance between two random vectors with given dispersion matrices. Linear Algebra and its Applications, 48(0):257–263. Reich, S. (2011). A dynamical systems framework for intermittent data assimilation. BIT Numer Math, 51(1):235–249. Rüschendorf, L. (1990). Corrigendum. J. Multivar. Anal., 34(1):156. Sei, T. (2011). Gradient modeling for multivariate quantitative data. Annals of the Institute of Statistical Mathematics, 63:675–688. 37/39 Example 1: unimodal but skewed 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 1 2 3 prior pdf for model var 4 1 2 3 4 d = h(x) = tanh(x) 10,000 independent samples sampled from the prior distribution and mapped to the posterior. 37/39 Example 1: unimodal but skewed 5 14 σd = 0.0252.5 2.0 σd = 0.0104 12 10 8 3 6 2 σd = 0.050 1.5 1.0 4 1 2 1.15 1.20 1.25 1.30 0.5 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.0 1.4 σd = 0.10 1.2 1.0 0.8 0.6 1.0 1.2 1.4 1.6 1.8 2.0 2.2 1.0 σd = 0.25 0.8 σd = 0.50 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.4 0.2 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0 Figure 8: Samples from the Gaussian prior distribution (mean 1.5) were mapped using the minimization transformation. The blue solid curve shows the true pdf. The dashed curve shows the product of the true pdf and the Jacobian determinant of the transformation. 38/39 0.10 à à æ à æ æ 0.08 à à 0.06 æ 0.04 à æ æ à 0.02 à 0.00 æ 0.01 à æ æ à æ 0.02 0.05 0.10 0.20 Standard deviation of data noise 0.50 Error in estimate of st dev Error in estimate of mean Example 1: unimodal but skewed à 0.06 à à à à 0.05 à 0.04 0.03 à 0.02 0.01 æ à à 0.00 æ 0.01 æ æ æ æ æ à æ æ æ 0.02 0.05 0.10 0.20 Standard deviation of data noise 0.50 (a) Absolute error in estimate of mean. (b) Absolute error in estimate of standard deviation. Figure 9: Comparison of empirical moments from 10,000 independent samples using minimization-based sampling with true moments. The magenta dots are 2 standard deviations in the error of samples of 100 realizations from the true distribution. 39/39