Lecture 8 The Principle of Maximum Likelihood Syllabus Lecture 01 Lecture 02 Lecture 03 Lecture 04 Lecture 05 Lecture 06 Lecture 07 Lecture 08 Lecture 09 Lecture 10 Lecture 11 Lecture 12 Lecture 13 Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Describing Inverse Problems Probability and Measurement Error, Part 1 Probability and Measurement Error, Part 2 The L2 Norm and Simple Least Squares A Priori Information and Weighted Least Squared Resolution and Generalized Inverses Backus-Gilbert Inverse and the Trade Off of Resolution and Variance The Principle of Maximum Likelihood Inexact Theories Nonuniqueness and Localized Averages Vector Spaces and Singular Value Decomposition Equality and Inequality Constraints L1 , L∞ Norm Problems and Linear Programming Nonlinear Problems: Grid and Monte Carlo Searches Nonlinear Problems: Newton’s Method Nonlinear Problems: Simulated Annealing and Bootstrap Confidence Intervals Factor Analysis Varimax Factors, Empircal Orthogonal Functions Backus-Gilbert Theory for Continuous Problems; Radon’s Problem Linear Operators and Their Adjoints Fréchet Derivatives Exemplary Inverse Problems, incl. Filter Design Exemplary Inverse Problems, incl. Earthquake Location Exemplary Inverse Problems, incl. Vibrational Problems Purpose of the Lecture Introduce the spaces of all possible data, all possible models and the idea of likelihood Use maximization of likelihood as a guiding principle for solving inverse problems Part 1 The spaces of all possible data, all possible models and the idea of likelihood viewpoint the observed data is one point in the space of all possible observations or dobs is a point in S(d) plot of dobs d3 O plot of dobs d3 dobs O now suppose … the data are independent each is drawn from a Gaussian distribution with the same mean m1 and variance σ2 (but m1 and σ unknown) plot of p(d) d3 O plot of p(d) d3 cloud centered on d1=d2=d3 with radius proportional to σ O now interpret … p(dobs) as the probability that the observed data was in fact observed L = log p(dobs) called the likelihood find parameters in the distribution maximize p(dobs) with respect to m1 and σ maximize the probability that the observed data were in fact observed the Principle of Maximum Likelihood Example solving the two equations solving the two equations usual formula for the sample mean almost the usual formula for the sample standard deviation these two estimates linked to the assumption of the data being Gaussian-distributed might get a different formula for a different p.d.f. L(m1, σ) example of a likelihood surface maximum likelihood point likelihood maximization process will fail if p.d.f. has no well-defined peak (A) p(d1, ,d1 ) (B) p(d1, ,d1 ) Part 2 Using the maximization of likelihood as a guiding principle for solving inverse problems linear inverse problem for with Gaussian-distibuted data with known covariance [cov d] assume Gm=d gives the mean d T principle of maximum likelihood maximize L = log p(dobs) minimize T with respect to m principle of maximum likelihood maximize L = log p(dobs) minimize E= T This is just weighted least squares principle of maximum likelihood when data Gaussian-distributed solve Gm=d with weighted least squares with weighting of special case of uncorrelated data each datum with a different variance [cov d]ii = σdi2 minimize special case of uncorrelated data each datum with a different variance [cov d]ii = σdi2 minimize errors weighted by their certainty but what about a priori information? probabilistic representation of a priori information probability that the model parameters are near m given by p.d.f. pA(m) probabilistic representation of a priori information probability that the model parameters are near m given by p.d.f. pA(m) centered at a priori value <m> probabilistic representation of a priori information probability that the model parameters are near m given by p.d.f. pA(m) variance reflects uncertainty in a priori information uncertain certain <m2> m2 <m1> <m1> <m2> m1 m1 m2 <m2> <m1> m2 m1 linear relationship approximation with Gaussian <m2> <m1> m2 m1 m1 m2 m2 p=constant m1 p=0 assessing the information content in pA(m) Do we know a little about m or a lot about m ? Information Gain, S -S called Relative Entropy, Relative Entropy, S also called Information Gain null p.d.f. state of no knowledge Relative Entropy, S also called Information Gain uniform p.d.f. might work for this probabilistic representation of data probability that the data are near d given by p.d.f. pA(d) probabilistic representation of data probability that the data are near d given by p.d.f. p(d) centered at observed data dobs probabilistic representation of data probability that the data are near d given by p.d.f. p(d) variance reflects uncertainty in measurements probabilistic representation of both prior information and observed data assume observations and a priori information are uncorrelated Example of datum, d dobs map model, m the theory d = g(m) is a surface in the combined space of data and model parameters on which the estimated model parameters and predicted data must lie the theory d = g(m) is a surface in the combined space of data and model parameters on which the estimated model parameters and predicted data must lie for a linear theory the surface is planar the principle of maximum likelihood says maximize on the surface d=g(m) (A) model, m 0 0.5 1 ddpre dobs 1.5 2 2.5 3 datum, d 3.5 d=g(m) 4 4.5 5 0 1 (B) ap3 m mest m 2 4 5 4 5 P(s) p(s) 0.2 0.1 0 0 1 smax2 3 s position along curve, s model, m 0 0.5 dobs 1 1.5 d dpre 2 2.5 3 datum, d 3.5 d=g(m) 4 4.5 5 1 2 3 4 5 3 4 5 mest≈map m P(s) p(s) 0 1 0.5 0 0 1 2 smax position along curve, s s (A) model, m 0 0.5 1 dpre≈ dobs 1.5 d 2 2.5 3 datum, d 3.5 d=g(m ) 4 4.5 5 0 (B) 1 P(s) p(s) 1 2 mestmap 3 4 5 m 0.5 0 0 1 smax 2 s 3 4 5 position along curve, s principle of maximum likelihood with Gaussian-distributed data Gaussian-distributed a priori information minimize this is just weighted least squares with so we already know the solution solve Fm=f with simple least squares when [cov d]=σd2I and [cov m]=σm2I this provides and answer to the question What should be the value of ε2 in damped least squares? The answer it should be set to the ratio of variances of the data and the a priori model parameters if the a priori information is Hm=h with covariance [cov h]A then the Fm=f becomes the most useful formula in inverse theory Gm=dobs with covariance [cov d] Hm=h with covariance [cov h]A mest = (FTF)-1FTdobs with