Lecture 08

advertisement
Lecture 8
The Principle of Maximum Likelihood
Syllabus
Lecture 01
Lecture 02
Lecture 03
Lecture 04
Lecture 05
Lecture 06
Lecture 07
Lecture 08
Lecture 09
Lecture 10
Lecture 11
Lecture 12
Lecture 13
Lecture 14
Lecture 15
Lecture 16
Lecture 17
Lecture 18
Lecture 19
Lecture 20
Lecture 21
Lecture 22
Lecture 23
Lecture 24
Describing Inverse Problems
Probability and Measurement Error, Part 1
Probability and Measurement Error, Part 2
The L2 Norm and Simple Least Squares
A Priori Information and Weighted Least Squared
Resolution and Generalized Inverses
Backus-Gilbert Inverse and the Trade Off of Resolution and Variance
The Principle of Maximum Likelihood
Inexact Theories
Nonuniqueness and Localized Averages
Vector Spaces and Singular Value Decomposition
Equality and Inequality Constraints
L1 , L∞ Norm Problems and Linear Programming
Nonlinear Problems: Grid and Monte Carlo Searches
Nonlinear Problems: Newton’s Method
Nonlinear Problems: Simulated Annealing and Bootstrap Confidence Intervals
Factor Analysis
Varimax Factors, Empircal Orthogonal Functions
Backus-Gilbert Theory for Continuous Problems; Radon’s Problem
Linear Operators and Their Adjoints
Fréchet Derivatives
Exemplary Inverse Problems, incl. Filter Design
Exemplary Inverse Problems, incl. Earthquake Location
Exemplary Inverse Problems, incl. Vibrational Problems
Purpose of the Lecture
Introduce the spaces of all possible data,
all possible models and the idea of likelihood
Use maximization of likelihood as a guiding principle for
solving inverse problems
Part 1
The spaces of all possible data,
all possible models and the idea of
likelihood
viewpoint
the observed data is one point in the space of all
possible observations
or
dobs is a point in S(d)
plot of dobs
d3
O
plot of dobs
d3
dobs
O
now suppose …
the data are independent
each is drawn from a Gaussian distribution
with the same mean m1 and variance σ2
(but m1 and σ unknown)
plot of p(d)
d3
O
plot of p(d)
d3
cloud
centered on
d1=d2=d3
with radius
proportional
to σ
O
now interpret …
p(dobs)
as the probability that the observed data was in
fact observed
L = log p(dobs)
called the likelihood
find parameters in the distribution
maximize
p(dobs)
with respect to m1 and σ
maximize the probability that the observed data
were in fact observed
the
Principle of Maximum Likelihood
Example
solving the two equations
solving the two equations
usual formula
for the sample
mean
almost the usual
formula for the
sample standard
deviation
these two estimates linked to the assumption of the
data being Gaussian-distributed
might get a different formula for a different p.d.f.
L(m1, σ)
example of a likelihood surface
maximum
likelihood
point
likelihood maximization process will
fail if p.d.f. has no well-defined peak
(A)
p(d1, ,d1 )
(B)
p(d1, ,d1 )
Part 2
Using the maximization of
likelihood as a guiding principle for
solving inverse problems
linear inverse problem for
with Gaussian-distibuted data
with known covariance [cov d]
assume
Gm=d
gives the mean d
T
principle of maximum likelihood
maximize L = log p(dobs)
minimize
T
with respect to m
principle of maximum likelihood
maximize L = log p(dobs)
minimize
E=
T
This is just weighted least squares
principle of maximum likelihood
when data Gaussian-distributed
solve Gm=d with weighted least squares
with weighting of
special case of uncorrelated data
each datum with a different variance
[cov d]ii = σdi2
minimize
special case of uncorrelated data
each datum with a different variance
[cov d]ii = σdi2
minimize
errors
weighted by
their certainty
but what about a priori information?
probabilistic representation of a priori
information
probability that the model parameters are
near m
given by p.d.f.
pA(m)
probabilistic representation of a priori
information
probability that the model parameters are
near m
given by p.d.f.
pA(m)
centered at a
priori value
<m>
probabilistic representation of a priori
information
probability that the model parameters are
near m
given by p.d.f.
pA(m)
variance reflects
uncertainty in a
priori information
uncertain
certain
<m2>
m2
<m1>
<m1>
<m2>
m1
m1
m2
<m2>
<m1>
m2
m1
linear relationship
approximation with Gaussian
<m2>
<m1>
m2
m1
m1
m2
m2
p=constant
m1
p=0
assessing the information content
in pA(m)
Do we know a little about m
or
a lot about m ?
Information Gain, S
-S called Relative Entropy,
Relative Entropy, S
also called Information Gain
null p.d.f.
state of no knowledge
Relative Entropy, S
also called Information Gain
uniform p.d.f. might
work for this
probabilistic representation of data
probability that the data are
near d
given by p.d.f.
pA(d)
probabilistic representation of data
probability that the data are
near d
given by p.d.f.
p(d)
centered at
observed data
dobs
probabilistic representation of data
probability that the data are
near d
given by p.d.f.
p(d)
variance reflects
uncertainty in
measurements
probabilistic representation of both
prior information and observed data
assume observations and a priori
information are uncorrelated
Example of
datum, d
dobs
map
model, m
the theory
d = g(m)
is a surface in the combined space of
data and model parameters
on which the estimated model
parameters and predicted data must lie
the theory
d = g(m)
is a surface in the combined space of
data and model parameters
on which the estimated model
parameters and predicted data must lie
for a linear theory
the surface is planar
the principle of maximum likelihood says
maximize
on the surface d=g(m)
(A)
model, m
0
0.5
1
ddpre
dobs
1.5
2
2.5
3
datum, d
3.5
d=g(m)
4
4.5
5
0
1
(B)
ap3
m
mest m
2
4
5
4
5
P(s)
p(s)
0.2
0.1
0
0
1
smax2
3
s
position along curve, s
model, m
0
0.5
dobs
1
1.5
d
dpre
2
2.5
3
datum, d
3.5
d=g(m)
4
4.5
5
1
2
3
4
5
3
4
5
mest≈map
m
P(s)
p(s)
0
1
0.5
0
0
1
2
smax position along curve, s
s
(A)
model, m
0
0.5
1
dpre≈
dobs
1.5
d
2
2.5
3
datum, d
3.5
d=g(m
)
4
4.5
5
0
(B)
1
P(s)
p(s)
1
2
mestmap
3
4
5
m
0.5
0
0
1
smax
2
s
3
4
5
position
along
curve, s
principle of maximum likelihood
with
Gaussian-distributed data
Gaussian-distributed a priori information
minimize
this is just weighted least squares
with
so we already know the solution
solve Fm=f with simple least squares
when [cov d]=σd2I and [cov m]=σm2I
this provides and answer to the question
What should be the value of ε2
in damped least squares?
The answer
it should be set to the ratio of variances of
the data and the a priori model parameters
if the a priori information is
Hm=h
with covariance [cov h]A
then the Fm=f becomes
the most useful formula in inverse theory
Gm=dobs with covariance [cov d]
Hm=h with covariance [cov h]A
mest = (FTF)-1FTdobs
with
Download