0 Lecture Notes No. 2 ( )

advertisement
2.160 System Identification, Estimation, and Learning
Lecture Notes No. 20
May 3, 2006
15. Maximum Likelihood
15.1 Principle
Consider an unknown stochastic process
Observed data
Unknown
Stochastic
Process
y N = ( y1 , y2 ,", y N )
Assume that each observed datum yi is generated by an assumed stochastic process
having a PDF:
f (m, λ ; x ) =
−
1
e
2πλ
( x − m )2
λ
(1)
where m is mean, λ is variance, and x is the random variable associated with yi .
We know that mean m and variance λ are determined by
1
m=
N
N
∑y
i =1
i
1
λ=
N
N
∑(y
i =1
i
− m)
2
(2)
Let us now obtain the same parameter values, m and λ , based on a different
principle: Maximum Likelihood.
Assuming that the N observations y1 , y2 ," , y N are stochastically independent,
consider the joint probability associated with the N observations:
N
f (m, λ ; x1 , x2 ," xN ) = ∏
i =1
−
1
e
2πλ
( x i − m )2
λ
(3)
Now, once y1 , y2 ," , y N have been observed (have taken specific values), what
parameter values, m and λ , provide the highest probability in f (m, λ ; x1 , x2 ," x N ) ?
In other words, what values of m and λ are most likely the case? This means that
we maximize the following functional with respect to m and λ :
Max f (m, λ ; y1 , y 2 , " y N )
m ,λ
(4)
1
Note that y1 , y2 ,", y N are known values. Therefore, f (m, λ ; y1 , y2 ," y N ) is a
function of m and λ only. Using our standard notation, this can be rewritten as
θˆ = arg max f (m, λ ; y N )
(5)
θ
where θˆ is estimate of m and λ . Maximizing f (m, λ ; y N ) is equivalent to
),
maximizing log f (
θˆ = arg max log f (m, λ ; y N )
θ
[
]
(6)
⎛
( y −m )2 ⎞⎟
1
= arg max ∑ ⎜⎜ log
− i
θ
2λ ⎟⎠
2πλ
⎝
Taking derivatives and setting them to zero,
N
∂ log f
1
= ∑ ( yi − m ) = 0
∂m
i =1 λ
∂ log f
d log π
=N
∂λ
dλ
−
1
2
m=
(7)
d −1 N ( yi − m )
−
λ ∑
= 0 (8)
2
dλ
i =1
2
λ=
1 N
∑ yi
N i =1
(9)
1 N
( yi − m )2
∑
N i =1
(10)
1 N
1 N
2
yi and λ = ∑ ( yi − m ) provide a stochastic
∑
N i =1
N i =1
process model that is most likely to generate the observed data y N . And these agree
with (2)
This Maximum Likelihood Estimate (MLE) is formally stated as follows.
The above results m =
Maximum Likelihood Estimate
Consider a joint probability density function with parameter vector θ as a
stochastic model of an unknown process:
f (θ ; x1 , x2 ," xN )
(11)
Given observed data y1 , y2 ,", y N form a deterministic function of θ , called the
likelihood function:
L(θ ) = f (θ ; y1 , y2 ," y N )
(12)
Determine parameter vector θ so that this likelihood function becomes maximum.
θˆ = arg max L(θ )
θ
(13)
This Maximum Likelihood Estimator was found by Ronald A. Fisher in 1912, when
he was an undergraduate junior at Cambridge University. He was 22 years old.
2
Don’t be disappointed. You are still young!
15.2 Likelihood Function for Probabilistic Models of Dynamic Systems
Consider a model set in predictor form
M (θ ) :
yˆ (t θ ) = g t , z t −1;θ
(
)
(14)
and an associated PDF of prediction error
f ε ( x,t ;θ )
(15)
ε (t;θ ) = y(t ) − yˆ (t θ )
(16)
Assume that the random process ε (t ) is independent in t , and consider the joint
probability density function:
N
f (θ ; x1 , x2 ," xN ) = ∏ fε ( xt , t ;θ )
(17)
t =1
Substituting observed data y1 , y2 ,", y N into (17) yields a likelihood function of
parameter θ ,
N
L(θ ) = ∏ fε ( y (t ) − yˆ (t θ ), t ;θ )
(18)
t =1
Maximizing this function is equivalent to maximizing its logarithmic function,
1
1 N
log L(θ ) = ∑ log fε (ε , t ;θ )
N
N t =1
(19)
Replacing log f ε by l (ε ,θ , t ) = − log fε (ε , t ;θ ) ,
θˆ ML ( Z N ) = arg max
θ
1
N
N
∑ l ( ε ,θ , t )
(20)
t =1
The above minimization is for a deterministic function, and is analogous to the
prediction-error method with the least square criterion.
θˆ LS (z N ) = arg max
θ
1
N
N
1
∑ 2 ε (t ,θ )
2
(21)
t =1
Replacing
θˆ LS (z N ) = arg max
θ
1
N
1 2
ε (t ,θ ) = l (ε ,θ , t )
2
N
∑ l (ε ,θ , t )
t =1
3
Therefore, the Maximum Likelihood Estimate is a special case of the prediction-error
method. Function l (ε ,θ , t ) may take a different form depending on the assumed
PDF of the process under consideration.
15.3 The Cramer-Rao Lower Bound
Consider the mean square error matrix of an estimate θˆN compared to its true
value θ0 ,
(
)(
)
T
P = E ⎡ θˆ( y N ) − θ 0 θˆ( y N ) − θ 0 ⎤
⎥⎦
⎢⎣
(22)
This mean square error matrix provides the overall quality of an estimate. Our goal
is to minimizing a lower limit to this matrix. The limit, called the Cramer-Rao
Lower Bound, gives the best performance that we can expect among many unbiased
estimators.
Theorem (The Cramer-Rao Lower Bound)
Let θˆ( y N ) be an unbiased estimate of θ ;
E[θˆ( y N )] = θ0
(23)
(
)
where the PDF of y N is f y θ 0 ; y N . Then
(
)(
)
T
E ⎡ θˆ( y N ) − θ 0 θˆ( y N ) − θ 0 ⎤ ≥ M −1
⎥⎦
⎢⎣
(24)
where M is the Fisher Information Matrix given by
⎡d
⎤⎡ d
⎤
M = E ⎢ log f y (θ ; x N ) ⎥ ⎢ log f y (θ ; x N ) ⎥
⎣ dθ
⎦ ⎣ dθ
⎦
⎡ d2
⎤
= − E ⎢ 2 log f y (θ ; x N ) ⎥
⎣ dθ
⎦
θ0
T
θ0
(25 − a )
(25 − b )
Proof
From (23)
θ 0 = ∫ θˆ(x N ) f y (θ 0 ; x N )dx N
(26)
RN
where dummy vector x N is a N × 1 vector in space R N .
Differentiating (26), which is a d × 1 vector equation, with respect to θ 0 ∈ R d ×1
yields
4
T
⎡ d
⎤
I = ∫ θˆ x ⎢
f y (θ 0 ; x N )⎥ dx N
⎣ dθ 0
⎦
RN
where I is the d × d identity matrix,
( )
N
(27)
0⎤
⎡1
⎢ 1
⎥
⎢
⎥ ∈ R d ×d
I=
% ⎥
⎢
⎢
⎥
1⎦
⎣0
Using the following relationship
d
1 d
f (θ )
log f (θ ) =
dθ
f (θ ) dθ
(28)
(27) can be written as
T
⎤
⎡ d
log f y (θ 0 ; x N )⎥ f y (θ 0 ; x N )dx N
I = ∫ θˆ(x N )⎢
⎦
⎣ dθ 0
RN
(29)
T
⎡
⎤
N ⎡ d
N ⎤
ˆ
log f y (θ 0 ; x )⎥ ⎥
= E ⎢θ (x )⎢
⎢⎣
⎦ ⎥⎦
⎣ dθ 0
Similarly, differentiating the identity: 1 =
∫f
y
(θ 0 ; x N )dx N yields
RN
T
T
⎡ d
⎤
⎡ d
⎤
0= ∫ ⎢
f y (θ 0 ; x N )⎥ dx N = ∫ ⎢
log f y (θ 0 ; x N )⎥ f y (θ 0 ; x N )dx N
dθ 0
dθ 0
⎦
⎦
RN ⎣
RN ⎣
T
⎡ d
⎤
∴ 0= E⎢
log f y (θ0 ; x N ) ⎥
⎣ dθ 0
⎦
Multiplying (30) by θ 0 and subtracting it from (29),
T
⎧⎪
⎫
⎡ d
N
N ⎤ ⎪
ˆ
log f y (θ0 ; x ) ⎥ ⎬ = I
E ⎨ ⎣⎡θ ( y ) − θ0 ⎦⎤ ⎢
⎣ dθ 0
⎦ ⎪⎭
⎪⎩
α ∈ R d ×1
(30)
(31)
β T ∈ R1× d
E (αβ T ) = I
⎛α ⎞
Consider 2d × 1 vector ⎜⎜ ⎟⎟ ,
⎝β ⎠
5
⎡⎛ α ⎞
E ⎢⎜⎜ ⎟⎟
⎣⎝ β ⎠
(α
T
⎤
⎡ Eαα T
⎦
⎣ Eβα
β T )⎥ = ⎢
(
Eαβ T ⎤
⎥≥0
Eββ T ⎦
T
Note Eαβ T = I , and Eβα T = E αβ T
)
T
Positive Semi-definite (32)
=I
∴ E (αα T ) ⋅ E (ββ T ) − I 2 ≥ 0
[
]
E (αα T ) ≥ E (ββ T )
−1
(33)
(34)
This gives the first equation, (25)-(a). Differentiating the transpose of (30) yields
⎡ d2
⎤
0= ∫ ⎢
log f y (θ 0 ; x N )⎥ f y (θ 0 ; x N )dx N
2
dθ 0
⎦
RN ⎣
(35)
T
⎡ d
⎤
⎡
⎤
d
+∫⎢
f y (θ 0 ; x N )⎥ dx N
log f y (θ 0 ; x N )⎥ ⎢
dθ 0
⎦ ⎣ dθ 0
⎦
RN ⎣
Rewriting this in Expectation form yields (25)-(b).
Q.E.D.
15.4 Best Unbiased Estimators for Dynamical Systems.
The Cramer-Rao Lower Bound, given by the Fisher Information Matrix M , is
the best performance we can expect among all unbiased estimates with regard to error
covariance. Interestingly, Maximum Likelihood Estimate provides this best error
covariance performance for large sample data. The following is to formulate the
problem for proving this. The complete proof, however, is left for your Extra-Credit
Problem.
Based on the logarithmic likelihood function, we can compute the Fisher
Information Matrix, the inverse of which provides the lower bound of the error
covariance. To simplify the problem, we assume that the PDF of prediction error ε
is a function of ε only. Namely,
∂fε
∂fε
=0
=0
∂θ
∂t
to
fε (ε , t ,θ ) ⎯Re
⎯duces
⎯⎯
⎯
→ fε (ε )
(36)
Note, however, that the total derivative of f ε w.r.t. θ is not zero, since ε depends on θ ;
ε (t;θ ) = y(t ) − yˆ (t θ ) . Denoting l0 (ε ) ≡ − log fε (ε ) , we can re-write (19), that is the
log likelihood function of a dynamical system is given by
N
log L(θ ) = ∑ log fε (ε ) = −∑ l0 (ε (t ,0) )
(37)
t =1
Differentiating this w.r.t. θ
6
N
N
d
dl dε
(38)
log L(θ ) = −∑ 0
= ∑ l0′ [ε (t ,θ )]ψ (t ,θ )
dθ
t =1 dε dθ
t =1
dε
dyˆ
d
Note that
=−
= −ψ . Replacing
log f y in (25) by the above equation
dθ
dθ
dθ
yields the Fisher Information Matrix M , evaluated at the true parameter value θ0 .
With this θ0 , we can assume that the prediction error sequence ε (t ,θ 0 ) are generated
as a sequence of independent random variables ε (t ,θ 0 ) = e0 (t ) whose PDF is fε (x) .
This leads to the Fisher Information Matrix given by
M=
[
]
1 N
E ψ (t ,θ 0 )ψ T (t ,θ 0 )
∑
k0 t =1
∞
[ f ′( x)] dx
1
≡ ∫ e
k0 − ∞ [ f e ( x)]
on the other hand, as N tends to infinity, the prediction-error method provides the
asymptotic variance Pθ , as discussed last time. From (20) we can treat Maximum
Likelihood Estimate as a special case of the prediction-error method. Therefore, the
previous result on asymptotic variance applies to MLE.
Extra Credit Problem
Obtain the asymptotic variance of MLE for dynamical systems, and show that it
agrees with the Fisher Information Matrix. Namely, Maximum Likelihood Estimate
provides the best unbiased estimate for large N.
7
Download