2.160 System Identification, Estimation, and Learning Lecture Notes No. 20 May 3, 2006 15. Maximum Likelihood 15.1 Principle Consider an unknown stochastic process Observed data Unknown Stochastic Process y N = ( y1 , y2 ,", y N ) Assume that each observed datum yi is generated by an assumed stochastic process having a PDF: f (m, λ ; x ) = − 1 e 2πλ ( x − m )2 λ (1) where m is mean, λ is variance, and x is the random variable associated with yi . We know that mean m and variance λ are determined by 1 m= N N ∑y i =1 i 1 λ= N N ∑(y i =1 i − m) 2 (2) Let us now obtain the same parameter values, m and λ , based on a different principle: Maximum Likelihood. Assuming that the N observations y1 , y2 ," , y N are stochastically independent, consider the joint probability associated with the N observations: N f (m, λ ; x1 , x2 ," xN ) = ∏ i =1 − 1 e 2πλ ( x i − m )2 λ (3) Now, once y1 , y2 ," , y N have been observed (have taken specific values), what parameter values, m and λ , provide the highest probability in f (m, λ ; x1 , x2 ," x N ) ? In other words, what values of m and λ are most likely the case? This means that we maximize the following functional with respect to m and λ : Max f (m, λ ; y1 , y 2 , " y N ) m ,λ (4) 1 Note that y1 , y2 ,", y N are known values. Therefore, f (m, λ ; y1 , y2 ," y N ) is a function of m and λ only. Using our standard notation, this can be rewritten as θˆ = arg max f (m, λ ; y N ) (5) θ where θˆ is estimate of m and λ . Maximizing f (m, λ ; y N ) is equivalent to ), maximizing log f ( θˆ = arg max log f (m, λ ; y N ) θ [ ] (6) ⎛ ( y −m )2 ⎞⎟ 1 = arg max ∑ ⎜⎜ log − i θ 2λ ⎟⎠ 2πλ ⎝ Taking derivatives and setting them to zero, N ∂ log f 1 = ∑ ( yi − m ) = 0 ∂m i =1 λ ∂ log f d log π =N ∂λ dλ − 1 2 m= (7) d −1 N ( yi − m ) − λ ∑ = 0 (8) 2 dλ i =1 2 λ= 1 N ∑ yi N i =1 (9) 1 N ( yi − m )2 ∑ N i =1 (10) 1 N 1 N 2 yi and λ = ∑ ( yi − m ) provide a stochastic ∑ N i =1 N i =1 process model that is most likely to generate the observed data y N . And these agree with (2) This Maximum Likelihood Estimate (MLE) is formally stated as follows. The above results m = Maximum Likelihood Estimate Consider a joint probability density function with parameter vector θ as a stochastic model of an unknown process: f (θ ; x1 , x2 ," xN ) (11) Given observed data y1 , y2 ,", y N form a deterministic function of θ , called the likelihood function: L(θ ) = f (θ ; y1 , y2 ," y N ) (12) Determine parameter vector θ so that this likelihood function becomes maximum. θˆ = arg max L(θ ) θ (13) This Maximum Likelihood Estimator was found by Ronald A. Fisher in 1912, when he was an undergraduate junior at Cambridge University. He was 22 years old. 2 Don’t be disappointed. You are still young! 15.2 Likelihood Function for Probabilistic Models of Dynamic Systems Consider a model set in predictor form M (θ ) : yˆ (t θ ) = g t , z t −1;θ ( ) (14) and an associated PDF of prediction error f ε ( x,t ;θ ) (15) ε (t;θ ) = y(t ) − yˆ (t θ ) (16) Assume that the random process ε (t ) is independent in t , and consider the joint probability density function: N f (θ ; x1 , x2 ," xN ) = ∏ fε ( xt , t ;θ ) (17) t =1 Substituting observed data y1 , y2 ,", y N into (17) yields a likelihood function of parameter θ , N L(θ ) = ∏ fε ( y (t ) − yˆ (t θ ), t ;θ ) (18) t =1 Maximizing this function is equivalent to maximizing its logarithmic function, 1 1 N log L(θ ) = ∑ log fε (ε , t ;θ ) N N t =1 (19) Replacing log f ε by l (ε ,θ , t ) = − log fε (ε , t ;θ ) , θˆ ML ( Z N ) = arg max θ 1 N N ∑ l ( ε ,θ , t ) (20) t =1 The above minimization is for a deterministic function, and is analogous to the prediction-error method with the least square criterion. θˆ LS (z N ) = arg max θ 1 N N 1 ∑ 2 ε (t ,θ ) 2 (21) t =1 Replacing θˆ LS (z N ) = arg max θ 1 N 1 2 ε (t ,θ ) = l (ε ,θ , t ) 2 N ∑ l (ε ,θ , t ) t =1 3 Therefore, the Maximum Likelihood Estimate is a special case of the prediction-error method. Function l (ε ,θ , t ) may take a different form depending on the assumed PDF of the process under consideration. 15.3 The Cramer-Rao Lower Bound Consider the mean square error matrix of an estimate θˆN compared to its true value θ0 , ( )( ) T P = E ⎡ θˆ( y N ) − θ 0 θˆ( y N ) − θ 0 ⎤ ⎥⎦ ⎢⎣ (22) This mean square error matrix provides the overall quality of an estimate. Our goal is to minimizing a lower limit to this matrix. The limit, called the Cramer-Rao Lower Bound, gives the best performance that we can expect among many unbiased estimators. Theorem (The Cramer-Rao Lower Bound) Let θˆ( y N ) be an unbiased estimate of θ ; E[θˆ( y N )] = θ0 (23) ( ) where the PDF of y N is f y θ 0 ; y N . Then ( )( ) T E ⎡ θˆ( y N ) − θ 0 θˆ( y N ) − θ 0 ⎤ ≥ M −1 ⎥⎦ ⎢⎣ (24) where M is the Fisher Information Matrix given by ⎡d ⎤⎡ d ⎤ M = E ⎢ log f y (θ ; x N ) ⎥ ⎢ log f y (θ ; x N ) ⎥ ⎣ dθ ⎦ ⎣ dθ ⎦ ⎡ d2 ⎤ = − E ⎢ 2 log f y (θ ; x N ) ⎥ ⎣ dθ ⎦ θ0 T θ0 (25 − a ) (25 − b ) Proof From (23) θ 0 = ∫ θˆ(x N ) f y (θ 0 ; x N )dx N (26) RN where dummy vector x N is a N × 1 vector in space R N . Differentiating (26), which is a d × 1 vector equation, with respect to θ 0 ∈ R d ×1 yields 4 T ⎡ d ⎤ I = ∫ θˆ x ⎢ f y (θ 0 ; x N )⎥ dx N ⎣ dθ 0 ⎦ RN where I is the d × d identity matrix, ( ) N (27) 0⎤ ⎡1 ⎢ 1 ⎥ ⎢ ⎥ ∈ R d ×d I= % ⎥ ⎢ ⎢ ⎥ 1⎦ ⎣0 Using the following relationship d 1 d f (θ ) log f (θ ) = dθ f (θ ) dθ (28) (27) can be written as T ⎤ ⎡ d log f y (θ 0 ; x N )⎥ f y (θ 0 ; x N )dx N I = ∫ θˆ(x N )⎢ ⎦ ⎣ dθ 0 RN (29) T ⎡ ⎤ N ⎡ d N ⎤ ˆ log f y (θ 0 ; x )⎥ ⎥ = E ⎢θ (x )⎢ ⎢⎣ ⎦ ⎥⎦ ⎣ dθ 0 Similarly, differentiating the identity: 1 = ∫f y (θ 0 ; x N )dx N yields RN T T ⎡ d ⎤ ⎡ d ⎤ 0= ∫ ⎢ f y (θ 0 ; x N )⎥ dx N = ∫ ⎢ log f y (θ 0 ; x N )⎥ f y (θ 0 ; x N )dx N dθ 0 dθ 0 ⎦ ⎦ RN ⎣ RN ⎣ T ⎡ d ⎤ ∴ 0= E⎢ log f y (θ0 ; x N ) ⎥ ⎣ dθ 0 ⎦ Multiplying (30) by θ 0 and subtracting it from (29), T ⎧⎪ ⎫ ⎡ d N N ⎤ ⎪ ˆ log f y (θ0 ; x ) ⎥ ⎬ = I E ⎨ ⎣⎡θ ( y ) − θ0 ⎦⎤ ⎢ ⎣ dθ 0 ⎦ ⎪⎭ ⎪⎩ α ∈ R d ×1 (30) (31) β T ∈ R1× d E (αβ T ) = I ⎛α ⎞ Consider 2d × 1 vector ⎜⎜ ⎟⎟ , ⎝β ⎠ 5 ⎡⎛ α ⎞ E ⎢⎜⎜ ⎟⎟ ⎣⎝ β ⎠ (α T ⎤ ⎡ Eαα T ⎦ ⎣ Eβα β T )⎥ = ⎢ ( Eαβ T ⎤ ⎥≥0 Eββ T ⎦ T Note Eαβ T = I , and Eβα T = E αβ T ) T Positive Semi-definite (32) =I ∴ E (αα T ) ⋅ E (ββ T ) − I 2 ≥ 0 [ ] E (αα T ) ≥ E (ββ T ) −1 (33) (34) This gives the first equation, (25)-(a). Differentiating the transpose of (30) yields ⎡ d2 ⎤ 0= ∫ ⎢ log f y (θ 0 ; x N )⎥ f y (θ 0 ; x N )dx N 2 dθ 0 ⎦ RN ⎣ (35) T ⎡ d ⎤ ⎡ ⎤ d +∫⎢ f y (θ 0 ; x N )⎥ dx N log f y (θ 0 ; x N )⎥ ⎢ dθ 0 ⎦ ⎣ dθ 0 ⎦ RN ⎣ Rewriting this in Expectation form yields (25)-(b). Q.E.D. 15.4 Best Unbiased Estimators for Dynamical Systems. The Cramer-Rao Lower Bound, given by the Fisher Information Matrix M , is the best performance we can expect among all unbiased estimates with regard to error covariance. Interestingly, Maximum Likelihood Estimate provides this best error covariance performance for large sample data. The following is to formulate the problem for proving this. The complete proof, however, is left for your Extra-Credit Problem. Based on the logarithmic likelihood function, we can compute the Fisher Information Matrix, the inverse of which provides the lower bound of the error covariance. To simplify the problem, we assume that the PDF of prediction error ε is a function of ε only. Namely, ∂fε ∂fε =0 =0 ∂θ ∂t to fε (ε , t ,θ ) ⎯Re ⎯duces ⎯⎯ ⎯ → fε (ε ) (36) Note, however, that the total derivative of f ε w.r.t. θ is not zero, since ε depends on θ ; ε (t;θ ) = y(t ) − yˆ (t θ ) . Denoting l0 (ε ) ≡ − log fε (ε ) , we can re-write (19), that is the log likelihood function of a dynamical system is given by N log L(θ ) = ∑ log fε (ε ) = −∑ l0 (ε (t ,0) ) (37) t =1 Differentiating this w.r.t. θ 6 N N d dl dε (38) log L(θ ) = −∑ 0 = ∑ l0′ [ε (t ,θ )]ψ (t ,θ ) dθ t =1 dε dθ t =1 dε dyˆ d Note that =− = −ψ . Replacing log f y in (25) by the above equation dθ dθ dθ yields the Fisher Information Matrix M , evaluated at the true parameter value θ0 . With this θ0 , we can assume that the prediction error sequence ε (t ,θ 0 ) are generated as a sequence of independent random variables ε (t ,θ 0 ) = e0 (t ) whose PDF is fε (x) . This leads to the Fisher Information Matrix given by M= [ ] 1 N E ψ (t ,θ 0 )ψ T (t ,θ 0 ) ∑ k0 t =1 ∞ [ f ′( x)] dx 1 ≡ ∫ e k0 − ∞ [ f e ( x)] on the other hand, as N tends to infinity, the prediction-error method provides the asymptotic variance Pθ , as discussed last time. From (20) we can treat Maximum Likelihood Estimate as a special case of the prediction-error method. Therefore, the previous result on asymptotic variance applies to MLE. Extra Credit Problem Obtain the asymptotic variance of MLE for dynamical systems, and show that it agrees with the Fisher Information Matrix. Namely, Maximum Likelihood Estimate provides the best unbiased estimate for large N. 7