lecture_05

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION • Objectives: Discrete Features Maximum Likelihood • Resources: D.H.S: Chapter 3 (Part 1) D.H.S.: Chapter 3 (Part 2) J.O.S.: Tutorial Nebula: Links BGSU: Example A.W.M.: Tutorial A.W.M.: Links S.P.: Primer CSRN: Unbiased A.W.M.: Bias URL: Audio: Discrete Features • For problems where features are discrete:  p(x  j )dx   P(x |ω j ) x • Bayes formula involves probabilities (not densities):  P  j x    p x  j P  j  px    P  j x    P x  j P  j  Px  where c     Px    P x  j P  j j 1 • Bayes rule remains the same: α*  arg min R(αi | x ) i • The maximum entropy distribution is a uniform distribution: P( x  x i )  1 N ECE 8443: Lecture 05, Slide 1 Discriminant Functions For Discrete Features • Consider independent binary features: pi  Pr [xi  1|ω1 ) qi  Pr [xi  1|ω2 ) x  ( x1,..., xd )t • Assuming conditional independence: d P(x | ω1)   i 1 pixi (1  1 xi pi ) d P(x | ω2 )   qixi (1  qi )1 xi i 1 • The likelihood ratio is: x P(x | ω1) d pi i (1  pi )1 xi  x P(x | ω2 ) i 1 qi i (1  qi )1 xi • The discriminant function is: d g (x )   xi ln i 1 pi (1  pi ) P (1 )  (1  xi ) ln  ln qi (1  qi ) P (2 ) d pi (1  qi ) (1  pi ) P (1 )   wi xi  w0   ln xi   ln  ln P (2 ) i 1 i 1 qi (1  pi ) i 1 (1  qi ) d ECE 8443: Lecture 05, Slide 2 d Introduction to Maximum Likelihood Estimation • In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(i), and class-conditional densities, p(x|i). • What can we do if we do not have this information? • What limitations do we face? • There are two common approaches to parameter estimation: maximum likelihood and Bayesian estimation. • Maximum Likelihood: treat the parameters as quantities whose values are fixed but unknown. • Bayes: treat the parameters as random variables having some known prior distribution. Observations of samples converts this to a posterior. • Bayesian Learning: sharpen the a posteriori density causing it to peak near the true value. ECE 8443: Lecture 05, Slide 3 General Principle • I.I.D.: c data sets, D1,...,Dc, where Dj drawn independently according to p(x|j). • Assume p(x|j) has a known parametric form and is completely determined by the parameter vector j (e.g., p(x|j)  N(j,j), where j=[1, ..., j , 11, 12, ...,dd]). • p(x|j) has an explicit dependence on j: p(x|j,j) • Use training samples to estimate 1, 2,..., c • Functional independence: assume Di gives no useful information about j for ij. • Simplifies notation to a set D of training samples (x1,... xn) drawn independently from p(x|) to estimate . • Because the samples were drawn independently: n p ( D | )   p ( x k ) k 1 ECE 8443: Lecture 05, Slide 4 Example of ML Estimation • p(D|) is called the likelihood of  with respect to the data. • The value of  that maximizes this likelihood, denoted ̂ , is the maximum likelihood estimate (ML) of . • Given several training points • Top: candidate source distributions are shown • Which distribution is the ML estimate? • Middle: an estimate of the likelihood of the data as a function of  (the mean) • Bottom: log likelihood ECE 8443: Lecture 05, Slide 5 General Mathematics Let   (1, 2 ,..., p ) t .        1 Let      .      p  Define : l    ln p D   ˆ  arg max l   θ n  ln(  p ( x k )) k 1   ln  p x k   n k 1 ECE 8443: Lecture 05, Slide 6 • The ML estimate is found by solving this equation:  l    [  ln  p x k  ] n k 1     ln  p x k    0. n k 1 • The solution to this equation can be a global maximum, a local maximum, or even an inflection point. • Under what conditions is it a global maximum? Maximum A Posteriori Estimation • A class of estimators – maximum a posteriori (MAP) – maximize l   p   where p   describes the prior probability of different parameter values. • An ML estimator is a MAP estimator for uniform priors. • A MAP estimator finds the peak, or mode, of a posterior density. • MAP estimators are not transformation invariant (if we perform a nonlinear transformation of the input data, the estimator is no longer optimum in the new space). This observation will be useful later in the course. ECE 8443: Lecture 05, Slide 7 Gaussian Case: Unknown Mean • Consider the case where only the mean,  = , is unknown:    ln  p x k    0 n k 1 ln( p (xk ))  ln[ 1 (2 ) d / 2  exp[ 1/ 2 1 (x k  ) t  1 (x k  )] 2 1 1   ln[( 2 ) d  ]  (x k  ) t  1 (x k  ) 2 2 which implies:   ln( p (xk ))   1 (x k  ) because:   1 1  d t 1 [  ln[( 2  )  ]  ( x   )  ( x   )]   k k   2 2   1  1  [ ln[( 2 ) d  ]  [ (x k  ) t  1 (x k  )]  2  2   1 (x k  ) ECE 8443: Lecture 05, Slide 8 Gaussian Case: Unknown Mean • Substituting into the expression for the total likelihood:  l    ln  p x k     1 (x k  )  0 n n k 1 k 1 • Rearranging terms: n 1   (x k  ˆ)  0 k 1 n  (x k  ˆ)  0 k 1 n n  x k   ˆ  0 k 1 n k 1  x k  n ˆ  0 k 1 n 1 ˆ   x k n k 1 • Significance??? ECE 8443: Lecture 05, Slide 9 Gaussian Case: Unknown Mean and Variance • Let  = [,2]. The log likelihood of a SINGLE point is: 1 1 1 ln( p( xk ))   ln[( 2 ) 2 ]  ( xk  1 ) t  2 (xk  1 ) 2 2 1   ( x   ) k 1    2 θl  θ ln( p ( xk θ))   2  1  ( xk  1 )   2 2 2 22   • The full likelihood leads to: n 1  ˆ ( xk  ˆ1 )  0 k 1 2 n n 1 ( xk  ˆ1 ) 2 2 ˆ  0   ( xk  1 )   ˆ2  ˆ  2 2ˆ2 k 1 2 2 k 1 k 1 n ECE 8443: Lecture 05, Slide 10 Gaussian Case: Unknown Mean and Variance 1 n ˆ • This leads to these equations: 1  ˆ   xk n k 1 • In the multivariate case: 1 n 2 ˆ  2  ˆ  ( xk  ˆ ) 2 n k 1 1 n ˆ   x k n k 1 ˆ 2  1 n t  x k  ˆ  x k  ˆ  n k 1 • The true covariance is the expected value of the matrix  x k  ˆ  x k  ˆ  , which is a familiar result. t ECE 8443: Lecture 05, Slide 11 Convergence of the Mean • Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later. • Expected value of the ML estimate of the mean: 1 n E[ ˆ ]  E[  xi ] n i 1  n 1  E[ xi ] n i 1 1 n     n i 1 ECE 8443: Lecture 05, Slide 12 var[ˆ ]  E[ ˆ 2 ]  ( E[ ˆ ]) 2  E[ ˆ 2 ]   2   1 n  1 n  E[  xi   x j ]   2  n i 1  n j 1  2  1  n n   2    E[ xi x j ]    2 n  i 1 j 1  Variance of the ML Estimate of the Mean • The expected value of xixj will be 2 for j  k since the two random variables are independent. • The expected value of xi2 will be 2 + 2. • Hence, in the summation above, we have n2-n terms with expected value 2 and n terms with expected value 2 + 2. • Thus, var[ˆ ]  1 n 2 n 2    n   n   2 2 2   2  2 n which implies: E[ ˆ ]  var[ˆ ]  ( E[ ˆ ])  2 2 2 n  2 • We see that the variance of the estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero). ECE 8443: Lecture 05, Slide 13 Summary • Discriminant functions for discrete features are completely analogous to the continuous case (end of Chapter 2). • To develop an optimal classifier, we need reliable estimates of the statistics of the features. • In Maximum Likelihood (ML) estimation, we treat the parameters as having unknown but fixed values. • Justified many well-known results for estimating parameters (e.g., computing the mean by summing the observations). • Biased and unbiased estimators. • Convergence of the mean and variance estimates. ECE 8443: Lecture 05, Slide 14

lecture_05

Related documents

Products

Support

lecture_05

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib