Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections 3.1-3.2 CS479/679 Pattern Recognition Dr. George Bebis Parameter Estimation • Bayesian Decision Theory allows us to design an optimal classifier given that we have estimated P(i) and p(x/i) first: p( x / j ) P( j ) P( j / x) p ( x) • Estimating P(i) is usually not very difficult. • Estimating p(x/i) could be more difficult: – Dimensionality of feature space is large. – Number of samples is often too small. Parameter Estimation (cont’d) • We will make the following assumptions: – A set of training samples D ={x1, x2, ...., xn} is given, where the samples were drawn according to p(x|j). – p(x|j) has some known parametric form: e.g., p(x /i) ~ N(μ i , i) also denoted as p(x/q) where q=(μi , Σi) • Parameter estimation problem: Given D, find the best possible q Main Methods in Parameter Estimation • Maximum Likelihood (ML) • Bayesian Estimation (BE) Main Methods in Parameter Estimation • Maximum Likelihood (ML) – Best estimate is obtained by maximizing the probability of obtaining the samples D ={x1, x2, ...., xn} actually observed: p(x1 , x2 ,..., xn / θ) p( D / θ) θˆ arg max p( D / θ) θ – ML assumes that θ is fixed and makes a point estimate: p (x / q ) p( x / qˆ) Main Methods in Parameter Estimation (cont’d) • Bayesian Estimation (BE) – Assumes that θ is a set of random variables that have some known a-priori distribution p(θ). – Estimates a distribution rather than making a point estimate (i.e., like ML): p ( x / D ) p ( x / θ ) p (θ / D ) d θ Note: the BE solution p(x/D) might not be of the parametric form assumed (e.g., p(x/q)). ML Estimation - Assumptions • Consider c classes and c training data sets (i.e., one for each class): D1, D2, ...,Dc • Samples in Dj are drawn independently according to p(x/ωj). • Problem: given D1, D2, ...,Dc and a model for p(x/ωj) ~ p(x/q), estimate: q1, q2,…, qc ML Estimation - Problem Formulation • If the samples in Dj provide no information about qi (i j ), we need to solve c independent problems (i.e., one for each class). • The ML estimate for D={x1,x2,..,xn} is the value θ̂ that maximizes p(D / q) (i.e., best supports the training data). θˆ arg max p( D / θ) q – Using independence assumption, we can simplify p(D / q) : n p ( D / θ) p(x1 , x 2 ,..., x n / θ) p( x k / θ) k 1 ML Estimation - Solution • How can we find the maximum of p(D/ q) ? θ p( D / θ) 0 where (gradient) ML Estimation Using Log-Likelihood • Taking the log for simplicity: n p ( D / θ) p(x1 , x 2 ,..., x n / θ) p( x k / θ) k 1 n ln p ( D / θ) ln p (x k / θ) log-likelihood k 1 • Maximizes ln p(D/ θ): θˆ arg max ln p( D / θ) q θ ln p( D / θ) 0 or n k 1 θ ln p (x k / θ) 0 Example training data: unknown mean, known variance p(D / θ) ln p(D/ θ) θ̂ =μ ML for Multivariate Gaussian Density: Case of Unknown θ=μ • Assume p(x / μ) ~ N (μ, ) 1 d 1 t 1 ln p(x / μ) ( x μ) ( x μ) ln 2 ln | | 2 2 2 • Computing the gradient, we have: μ ln p( D / μ) μ ln p(xk / μ) 1 (xk μ) k k ML for Multivariate Gaussian Density: Case of Unknown θ=μ (cont’d) • Setting μ ln p ( D / μ) 0 we have: n k 1 1 (x k μ) 0 or n x k 1 k nμ 0 1 n • The solution μ̂ is given by μˆ x k n k 1 The ML estimate is simply the “sample mean”. Special Case: Maximum A-Posteriori Estimator (MAP) • Assume that θ is a random variable with known p(θ). p( D / θ) p(θ) Consider: p(θ / D) p( D) • Maximize p(θ/D) or p(D/θ)p(θ) or ln p(D/ θ)p(θ): n p(x k 1 n k / θ) p(θ) ln p(x k 1 k / θ) ln p(θ) Special Case: Maximum A-Posteriori Estimator (MAP) (cont’d) • What happens when p(θ) is uniform? n ln p(x k 1 k / θ) ln p(θ) n ln p(x k 1 k / θ) MAP is equivalent to ML MAP for Multivariate Gaussian Density: Case of Unknown θ=μ • Assume p(x / μ) ~ N (μ, Diag ( μ )) and p(μ) ~ N (μ 0 , Diag (σμ0 )) (both μ 0 and σ μ0 are known) • Maximize ln p(μ /D) = ln p(D/ μ)p(μ): n ln p(x k 1 n k / μ) ln p(μ) μ ( ln p(x k / μ) ln p(μ)) 0 k 1 MAP for Multivariate Gaussian Density: Case of Unknown θ=μ (cont’d) μ2 n μ0 2 xk μ k 1 μˆ μ2 1 2 n μ 0 n 1 k 1 • If 2 μ (x k μ) μ2 1 , 2 μ 0 1 2 μ0 (μ μ 0 ) 0 or 0 1 n then μˆ x k n k 1 • What happens when μ0 0 ? μˆ μ0 ML for Univariate Gaussian Density: Case of Unknown θ=(μ,σ2) • Assume p( x / θ) ~ N ( , 2 ) θ =(θ1,θ2)=(μ,σ2) 1 1 2 ln p(x k / θ) ln 2 2 (x k ) 2 or 2 2 1 1 ln p(x k / θ) ln 2q 2 (x k q1 ) 2 2 2q 2 p(xk/θ) p(xk/θ) p(xk/θ) ML for Univariate Gaussian Density: Case of Unknown θ=(μ,σ2) (cont’d) p(xk/ θ)=0 =0 =0 • The solutions are given by: sample mean sample variance ML for Multivariate Gaussian Density: Case of Unknown θ=(μ,Σ) • In the general case (i.e., multivariate Gaussian) the solutions are: 1 n μˆ x k n k 1 sample mean n 1 ˆ (x k μˆ )(x k μˆ )t n k 1 sample covariance Biased and Unbiased Estimates • An estimate θ̂ is unbiased when E[θˆ ] θ • The ML estimate μ̂ is unbiased, i.e., E[μˆ ] μ • The ML estimates σ̂ and ̂ are biased: n 1 2 E[σˆ ] n 2 n 1 ˆ E[ ] n Biased and Unbiased Estimates (cont’d) • The following are unbiased estimates of σ̂ and ̂ 1 n 2 ˆ ˆ ( x μ ) k n 1 k 1 n 1 t ˆ ˆ ˆ ( x μ )( x μ ) k k n 1 k 1 Comments • ML estimation is simpler than alternative methods. • ML provides more accurate estimates as the number of training samples increases. • If the model for p(x/ θ) is correct, and the independence assumptions among samples are true, ML will work well.