Statistics 550 Notes 12 Reading: Section 2.2. I. Maximum Likelihood Properties Key valuable features of maximum likelihood estimators: 1. The MLE is consistent. 2. The MLE is asymptotically normal: ˆMLE SE (ˆ ) converges in distribution to a standard normal MLE distribution for a one-dimensional parameter. 3. The MLE is asymptotically optimal: roughly, this means that among all well-behaved estimators, the MLE has the smallest variance for large samples. Consistency of maximum likelihood estimates: Theorem 1: Consider the model X1 , , X n are iid with pmf or pdf { p( X i | ), } Suppose (a) the parameter space is finite; (b) is identifiable and (c) the p( X i | ) have common support for all . Then the maximum likelihood estimator ˆ is MLE consistent as n . 1 Outline of Proof (see Notes 11 for full proof): Let 0 denote the true parameter. We first show that for 0 1 n p( X i | ) P0 (l x ( 0 ) l x ( )) P log 0 1 as n n i 1 p ( X | ) i 0 (1.1) using Jensen’s inequality ( E p( X i | ) log 0 ) 0 p ( X | ) i 0 for ) and 0 the law of large numbers. Denote the points other than 0 in the finite parameter space by 1 , , K . Let A jn be the event that for n observations, l x ( 0 ) l x ( j ) . The event ˆ for n observations is contained in the event MLE 0 AKn . By (1.1), P ( A jn ) 1 as n for j 1, , K . Consequently, P( A1n AKn ) 1 as n and thus P0 (ˆMLE 0 ) 1 as n . A1n Comments on Consistency: (1) For infinite parameter spaces, the maximum likelihood can be shown to be consistent under conditions (b)-(c) of the theorem plus the following two assumptions: (1) The parameter space contains an open set of which the true parameter is an interior point (i.e., true parameter is not on boundary of parameter space); (2) p ( x | ) is differentiable in . 2 (2) The consistency theorem assumes that the parameter space does not depend on the sample size. Maximum likelihood can be inconsistent when the number of parameters increases with the sample size, e.g., X 1 , , X n independent normals with mean i and variance 2 . MLE of 2 is inconsistent. Asymptotic Normality of Maximum Likelihood Estimates: The consistency of the MLE says that under regularity conditions, ˆ will be close to the true with high 0 MLE probability. We now consider the distribution of a magnified difference between ˆ and ; this provides MLE 0 more precise information about the distribution of ˆMLE . Consider the model X 1 , , X n are iid with pmf or pdf { p( X i | ), } (we assume is one dimensional for simplicity; the basic ideas carry over to multidimensional ). The Fisher information number I ( ) is defined as 2 I ( ) E log p( X | ) 3 Lemma 1: Under regularity conditions on the smoothness of p ( X | ) (see Note (3) below), we have E log p ( X | ) 0 and (i) I ( ) Var log p ( X | ) E (ii) 2 log p ( X | ) 2 Proof: First, we observe that since p ( x | )dx 1 for all , we have p ( x | )dx 0 . Combining this with the p ( x | ) log p ( x | ) p ( x | ) , we have identity 0 where we have interchanged differentation and integration which is justified under the regularity conditions; this provdes (i). Taking second derivatives of p( x | )dx , we have 0 p( x | )dx log p( x | ) p( x | )dx E log p( x | ) log p( x | ) p( x | )dx 2 2 log p( x | ) p( x | ) log p( x | ) p( x | ) 2 from which (ii) follows. 4 e x Example: For the Poisson distribution, p( x | ) x ! , 2 X 1 I ( ) E 2 log p( X | ) E 2 . Theorem 2: Under “regularity conditions,” L 1 ˆ n ( MLE 0 ) N 0, I ( ) 0 Notes: L (1) The denotes convergence in law (or distribution) L 1 ˆ n ( ) N 0, MLE 0 and means that the CDF of I ( ) evaluated at a point x converges to the CDF of a 1 N 0, random variable at x for each x ; see I ( ) Appendix A.14. (2) The regularity conditions are (i) is identifiable; (ii) the p ( X | ), have common support A {x : p( x | ) 0} ; (iii) the parameter space contains an open set containing the true parameter value 0 as an interior point; (iv) for all x A , p ( X | ) is three times differentiable with respect to , the third derivative is 5 p( x | )dx can be differentiated continuous in and three times under the integral sign; (v) for any 0 , there exists a positive number c and a function M ( x) (both of which may depend on 0 ) such that 3 log p( x | ) M ( x) for all x A , c c 0 0 3 with E0 [ M ( X )] . (3) A key part of the proof is that under the regularity conditions, the MLE must satisfy the likelihood equation. If is open, l x ( ) is differentiable in and ˆMLE exists, then ˆ must satisfy the estimating equation MLE l x ( ) 0 This is known as the likelihood equation. (1.2) Solving (1.2) does not necessarily yield the MLE as there may be solutions of (1.2) that are not maxima, or solutions that are only local maxima. Outline of proof: For X 1 , , X n iid, the log likelihood n function is l X ( ) log p( X i | ) . Denote the derivatives i 1 with respect to by l ', l '', Expanding the first derivative of the log likelihood around the true value 0 , we have l X '( ) l X '(0 ) ( 0 )l X ''(0 ) (1.3) 6 where we are going to ignore the higher-order terms (a justifiable maneuver under the regularity conditions). Now substitute ˆ for in (1.3) and note that MLE l X '(ˆMLE ) 0 (see Note (3) above). Rearranging and multiplying through by n gives us 1 l X '(0 ) l '( ) n (ˆMLE 0 ) n X 0 n (1.4) l X ''(0 ) 1 l ''( ) X 0 n n l '( ) log p( X i | ) Note that X 0 so that from the i 1 0 central limit theorem and Lemma 1, we have that L 1 l X '(0 ) N (0, I (0 )) . Also n 2 l X ''( 0 ) 2 log p ( X i | ) so that by the law of i 1 n 0 large numbers and Lemma 1, we have that P 2 1 l X ''(0 ) E 2 log p( X | ) I (0 ) n Thus, if let W ~ N (0, I (0 )) , then n (ˆMLE 0 ) converges in law to W / I (0 ) ~ N (0,1/ I (0 )) , proving the theorem. Asymptotic Optimality of the MLE 7 Suppose that X1 , , X n ~ N ( ,1) . The MLE is ˆn X n , the sample mean based on the n observations. Another reasonable estimator of is the sample median n . The MLE satisfies L n (ˆn ) N (0,1) . It can be proved that the median satisfies L n ( n ) N (0, ) . 2 This means that the median is consistent but has a larger variance than the MLE for large sample sizes. More generally, consider two estimators Tn and U n and suppose that L n (Tn ) N (0, t 2 ) and that L n (U n ) N (0, u 2 ) . We define the asymptotic relative efficiency of U to T by ARE (U , T ) t 2 / u 2 . In the normal example, 1 ARE ( n , ˆn ) 0.63 . /2 The interpretation is that if you use the median as your estimator of , you are effectively using only a fraction of the data compared to using the MLE as your estimator. Theorem: If ˆn is the MLE and n is any other estimator, then 8 ARE (n ,ˆn ) 1 . Thus, the MLE has the smallest asymptotic variance and we say that the MLE is asymptotically efficient and asymptotically optimal. Comments: (1) We will provide an outline of the proof for this theorem when we study the Cramer-Rao (information) inequality in Chapter 3.4; (2) The result is actually more subtle than the stated theorem because it only covers a certain class of well behaved estimators – more details will be study in Stat 552. II. Uniqueness and Existence of the MLE For a finite sample, when does the MLE exist, when is it unique and how do we find the MLE? Anomalies of maximum likelihood estimates: Maximum likelihood estimates are not necessarily unique and do not even have to exist. Nonuniqueness of MLEs example: X 1 , 1 1 , Uniform( 2 2 ). 1 Lx ( ) 0 if max X i otherwise 9 , X n are iid 1 1 min X i 2 2 Thus any estimator ˆ that satisfies 1 1 max X i ˆ min X i is a maximum likelihood 2 2 estimator. Nonexistence of maximum likelihood estimator: The likelihood function can be unbounded. An important example is a mixture of normal distributions, which is frequently used in applications. X 1 , , X n iid with density ( x 1 )2 ( x 2 )2 1 1 f ( x) p exp exp (1 p) . 2 2 2 21 2 1 2 2 2 This is a mixture of two normal distributions. The 2 2 unknown parameters are ( p, 1 , 2 , 1 , 2 ) . 2 Let 1 X1 . Then as 1 0 , f ( X 1 ) so that the likelihood function is unbounded. 10