Statistics 512 Notes 14: Properties of Maximum Likelihood Estimates Continued Good properties of maximum likelihood estimates: (1) Invariance (2) Consistency (3) Asymptotic Normality (4) Efficiency Asymptotic Normality Suppose X 1 , , X n iid with density f ( x; ) , . Under regularity conditions, the large sample distribution of ˆ is approximately normal with mean and variance MLE 0 1/(nI (0 )) where 0 is the true value of . Regularity Conditions: (R0) The pdfs f ( x; ) are distinct, i.e., ' implies f ( x; ) f ( x; ') (the model is identifiable). (R1) The pdfs have common support for all . (R2) The point 0 is an interior point of . (R3) The pdf f ( x; ) is twice differentiable as a function of . (R4) The integral f ( x; )dx can be differentiated twice under the integral sign as a function of Note that X 1 , (R1). , X n iid uniform on [0, ] does not satisfy Fisher information: Define I ( ) by I ( ) E [ log f ( X ; )]2 . I ( ) is called the Fisher information about . The greater the squared value of log f ( X ; ) is on average, the more information there is to distinguish between different values of , making it easier to estimate . Lemma: Under the regularity conditions, 2 I ( ) E [ 2 log f ( X ; )] . Proof: First, we observe that since f ( x; )dx 1 , f ( x; )dx 0 . Combining this with the identity f ( x; ) log f ( x; ) f ( x; ) , we have 0 f ( x ; ) dx log f ( x ; ) f ( x; )dx where we have interchanged differentiation and integration using regularity condition (R4). Taking derivatives of the expressions just above, we have 0 log f ( x ; ) f ( x; )dx 2 2 log f ( x; ) f ( x; )dx log f ( x; ) f ( x; ) dx 2 so that 2 I ( ) log f ( x; ) f ( x; )dx 2 log f ( x; ) f ( x; )dx 2 Example: Information for a Bernoulli random variable. Let X be Bernoulli (p). Then log f ( x; p) x log p (1 x) log(1 p) , log f ( x; p) x 1 x , p p 1 p 2 log f ( x; p) x 1 x p 2 p 2 (1 p)2 Thus, X 1 X I ( p) E p 2 (1 p) 2 p p 1 p 1 1 1 p 2 (1 p) 2 p (1 p) p(1 p) There is more information about p when p is closer to zero or one. Additional regularity condition: (R5) The pdf f ( x; ) is three times differentiable as a function of . Further, for all , there exists a constant c and a function M(x) such that 3 log f ( x; ) M ( x) 3 with E0 [ M ( X )] for all 0 c 0 c and all x in the support of X. Theorem (6.2.2): Assume X 1 , , X n are iid with pdf f ( x;0 ) for 0 such that the regularity conditions (R0)(R5) are satisfied. Suppose further that Fisher information satisfies 0 I (0 ) . Then D 1 ˆ n MLE 0 N 0, I ( ) 0 Proof: Sketch of proof. From a Taylor series expansion, 0 l '(ˆMLE ) l '( 0 ) (ˆMLE 0 )l ''( 0 ) (ˆMLE 0 ) l '(0 ) l ''(0 ) 1/ 2 n l '(0 ) n (ˆMLE 0 ) 1 n l ''(0 ) First, we consider the numerator of this last expression. Its expectation is n n 1/ 2 i 1 E0 log f ( X i ; 0 ) 0 because f ( x; ) E0 log f ( X i ;0 ) f (x;0 )dx f ( x;0 ) f ( x; )dx 0 Its variance is 2 1 n 1/ 2 Var0 [n l '(0 )] i 1 E0 log f ( X i ; ) I (0 ) n 0 Next we consider the denominator: 1 1 n 2 l ''(0 ) i 1 2 log f ( X i ; ) n n By the law of large numbers, the latter expression converges to 2 I (0 ) E0 2 log f ( X ; ) 0 We thus have 1/ 2 n l '(0 ) n (ˆMLE 0 ) I (0 ) Therefore, E [n1/ 2 (ˆMLE 0 )] 0 . Furthermore, I ( ) 1 Var[n1/ 2 (ˆMLE 0 )] 2 0 I ( 0 ) I ( 0 ) and thus 1 Var[(ˆMLE 0 )] nI ( 0 ) The central limit theorem may be applied to l '( 0 ) , which is a sum of iid random variables: n l '( 0 ) i 1 log f ( X i ; ) Corollary: Under the same assumptions as Theorem 6.2.2, D 1 ˆ n MLE 0 N 0, ˆ ) I ( MLE Informally, Theorem 6.2.2 and its corollary say that the distribution of the MLE can be approximated by 1 N (0 , ). ˆ nI ( ) MLE From this fact, we can construct an asymptotic correct confidence interval. 1 1 ˆ z ˆ z C ( , ). MLE /2 MLE /2 Let n ˆ ˆ nI ( MLE ) nI ( MLE ) P Then P0 (0 Cn ) 1 as n . For 0.05, z / 2 1.96 2 so ˆMLE 2 1 nI (ˆ approximate 95% confidence interval for . MLE ) is an Example 1: Let X 1 , , X n be iid Bernoulli (p). The MLE is 1 I ( p ) p̂ X . We calculated above that p(1 p) . Thus, an approximate 95% confidence interval for p is 1/ 2 pˆ (1 pˆ ) pˆ 2 . This is what the newspapers report n when they say “the poll is accurate to within four points, 95 percent of the time.” Computation of maximum likelihood estimates Example 2: Logistic distribution. Let X 1 , , X n be iid with density exp{( x )} f ( x; ) x , . (1 exp{( x )}) 2 , The log of the likelihood simplifies to: n n l ( ) i 1 log f ( X i ; ) n nX 2i 1 log(1 exp{( X i )}) Using this, the first derivative is exp{( X i )} n l '( ) n 2 i 1 1 exp{( X i )} Setting this equal to 0 and rearranging terms results in teh equation: exp{( X i )} n n i 1 1 exp{( X )} 2 . (*) i Although this does not simplify, we can show the equation (*) has a unique solution. The derivative of the left hand side of (i) simplifies to exp{( X i )} exp{( X i )} n i 1 1 exp{( X )} i 1 1 exp{( X )}2 0 i i Thus, the left hand side of (*) is a strictly increasing function of . Finally, the left hand side of (*) approaches 0 as and approaches n as . Thus, the equation (*) has a unique solution. Also the second derivative of l ( ) is strictly negative for all ; so the solution is a maximum. n How do we find the maximum likelihood estimate that is the solution to (*)? Newton’s method is a numerical method for approximating solutions to equations. The method produces a sequence of (0) (1) values , , that, under ideal conditions, converges to the MLE ˆ . MLE To motivate the method, we expand the derivative of the ( j) log likelihood around : 0 l '(ˆMLE ) l '( ( j ) ) (ˆMLE ( j ) )l ''( ( j ) ) Solving for ˆ gives MLE l '( ( j ) ) MLE l ''( ( j ) ) This suggests the following iterative scheme: l '( ( j ) ) ( j 1) ( j) l ''( ( j ) ) . ˆ ( j) The following is an R function that uses Newton’s method to approximate the maximum likelihood estimate for a logistic distribution: mlelogisticfunc=function(xvec,toler=.001){ startvalue=median(xvec); n=length(xvec); thetahatcurr=startvalue; # Compute first deriviative of log liklelihood firstderivll=n-2*sum(exp(-xvec+thetahatcurr)/(1+exp(xvec+thetahatcurr))); # Continue Newton’s method until the first derivative # of the likelihood is within toler of 0 while(abs(firstderivll)>toler){ # Compute second derivative of log likelihood secondderivll=-2*sum(exp(-xvec+thetahatcurr)/(1+exp(xvec+thetahatcurr))^2); # Newton’s method update of estimate of theta thetahatnew=thetahatcurr-firstderivll/secondderivll; thetahatcurr=thetahatnew; # Compute first derivative of log likelihood firstderivll=n-2*sum(exp(-xvec+thetahatcurr)/(1+exp(xvec+thetahatcurr))); } list(thetahat=thetahatcurr); }