Stat 550 Notes 10 Reading: Chapter 3.4.2 We will give an outline of a proof of the asymptotic optimality of the MLE: Theorem: If ˆn is the MLE and n is any other estimator, then ARE (n ,ˆn ) 1 . As a tool for proving this fact, we develop a lower bound on the variance of an unbiased estimator. This is a fundamental result in mathematical statistics that shows the best that is achievable by a certain type of estimator. I. The Information Inequality The information inequality (sometimes called Cramér-Rao lower bound) provides a lower bound on the variance of an unbiased estimator. We will focus on a one-parameter model – the data X is generated from p ( X | ) , is an unknown parameter, . We make two “regularity” assumptions on the model { p( X | ) : } : 1 (I) The support of p ( X | ) , A { x : p ( x | ) 0} does not depend on . Also for all x A , , log p ( x | ) exists and is finite. (II) If T is any statistic such that E (| T |) for all , then the operations of integration and differentiation by can be interchanged in T ( x ) p( x | )dx . That is, d d E T ( x ) p ( x | )dx T ( x ) p( x | )dx (0.1) d d whenever the right hand side of (0.1) is finite. Assumption II is not useful as written – what is needed are simple sufficient conditions on p ( x | ) for II to hold. Classical conditions may be found in an analysis book such as Rudin, Principles of Mathematical Analysis, pg. 236237. Assumptions I and II generally hold for a one-parameter exponential family. Proposition 3.4.1: If p( x | ) h( x) exp{ ( )T ( x) B( )} is an exponential family and ( ) has a nonvanishing continuous derivative on , then Assumptions I and II hold. Recall the concept of Fisher information. 2 For a model { p( X | ) : } and a value of , the Fisher information number I ( ) is defined as 2 I ( ) E log p( X | ) . The Fisher information can be thought of as a measure of how fast on average the likelihood is changing as is changing – the faster the likelihood is changing (i.e., the higher the information), the easier it is to estimate . Recall Lemma 1 of Notes 9. Lemma 1: Suppose Assumptions I and II hold and that E log p( X | ) . Then E log p( X | 0 and thus, I ( ) Var log p( X | ) . The information (Cramer-Rao) inequality provides a lower bound on the variance that an estimator can achieve based on the Fisher information number of the model. Theorem 3.4.1 (Information Inequality): Let T ( X ) be any statistic such that Var (T ( X )) for all . Denote E (T ( X )) by ( ) . Suppose that Assumptions I and II 3 hold and 0 I ( ) . Then for all , ( ) is differentiable and [ '( )]2 Var (T ( X )) I ( ) . The application of the Information Inequality to unbiased estimators is Corollary 3.4.1: Suppose the conditions of Theorem 3.4.1 hold and T ( X ) is an unbiased estimate of . Then 1 Var (T ( X )) I ( ) This corollary holds because for an unbiased estimator, ( ) so that '( ) 1 . Proof of Information Inequality: The proof of the theorem is a clever application of the Cauchy-Schwarz Inequality. Stated statistically, the Cauchy-Schwarz Inequality is that for any two random variables X and Y , [Cov( X , Y )]2 Var ( X )Var (Y ) . (Bickel and Doksum, page 458) If we rearrange the inequality, we can get a lower bound on the variance of X , [Cov( X , Y )]2 Var ( X ) Var (Y ) We choose X to be the estimator T ( X ) and Y to be the d quantity d log f ( X | ) , and apply the Cauchy-Schwarz Inequality. 4 d Cov T ( X ), log f ( X | ) . We First, we compute d have, using Assumption II, d E [T ( X ) log p ( x | )] E d d d p( x | ) T ( X ) p ( x | ) d d p ( x | ) d T ( x ) p( x | ) p( x | )dx = T ( x) d p( x | ) dx = d d T ( x ) p ( x | ) d E [T ( X )] '( ) d d d E log f ( X | ) 0 so that we From Lemma 1, d d Cov T ( X ), log f ( X | ) '( ) . conclude that d From Lemma 1, we have d Var log f ( X | ) I ( ) . Thus, we conclude from d the Cauchy-Schwarz inequality applied to d log f ( X | ) that T ( X ) and d [ '( )]2 Var (T ( X )) I ( ) . II. The Information Inequality and Asymptotic Optimality of the MLE 5 Consider X 1 , , X n iid from a distribution p( X i | ) , , which satisfies assumptions I and II. Let I1 ( ) be the Fisher information for one observation X1 alone: d I1 ( ) Var log p( X 1 | ) . d Recall from Notes 9 that I ( ) nI1 ( ) . Theorem 2 of Notes 9 showed that the MLE was asymptotically normal: Under “regularity conditions,” (including Assumptions I and II), L 1 ˆ n ( MLE 0 ) N 0, I ( ) 1 0 Thus, from Theorem 2, we have that for large n , ˆMLE is 1 1 approximately unbiased and has variance nI ( ) I ( ) . 1 By the Information Inequality, the minimum variance of an 1 unbiased estimator is I ( ) . Thus the MLE approximately achieves the lower bound of the Information Inequality. This suggests that for large n , among all consistent estimators (which are approximately unbiased for large n ), the MLE is achieving approximately the lowest variance and is hence asymptotically optimal. There may be other estimators that perform as well as the MLE asymptotically but no estimator is better. 6 Note: Making precise the sense in which the MLE is asymptotically optimal took many years of brilliant work by Lucien Le Cam and other mathematical statisticians. III. Application of Information Inequality to Best Unbiased Estimation Before returning to the MLE, we provide another application of the information inequality. Consider the point estimation problem of estimating g ( ) when the data X is generated from p ( X | ) , is an unknown parameter, . A fundamental problem in choosing a point estimator (more generally a decision procedure) is that generally no procedure dominates all other procedures. The MLE is asymptotically optimal but not necessarily optimal for the given sample size. Two global approaches to choosing the estimator with the best risk function we have considered are (1) Bayes – minimize weighted average of risk; (2) minimax – minimize worst case risk. Another approach is to restrict the class of possible estimators and look for a procedure within a restricted class that dominates all others. 7 The most commonly used restricted class of estimators is the class of unbiased estimators: U { ( X ) : E [ ( X )] g ( ) for all } . Under squared error loss, the risk of an unbiased estimator is the variance of the unbiased estimator. Uniformly minimum variance unbiased estimator * (UMVU): Estimator ( X ) which has the minimum variance among all unbiased estimators for all , i.e., Var ( * ( X )) Var ( ( X )) for all for all ( X ) U . A UMVU estimator is at least as good as all other unbiased estimators under squared error loss. A UMVU estimator might or might not exist. Application of Information Inequality to find UMVU estimator for Poisson model: Consider X 1 , , X n iid Poisson ( ) with parameter space 0 . This is a one-parameter exponential family, p( X | ) exp(n log i 1 X i i 1 log X i !) . We d 1 '( ) log have d , which is greater than zero over the whole parameter space. Thus, by Proposition 3.4.1, Assumptions I and II hold for this model. n 8 n The Fisher information number is d I ( ) Var log p ( X | ) d n n d Var n log i 1 X i i 1 log X i !) d 1 n n 1 Var n i 1 X i 2 nVar ( X i ) Thus, by the Information Inequality, the lower bound on 1 the variance of an unbiased estimator of is I ( ) n . The unbiased estimator X has Var ( X ) n . Thus, X achieves the Information Inequality lower bound and is hence a UMVU estimator. Comment: While the Information Inequality can be used to establish that an estimator is UMVU for certain models, failure of an estimator to achieve the lower bound does not necessarily mean that the estimator is not UMVU for a model. There are some models for which no unbiased estimator achieves the lower bound. 9