Statistics 550 Notes 11 Reading: Section 2.2. Take-home midterm: I will e-mail it to you by Saturday, October 14th. It will be due Wednesday, October 25th by 5 p.m. I. Maximum Likelihood The method of maximum likelihood is an approach for estimating parameters in “parametric” model, i.e., a model in which the family of possible distributions { p( x | ), } , d has a parameter space that is a subset of for some finite d. Motivating Example: A box of Dunkin Donuts munchkins contains 12 munchkins. Each munchkin is either glazed or not glazed. Let denote the number of glazed donuts in the box. To gain some information on , you are allowed to select five of the munchkins from the box randomly without replacement and view them. Let X denote the number of glazed munchkins in the sample. Suppose X=3 of the munchkins in the sample are glazed. How should we estimate ? Probability model: Imagine that the munchkins are numbered 1-12. A sample of five donuts thus consists of 1 12 five distinct numbers. All 5 792 samples are equally likely. The distribution of X is hypergeometric: 12 x 5 x P ( X x) 12 5 The following table shows the probability distribution for X given for each possible value of . X=Number of glazed munchkins in the sample 0 1 2 3 4 5 Number 0 1 0 0 0 0 0 of glazed 1 .5833 .4167 0 0 0 0 munch- 2 .3182 .5303 .1515 0 0 0 kins originally 3 .1591 .4773 .3182 .0454 0 0 in 4 .0707 .3535 .4243 .1414 .0101 0 Box 5 .0265 .2210 .4419 .2652 .0442 .0012 ( ) 6 .0076 .1136 .3788 .3788 .1136 .0076 7 .0012 .0442 .2652 .4419 .2210 .0265 8 0 .0101 .1414 .4243 .3535 .0707 9 0 0 .0454 .3182 .4773 .1591 10 0 0 0 .1515 .5303 .3182 11 0 0 0 0 .4167 .5833 12 0 0 0 0 0 1 2 Once we obtain the sample X=3, what should we estimate to be? It’s not clear how to apply the method of moments. We ˆ have E ( X ) 5 12 but solving 5 12 3 0 gives ˆ 7.2 , which is not in the parameter space. Maximum likelihood approach: We know that it is impossible that =0, 1, 2, 11 or 12. The set of possible values for once we observe X=3 are =3, 4, 5, 6, 7, 8, 9, 10. Although both =3 and =7 are possible, the occurrence of X=3 would be more “likely” if =7 [ P 7 ( X 3) .4419 ] than if =3 [ P 3 ( X 3) .0454 ]. Among =3, 4, 5, 6, 7, 8, 9, 10, the that makes the observed data X=3 most “likely” is =7. General definitions for maximum likelihood estimator The likelihood function is defined by LX ( ) p( X | ) . The likelihood function is just the joint probability mass or probability density of the data, except that we treat it as a function of the parameter . Thus, LX : [0, ) . The likelihood function is not a probability mass function or a probability density function: in general, it is not true that 3 LX ( ) integrates to 1 with respect to . In the motivating example, for X 3 , LX 3 ( ) 2.167 . The maximum likelihood estimator (the MLE), denoted by ˆ , is the value of that maximizes the likelihood: MLE ˆMLE arg max Lx ( ) . For the motivating example, ˆMLE =7. Intuitively, the MLE is a reasonable choice for an estimator. The MLE is the parameter point for which the observed sample is most likely. Equivalently, the log likelihood function is l x ( ) log p( x | ) ˆ arg max l ( ) . x MLE Example 2: Poisson distribution. Suppose X 1 , , X n are iid Poisson( ). e X n n n l x ( ) i 1 log n ( i 1 X i ) log i 1 X i ! i Xi ! To maximize the log likelihood, we set the first derivative of the log likelihood equal to zero, 1 n l '( ) i 1 X i n 0. 4 X is the unique solution to this equation. To confirm that X in fact maximizes l ( ) , we can use the second derivative test, 1 n l ''( ) 2 i 1 X i n l ''( X ) 0 as long as i 1 X i 0 so that X in fact n X i 0 , it can be seen by inspection that 0 maximizes l x ( ) . maximizes l ( ) . When i 1 Example 3: Suppose X 1 , , X n are iid Uniform( 0, ]. if max X i 0 Lx ( ) 1 if max X i n Thus, ˆ max X . MLE i Recall that the method of moments estimator is 2 X . In notes 4, we showed that max X i dominates 2 X for the squared error loss function (although max X i is dominated n 1 by n max X i ). Key valuable features of maximum likelihood estimators: 1. The MLE is consistent. 5 2. The MLE is asymptotically normal: ˆMLE SE (ˆ ) converges in distribution to a standard normal MLE distribution for a one-dimensional parameter. 3. The MLE is asymptotically optimal: roughly, this means that among all well-behaved estimators, the MLE has the smallest variance for large samples. Motivation for maximum likelihood as a minimum contrast estimate: The Kullback-Leibler distance (information divergence) between two density functions g and f for a random variable X that have the same support is K ( g , f ) E f [log( f ( X ) / g ( X ))] log[ f ( x) / g ( x)] f ( x)dx Note that by Jensen’s inequality E f [log( f ( X ) / g ( X ))] E f [ log( g ( X ) / f ( X ))] log{E f [ g ( X ) / f ( X )]} 0 where the inequality is strict if f g since –log is a strictly convex function. Also note that K ( f , f ) 0 . Thus, the Kullback-Leibler distance between g and a fixed f is minimized at g f . Suppose the family of models has the same support for each and that is identifiable. Consider the 6 function ( x, ) l x ( ) . The discrepancy for this function is D( 0 , ) E0 log p( x | ) K ( p ( x | ), p( x | )) E [log p( x | ) . 0 0 0 D( 0 , ) 0 By the results of the above paragraph, arg min so that ( x, ) l x ( ) is a valid contrast function. The minimum contrast estimator associated with the contrast function ( x, ) l x ( ) is arg min l ( ) arg max l ( ) ˆ x x MLE Thus, the maximum likelihood estimator is a minimum contrast estimator for a contrast that is based on the Kullback-Leibler distance. Consistency of maximum likelihood estimates: A basic desirable property of estimators is that they are consistent, i.e., converge to the true parameter when there is a “large” amount of data. The maximum likelihood estimator is generally, although not always consistent. We prove a special case of consistency here. Theorem: Consider the model X1 , , X n are iid with pmf or pdf { p( X i | ), } Suppose (a) the parameter space is finite; (b) is identifiable and (c) the p( X i | ) have common support for 7 all . Then the maximum likelihood estimator ˆMLE is consistent as n . Proof: Let 0 denote the true parameter. First, we show that P0 (l x ( 0 ) l x ( )) 1 as n (1.1) The inequality is equivalent to p( X i | ) 1 n log 0. n i 1 p( X i | 0 ) By the law of large numbers, the left side tends in probability toward p( X i | ) E0 log p( X i | 0 ) Since –log is strictly convex, Jensen’s inequality shows that p( X i | ) p( X i | ) E0 log log E 0 0 p( X i | 0 ) p( X i | 0 ) and (1.1) follows. For a finite parameter space, ˆMLE is consistent if and only if P (ˆ ) 1 . 0 MLE 0 Denote the points other than 0 in the finite parameter space by 1 , , K . Let A jn be the event that for n observations, l x ( 0 ) l x ( j ) . The event ˆMLE 0 for n observations is contained in the event A1n AKn . By (1.1), P ( A jn ) 1 as n for j 1, , K . Consequently, 8 AKn ) 1 as n and since ˆMLE 0 for n observations is contained in the event A1n AKn , P0 (ˆMLE 0 ) 1 as n . P( A1n For infinite parameter spaces, the maximum likelihood can be shown to be consistent under conditions (b)-(c) of the theorem plus the following two assumptions: (1) The parameter space contains an open set of which the true parameter is an interior point (i.e., true parameter is not on boundary of parameter space); (2) p ( x | ) is differentiable in . The consistency theorem assumes that the parameter space does not depend on the sample size. Maximum likelihood can be inconsistent when the number of parameters increases with the sample size, e.g., X 1 , , X n independent normals with mean i and variance 2 . MLE of 2 is inconsistent. 9