Notes 11 - Wharton Statistics Department

advertisement
Statistics 550 Notes 11
Reading: Section 2.2.
Take-home midterm: I will e-mail it to you by Saturday,
October 14th. It will be due Wednesday, October 25th by 5
p.m.
I. Maximum Likelihood
The method of maximum likelihood is an approach for
estimating parameters in “parametric” model, i.e., a model
in which the family of possible distributions
{ p( x |  ), } ,
d
has a parameter space  that is a subset of  for some
finite d.
Motivating Example: A box of Dunkin Donuts munchkins
contains 12 munchkins. Each munchkin is either glazed or
not glazed. Let  denote the number of glazed donuts in
the box. To gain some information on  , you are allowed
to select five of the munchkins from the box randomly
without replacement and view them. Let X denote the
number of glazed munchkins in the sample. Suppose X=3
of the munchkins in the sample are glazed. How should we
estimate  ?
Probability model: Imagine that the munchkins are
numbered 1-12. A sample of five donuts thus consists of
1
12 
five distinct numbers. All  5   792 samples are equally
 
likely. The distribution of X is hypergeometric:
    12   
 

x
5

x

P ( X  x)    
12 
 
5 
The following table shows the probability distribution for X
given  for each possible value of  .
X=Number of glazed munchkins in the
sample
0
1
2
3
4
5
Number 0 1
0
0
0
0
0
of glazed 1 .5833 .4167 0
0
0
0
munch- 2 .3182 .5303 .1515 0
0
0
kins
originally 3 .1591 .4773 .3182 .0454 0
0
in
4 .0707 .3535 .4243 .1414 .0101 0
Box
5 .0265 .2210 .4419 .2652 .0442 .0012
( )
6 .0076 .1136 .3788 .3788 .1136 .0076
7 .0012 .0442 .2652 .4419 .2210 .0265
8 0
.0101 .1414 .4243 .3535 .0707
9 0
0
.0454 .3182 .4773 .1591
10 0
0
0
.1515 .5303 .3182
11 0
0
0
0
.4167 .5833
12 0
0
0
0
0
1
2
Once we obtain the sample X=3, what should we estimate
 to be?
It’s not clear how to apply the method of moments. We
ˆ

have E ( X )  5 12 but solving 5 12  3  0 gives ˆ  7.2 ,
which is not in the parameter space.
Maximum likelihood approach: We know that it is
impossible that  =0, 1, 2, 11 or 12. The set of possible
values for  once we observe X=3 are
=3, 4, 5, 6, 7, 8, 9, 10. Although both  =3 and  =7 are
possible, the occurrence of X=3 would be more “likely” if
 =7 [ P 7 ( X  3)  .4419 ] than if  =3
[ P 3 ( X  3)  .0454 ]. Among  =3, 4, 5, 6, 7, 8, 9, 10,
the  that makes the observed data X=3 most “likely” is
 =7.
General definitions for maximum likelihood estimator
The likelihood function is defined by LX ( )  p( X |  ) .
The likelihood function is just the joint probability mass or
probability density of the data, except that we treat it as a
function of the parameter  . Thus, LX :   [0, ) . The
likelihood function is not a probability mass function or a
probability density function: in general, it is not true that
3
LX ( ) integrates to 1 with respect to  . In the motivating
example, for X  3 ,  LX 3 ( )  2.167 .
 
The maximum likelihood estimator (the MLE), denoted by
ˆ , is the value of  that maximizes the likelihood:
MLE
ˆMLE  arg max  Lx ( ) . For the motivating example,
ˆMLE =7.
Intuitively, the MLE is a reasonable choice for an
estimator. The MLE is the parameter point for which the
observed sample is most likely.
Equivalently, the log likelihood function is
l x ( )  log p( x |  )
ˆ  arg max l ( ) .
  x
MLE
Example 2: Poisson distribution. Suppose X 1 , , X n are iid
Poisson(  ).
e   X
n
n
n
l x ( )   i 1 log
 n  ( i 1 X i ) log    i 1 X i !
i
Xi !
To maximize the log likelihood, we set the first derivative
of the log likelihood equal to zero,
1 n
l '( )   i 1 X i  n  0.

4
X is the unique solution to this equation. To confirm that
X in fact maximizes l ( ) , we can use the second
derivative test,
1 n
l ''( )  2  i 1 X i

n
l ''( X )  0 as long as i 1 X i  0 so that X in fact

n
X i  0 , it can be seen by
inspection that 0 maximizes l x ( ) .
maximizes l ( ) . When
i 1
Example 3: Suppose X 1 , , X n are iid Uniform( 0,  ].
if max X i  
0

Lx ( )   1
if max X i  
 n
Thus, ˆ  max X .
MLE
i
Recall that the method of moments estimator is 2 X . In
notes 4, we showed that max X i dominates 2 X for the
squared error loss function (although max X i is dominated
n 1
by n max X i ).
Key valuable features of maximum likelihood estimators:
1. The MLE is consistent.
5
2. The MLE is asymptotically normal:
ˆMLE  
SE (ˆ ) converges in distribution to a standard normal
MLE
distribution for a one-dimensional parameter.
3. The MLE is asymptotically optimal: roughly, this means
that among all well-behaved estimators, the MLE has the
smallest variance for large samples.
Motivation for maximum likelihood as a minimum contrast
estimate:
The Kullback-Leibler distance (information divergence)
between two density functions g and f for a random
variable X that have the same support is
K ( g , f )  E f [log( f ( X ) / g ( X ))]   log[ f ( x) / g ( x)] f ( x)dx
Note that by Jensen’s inequality
E f [log( f ( X ) / g ( X ))]  E f [ log( g ( X ) / f ( X ))]
  log{E f [ g ( X ) / f ( X )]}
0
where the inequality is strict if f  g since –log is a strictly
convex function. Also note that K ( f , f )  0 . Thus, the
Kullback-Leibler distance between g and a fixed f is
minimized at g  f .
Suppose the family of models has the same support for
each    and that  is identifiable. Consider the
6
function  ( x, )  l x ( ) . The discrepancy for this
function is
D( 0 ,  )   E0 log p( x |  ) 
K ( p ( x |  ), p( x |  ))  E [log p( x |  ) .
0
0
0
D( 0 , )   0
By the results of the above paragraph, arg min

so that  ( x, )  l x ( ) is a valid contrast function. The
minimum contrast estimator associated with the contrast
function  ( x, )  l x ( ) is
arg min  l ( )  arg max l ( )  ˆ
 
x
  x
MLE
Thus, the maximum likelihood estimator is a minimum
contrast estimator for a contrast that is based on the
Kullback-Leibler distance.
Consistency of maximum likelihood estimates:
A basic desirable property of estimators is that they are
consistent, i.e., converge to the true parameter when there
is a “large” amount of data. The maximum likelihood
estimator is generally, although not always consistent. We
prove a special case of consistency here.
Theorem: Consider the model X1 , , X n are iid with pmf
or pdf
{ p( X i |  ), }
Suppose (a) the parameter space  is finite; (b)  is
identifiable and (c) the p( X i |  ) have common support for
7
all   . Then the maximum likelihood estimator ˆMLE is
consistent as n   .
Proof: Let  0 denote the true parameter. First, we show that
P0 (l x ( 0 )  l x ( ))  1 as n  
(1.1)
The inequality is equivalent to
 p( X i |  ) 
1 n
log


  0.
n i 1
 p( X i |  0 ) 
By the law of large numbers, the left side tends in
probability toward
 p( X i |  ) 
E0 log 

 p( X i |  0 ) 
Since –log is strictly convex, Jensen’s inequality shows that
 p( X i |  ) 
 p( X i |  ) 
E0 log 

log
E

0
0 
 p( X i |  0 ) 
 p( X i |  0 ) 
and (1.1) follows.
For a finite parameter space, ˆMLE is consistent if and only
if P (ˆ   )  1 .
0
MLE
0
Denote the points other than  0 in the finite parameter
space by 1 , , K . Let A jn be the event that for n
observations, l x ( 0 )  l x ( j ) . The event ˆMLE   0 for n
observations is contained in the event A1n   AKn . By
(1.1), P ( A jn )  1 as n   for j  1, , K . Consequently,
8
 AKn )  1 as n   and since ˆMLE   0 for n
observations is contained in the event A1n   AKn ,
P0 (ˆMLE   0 )  1 as n   .
P( A1n 
For infinite parameter spaces, the maximum likelihood can
be shown to be consistent under conditions (b)-(c) of the
theorem plus the following two assumptions: (1) The
parameter space contains an open set of which the true
parameter is an interior point (i.e., true parameter is not on
boundary of parameter space); (2) p ( x |  ) is differentiable
in  .
The consistency theorem assumes that the parameter space
does not depend on the sample size. Maximum likelihood
can be inconsistent when the number of parameters
increases with the sample size, e.g.,
X 1 , , X n independent normals with mean i and variance
 2 . MLE of  2 is inconsistent.
9
Download