Notes 12 - Wharton Statistics Department

advertisement
Statistics 550 Notes 12
Reading: Section 2.2.
I. Maximum Likelihood Properties
Key valuable features of maximum likelihood estimators:
1. The MLE is consistent.
2. The MLE is asymptotically normal:
ˆMLE  
SE (ˆ ) converges in distribution to a standard normal
MLE
distribution for a one-dimensional parameter.
3. The MLE is asymptotically optimal: roughly, this means
that among all well-behaved estimators, the MLE has the
smallest variance for large samples.
Consistency of maximum likelihood estimates:
Theorem 1: Consider the model X1 , , X n are iid with pmf
or pdf
{ p( X i |  ), }
Suppose (a) the parameter space  is finite; (b)  is
identifiable and (c) the p( X i |  ) have common support for
all   . Then the maximum likelihood estimator ˆ is
MLE
consistent as n   .
1
Outline of Proof (see Notes 11 for full proof): Let  0 denote
the true parameter. We first show that for   
0
1 n

 p( X i |  ) 
P0 (l x ( 0 )  l x ( ))  P   log 
  0   1 as n  
 n i 1
p
(
X
|

)
i
0 



(1.1)
using Jensen’s inequality ( E

 p( X i |  )  
log


   0 )
0 
p
(
X
|

)
i
0



for    ) and
0
the law of large numbers. Denote the points other than  0
in the finite parameter space by 1 , , K . Let A jn be the
event that for n observations, l x ( 0 )  l x ( j ) . The event
ˆ   for n observations is contained in the event
MLE
0
 AKn . By (1.1), P ( A jn )  1 as n   for
j  1, , K . Consequently,
P( A1n   AKn )  1 as n   and thus
P0 (ˆMLE   0 )  1 as n   .
A1n 
Comments on Consistency:
(1) For infinite parameter spaces, the maximum likelihood
can be shown to be consistent under conditions (b)-(c) of
the theorem plus the following two assumptions: (1) The
parameter space contains an open set of which the true
parameter is an interior point (i.e., true parameter is not on
boundary of parameter space); (2) p ( x |  ) is differentiable
in  .
2
(2) The consistency theorem assumes that the parameter
space does not depend on the sample size. Maximum
likelihood can be inconsistent when the number of
parameters increases with the sample size, e.g.,
X 1 , , X n independent normals with mean i and variance
 2 . MLE of  2 is inconsistent.
Asymptotic Normality of Maximum Likelihood Estimates:
The consistency of the MLE says that under regularity
conditions, ˆ
will be close to the true  with high
0
MLE
probability. We now consider the distribution of a
magnified difference between ˆ
and  ; this provides
MLE
0
more precise information about the distribution of ˆMLE .
Consider the model X 1 , , X n are iid with pmf or pdf
{ p( X i |  ), } (we assume  is one dimensional for
simplicity; the basic ideas carry over to multidimensional
 ).
The Fisher information number I ( ) is defined as
2
 
 
I ( )  E 
log p( X |  )  


 

3
Lemma 1: Under regularity conditions on the smoothness
of p ( X |  ) (see Note (3) below), we have
 

E
log
p
(
X
|

)

 
  0 and
(i)
 

I
(

)

Var
log
p
(
X
|

)
  E
 

(ii)




 2

log
p
(
X
|

)
 2

 

Proof: First, we observe that since  p ( x |  )dx  1 for all

 , we have   p ( x |  )dx  0 . Combining this with the

 

p
(
x
|

)

log
p
(
x
|

)
 
 p ( x |  ) , we have
identity 
0


 
 

where we have interchanged differentation and integration
which is justified under the regularity conditions; this
provdes (i). Taking second derivatives of  p( x |  )dx , we
have
0

 p( x |  )dx     log p( x |  )  p( x |  )dx  E   log p( x |  ) 
  

log p( x |  )  p( x |  )dx


  

 2

 

   2 log p( x |  )  p( x |  )    log p( x |  )  p( x |  )
 

 

2
from which (ii) follows.
4
e   x
Example: For the Poisson distribution, p( x |  )  x ! ,
 2

 X 1
I ( )   E  2 log p( X |  )    E  2   .
   
 

Theorem 2: Under “regularity conditions,”
L

1 
ˆ
n ( MLE  0 )  N  0,

I
(

)
0 

Notes:
L
(1) The  denotes convergence in law (or distribution)
L

1 
ˆ
n
(



)

N
0,


MLE
0
and means that the CDF of
 I ( ) 
evaluated at a point x converges to the CDF of a

1 
N  0,
 random variable at x for each x ; see
 I ( ) 
Appendix A.14.
(2) The regularity conditions are (i)  is identifiable; (ii) the
p ( X |  ),    have common support
A  {x : p( x |  )  0} ; (iii) the parameter space  contains
an open set containing the true parameter value 0 as an
interior point; (iv) for all x  A , p ( X |  ) is three times
differentiable with respect to  , the third derivative is
5
 p( x |  )dx can be differentiated
continuous in  and
three times under the integral sign; (v) for any 0  ,
there exists a positive number c and a function M ( x) (both
of which may depend on 0 ) such that
3
log p( x |  )  M ( x) for all x  A ,   c      c
0
0
 3
with E0 [ M ( X )]   .
(3) A key part of the proof is that under the regularity
conditions, the MLE must satisfy the likelihood equation.
If  is open, l x ( ) is differentiable in  and ˆMLE exists,
then ˆ
must satisfy the estimating equation
MLE
 l x ( )  0
This is known as the likelihood equation.
(1.2)
Solving (1.2) does not necessarily yield the MLE as there
may be solutions of (1.2) that are not maxima, or solutions
that are only local maxima.
Outline of proof: For X 1 ,
, X n iid, the log likelihood
n
function is l X ( )   log p( X i |  ) . Denote the derivatives
i 1
with respect to  by l ', l '',
Expanding the first derivative
of the log likelihood around the true value 0 , we have
l X '( )  l X '(0 )  (  0 )l X ''(0 ) 
(1.3)
6
where we are going to ignore the higher-order terms (a
justifiable maneuver under the regularity conditions).
Now substitute ˆ for  in (1.3) and note that
MLE
l X '(ˆMLE )  0 (see Note (3) above). Rearranging and
multiplying through by n gives us
1
l X '(0 )

l
'(

)
n (ˆMLE  0 )  n X 0  n
(1.4)
l X ''(0 )  1 l ''( )
X
0
n
n

l
'(

)

log p( X i |  )
Note that X 0 
so that from the


i 1
 0
central limit theorem and Lemma 1, we have that
L
1
l X '(0 )  N (0, I (0 )) . Also
n
2
l X ''( 0 )   2 log p ( X i |  )
so that by the law of
i 1 
 
n
0
large numbers and Lemma 1, we have that
P
 2

1
 l X ''(0 )  E  2 log p( X |  )   I (0 )
n
 

Thus, if let W ~ N (0, I (0 )) , then n (ˆMLE  0 ) converges
in law to W / I (0 ) ~ N (0,1/ I (0 )) , proving the theorem.
Asymptotic Optimality of the MLE
7
Suppose that X1 , , X n ~ N ( ,1) . The MLE is ˆn  X n ,
the sample mean based on the n observations. Another
reasonable estimator of  is the sample median  n . The
MLE satisfies
L
n (ˆn   )  N (0,1) .
It can be proved that the median satisfies
L

n ( n   )  N (0, ) .
2
This means that the median is consistent but has a larger
variance than the MLE for large sample sizes.
More generally, consider two estimators Tn and U n and
suppose that
L
n (Tn   )  N (0, t 2 )
and that
L
n (U n   )  N (0, u 2 ) .
We define the asymptotic relative efficiency of U to T by
ARE (U , T )  t 2 / u 2 . In the normal example,
1
ARE ( n , ˆn ) 
 0.63 .
 /2
The interpretation is that if you use the median as your
estimator of  , you are effectively using only a fraction of
the data compared to using the MLE as your estimator.
Theorem: If ˆn is the MLE and  n is any other estimator,
then
8
ARE (n ,ˆn )  1 .
Thus, the MLE has the smallest asymptotic variance and
we say that the MLE is asymptotically efficient and
asymptotically optimal.
Comments: (1) We will provide an outline of the proof for
this theorem when we study the Cramer-Rao (information)
inequality in Chapter 3.4; (2) The result is actually more
subtle than the stated theorem because it only covers a
certain class of well behaved estimators – more details will
be study in Stat 552.
II. Uniqueness and Existence of the MLE
For a finite sample, when does the MLE exist, when is it
unique and how do we find the MLE?
Anomalies of maximum likelihood estimates:
Maximum likelihood estimates are not necessarily unique
and do not even have to exist.
Nonuniqueness of MLEs example: X 1 ,
1
1


,


Uniform(
2
2 ).

1
Lx ( )  
0
if max X i 
otherwise
9
, X n are iid
1
1
   min X i 
2
2
Thus any estimator ˆ that satisfies
1
1
max X i   ˆ  min X i  is a maximum likelihood
2
2
estimator.
Nonexistence of maximum likelihood estimator: The
likelihood function can be unbounded. An important
example is a mixture of normal distributions, which is
frequently used in applications.
X 1 , , X n iid with density
 ( x  1 )2 
 ( x  2 )2 
1
1
f ( x)  p
exp 
exp 
  (1  p)
.
2
2
2

21
2

1


 2 2

2
This is a mixture of two normal distributions. The
2
2
unknown parameters are ( p, 1 , 2 ,  1 ,  2 ) .
2
Let 1  X1 . Then as  1  0 , f ( X 1 )   so that the
likelihood function is unbounded.
10
Download