Stat 550 Notes 10 - Wharton Statistics Department

advertisement
Stat 550 Notes 10
Reading: Chapter 3.4.2
We will give an outline of a proof of the asymptotic
optimality of the MLE:
Theorem: If ˆn is the MLE and  n is any other estimator,
then
ARE (n ,ˆn )  1 .
As a tool for proving this fact, we develop a lower bound
on the variance of an unbiased estimator. This is a
fundamental result in mathematical statistics that shows the
best that is achievable by a certain type of estimator.
I. The Information Inequality
The information inequality (sometimes called Cramér-Rao
lower bound) provides a lower bound on the variance of an
unbiased estimator.
We will focus on a one-parameter model – the data X is
generated from p ( X |  ) ,  is an unknown parameter,
  .
We make two “regularity” assumptions on the model
{ p( X |  ) :   } :
1
(I) The support of p ( X |  ) , A  { x : p ( x |  )  0} does not
depend on  . Also for all x  A ,   ,

log p ( x |  ) exists and is finite.

(II) If T is any statistic such that E (| T |)   for all   ,
then the operations of integration and differentiation by
 can be interchanged in  T ( x ) p( x |  )dx . That is,
d 
d
E  T ( x ) p ( x |  )dx    T ( x )
p( x |  )dx (0.1)


d
d
whenever the right hand side of (0.1) is finite.
Assumption II is not useful as written – what is needed are
simple sufficient conditions on p ( x |  ) for II to hold.
Classical conditions may be found in an analysis book such
as Rudin, Principles of Mathematical Analysis, pg. 236237.
Assumptions I and II generally hold for a one-parameter
exponential family.
Proposition 3.4.1: If
p( x |  )  h( x) exp{ ( )T ( x)  B( )} is an exponential
family and  ( ) has a nonvanishing continuous derivative
on  , then Assumptions I and II hold.
Recall the concept of Fisher information.
2
For a model { p( X |  ) :   } and a value of  , the Fisher
information number I ( ) is defined as
2
 
 
I ( )  E 
log p( X |  )  
  .
 
The Fisher information can be thought of as a measure of
how fast on average the likelihood is changing as  is
changing – the faster the likelihood is changing (i.e., the
higher the information), the easier it is to estimate  .
Recall Lemma 1 of Notes 9.
Lemma 1: Suppose Assumptions I and II hold and that

E
log p( X |  )   . Then

 

E 
log p( X |    0 and thus,
 

 

I ( )  Var 
log p( X |  )  .
 

The information (Cramer-Rao) inequality provides a lower
bound on the variance that an estimator can achieve based
on the Fisher information number of the model.
Theorem 3.4.1 (Information Inequality): Let T ( X ) be any
statistic such that Var (T ( X ))   for all  . Denote
E (T ( X )) by  ( ) . Suppose that Assumptions I and II
3
hold and 0  I ( )   . Then for all  ,  ( ) is
differentiable and
[ '( )]2
Var (T ( X )) 
I ( ) .
The application of the Information Inequality to unbiased
estimators is Corollary 3.4.1:
Suppose the conditions of Theorem 3.4.1 hold and T ( X ) is
an unbiased estimate of  . Then
1
Var (T ( X )) 
I ( )
This corollary holds because for an unbiased estimator,
 ( )   so that  '( )  1 .
Proof of Information Inequality: The proof of the theorem
is a clever application of the Cauchy-Schwarz Inequality.
Stated statistically, the Cauchy-Schwarz Inequality is that
for any two random variables X and Y ,
[Cov( X , Y )]2  Var ( X )Var (Y ) .
(Bickel and Doksum, page 458)
If we rearrange the inequality, we can get a lower bound on
the variance of X ,
[Cov( X , Y )]2
Var ( X ) 
Var (Y )
We choose X to be the estimator T ( X ) and Y to be the
d
quantity d log f ( X |  ) , and apply the Cauchy-Schwarz
Inequality.
4
d


Cov
T
(
X
),
log
f
(
X
|

)

 . We
First, we compute
d


have, using Assumption II,
d
E [T ( X )
log p ( x |  )]  E
d

 d


 d p( x |  )  
T ( X ) 
 
p
(
x
|

)





 
 d

 d p ( x |  ) 
d

 T ( x )  p( x |  )  p( x |  )dx =  T ( x)  d p( x |  )  dx =




d
d
T ( x ) p ( x |  ) d 
E [T ( X )]   '( )

d
d
 d

E
log
f
(
X
|

)
  0 so that we
From Lemma 1,   d

d


Cov
T
(
X
),
log
f
(
X
|

)

   '( ) .
conclude that
d


From Lemma 1, we have
 d

Var 
log f ( X |  )   I ( ) . Thus, we conclude from
 d

the Cauchy-Schwarz inequality applied to
d
log f ( X |  ) that
T ( X ) and
d
[ '( )]2
Var (T ( X )) 
I ( ) .
II. The Information Inequality and Asymptotic Optimality
of the MLE
5
Consider X 1 , , X n iid from a distribution p( X i |  ) ,
  , which satisfies assumptions I and II. Let I1 ( ) be
the Fisher information for one observation X1 alone:
 d

I1 ( )  Var 
log p( X 1 |  )  .
 d

Recall from Notes 9 that I ( )  nI1 ( ) .
Theorem 2 of Notes 9 showed that the MLE was
asymptotically normal:
Under “regularity conditions,” (including Assumptions I
and II),
L

1 
ˆ
n ( MLE  0 )  N  0,

I
(

)
 1 0 
Thus, from Theorem 2, we have that for large n , ˆMLE is
1
1

approximately unbiased and has variance nI ( ) I ( ) .
1
By the Information Inequality, the minimum variance of an
1
unbiased estimator is I ( ) . Thus the MLE approximately
achieves the lower bound of the Information Inequality.
This suggests that for large n , among all consistent
estimators (which are approximately unbiased for large n ),
the MLE is achieving approximately the lowest variance
and is hence asymptotically optimal. There may be other
estimators that perform as well as the MLE asymptotically
but no estimator is better.
6
Note: Making precise the sense in which the MLE is
asymptotically optimal took many years of brilliant work
by Lucien Le Cam and other mathematical statisticians.
III. Application of Information Inequality to Best Unbiased
Estimation
Before returning to the MLE, we provide another
application of the information inequality.
Consider the point estimation problem of estimating
g ( ) when the data X is generated from p ( X |  ) ,  is an
unknown parameter,   .
A fundamental problem in choosing a point estimator
(more generally a decision procedure) is that generally no
procedure dominates all other procedures. The MLE is
asymptotically optimal but not necessarily optimal for the
given sample size.
Two global approaches to choosing the estimator with the
best risk function we have considered are (1) Bayes –
minimize weighted average of risk; (2) minimax –
minimize worst case risk.
Another approach is to restrict the class of possible
estimators and look for a procedure within a restricted class
that dominates all others.
7
The most commonly used restricted class of estimators is
the class of unbiased estimators:
U  { ( X ) : E [ ( X )]  g ( ) for all   } .
Under squared error loss, the risk of an unbiased estimator
is the variance of the unbiased estimator.
Uniformly minimum variance unbiased estimator
*
(UMVU): Estimator  ( X ) which has the minimum
variance among all unbiased estimators for all   , i.e.,
Var ( * ( X ))  Var ( ( X )) for all   
for all  ( X )  U .
A UMVU estimator is at least as good as all other unbiased
estimators under squared error loss.
A UMVU estimator might or might not exist.
Application of Information Inequality to find UMVU
estimator for Poisson model:
Consider X 1 , , X n iid Poisson (  ) with parameter space
0     . This is a one-parameter exponential family,
p( X |  )  exp(n  log  i 1 X i  i 1 log X i !) . We
d
1

'(

)

log


have
d
 , which is greater than zero over
the whole parameter space. Thus, by Proposition 3.4.1,
Assumptions I and II hold for this model.
n
8
n
The Fisher information number is
 d

I ( )  Var 
log p ( X |  )  
 d

n
n
 d 

Var 
n  log   i 1 X i   i 1 log X i !)   

 d 
1 n
n

 1
Var  n   i 1 X i   2 nVar ( X i ) 



 
Thus, by the Information Inequality, the lower bound on
1


the variance of an unbiased estimator of  is I ( ) n .

The unbiased estimator X has Var ( X )  n . Thus,
X achieves the Information Inequality lower bound and is
hence a UMVU estimator.
Comment: While the Information Inequality can be used to
establish that an estimator is UMVU for certain models,
failure of an estimator to achieve the lower bound does not
necessarily mean that the estimator is not UMVU for a
model. There are some models for which no unbiased
estimator achieves the lower bound.
9
Download