Notes 18 - Wharton Statistics Department

advertisement

Statistics 550 Notes 18

Reading: Section 3.3-3.4

I. Finding Minimax Estimators (Section 3.3)

Review: Theorem 3.3.2 is a tool for finding minimax estimators. It says that if an estimator is a Bayes estimator and has maximum risk equal to its Bayes risk, then the estimator is minimax. In particular, a Bayes estimator with constant risk is a minimax estimator.

Minimax as limit of Bayes rules:

If the parameter space

 is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem 3.3.2. The theorem says that if an estimator has maximum risk that is equal to the limit of the

Bayes risks of the Bayes estimators for a sequence of priors, then the estimator is minimax. In particular, an estimator with constant risk equal to the limit of the Bayes risks of the Bayes estimators for a sequence of priors is minimax.

Theorem 3.3.3: Let

 sup

R

 

*

)

  

* be a decision rule such that

. Let {

 k

} be a sequence of prior distributions and let r k be the Bayes risk of the Bayes rule with respect to the prior

 k

. If

1

r r as k k

  

, then

* is minimax.

Proof: Suppose sup

R

   

 is any other estimator. Then,

R

   k

  r k

, and this holds for every k . Hence, sup

R

   lim k

 r k

  sup

R

 

*

)

* is minimax.

. Thus,

Note: Unlike Theorem 3.3.2, even if the Bayes estimators for the priors

 k are unique, the theorem does not guarantee that

* is the unique minimax estimator.

Example 2 (Example 3.3.3):

X

1

, , X n iid

N

. Suppose we want to estimate with squared error loss. We will show that X is minimax.

First, note that X has constant risk

1 n sequence of priors,

  k

. Consider the

N (0, )

. In Notes 16, we showed that the Bayes estimator for squared error loss with respect n to the prior

 k

is

 ˆ k

 n

1 k

X  k

is

2

R

  ˆ k

)

E

 

 k n n

1 k

X

 2

 n

 

 

 

 n

2

with respect to

 k

1

 

2

1 k

2

is

. r k

 

 n

 

2

1

2

 n

1 k

2

2

1 k exp 

2

2 k

 d

1

 n

 n

1 k

 2 

 

 n

 k

1 k

2

As k

 

, r k

1 n

, which is the constant risk of X .

Thus, by Theorem 3.3.3, X is minimax.

II. Unbiased Estimation and Uniformly Minimum

Variance Unbiased Estimates (Section 3.4)

Consider the point estimation problem of estimating g

 when the data unknown parameter,

X

is generated from



. p ( X | ) ,

 is an

3

A fundamental problem in choosing a point estimator

(more generally a decision procedure) is that generally no procedure dominates all other procedures.

Two approaches we have considered are (1) Bayes – minimize weighted average of risk; (2) minimax – minimize worst case risk.

Another approach is to restrict the class of possible estimators and look for a procedure within a restricted class that dominates all others.

The most commonly used restricted class of estimators is the class of unbiased estimators:

U

 

X E

  g

   

}

.

Under squared error loss, the risk of an unbiased estimator is the variance of the unbiased estimator.

Uniformly minimum variance unbiased estimator

(UMVU): Estimator

*

( X ) which has the minimum variance among all unbiased estimators for all

 

Var

(

*

( X ))

Var

 for all

( X )

U

.

X )) for all

  

, i.e.,

A UMVU estimator is at least as good as all other unbiased estimators under squared error loss.

A UMVU estimator might or might not exist.

4

Example of challenges in finding UMVU estimator

(Poisson unbiased estimation):

Let

X

1

, , X n be iid Poisson (

). Consider the sample mean X and the sample variance

S

2  n

1

1 i n 

1

( X i

X )

2

.

Because the mean and the variance of a Poisson are both

, the sample mean and the sample variance are both unbiased estimators of

.

To determine the better estimator X or S

2 compare variances. We easily have

, we should now

 n

but

2 is quite a lengthy calculation. This is one of the first problems in finding a UMVU estimator – the calculations may be long and involved. It turns out that

Var X

Var S

2 for all

.

Even if we can establish that X is better than S

2

, consider the class of estimators

W X S

2

)

 aX (1 )

2

For every constant a ,

[ a

( ,

2

)]

 infinitely many unbiased estimators of

so we now have

. Even if X is better than S

2

, is it better than every a

( ,

2

)

?

Furthermore, how can we be sure that there are not other, better unbiased estimators lurking about?

5

II. The Information Inequality

The information inequality provides a lower bound on the variance of an unbiased estimator. If an unbiased estimator achieves this lower bound, then it must be UMVU.

We will focus on a one-parameter model – the data X is generated from p ( X | ) ,

 is an unknown parameter,

 

.

We make two “regularity” assumptions on the model

{ (

X

  depend on



}

(I) The support of

: p ( X

. Also for all

, x

A

A , exists and is finite. x p x

 

,

 

0} does not

(II) If T is any statistic such that

E

(| T |)

  for all

 

, then the operations of integration and differentiation by

 can be interchanged in

T x x

 d x

. That is, for integration over

 q

, d d

E

T x x

 d x

T d d

 p x whenever the right hand side of (1.1) is finite.

 d x

(1.1)

Assumption II is not useful as written – what is needed are simple sufficient conditions on p ( | ) for II to hold.

Classical conditions may be found in an analysis book such

6

as Rudin, Principles of Mathematical Analysis , pg. 236-

237.

Assumptions I and II generally hold for a one-parameter exponential family.

Proposition 3.4.1: If p x

  h x

 

T x

B

 is an exponential family and ( ) has a nonvanishing continuous derivative on

, then Assumptions I and II hold.

Recall from Notes 12 the concept of Fisher information. In

Notes 12, we defined the Fisher information for a single random variable X . Here we define the Fisher information for data X that might be multidimensional.

For a model { ( X

  

} information number I

and a value of is defined as

, the Fisher

I

E

  log ( X

 2

.

Lemma 1 (similar to Lemma 1 from Notes 12) : Suppose

Assumptions I and II hold and that

E

 

 log ( X

  

. Then

E

  log ( X |

 

0 and thus,

I

Var

  log ( X

.

7

0

Proof: First, we observe that since

 p x

 d x

1

for all

, we have

 

 p x

 d x

0

. Combining this with the identity

  p

  x

 p x

, we have

 

 p

 dx

 

 

 x p x

 d x

E

  x

 where we have interchanged differentation and integration which is justified under Assumption II.

The information (Cramer-Rao) inequality provides a lower bound on the variance that an estimator can achieve based on the Fisher information number of the model.

Theorem 3.4.1 (Information Inequality): Let ( ) statistic such that ( ( X ))

  for all

 be any

. Denote

( ( X )) hold and 0 by ( )

I

. Suppose that Assumptions I and II

 

. Then for all

, ( ) is differentiable and

( ( X ))

 

2

I

.

The application of the Information Inequality to unbiased estimators is Corollary 3.4.1:

Suppose the conditions of Theorem 3.4.1 hold and ( ) an unbiased estimate of

. Then is

8

( ( X ))

I

1

This corollary holds because for an unbiased estimator,

    so that '( )

1 .

Proof of Information Inequality: The proof of the theorem is a clever application of the Cauchy-Schwarz Inequality.

Stated statistically, the Cauchy-Schwarz Inequality is that for any two random variables X and Y ,

[ Cov X Y

2 

( ) ( ) .

If we rearrange the inequality, we can get a lower bound on the variance of X ,

[ Cov X Y

2

We choose X to be the estimator ( ) and Y to be the quantity d d

 log ( X

, and apply the Cauchy-Schwarz

Inequality.

First, we compute

Cov T ( X ), d d

 f X

. We

 have, using Assumption II,

[ ( ) d d

 p x

 

E

T ( X ) 

 d d

 p p

T 

 d d

 p x p x

 p x

 d x =

T

 d d

 p x

 d x = d d

T x x

   d d

 

[ ( X )]

  

9

From Lemma 1,

E

 d d

 log ( X

 

0 so that we conclude that

Cov T ( X ), d d

 f X | )

 

 

.

From Lemma 1, we have

Var

 d d

 log ( X

 

I

. Thus, we conclude from the Cauchy-Schwarz inequality applied to

( ) and d d

 log ( X

( ( X ))

 

2

I

. that

Application of Information Inequality to find UMVU estimator for Poisson model:

Consider

X

1

, ,

0

X n iid Poisson (

) with parameter space

. This is a one-parameter exponential family, p ( X |

)

 exp(

 n

  log

  n i

1

X i

  n i

1 log X i

!)

. We have

 d d

 log

 

1

 , which is greater than zero over the whole parameter space. Thus, by Proposition 3.4.1,

Assumptions I and II hold for this model.

The Fisher information number is

10

I

Var

 d d

 log ( X

Var

 d d

 n

 

Var

 

1

 n i

1 log

  i n

1

X i

  n i

1 log X i

!)

X i

1

2 nVar X i

)

 n

Thus, by the Information Inequality, the lower bound on the variance of an unbiased estimator of

 is

I

1

 n

.

The unbiased estimator X has 

( )

 n

. Thus,

X achieves the Information Inequality lower bound and is hence a UMVU estimator.

Comment: While the Information Inequality can be used to establish that an estimator is UMVU for certain models, failure of an estimator to achieve the lower bound does not necessarily mean that the estimator is not UMVU for a model. There are some models for which no unbiased estimator achieves the lower bound.

Multiparameter Information Inequality: There is a version of the information inequality for multiparameter models, which is given in Theorem 3.4.3 on pg. 186 of Bickel and

Doksum. value of (

'( )

) is replaced by the gradient of the expected with respect to

. 1/ ( ) is replaced by the inverse of the Fisher information matrix, where the Fisher

 information matrix is

Cov

   log ( X

.

11

III. The Information Inequality and Asymptotic Optimality of the MLE

Theorem 2 of Notes 12 was about the asymptotic variance of the MLE. We will rewrite the result of the theorem in terms of the Fisher Information Number as we have defined it in this notes.

Consider

X

1

, ,

 

X n p X

, which satisfies assumptions I and II. Let i

,

I

1

Var

 d d

 p X

1

.

Theorem 2 of Notes 12 can be rewritten as

Theorem 2’: Under “regularity conditions,” (including

Assumptions I and II), n (

 ˆ

MLE

 

0

)

L

N

0,

1

I

1 0

The relationship between I

and

I

1

 is

I

Var

 d d

 log ( X

 

Var

 nVar

 d d

 i n

1 d d

 p X i

 

Var

 p X

1

 

 nI

1

 i n

1 d d

 p X i

 

12

Thus, from Theorem 2’, we have that for large n

MLE

is approximately unbiased and has variance nI

1

1

I

1

 .

By the Information Inequality, the minimum variance of an unbiased estimator is

I

1

 . Thus the MLE approximately achieves the lower bound of the Information Inequality.

This suggests that for large n , among all consistent estimators (which are approximately unbiased for large n ), the MLE is achieving approximately the lowest variance and is hence asymptotically optimal.

Note: Making precise the sense in which the MLE is asymptotically optimal took many years of brilliant work by Lucien Le Cam and other mathematical statisticians. It will be covered in Stat 552.

13

Download