Chapter 2: Maximum Likelihood Estimation Advanced Econometrics - HEC Lausanne Christophe Hurlin

advertisement

Chapter 2: Maximum Likelihood Estimation

Advanced Econometrics - HEC Lausanne

Christophe Hurlin

University of Orléans

December 9, 2013

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 1 / 207

Section 1

Introduction

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 2 / 207

1. Introduction

The Maximum Likelihood Estimation (MLE) is a method of estimating the parameters of a model. This estimation method is one of the most widely used.

The method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes the "agreement" of the selected model with the observed data.

The Maximum-likelihood Estimation gives an uni…ed approach to estimation.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 3 / 207

2. The Principle of Maximum Likelihood

What are the main properties of the maximum likelihood estimator?

I

I

I

I

Is it asymptotically unbiased?

Is it asymptotically e¢ cient? Under which condition(s)?

Is it consistent?

What is the asymptotic distribution?

How to apply the maximum likelihood principle to the multiple linear regression model, to the Probit/Logit Models etc. ?

... All of these questions are answered in this lecture...

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 4 / 207

1. Introduction

The outline of this chapter is the following:

Section 2: The principle of the maximum likelihood estimation

Section 3: The likelihood function

Section 4: Maximum likelihood estimator

Section 5: Score, Hessian and Fisher information

Section 6: Properties of maximum likelihood estimators

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 5 / 207

1. Introduction

References

Amemiya T. (1985), Advanced Econometrics. Harvard University Press.

Greene W. (2007), Econometric Analysis, sixth edition, Pearson - Prentice Hil

Pelgrin, F. (2010), Lecture notes Advanced Econometrics, HEC Lausanne ( a special thank )

Ruud P., (2000) An introduction to Classical Econometric Theory, Oxford

University Press.

Zivot, E. (2001), Maximum Likelihood Estimation, Lecture notes.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 6 / 207

Section 2

The Principle of Maximum Likelihood

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 7 / 207

2. The Principle of Maximum Likelihood

Objectives

In this section, we present a simple example in order

1

To introduce the notations

2

To introduce the notion of likelihood and log-likelihood .

3

To introduce the concept of maximum likelihood estimator

4 To introduce the concept of maximum likelihood estimate

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 8 / 207

2. The Principle of Maximum Likelihood

Example

Suppose that X

1

,X

2

, ,X

N

X i

Pois are i.i.d. discrete random variables, such that

(

θ

) with a pmf (probability mass function) de…ned as:

Pr ( X i

= x i

) = exp (

θ

)

θ x i x i

!

where

θ is an unknown parameter to estimate.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 9 / 207

2. The Principle of Maximum Likelihood

Question: What is the probability of observing the particular sample f x

1

, x

2

, .., x

N g , assuming that a Poisson distribution with as yet unknown parameter

θ generated the data?

This probability is equal to

Pr (( X

1

= x

1

) \ ...

\ ( X

N

= x

N

))

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 10 / 207

2. The Principle of Maximum Likelihood

Since the variables X i are i .

i .

d . this joint probability is equal to the product of the marginal probabilities

Pr (( X

1

= x

1

) \ ...

\ ( X

N

= x

N

)) = i

N

= 1

Pr ( X i

= x i

)

Given the pmf of the Poisson distribution, we have:

Pr (( X

1

= x

1

) \ ...

\ ( X

N

= x

N

)) = i

N

= 1 exp (

θ

)

θ x i x i

!

= exp (

θ

N )

θ

∑ i

N

= 1 x i i

N

= 1 x i

!

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 11 / 207

2. The Principle of Maximum Likelihood

De…nition

This joint probability is a function of

θ

(the unknown parameter) and corresponds to the likelihood of the sample f x

1

, .., x

N g denoted by

L

N

(

θ

; x

1

.., x

N

) = Pr (( X

1

= x

1

) \ ...

\ ( X

N

= x

N

)) with

L

N

(

θ

; x

1

.., x

N

) = exp (

θ

N )

θ

∑ N

= 1 x i

1 i

N

= 1 x i

!

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 12 / 207

2. The Principle of Maximum Likelihood

Example

Let us assume that for N to

= 10 , we have a realization of the sample equal f 5 , 0 , 1 , 1 , 0 , 3 , 2 , 3 , 4 , 1 g , then:

L

N

(

θ

; x

1

.., x

N

) = Pr (( X

1

= x

1

) \ ...

\ ( X

N

= x

N

))

L

N

(

θ

; x

1

.., x

N

) = e 10 θ

θ

20

207 , 360

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 13 / 207

2. The Principle of Maximum Likelihood

Question: What value of

θ would make this sample most probable ?

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 14 / 207

2. The Principle of Maximum Likelihood

This Figure plots the function L

N single mode at

θ

(

θ

; x ) for various values of

θ

. It has a

= 2, which would be the maximum likelihood estimate, or MLE, of

θ

.

1.2

x 10

- 8

1

0.8

0.6

0.4

0.2

0

0 0.5

1 1.5

2

θ

2.5

3 3.5

4

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 15 / 207

2. The Principle of Maximum Likelihood

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 16 / 207

2. The Principle of Maximum Likelihood

Consider maximizing the likelihood function L

N

(

θ

; x

1

.., x

N

) with respect to

θ

. Since the log function is monotonically increasing, we usually maximize ln L

N

(

θ

; x

1

.., x

N

) instead. In this case: ln L

N

(

θ

; x

1

.., x

N

) =

θ

N + ln (

θ

) i

N

= 1 x i ln i

N

= 1 x i

!

∂ ln L

N

(

θ

; x

1

.., x

N

)

= N +

∂θ θ

1 i

N

= 1 x i

2 ln L

N

(

θ

; x

1

.., x

N

)

∂θ

2

=

1

θ

2 i

N

= 1 x i

< 0

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 17 / 207

2. The Principle of Maximum Likelihood

Under suitable regularity conditions, the maximum likelihood estimate

(estimator) is de…ned as:

= arg max

θ

2 R + ln L

N

(

θ

; x

1

.., x

N

)

FOC :

∂ ln L

N

(

θ

; x

1

.., x

N

)

∂θ

()

= N +

1 i

N

= 1 x i

= 0

= ( 1 / N ) i

N

= 1 x i

SOC :

2 ln L

N

(

θ

; x

1

.., x

N

)

∂θ

2 b is a maximum.

=

1 b 2 i

N

= 1 x i

< 0

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 18 / 207

2. The Principle of Maximum Likelihood

The maximum likelihood estimate ( realization ) is:

( x ) =

1

N i

N

= 1 x i

Given the sample f 5 , 0 , 1 , 1 , 0 , 3 , 2 , 3 , 4 , 1 g , we have ( x ) =

The maximum likelihood estimator ( random variable ) is:

2 .

=

1

N i

N

= 1

X i

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 19 / 207

2. The Principle of Maximum Likelihood

Continuous variables

The reference to the probability of observing the given sample is not exact in a continuous distribution, since a particular sample has probability zero. Nonetheless, the principle is the same.

The likelihood function then corresponds to the pdf associated to the joint distribution of ( X

1

, X

2

, .., X

N

) evaluated at the point

( x

1

, x

2

, .., x

N

) :

L

N

(

θ

; x

1

.., x

N

) = f

X

1

,.., X

N

( x

1

, x

2

, .., x

N

;

θ

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 20 / 207

2. The Principle of Maximum Likelihood

Continuous variables

If the random variables f X

1

, X

2

, .., X

N g are i .

i .

d .

then we have:

L

N

(

θ

; x

1

.., x

N

) = i

N

= 1 f

X

( x i

;

θ

) where f

X

X i

( x i

;

θ

) denotes the pdf of the marginal distribution of since all the variables have the same distribution).

X (or

The values of the parameters that maximize L

N

( are the maximum likelihood estimates, denoted b

θ

(

; x x

)

1

.

.., x

N

) or its log

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 21 / 207

Section 3

The Likelihood function

De…nitions and Notations

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 22 / 207

3. The Likelihood Function

Objectives

1 Introduce the notations for an estimation problem that deals with a marginal distribution or a conditional distribution (model) .

2 De…ne the likelihood and the log-likelihood functions.

3

Introduce the concept of conditional log-likelihood

4

Propose various applications

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 23 / 207

3. The Likelihood Function

Notations

Let us consider a continuous random variable X , with a pdf denoted f

X

( x ;

θ

) , for x 2 R

θ

= (

θ 1

..

θ K

)

| that

θ is a K 1 vector of unknown parameters. We assume

2 Θ R K

.

Let us consider a sample f X

1

, .., X

N g of i .

i .

d .

random variables with the same arbitrary distribution as X .

The realisation of or x for simplicity.

f X

1

, .., X

N g (the data set..) is denoted f x

1

, .., x

N g

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 24 / 207

3. The Likelihood Function

Example (Normal distribution)

If X N m ,

σ

2 then: f

X

( z ;

θ

) =

σ p

1

2

π exp

( z m )

2

!

2

σ

2 with K = 2 and

θ

= m

σ

2

8 z 2 R

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 25 / 207

3. The Likelihood Function

De…nition (Likelihood Function)

The likelihood function is de…ned to be:

L

N

:

Θ R N

!

R +

(

θ

; x

1

, .., x n

) 7 !

L

N

(

θ

; x

1

, .., x n

) = i

N

= 1 f

X

( x i

;

θ

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 26 / 207

3. The Likelihood Function

De…nition (Log-Likelihood Function)

The log-likelihood function is de…ned to be:

`

N

:

Θ R N

!

R

(

θ

; x

1

, .., x n

) 7 !

`

N

(

θ

; x

1

, .., x n

) = i

N

= 1 ln f

X

( x i

;

θ

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 27 / 207

3. The Likelihood Function

Remark: the (log-)likelihood function depends on two type of arguments:

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 28 / 207

3. The Likelihood Function

Notations: In the rest of the chapter, I will use the following alternative notations:

L

N

(

θ

; x ) L (

θ

; x

1

, .., x

N

) L

N

(

θ

)

`

N

(

θ

; x ) ln L

N

(

θ

; x ) ln L (

θ

; x

1

, .., x

N

) ln L

N

(

θ

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 29 / 207

3. The Likelihood Function

Example (Sample of Normal Variables)

We consider a sample realisation by f y

1

, .., y

N f Y g

1

, .., or y

Y

N g N .

i .

d .

. Let us de…ne m

θ

,

σ

=

2 and denote the m

σ

2

|

, then we have:

L

N

(

θ

; y ) = i

N

= 1

σ p

1

2

π

=

σ

2

2

π exp

N / 2 exp

( y i m )

2

!

2

σ

2

1

2

σ

2 i

N

= 1

( y i m )

2

!

`

N

(

θ

; y ) =

N ln

σ

2

2

N ln ( 2

π

)

2

1

2

σ

2 i

N

= 1

( y i m )

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 30 / 207

3. The Likelihood Function

De…nition (Likelihood of one observation)

We can also de…ne the (log-)likelihood of one observation x i

:

L i

(

θ

; x ) = f

X

( x i

;

θ

) with L

N

(

θ

; x ) = i

N

= 1

L i

(

θ

; x )

` i

(

θ

; x ) = ln f

X

( x i

;

θ

) with `

N

(

θ

; x ) = i

N

= 1

` i

(

θ

; x )

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 31 / 207

3. The Likelihood Function

Example (Exponential Distribution)

Suppose that D

1

, D

2

, .., D

N for instance), with D i

Exp are i .

i .

d .

positive random variables (durations

(

θ

) with

θ

0 and

L i

(

θ

; d i

) = f

D

( d i

;

θ

) =

1 exp

θ d i

θ

Then we have:

` i

(

θ

; d i

) = ln ( f

D

( d i

;

θ

)) = ln (

θ

)

L

N

(

θ

; d ) =

θ

N exp

`

N

(

θ

; d ) = N ln (

θ

)

θ

1 i

N

= 1 d i

!

θ

1 i

N

= 1 d i d i

θ

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 32 / 207

3. The Likelihood Function

Remark: The (log-)likelihood and the Maximum Likelihood Estimator are always based on an assumption (bet?) about the distribution of Y .

Y i

Distribution with pdf f

Y

( y ;

θ

) = ) L

N

(

θ

; y ) and `

N

(

θ

; y )

In practice, generally we have no idea about the true distribution of Y i

....

A solution: the Quasi-Maximum Likelihood Estimator

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 33 / 207

3. The Likelihood Function

Remark: We can also use the MLE to estimate the parameters of a model (with dependent and explicative variables) such that: y = g ( x ;

θ

) +

ε where

β denotes the vector or parameters, X a set of explicative variables,

ε and error term and g ( .

) the link function.

In this case, we generally consider the conditional distribution of Y given

X , which is equivalent to unconditional distribution of the error term

ε

:

Y j X D ()

ε

D

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 34 / 207

3. The Likelihood Function

Notations (model)

Let us consider two continuous random variables Y and X

We assume that Y has a conditional distribution given X pdf denoted f

Y j x

( y ;

θ

) , for y 2 R

= x with a

θ

= (

θ 1

..

θ K

)

| that

θ is a K 1 vector of unknown parameters. We assume

2 Θ R K .

Let us consider a sample a realisation f x

1

, y

N g i

N

= 1

.

f X

1

, Y

N g i

N

= 1 of i .

i .

d .

random variables and

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 35 / 207

3. The Likelihood Function

De…nition (Conditional likelihood function)

The (conditional) likelihood function is de…ned to be:

L

N

(

θ

; y j x ) = i

N

= 1 f

Y j X

( y i j x i

;

θ

) where f

Y j X

( y i j x i

;

θ

) denotes the conditional pdf of Y i given X i

.

Remark: The conditional likelihood function is the joint conditional density of the data in which the unknown parameter is .

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 36 / 207

3. The Likelihood Function

De…nition (Conditional log-likelihood function)

The (conditional) log-likelihood function is de…ned to be:

`

N

(

θ

; y j x ) = i

N

= 1 ln f

Y j X

( y i j x i

;

θ

) where f

Y j X

( y i j x i

;

θ

) denotes the conditional pdf of Y i given X i

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 37 / 207

3. The Likelihood Function

Remark: The conditional probability density function (pdf) can denoted by: f

Y j X

( y j x ;

θ

) f

Y

( y j X = x ;

θ

) f

Y

( y j X = x )

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 38 / 207

3. The Likelihood Function

Example (Linear Regression Model)

Consider the following linear regression model: y i

= X i

>

β

+

ε i where X i is a K 1 vector of random variables and

β

= (

β

1

..

β

K

)

> a

K 1 vector of parameters. We assume that the

ε i

ε i

N 0 ,

σ

2 are i .

i .

d .

with

.

Then, the conditional distribution of Y i given X i

= x i is:

Y i j x i

N x i

>

β

,

σ

2

L i

(

θ

; y j x ) = f

Y j x

( y i j x i

;

θ

) =

σ p

1

2

π exp where

θ

=

β

>

σ

2

> is K + 1 1 vector.

y i x i

>

β

2

σ

2

2

!

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 39 / 207

3. The Likelihood Function

Example (Linear Regression Model, cont’d)

Then, if we consider an i .

i .

d .

sample f y i

, x i g i

N

= 1 conditional (log-)likelihood is de…ned to be:

, the corresponding

L

N

(

θ

; y j x ) = i

N

= 1 f

Y j X

=

σ

2

2

π

( y i j x i

;

θ

) = i

N

= 1

N / 2 exp

1

2

σ

2 i

σ p

1

2

π exp

N

= 1 y i x i

>

β y i

2

!

x i

>

β

2

σ

2

2

!

`

N

(

θ

; y j x ) =

N ln

σ

2

2

N ln ( 2

π

)

2

1

2

σ

2 i

N

= 1 y i x i

>

β

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 40 / 207

3. The Likelihood Function

Remark: Given this principle, we can derive the (conditional) likelihood and the log-likelihood functions associated to a speci…c sample for any type of econometric model in which the conditional distribution of the dependent variable is known.

Dichotomic models: probit, logit models etc.

Censored regression models: Tobit etc.

Times series models: AR, ARMA, VAR etc.

GARCH models

....

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 41 / 207

3. The Likelihood Function

Example (Probit/Logit Models)

Let us consider a dichotomic variable Y i default and 0 otherwise.

X i

= ( X i 1

...

X iK such that Y i

) denotes a a

=

K

1 if the …rm i

1 vector of is in individual caracteristics. We assume that the conditional probability of default is de…ned as:

Pr ( Y i

= 1 j X i

= x i

) = F x i

>

β where

β

= (

β

1

..

β

K

)

> is a vector of parameters and

(cumlative distribution function).

F ( .

) is a cdf

Y i

=

1

0 with probability F x i

>

β with probability 1 F x i

>

β

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 42 / 207

3. The Likelihood Function

Remark: Given the choice of the link function F logit model.

( .

) we get a probit or a

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 43 / 207

3. The Likelihood Function

De…nition (Probit Model)

In a probit model , the conditional probability of the event Y i

= 1 is:

Pr ( Y i

= 1 j X i

= x i

) =

Φ

( x i β

) = x i

R β

∞ p

1

2

π exp u

2

2 where

Φ ( .

) denotes the cdf of the standard normal distribution.

du

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 44 / 207

3. The Likelihood Function

De…nition (Logit Model)

In a logit model , the conditional probability of the event Y i

= 1 is:

Pr ( Y i

= 1 j X i

= x i

) =

Λ x i

>

β

=

1

1 + exp x i

>

β where

Λ

( .

) denotes the cdf of the logistic distribution.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 45 / 207

3. The Likelihood Function

Example (Probit/Logit Models, cont’d)

What is the (conditional) log-likelihood of the sample

Whatever the choice of F f y i

, x i g i

N

= 1

?

( .

) , the conditional distribution of Yi given

X i

= x i is a Bernouilli distribution since:

Y i

=

1

0 with probability F x i

>

β with probability 1 F x i

>

β

Then, for

θ

=

β

, we have:

L i

(

θ

; y j x ) = f

Y j x

( y i j x i

;

θ

) = h

F x i

>

β i y i h

1 F x i

>

β i

1 y i where f

Y j x

(

(pmf) of Y i

.

y i j x i

;

θ

) denotes the conditional probability mass function

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 46 / 207

3. The Likelihood Function

Example (Probit/Logit Models, cont’d)

The (conditional) likelihood and log-likelihood of the sample de…ned to be: f y i

, x i g i

N

= 1 are

L

N

(

θ

; y j x ) = i

N

= 1 f

Y j x

( y i j x i

;

θ

) = i

N

= 1 h

F x i

>

β i y i h

1 F x i

>

β i

1 y i

`

N

(

θ

; y j x ) =

= i

N

= 1 y i ln h

F x i

>

β

∑ i : y i

= 1 ln F x i

>

β i

+

+ i

N

= 1

( 1 y i

) ln h

1 F x i

>

β

∑ i : y i

= 0 ln h

1 F x i

>

β i i where f

Y j x

(pmf) of Y i

( y i j x i

;

θ

) denotes the conditional probability mass function

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 47 / 207

3. The Likelihood Function

Key Concepts

1

Likelihood (of a sample) function

2 Log-likelihood (of a sample) function

3

Conditional Likelihood and log-likelihood function

4

Likelihood and log-likelihood of one observation

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 48 / 207

Section 4

Maximum Likelihood Estimator

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 49 / 207

4. Maximum Likelihood Estimator

Objectives

1

This section will be concerned with obtaining estimates of the parameters

θ

.

2

We will de…ne the maximum likelihood estimator (MLE).

3

Before we begin that study, we consider the question of whether estimation of the parameters is possible at all: the question of identi…cation .

4 We will introduce the invariance principle

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 50 / 207

4. Maximum Likelihood Estimator

De…nition (Identi…cation)

The parameter vector

θ is identi…ed (estimable) if for any other parameter vector,

θ

=

θ

, for some data y , we have

L

N

(

θ

; y ) = L

N

(

θ

; y )

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 51 / 207

4. Maximum Likelihood Estimator

Example

Let us consider a latent (continuous and unobservable) variable Y i that:

Y i

= X i

>

β

+

ε i such with

β

= (

β

1

..

β

K i .

i .

d .

such that

E

)

>

, X i

= ( X i 1

...

X

(

ε i

) = 0 and

V

(

ε iK i

)

>

) = and where the error term

ε i is

σ

2

.

The distribution of

ε i is symmetric around 0 and we denote by G ( .

) the cdf of the standardized error term

ε i

/

σ

.

We assume that this cdf does not depend on

σ or

β

.

Example:

ε i

/

σ

N ( 0 , 1 ) .

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 52 / 207

4. Maximum Likelihood Estimator

Example (cont’d)

We observe a dichotomic variable Y i such that:

Y i

=

1

0 if Y i

> 0 otherwise

Problem: are the parameters

θ

= (

β

>

σ

2 ) > identi…able?

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 53 / 207

4. Maximum Likelihood Estimator

Solution:

To answer to this question we have to compute the (log-)likelihood of the sample of observed data f y i

, x i g i

N

= 1

.

We have:

Pr ( Y i

= 1 j X i

= x i

) = Pr ( Y i

> 0 j X i

= x i

)

= Pr

ε i

> x i

>

β

= 1 Pr

ε i x i

>

β

= 1 Pr

ε i

σ x i

>

β

σ

If we denote by G ( .

) the cdf associated to the distribution of

ε i

/

σ

, since this distribution is symetric around 0, then we have:

Pr ( Y i

= 1 j X i

= x i

) = G x i

>

β

σ

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 54 / 207

4. Maximum Likelihood Estimator

Solution (cont’d):

For

θ

= (

β

>

σ

2

) >

, we have

`

N

(

θ

; y j x ) = i

N

= 1 y i ln G x i

>

β

σ

+ i

N

= 1

( 1 y i

) ln 1 G x i

>

β

σ

This log-likelihood depends only on the ratio

β

/

σ

.

So, for

θ and

θ

= ( k

β

> k

σ

) >

, with k = 1 :

= (

β

>

σ

2 ) >

`

N

(

θ

; y j x ) = `

N

(

θ

; y j x )

The parameters

β and

σ

2 ratio

β

/

σ

.

cannot be identi…ed . We can only identify the

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 55 / 207

4. Maximum Likelihood Estimator

Remark:

In this latent model, only the ratio

β

/

σ can be identi…ed since

Pr ( Y i

= 1 j X i

= x i

) = Pr

ε i

σ

< x i

>

β

σ

= G x i

>

β

σ

The choice of a logit or probit model implies a normalisation on the variance of

ε i

/

σ and then on

σ

2 : probit : Pr ( Y i

= 1 j X i

= x i

) =

Φ x i

> e with =

β i

/

σ

,

V ε i

σ

= 1

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 56 / 207

4. Maximum Likelihood Estimator

De…nition (Maximum Likelihood Estimator)

A maximum likelihood estimator b of

θ maximization problem:

2 Θ is a solution to the

= arg max

θ

2 Θ

`

N

(

θ

; y j x ) or equivalently

= arg max

θ

2 Θ

L

N

(

θ

; y j x )

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 57 / 207

4. Maximum Likelihood Estimator

Remarks

1 Do not confuse the maximum likelihood estimator b

(which is a random variable) and the maximum likelihood estimate b corresponds to the realisation of b on the sample x .

( x ) which

2

Generally, it is easier to maximise the log-likelihood than the likelihood (especially for the distributions that belong to the exponential family).

3

When we consider an unconditional likelihood, the MLE is de…ned by:

= arg max

θ

2 Θ

`

N

(

θ

; x )

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 58 / 207

4. Maximum Likelihood Estimator

De…nition (Likelihood equations)

Under suitable regularity conditions, a maximum likelihood estimator

(MLE) of

θ is de…ned to be the solution of the …rst-order conditions

(FOC):

`

N

(

θ

; y j x )

∂θ

= 0

( K , 1 ) or

L

N

(

θ

; y j x )

∂θ

= 0

( K , 1 )

These conditions are generally called the likelihood or log-likelihood equations .

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 59 / 207

4. Maximum Likelihood Estimator

Notations

The …rst derivative ( gradient ) of the (conditional) log-likelihood evaluated at the point

θ satis…es:

L

N

(

θ

; y j x )

∂θ

L

N b

; y j x

∂θ

= g b

; y j x = 0

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 60 / 207

4. Maximum Likelihood Estimator

Remark

The log-likelihood equations correspond to a linear/nonlinear system of

K equations with K unknown parameters

θ 1

, ..,

θ K

:

0 1

`

N

(

θ

; Y j x )

∂θ 1

`

N

(

θ

; Y j x )

=

∂θ

...

=

`

N

(

θ

; Y j x )

∂θ K

0

@

0

...

0

1

A

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 61 / 207

4. Maximum Likelihood Estimator

De…nition (Second Order Conditions)

Second order condition (SOC) of the likelihood maximisation problem: the

Hessian matrix evaluated at

θ must be negative de…nite.

2

`

N

(

θ

; y j x )

∂θ∂θ

> is negative de…nite or

2

L

N

(

θ

; y j x )

∂θ∂θ

> is negative de…nite

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 62 / 207

4. Maximum Likelihood Estimator

Remark:

The Hessian matrix (realisation) is a K K matrix:

0

2 `

N

(

θ

; y j x )

∂θ∂θ

>

=

B

B

2

`

N

(

θ

; y j x )

∂θ

2

1

2 `

N

(

θ

; y j x )

∂θ

2

∂θ

1

..

2

`

N

(

θ

; y j x )

∂θ K ∂θ 1

2

`

N

(

θ

; y j x )

∂θ 1 ∂θ 2

2 `

N

(

θ

; y j x )

∂θ

2

2

..

..

..

..

..

..

2

`

N

(

θ

; y j x )

∂θ 1 ∂θ K

1

..

..

2

`

N

(

θ

; y j x )

2

∂θ

K

C

C

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 63 / 207

4. Maximum Likelihood Estimator

Reminders

A negative de…nite matrix is a symetric (Hermitian if there are complex entries) matrix all of whose eigenvalues are negative.

The n n Hermitian matrix M is said to be negative-de…nite if: x

|

Mx < 0 for all non-zero x in

R n

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 64 / 207

4. Maximum Likelihood Estimator

Example (MLE problem with one parameter)

Let us consider a real-valued random variable X with a pdf given by: f

X x ;

σ

2

= exp x

2

2

σ

2 x

σ

2

8 x 2 [ 0 , +

[ where

σ

2 is an unknown parameter. Let us consider a sample f X

1

, .., X

N g of i .

i .

d .

random variables with the same arbitrary distribution as X .

Problem: What is the maximum likelihood estimator (MLE) of

σ

2

?

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 65 / 207

4. Maximum Likelihood Estimator

Solution:

We have: ln f

X x ;

σ

2

= x

2

2

σ

2

+ ln ( x ) ln

σ

2

So, the log-likelihood of the sample f x

1

, .., x

N g is:

`

N σ

2

; x = i

N

= 1 ln f

X x i

;

σ

2

=

1

2

σ

2 i

N

= 1 x i

2

+ i

N

= 1 ln ( x i

) N ln

σ

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 66 / 207

4. Maximum Likelihood Estimator

Solution (cont’d):

The maximum likelihood estimator b 2 maximization problem: of

σ

2 2 R + is a solution to the

σ

2

= arg max `

N σ

σ

2 2 R +

2

; x = arg max

σ

2 2 R +

1

2

σ

2 i

N

= 1 x i

2

+ i

N

= 1 ln ( x i

) N ln

σ

2

`

N σ

2 ; x

∂σ

2

FOC (log-likelihood equation) :

=

1

2

σ

4 i

N

= 1 x i

2

N

σ

2

`

N σ

2

; x

∂σ

2

σ

2

=

1

2 b

4 i

N

= 1 x i

2

N

σ

2

= 0 () b 2

=

1

2 N i

N

= 1 x i

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 67 / 207

4. Maximum Likelihood Estimator

Solution (cont’d):

Check that b 2 is a maximum:

`

N σ

2

; x

∂σ

2

=

1

2

σ

4 i

N

= 1 x i

2

N

σ

2

SOC:

2

`

N σ

2

; x

∂σ

4

σ

2

=

=

=

2

`

N σ

2

; x

∂σ

4

=

1

σ

6 i

N

= 1 x i

2

N

+

σ

4

1

6 i

N

= 1 x i

2

+

N

σ

4

2 N b 2

σ

6

N

4

< 0

+

N

σ

4 since b 2

=

1

2 N i

N

= 1 x i

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 68 / 207

4. Maximum Likelihood Estimator

Conclusion:

The maximum likelihood estimator (MLE) of the parameter

σ

2 by:

2

=

1

2 N i

N

= 1

X i

2 is de…ned

The maximum likelihood estimate of the parameter

σ

2 is equal to:

2

( x ) =

1

2 N i

N

= 1 x i

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 69 / 207

4. Maximum Likelihood Estimator

Example (Sample of normal variables)

We consider a sample the MLE of m and

σ

2 ?

f Y

1

, .., Y

N g N .

i .

d .

m ,

σ

2

.

Problem: what are

Solution: Let us de…ne

θ

= m

σ

2

|

.

= arg max

σ

2 2 R + , m 2 R

`

N

(

θ

; y ) with

`

N

(

θ

; y ) =

N ln

σ

2

2

N ln ( 2

π

)

2

1

2

σ

2 i

N

= 1

( y i m )

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 70 / 207

4. Maximum Likelihood Estimator

Solution (cont’d):

`

N

(

θ

; y ) =

`

N

(

θ

; y )

∂ m

=

N ln

σ

2

2

1

σ

2 i

N

= 1

( y i m )

N ln ( 2

π

)

2

`

N

(

θ

; y )

∂σ

2

=

1

2

σ

2 i

N

= 1

( y i m )

2

The …rst derivative of the log-likelihood function is de…ned by:

!

`

N

(

θ

; y )

∂θ

=

`

N

(

θ

∂ m

; y )

`

N

(

θ ; y )

∂σ

2

N

2

σ

2

1

+

2

σ

4 i

N

= 1

( y i m )

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 71 / 207

4. Maximum Likelihood Estimator

Solution (cont’d):

FOC ( log-likelihood equations )

`

N

(

θ

; y )

∂θ

=

N

2 b

2

1

σ

2

+

∑ i

N

= 1

1

2 b

4

( y i

∑ i

N

= 1

( y i b ) b )

2

!

=

0

0

!

So, the MLE correspond to the empirical mean and variance:

= b

2 with

=

1

N i

N

= 1

Y i

2

=

1

N i

N

= 1

Y i

Y

N

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 72 / 207

4. Maximum Likelihood Estimator

Solution (cont’d):

`

N

(

θ

; y )

∂ m

=

1

σ

2 i

N

= 1

( y i m )

`

N

(

θ

; y )

∂σ

2

=

N

2

σ

2

1

+

2

σ

4 i

N

= 1

( y i m )

2

The Hessian matrix (realization) is:

2

`

N

(

θ

; y )

∂θ∂θ

>

=

=

2

`

N

(

θ ; y )

∂ m 2

2

`

N

∂σ

2

(

θ

; y )

∂ m

1

σ

4

2

`

N

(

θ ; y )

2

∂ m

∂σ

2

`

N

(

θ

; y )

∂σ

4

∑ i

N

= 1

N

σ

2

( y i m )

!

N

2

σ

4

1

σ

4

1

σ

6

∑ i

N

= 1

∑ i

N

= 1

( y i m )

( y i m )

2

!

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 73 / 207

4. Maximum Likelihood Estimator

Solution (cont’d): SOC

2 `

N

(

θ

;

∂θ∂θ

> y )

=

=

0

1

σ

4

N

σ

2

∑ i

N

= 1

N

σ

2

( y i b )

!

0

N

2 b 4

N b 2

σ

6

N

2 b 4

1

σ

4

1

σ

6

∑ i

N

= 1

( i

N

= 1 y

( i y i since since N b

2 `

N

(

θ

;

∂θ∂θ

>

= ∑ y ) i

N

= 1

= y i and N b 2

0

N

σ

2

0

= ∑ i

N

= 1

!

N

2 b

4

( y i b )

2 is de…nite negative b ) b )

2

!

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 74 / 207

4. Maximum Likelihood Estimator

Example (Linear Regression Model)

Consider the linear regression model: y i

= x i

>

β

+

ε i where x i

= ( that the

ε i x are i 1

...

N .

x i iK

.

d

)

.

>

0 and

β

,

σ

2

= (

β

1

..

β

K

)

> are K 1 vectors. We assume

.

Then, the (conditional) log-likelihood of the observations ( x i

, y i

) is given by

`

N

(

θ

; y j x ) =

N ln

σ

2

2

N ln ( 2

π

)

2

1

2

σ

2 i

N

= 1 y i x i

>

β

2 where

θ

= ( of

β and

σ

2 ?

β

>

σ

2

) > is ( K + 1 ) 1 vector.

Question: what are the MLE

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 75 / 207

4. Maximum Likelihood Estimator

Notation 1: The derivative of a scalar y by a K 1 vector x = ( x

1

...

x

K

)

> is K 1 vector

0 1

∂ y

∂ x

=

∂ y

∂ x

1

..

∂ y

∂ x

K

Notation 2: If x and

β are two K 1 vectors, then :

∂ x

>

β

∂β

= x

( K , 1 )

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 76 / 207

4. Maximum Likelihood Estimator

Solution

= arg max

β

2 R K ,

σ

2 2 R +

N ln

σ

2

2

N

2 ln ( 2

π

)

1

2

σ

2 i

N

= 1 y i x i

>

β

2

The …rst derivative of the log-likelihood function is a ( K + 1 ) 1 vector:

0

`

N

(

θ

; y j x )

| ∂θ }

( K + 1 ) 1

=

@

`

N

(

θ

; y j x )

∂β

`

N

(

θ ; y j x )

∂σ

2

0

1

A

=

B

B

`

N

(

θ

; y j x )

∂β

1

..

`

N

(

θ

; y j x )

∂β

K

`

N

(

θ

; y j x )

∂σ

2

1

C

C

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 77 / 207

4. Maximum Likelihood Estimator

Solution (cont’d)

= arg max

β

2 R K ,

σ

2 2 R +

N ln

σ

2

2

N

2 ln ( 2

π

)

1

2

σ

2 i

N

= 1 y i x i

>

β

2

The …rst derivative of the log-likelihood function is a ( K + 1 ) 1 vector:

`

N

(

θ

; y j x )

∂β

| {z }

( K , 1 )

=

1

σ

2 i

N

= 1 x i

( K , 1 ) y i x i

>

( 1 , 1 )

β

| {z }

`

N

(

θ

; y j x )

| {z

2

}

( 1 , 1 )

=

N

2

σ

2

+

1

2

σ

4 i

N

= 1 y i

2 x i

>

β

| {z }

( 1 , 1 )

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 78 / 207

4. Maximum Likelihood Estimator

Solution (cont’d):

FOC ( log-likelihood equations )

0

`

N

(

θ

; y j x )

∂θ

=

@

N

2 b 2

1

σ

2

∑ i

N

= 1 x i

+

1

2 b 4

∑ i

N

= 1 y i y i x i

> x i

>

So, the MLE is de…ned by:

1

2

A

=

0

K

0

=

= i

N

= 1

X i

X i

>

!

1 i

N

= 1

X i

Y i

!

2 b 2

=

1

N i

N

= 1

Y i

X i

>

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 79 / 207

4. Maximum Likelihood Estimator

Solution (cont’d):

The Hessian is a ( K + 1 ) ( K + 1 ) matrix:

2 `

N

(

θ

; y j x )

>

|

( K + 1 ) ( K + 1 )

}

=

B

B

B

B

0

|

2 `

N

(

θ

; y j x )

{z

>

K K

}

|

2 `

N

(

θ

; y j x )

∂σ

2

∂β

1 K

>

}

|

2 `

N

(

θ

; y j x )

∂β∂σ

K 1

2

}

|

2 `

N

(

θ

; y j x )

∂σ

4

1 1

}

1

C

C

C

C

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 80 / 207

4. Maximum Likelihood Estimator

Solution (cont’d):

`

N

(

θ

; y j x )

=

∂β

1

σ

2 i

N

= 1 x i y i x i

>

β

`

N

(

θ

; y j x )

∂σ

2

=

N

2

σ

2

+

1

2

σ

4 i

N

= 1

So, the Hessian matrix (realization) is equal to: y i x i

>

β

2

2 `

N

(

θ

; y j x )

∂θ∂θ

>

=

B

@

0

1

σ

2

∑ i

N

= 1 x i

K 1 x i

>

|{z}

1 K

1

σ

4

∑ i

N

= 1 x i

>

|{z}

1 K y i x

1 1 i

>

β

| {z }

N

2 σ

4

1

σ

4

∑ i

N

= 1

1

σ

6 x i

K 1 y i x

1 1 i

>

β

| {z }

∑ i

N

= 1 y i x

| {z }

1 1 i

>

β

2

1

C

A

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 81 / 207

4. Maximum Likelihood Estimator

Solution (cont’d):

Second Order Conditions ( SOC)

0

2 `

N

(

θ

∂θ∂θ

>

)

=

1

σ

4

1

σ

2

∑ i

N

= 1 x i x i

>

∑ i

N

= 1 x i

> y i x i

> N

2 b 4

1

σ

4

∑ i

N

= 1 x i

1

σ

6

∑ i

N

= 1 y i y i x i

> x i

>

2

1

Since ∑ i

N

= 1 x i

>

2 y i x i

>

`

N

(

θ

)

∂θ∂θ

>

=

= 0 (FOC) and N b 2

= ∑ i

N

= 1

N

σ

2

∑ i

N

= 1 x i x i

>

0

N

2 b

4

0

N b 2

σ

6 y i x i

>

!

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 82 / 207

4. Maximum Likelihood Estimator

Solution (cont’d):

Second Order Conditions ( SOC).

2 `

N

(

θ

; y j x )

∂θ∂θ

>

=

1

σ

2

∑ i

N

= 1

0 x i x i

>

0

N

2 b

4

!

is de…nite negative

Since ∑ i

N

= 1 x i x i

> is positive de…nite (assumption), the Hessian matrix is de…nite negative and

θ is the MLE of the parameters

θ

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 83 / 207

4. Maximum Likelihood Estimator

Theorem (Equivariance or Invariance Principle)

Under suitable regularity conditions, the maximum likelihood estimator of a function g ( .

) of the parameter

θ is g b

, where b is the maximum likelihood estimator of

θ

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 84 / 207

4. Maximum Likelihood Estimator

Invariance Principle

The MLE is invariant to one-to-one transformations of

θ

. Any transformation that is not one to one either renders the model inestimable if it is one to many or imposes restrictions if it is many to one.

For the practitioner, this result is extremely useful. For example, when a parameter appears in a likelihood function in the form 1 /

θ

, it is usually worthwhile to reparameterize the model in terms of

γ

= 1 /

θ

.

Example: Olsen (1978) and the reparametrisation of the likelihood function of the Tobit Model.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 85 / 207

4. Maximum Likelihood Estimator

Example (Invariance Principle)

Suppose that the normal log-likelihood in the previous example is parameterized in terms of the precision parameter,

γ

2

= 1 /

σ

2

.

The log-likelihood

`

N m ,

σ

2

; y =

N ln

σ

2

2

N ln ( 2

π

)

2

1

2

σ

2 i

N

= 1

( y i m )

2 becomes

`

N m ,

γ

2

; y =

N ln

γ

2

2

N ln ( 2

π

)

2

γ

2

2 i

N

= 1

( y i m )

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 86 / 207

4. Maximum Likelihood Estimator

Example (Invariance Principle, cont’d)

The MLE for m is clearly still Y

N

. But the likelihood equation for

γ

2 now:

`

N m ,

γ

2

; y

=

N 1

∂γ

2 2

γ

2 2 i

N

= 1

( y i m )

2 is and the MLE for

γ

2 is now de…ned by:

γ

2

=

N

∑ i

N

= 1

( Y i m )

2

=

1

σ

2 as expected.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 87 / 207

Key Concepts

1 Identi…cation.

2

Maximum likelihood estimator.

3

Maximum likelihood estimate.

4

Log-likelihood equations.

5 Equivariance or invariance principle.

6

Gradient Vector and Hessian Matrix (deterministic elements).

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 88 / 207

Section 5

Score, Hessian and Fisher Information

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 89 / 207

5. Score, Hessian and Fisher Information

Objectives

We aim at introducing the following concepts:

1

Score vector and gradient

2

Hessian matrix

3 Fischer information matrix of the sample

4 Fischer information matrix of one observation for marginal and conditional distributions

5 Average Fischer information matrix of one observation

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 90 / 207

5. Score, Hessian and Fisher Information

De…nition (Score Vector)

The (conditional) score vector is a K 1 vector de…ned by: s

N

(

θ

; Y j x )

( K , 1 ) s (

θ

) =

`

N

(

θ

; Y j x )

∂θ

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 91 / 207

5. Score, Hessian and Fisher Information

Remarks:

The score s

N

(

θ

; Y j x ) is a vector of random elements since it depends on the random variables Y

1

, .., .

Y

N

.

For an unconditional log-likelihood, `

N

(

θ

; x ) , the score is denoted by s

N

(

θ

; X ) =

`

N

(

θ

; X ) /

∂θ

The score is a K 1 vector such that:

0 s

N

(

θ

; Y j x ) =

`

N

(

θ

; Y j x )

∂θ 1

.

`

N

(

θ

; Y j x )

∂θ K

1

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 92 / 207

5. Score, Hessian and Fisher Information

Corollary

By de…nition, the score vector satis…es

E

θ

( s

N

(

θ

; Y j x )) = 0

K where

E

θ means the expectation with respect to the conditional distribution Y j X = x.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 93 / 207

5. Score, Hessian and Fisher Information

Remark: If we consider a variable X with a pdf f

X

E

θ

( .

)

( x ;

θ

) , 8 x 2 means the expectation with respect to the distribution of

R

X

, then

:

E

θ

( s

N

(

θ

; X )) =

∞ s

N

(

θ

; x ) f

X

( x ;

θ

) dx = 0

Remark: If we consider a variable Y with a conditional pdf f

8 y 2 R

, then

E

θ

Y

( .

) means the expectation with respect to the j x distribution of Y j X = x :

( y ;

θ

) ,

E

θ

( s

N

(

θ

; Y j x )) =

∞ s

N

(

θ

; Y j x ) f

Y j x

( y ;

θ

) dy = 0

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 94 / 207

5. Score, Hessian and Fisher Information

Proof.

If we consider a variable X with a pdf f

X

( x ;

θ

) , 8 x 2 R

, then:

E

θ

( s

N

(

θ

; X )) =

Z s

N

(

θ

; x ) f

X

( x ;

θ

) dx

=

=

N

N

Z

Z

= N

∂θ

1

= N

∂θ

∂ ln f

X

∂θ

( x ;

θ

) f

X f

Z

X

(

1 x ;

θ

)

∂ f

X

( x ;

θ

( x ;

θ

) f

X

∂θ

) dx

( x ;

θ

) dx f

X

( x ;

θ

) dx

= 0

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 95 / 207

5. Score, Hessian and Fisher Information

Example (Exponential Distribution)

Suppose that D

1

D i

Exp

, D

2

, .., D

N

(

θ

) and

E

( D i

) = are i .

i .

d ., positive random variable with

θ

> 0 .

f

D

( d ;

θ

) =

1 exp

θ d

θ

, 8 d 2 R +

`

N

(

θ

; d ) = N ln (

θ

)

θ

1 i

N

= 1 d i

The score (scalar) is equal to: s

N

(

θ

; D ) =

N

+

θ

1

θ

2 i

N

= 1

D i

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 96 / 207

5. Score, Hessian and Fisher Information

Example (Exponential Distribution, cont’d)

By de…nition:

E

θ

( s

N

(

θ

; D )) =

E

θ

=

=

N

+

θ

1

θ

2 i

N

= 1

D i

!

N

θ

N

θ

+

1

θ

2 i

N

= 1

E

θ

( D i

)

+

N

θ

θ

2

= 0

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 97 / 207

5. Score, Hessian and Fisher Information

Example (Linear Regression Model)

Let us consider the previous linear regression model y i score is de…ned by:

0 s

N

(

θ

; Y j x ) =

@

N

2

σ

2

1

σ

2

∑ i

N

= 1 x i

+

1

2

σ

4

Y i

∑ i

N

= 1

Y i x i

>

β x

= i

>

β x i

>

2

β

+

ε i

. The

1

A

Then, we have

0

E

θ

( s

N

(

θ

; Y j x )) =

E

θ

@

1

σ

2

∑ i

N

= 1 x i

N

2 σ

2

+

Y i

1

2 σ

4

∑ i

N

= 1

Y i x i

>

β x i

>

β

2

1

A

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 98 / 207

5. Score, Hessian and Fisher Information

Example (Linear Regression Model, cont’d)

We know that E

θ

( Y i j x ) = x i

>

β

.

So, we have:

E

θ

1

σ

2

∑ i

N

= 1 x i

Y i x i

>

β

=

1

σ

2

1

=

σ

2

= 0

K

∑ i

N

= 1 x i

∑ i

N

= 1 x i

E

θ

( Y i j x ) x i

>

β x i

>

β x i

>

β

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 99 / 207

5. Score, Hessian and Fisher Information

Example (Linear Regression Model, cont’d)

=

=

=

=

E

θ

N

2

σ

2

+

1

2

σ

4

∑ i

N

= 1

Y i x i

>

β

2

N

2

σ

2

N

2

σ

2

N

2

σ

2

N

2

σ

2

+

1

2

σ

4

∑ i

N

= 1

E

θ

Y i x i

>

β

2

+

+

+

1

2

σ

4

∑ i

N

= 1

1

2

σ

4

N

σ

2

∑ i

N

= 1

E

θ

V

θ

( Y i

( Y i j x )

E

θ

2

σ

4

( Y i j x ))

2

= 0

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 100 / 207

5. Score, Hessian and Fisher Information

De…nition (Gradient)

The gradient vector associated to the log-likelihood function is a K 1 vector de…ned by: g

N

(

θ

; y j x )

( K , 1 ) g (

θ

) =

`

N

(

θ

; y j x )

∂θ

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 101 / 207

5. Score, Hessian and Fisher Information

Remarks

1 The gradient g

N

(

θ

; y j x ) is a vector of deterministic entries since it depends on the realisation y

1

, .., y

N

.

2

For an unconditional log-likelihood, the gradient is de…ned by g

N

(

θ

; x ) =

`

N

(

θ

; x ) /

∂θ

3

The gradient is a K 1 vector such that:

0 g

N

(

θ

; y j x ) =

`

N

(

θ

; y j x )

∂θ 1

.

`

N

(

θ

; y j x )

∂θ K

1

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 102 / 207

5. Score, Hessian and Fisher Information

Corollary

By de…nition of the FOC, the gradient vector satis…es g

N b

; y j x = 0

K where b

= b

( x ) is the maximum likelihood estimate of

θ

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 103 / 207

5. Score, Hessian and Fisher Information

Example (Linear regression model)

In the linear regression model, the gradient associated to the log-likelihood function is de…ned to be:

!

g

N

(

θ

; y j x ) =

N

2

σ

2

1

σ

2

+

∑ i

N

= 1 x i y i x i

>

β

1

2

σ

4

∑ i

N

= 1 y i x i

>

β

2

Given the FOC, we have:

0 g

N b

; y j x =

1

σ

2

∑ i

N

= 1 x i y i x i

>

N

2 b 2

+

1

2 b 4

∑ i

N

= 1 y i x i

>

2

1

=

0

K

0

!

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 104 / 207

5. Score, Hessian and Fisher Information

De…nition (Hessian Matrix)

The Hessian matrix (deterministic) is de…ned as to be:

H

N

(

θ

; y j x ) =

2 `

N

(

θ

; y j x )

∂θ∂θ

>

Remarks: The matrix

2

`

N

(

θ ; y j x )

∂θ∂θ

> not confuse the two matrices ∂

2 is also called the Hessian matrix, but do

`

N

(

θ

; Y j x )

∂θ∂θ

> and ∂

2

`

N

(

θ

; y j x )

∂θ∂θ

>

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 105 / 207

5. Score, Hessian and Fisher Information

Random Variable

Score vector

Hessian Matrix

`

N

(

θ

; Y j x )

∂θ

2

`

N

(

θ ; Y j x )

∂θ∂θ

>

Constant

Gradient vector

Hessian Matrix

`

N

(

θ

; y j x )

∂θ

2

`

N

(

θ ; y j x )

∂θ∂θ

>

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 106 / 207

5. Score, Hessian and Fisher Information

De…nition (Fisher Information Matrix)

The (conditional) Fisher information matrix associated to the sample f Y

1

, .., Y

N g is the variance-covariance matrix of the score vector:

I

N

(

θ

K K

) = V

θ

( s

N

(

θ

; Y j x )) or equivalently:

I

N

(

θ

) =

V

θ

`

N

(

θ

; Y j x )

∂θ where V

θ

Y j X .

means the variance with respect to the conditional distribution

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 107 / 207

5. Score, Hessian and Fisher Information

Corollary

Since by de…nition

E

θ

( s

N

(

θ

; Y j x )) = 0 , then an alternative de…nition of the Fisher information matrix of the sample f Y

1

, .., Y

N g is:

0 1

I (

θ

K K

) = E

θ

B s

N

(

θ

; Y j x )

| {z }

K 1 s

|

N

(

θ

; Y j x )

>

{z }

1 K

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 108 / 207

5. Score, Hessian and Fisher Information

De…nition (Fisher Information Matrix)

The (conditional) Fisher information matrix of the sample also given by: f Y

1

, .., Y

N g is

I

N

(

θ

) =

E

θ

2

`

N

(

θ

; Y j x )

∂θ∂θ

>

=

E

θ

( H

N

(

θ

; Y j x ))

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 109 / 207

5. Score, Hessian and Fisher Information

De…nition (Fisher Information Matrix, summary)

The (conditional) Fisher information matrix of the sample can alternatively be de…ned by: f Y

1

, .., Y

N g

I

N

(

θ

) =

V

θ

( s

N

(

θ

; Y j x ))

I

N

(

θ

) = E

θ s

N

(

θ

; Y j x ) s

N

(

θ

; Y j x )

>

I

N

(

θ

) =

E

θ

( H

N

(

θ

; Y j x )) where

E

θ and

V

θ denote the mean and the variance with respect to the conditional distribution Y vector and H

N j X , and where s

N

(

θ

; Y j x ) denotes the score

(

θ

; Y j x ) the Hessian matrix.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 110 / 207

5. Score, Hessian and Fisher Information

De…nition (Fisher Information Matrix, summary)

The (conditional) Fisher information matrix of the sample can alternatively be de…ned by: f Y

1

, .., Y

N g

I

N

(

θ

) = V

θ

`

N

(

θ

; Y j x )

∂θ

I

N

(

θ

) =

E

θ

`

N

(

θ

; Y j x )

∂θ

`

N

(

θ

;

∂θ

Y j x )

>

!

I

N

(

θ

) =

E

θ

2

`

N

(

θ

; Y j x )

∂θ∂θ

> where E

θ and V

θ denote the mean and the variance with respect to the conditional distribution Y j X .

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 111 / 207

5. Score, Hessian and Fisher Information

Remarks

1

Three equivalent de…nitions of the Fisher information matrix, and as a consequence three di¤erent consistent estimates of the Fisher information matrix (see later).

2

The Fisher information matrix associated to the sample f Y

1

, .., Y

N g can also be de…ned from the Fisher information matrix for the observation i .

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 112 / 207

5. Score, Hessian and Fisher Information

De…nition (Fisher Information Matrix)

The (conditional) Fisher information matrix associated to the i th individual can be de…ned by:

I i

(

θ

) =

V

θ

` i

(

θ

; Y i j x i

)

∂θ

I i

(

θ

) =

E

θ

` i

(

θ

; Y i j x i

)

` i

(

θ

; Y i j x i

)

>

!

∂θ ∂θ

I i

(

θ

) = E

θ

2 ` i

(

θ

; Y i j x i

)

∂θ∂θ

> where

E

θ and

V

θ denote the expectation and variance with respect to the true conditional distribution Y i j X i

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 113 / 207

5. Score, Hessian and Fisher Information

De…nition (Fisher Information Matrix)

The (conditional) Fisher information matrix associated to the i th individual can be alternatively be de…ned by:

I i

(

θ

) = V

θ

( s i

(

θ

; Y i j x i

))

I i

(

θ

) = E

θ s i

(

θ

; Y i j x i

) s i

(

θ

; Y i j x i

)

>

I i

(

θ

) =

E

θ

( H i

(

θ

; Y i j x i

)) where

E

θ and

V

θ denote the expectation and variance with respect to the true conditional distribution Y i j X i

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 114 / 207

5. Score, Hessian and Fisher Information

Theorem

The Fisher information matrix associated to the sample f Y

1

, .., Y

N g is equal to the sum of individual Fisher information matrices:

I

N

(

θ

) = i

N

= 1

I i

(

θ

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 115 / 207

5. Score, Hessian and Fisher Information

Remark:

1

In the case of a marginal log-likelihood, the Fisher information matrix associated to the variable X i is the same for the observations i :

I i

(

θ

) = I (

θ

) 8 i = 1 , ..

N

2 In the case of a conditional log-likelihood, the Fisher information matrix associated to the variable Y i observation i :

I i given X i

(

θ

) = I j

(

θ

) 8 i = j

= x i depends on the

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 116 / 207

5. Score, Hessian and Fisher Information

Example (Exponential marginal distribution)

Suppose that D

1

, D

2

, .., D

N

D i

Exp (

θ

)

E ( D i are i .

i .

d ., positive random variable with

) =

θ

V ( D i

) =

θ

2 f

D

( d ;

θ

) =

1 exp

θ d

θ

, 8 d 2 R + d i

` i

(

θ

; d i

) = ln (

θ

)

θ

Question : what is the Fisher information number (scalar) associated to

D i

?

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 117 / 207

5. Score, Hessian and Fisher Information

Solution

` (

θ

; d i

) = ln (

θ

) d i

θ

The score of the observation X i is de…ned by: s i

(

θ

; D i

) =

` i

(

θ

; D i

)

=

∂θ

1

θ

+

D i

θ

2

Let us use the three de…nitions of the information quantity I i

(

θ

) :

I i

(

θ

) =

V

θ

( s i

(

θ

; D i

))

= E

θ s i

(

θ

; D i

)

2

= E

θ

( H i

(

θ

; D i

))

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 118 / 207

5. Score, Hessian and Fisher Information

Solution, cont’d s i

(

θ

; D i

) =

` i

(

θ

; D i

)

=

∂θ

First de…nition:

I i

(

θ

) =

V

θ

( s i

(

θ

; D i

))

=

=

V

θ

1

θ

1

θ

4

1

V

θ

( D i

)

+

D i

θ

2

=

θ

2

Conclusion: I i

(

θ

) = I (

θ

) does not depend on i .

1

θ

+

D i

θ

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 119 / 207

5. Score, Hessian and Fisher Information

Solution, cont’d

Second de…nition: s i

(

θ

; D i

) =

` i

(

θ

; D i

)

=

∂θ

I i

(

θ

) =

E

θ

=

E

θ s i

(

θ

; D i

)

2

1

θ

+

D i

θ

2

=

V

θ

1

+

θ

D i

θ

2

=

1

θ

2

2

!

since

E

θ

Conclusion: I i

(

θ

) = I (

θ

) does not depend on i .

1

θ

+

D i

θ

2

1

θ

+

D i

θ

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

= 0

December 9, 2013 120 / 207

5. Score, Hessian and Fisher Information

Solution, cont’d

Third de…nition: s i

(

θ

; D i

) =

` i

(

θ

; D i

)

=

∂θ

H i

(

θ

; D i

) =

2 ` i

(

θ

; D i

)

∂θ

2

=

1

θ

2

1

θ

+

D i

θ

2

2 D i

θ

3

I i

(

θ

) =

E

θ

( H i

(

θ

; D i

))

=

=

=

E

θ

1

θ

2

1

θ

2

+

+

1 2 D i

2

θ

2

E

θ

θ

3

2

θ

θ

3

=

θ

3

( D i

)

1

θ

2

Conclusion: I i

(

θ

) = I (

θ

) does not depend on i

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 121 / 207

5. Score, Hessian and Fisher Information

Example (Linear regression model)

We shown that:

0

1

σ

2 x i

K 1 x i

>

1 K

2

` i

(

θ

; Y i j x i

)

∂θ∂θ

>

=

B

B

1

σ

4 i

>

1 K

Y i x

1 1 i

>

β

| {z }

1

2

σ

4

1

σ

4 x i

K 1

Y i x

1 1 i

>

β

| {z }

1

σ

6

Y i x

β

| {z }

1 1 i

>

2

1

C

C

Question : what is the Fisher information matrix associated to the observation Y i

?

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 122 / 207

5. Score, Hessian and Fisher Information

Solution

The information matrix is then de…ned by:

I (

θ

)

K + 1 K + 1

=

E

θ

2

` i

(

θ

; Y i j x i

)

∂θ∂θ

>

=

E

θ

( H i

(

θ

; Y i j where

E

θ means the expectation with respect to the conditional distribution Y i j X i

= x i x i

))

0

I i

(

θ

) =

@

1

σ

4 x i

>

1

σ

2 x i x i

>

E

θ

( Y i

) x i

>

β

1

2

σ

4

1

σ

4 x i

+

E

θ

1

σ

6

E

θ

( Y i

) x i

>

β

Y i x i

>

β

2

1

A

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 123 / 207

5. Score, Hessian and Fisher Information

Solution (cont’d)

0

I i

(

θ

) =

@

1

σ

4 x i

>

1

σ

2 x i x i

>

E

θ

( Y i

) x i

>

β

1

2 σ

4

1

σ

4 x i

E

θ

+

1

σ

6

E

θ

( Y i

) x i

>

β

Y i x i

>

β

2

Given that

E

θ

( Y i

) = x i

>

β and

E

θ

( Y i x i

>

β

I i

(

θ

) =

1

σ

2 x i x i

>

0

0

1

2

σ

4

2

) =

σ

2

, then we have:

!

Conclusion: I i

(

θ

) depends on x i and I i

(

θ

) = I j

(

θ

) for i = j .

1

A

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 124 / 207

5. Score, Hessian and Fisher Information

De…nition (Average Fisher information matrix)

For a conditional model , the average Fisher information matrix for one observation is de…ned by:

I (

θ

) = E

X

( I i

(

θ

)) where

E

X variable).

denotes the expectation with respect to X (conditioning

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 125 / 207

5. Score, Hessian and Fisher Information

Summary: For a conditional model (and only for a conditional model), we have:

I (

θ

) =

E

X

V

θ

` i

(

θ

; Y i j X i

)

∂θ

=

E

X

(

V

θ

( s (

θ

; Y i j X i

)))

I

I (

θ

) =

E

X

E

θ

` i

(

θ

; Y i j X i

)

` i

(

θ

; Y i j X i

)

>

!

∂θ ∂θ

=

E

X

E

θ s i

(

θ

; Y i j X i

) s i

(

θ

; Y i j X i

)

>

(

θ

) = E

X

E

θ

2 ` i

(

θ

; Y i j X i

)

∂θ∂θ

>

= E

X

E

θ

( H i

(

θ

; Y i j X i

))

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 126 / 207

5. Score, Hessian and Fisher Information

Summary: For a marginal distribution , we have:

I (

θ

) =

V

θ

` i

(

θ

; Y i

)

∂θ

=

V

θ

( s (

θ

; Y i

))

I (

θ

I (

θ

) = E

θ

` i

(

θ

; Y i

)

` i

(

θ

; Y i

)

>

!

∂θ ∂θ

= E

θ s i

(

θ

; Y i

) s i

(

θ

; Y i

)

>

) =

E

θ

2

` i

(

θ

; Y i

)

∂θ∂θ

>

=

E

θ

( H i

(

θ

; Y i

))

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 127 / 207

5. Score, Hessian and Fisher Information

Example (Linear Regression Model)

In the linear model, the individual Fisher information matrix is equal to:

I i

(

θ

) =

1

σ

2 x i x i

>

0

0

1

2

σ

4

!

and the average Fisher information Matrix for one observation is de…ned by:

!

I (

θ

) =

E

X

( I i

(

θ

)) =

1

σ

2

E

X

0

X i

X i

>

0

1

2

σ

4

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 128 / 207

5. Score, Hessian and Fisher Information

Summary: in order to compute the average information matrix I (

θ

) for one observation:

Step 1: Compute the Hessian matrix or the score vector for one observation

H i

(

θ

; Y i j x i

) =

2

` i

(

θ

; Y i j x i

)

∂θ∂θ

> s i

(

θ

; Y i j x i

) =

` i

(

θ

; Y i j x i

)

∂θ

Step 2: Take the expectation (or the variance) with respect to the conditional distribution Y i j X i

= x i

I i

(

θ

) =

V

θ

( s i

(

θ

; Y i j x i

)) =

E

θ

( H i

(

θ

; Y i j x i

))

Step 3: Take the expectation with respect to the conditioning variable X

I (

θ

) =

E

X

( I i

(

θ

))

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 129 / 207

5. Score, Hessian and Fisher Information

Theorem

In a sampling model (with i .

i .

d .

observations), one has:

I

N

(

θ

) = N I (

θ

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 130 / 207

5. Score, Hessian and Fisher Information pdf

Score Vector

Hessian Matrix

Information matrix

Av. Infor. Matrix

Marginal Distribution Cond. Distribution (model) f

X i

(

θ

; x i

) f

Y i j x i

(

θ

; y j x )

I

I i s i

H i

(

θ

; X i

)

(

θ

; X i

)

(

θ

) = I (

θ

)

(

θ

) = I i

(

θ

) s i

(

θ

; Y i j x i

)

H i

(

θ

; Y i j x i

)

I i

(

θ

)

I (

θ

) =

E

X

( I i

(

θ

)) with I i

(

θ

) =

V

θ

( s i

(

θ

; Y i j x i

)) =

E

θ

E

θ

( H i

(

θ

; Y i s i j x

( i

θ

;

))

Y i j x i

) s i

(

θ

; Y i j x i

)

>

=

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 131 / 207

5. Score, Hessian and Fisher Information

How to estimate the average Fisher Information Matrix?

This matrix is particularly important, since we will see that its corresponds to the asymptotic variance covariance matrix of the

MLE.

Let us assume that we have a consistent estimator b of the parameter

θ

, how to estimate the average Fisher information matrix?

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 132 / 207

5. Score, Hessian and Fisher Information

De…nition (Estimators of the average Fisher Information Matrix)

If b converges in probability to

θ 0

(true value), then: b

θ b

θ

=

1

N i

N

= 1 b i

=

1

N i

N

= 1

` i

(

θ

; y i j x i

)

∂θ

` i

(

θ

; y i j x i

)

∂θ

>

!

b

θ

=

1

N i

N

= 1

2 ` i

(

θ

; y i j x i

)

∂θ∂θ

> are three consistent estimators of the average Fisher information matrix.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 133 / 207

5. Score, Hessian and Fisher Information

1

The …rst estimator corresponds to the average of the N Fisher information matrices (for Y

1

, .., Y

N

) evaluated at the estimated value b

.

This estimator will rarely be available in practice.

2

The second estimator corresponds to the average of the product of the individual score vectors evaluated at

θ

.

It is known as the BHHH

(Berndt, Hall, Hall, and Hausman, 1994) estimator or OPG estimator

(outer product of gradients).

b

θ

=

1

N i

N

= 1 g i b

; y i j x i g i b

; y i j x i

>

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 134 / 207

5. Score, Hessian and Fisher Information

3.

The third estimator corresponds to the opposite of the average of the

Hessian matrices evaluated at

θ

.

b

θ

=

1

N i

N

= 1

H i b

; y i j x i

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 135 / 207

5. Score, Hessian and Fisher Information

Problem

These three estimators are asymptotically equivalent, but they could give di¤erent results in …nite samples. Available evidence suggests that in small or moderate sized samples, the Hessian is preferable (Greene, 2007).

However, in most cases, the BHHH estimator will be the easiest to compute.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 136 / 207

5. Score, Hessian and Fisher Information

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 137 / 207

5. Score, Hessian and Fisher Information

Example (CAPM)

The empirical analogue of the CAPM is given by: r it

=

α i

+

β i r mt

+

ε t e it

= r it r excess return of security i at time t where

ε t is an i .

i .

d .

error term with:

E

(

ε t

) = 0

V

(

ε t

) =

σ

2 e mt

= ( r mt r ft

}

) market excess return at time t

E

(

ε t j e mt

) = 0

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 138 / 207

5. Score, Hessian and Fisher Information

Example (CAPM, cont’d)

Data ( data …le: capm.xls

): Microsoft, SP500 and Tbill (closing prices) from 11/1/1993 to 04/03/2003

0.10

0.08

0.05

0.04

0.00

0.00

-0.05

-0.04

-0.10

-0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08

RSP500

-0.08

500 1000

RSP500

1500

RMSFT

2000

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 139 / 207

5. Score, Hessian and Fisher Information

Example (CAPM, cont’d)

We consider the CAPM model rewritten as follows r it

= x t

>

β

+

ε t t = 1 , ..

T where x t

θ

=

α i

:

= ( 1 e mt

)

>

β i

:

σ

2

>

= is 2 1 vector of random variables,

β

>

:

σ

2

> is 3 1 vector of parameters, and where the error term

ε

E

(

ε t j e mt

) = 0 .

t satis…es

E

(

ε t

) = 0 ,

V

(

ε t

) =

σ

2 and

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 140 / 207

5. Score, Hessian and Fisher Information

Example (CAPM, cont’d)

Question: Compute three alternative estimators of the asymptotic variance covariance matrix of the MLE estimator b

= b i i

2

>

= b i

β i

= t

T

= 1 x t x t

>

!

1 t

T

= 1 x t e it

!

σ

2

=

1

T t

T

= 1 e it x t

>

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 141 / 207

5. Score, Hessian and Fisher Information

Solution The ML estimator is de…ned by:

= arg max

β

2 R 2 ,

σ

2 2 R +

T

2 ln

σ

2

T

2 ln ( 2

π

)

1

2

σ

2

The problem is regular, so we have: p

T b

θ 0 d

! N 0 , I

1

(

θ 0

) t

T

= 1 r it or equivalently asy

N

θ 0

,

1

T

I

1

(

θ 0

)

The asymptotic variance covariance matrix of b is

V b

=

1

T

I

1

(

θ 0

) x t

>

2

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 142 / 207

5. Score, Hessian and Fisher Information

Solution (cont’d)

First estimator: The information matrix at time t is de…ned by (third de…nition):

0 1

I t

(

θ

) =

E

θ

@

2

` t θ

; e it

∂ θ ∂ θ

> x t

A

=

E

θ

H t θ

; e it x t where

E

θ means the expectation with respect to the conditional distribution e it

X t

= x t

I t

(

θ

) =

0

1

σ

4 x t

>

1

σ

2 x t x t

>

E

θ

R it x t

>

β

1

σ

4 x t

E

θ

R it

1

2 σ

4

+

1

σ

6

E

θ

R it x t

>

β x t

>

β

2

1

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 143 / 207

5. Score, Hessian and Fisher Information

Solution (cont’d)

First estimator:

0

I t

(

θ

) =

1

σ

4 x t

> E

θ

1

σ

2 x t x t

> e it x t

>

β

1

σ

4 x t

E

θ

1

2

σ

4

+

1

σ

6

E

θ e it x t

>

β e it x t

>

β

2

1

Given that

E

θ e it

= x t

>

β and

E

θ

I t

(

θ

) =

1

σ

2 x t x

0

1 2 t

> e it x t

>

β

2

0

2 1

1

2

σ

4

!

=

σ

2

, then we have:

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 144 / 207

5. Score, Hessian and Fisher Information

Solution (cont’d)

First estimator:

I t

(

θ

) =

1

σ

2 x t x t

>

0

1 2

0

2 1

1

2

σ

4

!

An estimator of the asymptotic variance covariance matrix of b is given by: b

θ

=

1

T t

T

= 1

I t

V asy

=

=

1

T

1

1

T b 2

∑ T t = 1 x t x t

>

0

1 2

0

2 1

1

2 b 4

!

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 145 / 207

5. Score, Hessian and Fisher Information

Solution (cont’d)

Second de…nition (BHHH) : b

θ

V asy

=

1

T

1

=

T t

T

= 1

` t

(

θ

; e it j x t

)

∂ θ

1 with

0

` t

(

θ

; e it j x t

)

∂ θ

=

1

σ

2 x t

1

2 b

2

+

1

2 b

4 r it r it x t

> x t

>

` t

(

θ

; e it j x t

)

∂ θ

>

!

2

1

=

1

1

σ

2

2 b 2 x t b t

+

1

2 b 4

ε

2 t

!

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 146 / 207

5. Score, Hessian and Fisher Information

Solution (cont’d)

Second de…nition (BHHH) :

=

=

0

@

>

` t

(

θ

; e it j x t

)

` t

(

θ

; e it j x t

)

∂ θ

1

1

σ

2

2 b

2 x t b t

+

1

2 b

4

ε

2 t

!

1

σ

4 x t x t

>

ε

2 t

1

σ

2 x t

> b t

1

2 b

2

+

1

2 b

4

ε

2 t

∂ θ

1

σ

2 x t

> b t

1

σ

2 x t b t

1

2 b 2

1

2 b

2

+

1

2 b 4

1

2 b 2

+

+

1

2 b

4

2 t

1

2 b 4

2

2 t

ε

2 t

1

A

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 147 / 207

5. Score, Hessian and Fisher Information

Solution (cont’d)

Second de…nition (BHHH) : so we have

V asy

=

1

T

1 with b

θ

=

1

T t

T

= 1

0

@

1

σ

2 x t

> b t

1

σ

4 x t x t

>

ε

2 t

1

2 b

2

+

1

2 b

4

ε

2 t

1

σ

2 x t b t

1

2 b

2

1

2 b 2

+

+

1

2 b

4

1

2 b 4

2

2 t

ε

2 t

1

A

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 148 / 207

5. Score, Hessian and Fisher Information

Solution (cont’d)

Third de…nition (inverse of the Hessian) : we know that

V asy

=

1

T

1 b

θ

H t

0 b

; e it j x t

=

@

=

1

T t

T

= 1

H t

1

σ

4 x t

>

1

σ

2 x t x t

> e it x t

> b

; e it j x t

1

2 b 4

1

σ

4 x t e it

1

σ

6 x t

> e it x t

>

1

2

A

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 149 / 207

5. Score, Hessian and Fisher Information

Solution (cont’d)

Third de…nition (inverse of the Hessian) :

H t

0 b

; e it j x t

=

@

1

σ

4 x t

>

1

σ

2 x t x t

> r it x t

> 1

2 b

4

1

σ

4 x t e it x t

>

1

σ

6 r it x t

>

1

2

A

Given the FOC (log-likelihood equations), ∑ T t = 1 x t e it x t

>

2 e it x t

> = T b 2

.

t

T

= 1

H t b

; e it j x t

=

1

σ

2

∑ T t = 1 x t x t

>

0

1 2

0

2 1

T

2 b 4

!

= 0 and

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 150 / 207

5. Score, Hessian and Fisher Information

Solution (cont’d)

Third de…nition (inverse of the Hessian) :

So, in this case, the third estimator of b

θ coïncides with the …rst one: b

θ

=

1

T t

T

= 1

H t

V asy

=

1

T b

; e it j x t

=

1

1

T b

2

∑ T t = 1

0

1 2 x t x t

>

0

2 1

1

2 b 4

!

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 151 / 207

5. Score, Hessian and Fisher Information

Solution (cont’d)

These three estimates of the asymptotic variance covariance matrix are asymptotically equivalent, but can be largely di¤erent in …nite sample...

V asy

1

=

T

1 with b

θ

=

1

T t

T

= 1 b

θ

=

1

T t

T

= 1

` t

(

θ

; e it j x t

)

I t

∂ θ

` t

(

θ

; e it j x t

)

∂ θ

>

!

b

θ

=

1

T t

T

= 1

( H t

(

θ

; e it j x t

))

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 152 / 207

5. Score, Hessian and Fisher Information

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 153 / 207

5. Score, Hessian and Fisher Information

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 154 / 207

5. Score, Hessian and Fisher Information

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 155 / 207

Key Concepts

1

Gradient and Hessian Matrix (deterministic elements).

2

Score Vector (random elements).

3 Hessian Matrix (random elements).

4

Fisher information matrix associated to the sample.

5

(Average) Fisher information matrix for one observation.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 156 / 207

Section 6

Properties of Maximum Likelihood Estimators

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 157 / 207

6. Properties of Maximum Likelihood Estimators

Objectives

MLE is a good estimator? Under which conditions the MLE is unbiased, consistent and corresponds to the BUE (Best Unbiased

Estimator)? = > regularity conditions

Is the MLE consistent ?

Is the MLE optimal or e¢ cient ?

What is the asymptotic distribution of the MLE? The magic of the

MLE...

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 158 / 207

6. Properties of Maximum Likelihood Estimators

De…nition (Regularity conditions)

Greene (2007) identify three regularity conditions

R1 The …rst three derivatives of ln f

X

(

θ

; x i

) with respect to

θ are continuous and …nite for almost all x i and for all

θ

. This condition ensures the existence of a certain Taylor series approximation and the

…nite variance of the derivatives of ` i

(

θ

; x i

) .

R2 The conditions necessary to obtain the expectations of the …rst and second derivatives of ln f

X

(

θ

; X i

) are met.

R3 For all values of

θ

,

3 ln f

X

(

θ

; x i

) /

∂θ i ∂θ j ∂θ k is less than a function that has a …nite expectation. This condition will allow us to truncate the

Taylor series.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 159 / 207

6. Properties of Maximum Likelihood Estimators

De…nition (Regularity conditions, Zivot 2001)

A pdf f

X

(

θ

; x ) is regular if and only of:

R1 The support of the random variables X , SX does not depend on

θ

.

= f x : f

X

(

θ

; x ) > 0 g ,

R2 f

X

(

θ

; x ) is at least three times di¤erentiable with respect to these derivatives are continuous.

θ

, and

R3 The true value of

θ lies in a compact set

Θ

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 160 / 207

6. Properties of Maximum Likelihood Estimators

Under these regularity conditions, the maximum likelihood estimator b possesses many appealing properties:

1

The maximum likelihood estimator is consistent .

2 The maximum likelihood estimator is asymptotically normal (the magic of the MLE..).

3

The maximum likelihood estimator is asymptotically optimal or e¢ cient.

4

The maximum likelihood estimator is equivariant : if b is an estimator of

θ then g ( b

) is an estimator of g (

θ

) .

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 161 / 207

6. Properties of Maximum Likelihood Estimators

Theorem (Consistency)

Under regularity conditions, the maximum likelihood estimator is consistent

N p

!

!

θ 0 or equivalently: p lim

N !

=

θ 0 where

θ 0 denotes the true value of the parameter

θ

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 162 / 207

6. Properties of Maximum Likelihood Estimators

Sketch of the proof (Greene, 2007)

Because b is the MLE, in any …nite sample, for any

θ true

θ 0

) it must be true that

= b

(including the ln L

N b

; y j x ln L

N

(

θ

; y j x )

Consider, then, the random variable L

N

(

θ

; Y j x ) / L

N

(

θ 0

; Y j x ) . Because the log function is strictly concave, from Jensen’s Inequality, we have

E

θ ln

L

N

L

N

(

θ

; Y j x )

(

θ 0

; Y j x ) ln

E

θ

L

N

L

N

(

θ

; Y j x )

(

θ 0

; Y j x )

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 163 / 207

6. Properties of Maximum Likelihood Estimators

Sketch of the proof, cont’d

The expectation on the right-hand side is exactly equal to one, as

E

θ

L

N

L

N

(

θ

; Y j x )

(

θ 0

; Y j x )

=

Z

Z

L

N

L

N

(

θ

; y j x )

(

θ 0

; y j x )

L

N

(

θ 0

; y j x ) dy

= L

N

(

θ

; y j x ) dy

= 1 is simply the integral of a joint density.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 164 / 207

6. Properties of Maximum Likelihood Estimators

Sketch of the proof, cont’d

So we have

E

θ ln

L

N

L

N

(

θ

; Y j x )

(

θ 0

; Y j x ) ln

E

θ

L

N

L

N

(

θ

; Y j x )

(

θ 0

; Y j x )

Divide the left hand side of this equation by N to produce

= ln ( 1 ) = 0

E

θ

1

N ln L

N

(

θ

; Y j x )

This produces a central result:

E

θ

1

N ln L

N

(

θ 0

; Y j x )

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 165 / 207

6. Properties of Maximum Likelihood Estimators

Theorem (Likelihood Inequality)

The expected value of the log-likelihood is maximized at the true value of the parameters. For any

θ

, including b

:

E

θ

1

N

`

N

(

θ 0

; Y i j x i

) E

θ

1

N

`

N

(

θ

; Y i j x i

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 166 / 207

6. Properties of Maximum Likelihood Estimators

Sketch of the proof, cont’d

Notice that

1

N

`

N

(

θ

; Y i j x i

) =

1

N

∑ i

N

= 1

` i

(

θ

; Y i j x i

) where the elements ` i

(

θ

; Y i j x i

) for i = 1 , ..

N are i .

i .

d .

. So, using a law of large numbers, we get:

1

N

`

N

(

θ

; Y i j x i

)

N p

!

!

E

θ

1

N

`

N

(

θ

; Y i j x i

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 167 / 207

6. Properties of Maximum Likelihood Estimators

Sketch of the proof, cont’d

The Likelihood inequality for

θ

= b implies

E

θ

1

N

`

N

(

θ 0

; Y i j x i

)

E

θ

1

N

`

N b

; Y i j x i with

1

N

`

N

(

θ 0

; Y i j x i

)

N p

!

!

E

θ

1

N

`

N b

; Y i j x i

N p

!

!

E

θ

1

N

`

N

(

θ 0

; Y i j x i

)

1

N

`

N b

; Y i j x i and thus

N lim

!

Pr

1

N

`

N

(

θ 0

; Y i j x i

)

1

N

`

N b

; Y i j x i

= 1

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 168 / 207

6. Properties of Maximum Likelihood Estimators

Sketch of the proof, cont’d So we have two results:

N lim

!

Pr

1

N

`

N

(

θ 0

; Y i j x i

)

1

N

`

N b

; Y i j x i

It necessarily implies that

1

1

N

`

N b

; Y i

N

`

N

(

θ 0

; Y i j x i

) j x i

= 1

8 N

1

N

`

N b

; Y i j x i

N p

!

!

1

N

`

N

(

θ 0

; Y i j x i

)

If

θ is a scalar, we have immediatly:

N p

!

!

θ 0

For a more general case with dim (

θ

) = K , see a formal proof in Amemiya

(1985).

Amemiya T., (1985) Advanced Econometrics. Harvard University Press

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 169 / 207

6. Properties of Maximum Likelihood Estimators

Remark

The proof of the consistency of the MLE is largely easiest when we have a formal expression for the maximum likelihood estimator b

= b

( X

1

, .., X

N

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 170 / 207

6. Properties of Maximum Likelihood Estimators

Example

Suppose that D

1

, D

2

, .., D

N

D i

Exp (

θ 0

) , with are i .

i .

d ., positive random variable with f

D

( d ;

θ

) =

1 exp

θ d

θ

, 8 d 2 R +

E

θ

( D i

) =

θ 0

V

θ

( D i

) =

θ

2

0 where

θ 0 is the true value of consistent.

θ

.

Question: show that the MLE is

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 171 / 207

6. Properties of Maximum Likelihood Estimators

Solution

The log-likelihood function associated to the sample by:

`

N

(

θ

; d ) = N ln (

θ

)

θ

1 i

N

= 1 d i f d

1

, .., d

N g is de…ned

We admit that maximum likelihood estimator corresponds to the sample mean:

=

1

N

∑ i

N

= 1

D i

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 172 / 207

6. Properties of Maximum Likelihood Estimators

Solution, cont’d

Then, we have:

E

θ

1

=

N

∑ i

N

= 1

E

θ

( D i

) =

θ

V

θ b is unbiased

=

1

N 2

∑ i

N

= 1

V

θ

( D i

) =

θ

2

N

As a consequence

E

θ

=

θ lim

N !

V

θ

= 0 and

N p

!

!

θ

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 173 / 207

6. Properties of Maximum Likelihood Estimators

Lemma

Under stronger conditions, the maximum likelihood estimator converges almost surely to

θ 0

N a .

s .

!

!

θ 0

= )

N p

!

!

θ 0

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 174 / 207

6. Properties of Maximum Likelihood Estimators

1 If we restrict ourselves to the class of unbiased estimators (linear and nonlinear) then we de…ne the best estimator as the one with the smallest variance .

2

With linear estimators (next chapter), the Gauss-Markov theorem tells us that the ordinary least squares (OLS) estimator is best (BLUE).

3

When we expand the class of estimators to include linear and nonlinear estimators it turns out that we can establish an absolute lower bound on the variance of any unbiased estimator

θ of

θ under certain conditions.

4

Then if an unbiased estimator b has a variance that is equal to the lower bound then we have found the best unbiased estimator

(BUE) .

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 175 / 207

6. Properties of Maximum Likelihood Estimators

De…nition (Cramer-Rao or FDCR bound)

Let X

1

, .., X

N be an i estimator of

θ

; i.e., E

.

θ i .

( d b

.

sample with pdf

) =

θ

. If f

X

(

θ

; x ) f

X

(

θ

; x ) . Let is regular then b be an unbiased

V

θ

I

N

1

(

θ 0

) FDCR or Cramer-Rao bound where I

N

(

θ 0

) denotes the Fisher information number for the evaluated at the true value

θ 0

.

sample

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 176 / 207

6. Properties of Maximum Likelihood Estimators

Remarks

1

Hence, the Cramer-Rao Bound is the inverse of the information matrix associated to the sample. Reminder: three de…nitions for I

N

I

N

(

θ 0

) =

V

θ

`

N

(

θ

; Y j x )

∂θ

θ 0

!

(

θ 0

) .

!

I

N

(

θ 0

) =

E

θ

`

N

(

θ

; Y j x )

∂θ

I

N

(

θ 0

) =

E

θ

`

N

(

θ

; Y j x )

>

∂θ

θ 0

2 `

N

(

θ

; Y j x )

∂θ∂θ

>

θ 0

!

θ 0

2

If

θ is a vector then

V

θ is positive semi-de…nite b

I

N

1

(

θ 0

) means that

V

θ b

I

N

1

(

θ 0

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 177 / 207

6. Properties of Maximum Likelihood Estimators

Theorem (E¢ ciency)

Under regularity conditions, the maximum likelihood estimator is asymptotically e¢ cient and attains the FDCR (Frechet - Darnois -

Cramer - Rao) or Cramer-Rao bound :

V

θ

= I

N

1

(

θ 0

) where I

N

(

θ 0

) denotes the Fisher information matrix associated to the sample evaluated at the true value

θ 0

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 178 / 207

6. Properties of Maximum Likelihood Estimators

Example (Exponential Distribution)

Suppose that D

1

, D

2

, .., D

N

D i

Exp (

θ 0

) , with are i .

i .

d ., positive random variable with f

D

( d ;

θ

) =

1 exp

θ d

θ

, 8 d 2 R +

E

θ

( D i

) =

θ 0

V

θ

( D i

) =

θ

2

0 where

θ 0 is the true value of

θ

.

Question: show that the MLE is e¢ cient.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 179 / 207

6. Properties of Maximum Likelihood Estimators

Solution

We shown that the maximum likelihood estimator corresponds to the sample mean,

=

1

N i

N

= 1

D i

V

θ

=

θ

2

0

N

E

θ

=

θ 0

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 180 / 207

6. Properties of Maximum Likelihood Estimators

Solution, cont’d

The log-likelihood function is

`

N

(

θ

; d ) = N ln (

θ

)

θ

1 i

N

= 1 d i

The score vector is de…ned by: s

N

(

θ

; D ) =

`

N

(

θ

; D )

=

∂θ

N

+

θ

1

θ

2 i

N

= 1

D i

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 181 / 207

6. Properties of Maximum Likelihood Estimators

Solution, cont’d

Let us use one of the three de…nitions of the information quantity I

N

(

θ

) :

I

N

(

θ

) =

V

θ

`

N

(

θ

; D )

=

V

θ

=

=

∂θ

N

+

θ

1

θ

2 i

N

= 1

D i

!

1

θ

4

N

θ

2 i

N

= 1

θ

4

=

V

θ

N

θ

2

( D i

)

Then, b is e¢ cient and attains the Cramer-Rao bound.

V

θ

= I

N

1

(

θ 0

) =

θ

2

N

December 9, 2013 182 / 207 Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

6. Properties of Maximum Likelihood Estimators

Theorem (Convergence of the MLE)

Under suitable regularity conditions, the MLE is asymptotically normally distributed with p

N b

θ 0 d

! N 0 , I

1

(

θ 0

) where

θ 0 denotes the true value of the parameter and I

Fisher information matrix for one observation.

(

θ 0

) the (average)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 183 / 207

6. Properties of Maximum Likelihood Estimators

Corollary

Another way, to write this result, is to say that for large sample size N , the

MLE b is approximatively distributed according a normal distribution asy

N

θ 0

, N

1

I

1

(

θ 0

) or equivalently asy

N

θ 0

, I

N

1

(

θ 0

) where I

N

(

θ 0

) = N I (

θ 0 associated to the sample.

) denotes the Fisher information matrix

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 184 / 207

6. Properties of Maximum Likelihood Estimators

De…nition (Asymptotic Variance)

The asymptotic variance of the MLE is de…ned by:

V asy

= I

N

1

(

θ 0

) where I

N

(

θ 0

) denotes the Fisher information matrix associated to the sample. This asymptotic variance of the MLE corresponds to the

Cramer-Rao or FDCR bound.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 185 / 207

6. Properties of Maximum Likelihood Estimators

The magic of the MLE

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 186 / 207

6. Properties of Maximum Likelihood Estimators

Proof (MLE convergence)

At the maximum likelihood estimator, the gradient of the log-likelihood equals zero (FOC): g

N

( K , 1 ) g

N b

; y j x =

`

N

(

θ

; y j x )

∂θ

= 0

K where b

= b

( x ) denotes here the ML estimate . Expand this set of equations in a Taylor series around the true parameters

θ 0

. We will use the mean value theorem to truncate the Taylor series at the second term: g

N

= g

N

(

θ 0

) + H

N θ b

θ 0

= 0

The Hessian is evaluated at a point

θ that is between instance

θ

=

ω b

+ ( 1

ω

)

θ 0 for some 0 <

ω

< 1.

b and

θ 0

, for

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 187 / 207

6. Properties of Maximum Likelihood Estimators

Proof (MLE convergence, cont’d)

We then rearrange this equation and multiply the result by p

N to obtain: p

N b

θ 0

= H

N θ

1 p

Ng

N

(

θ 0

)

By dividing H

N θ and g

N

(

θ 0

) by N , we obtain: p

N b

θ 0

=

=

1

N

H

N

1

N

H

N

θ

θ

1

1 p

N

1

N g

N

(

θ 0

) p

Ng (

θ 0

) where g (

θ 0

) denotes the sample mean of the individual gradient vectors g (

θ 0

) =

1

N g

N

(

θ 0

) =

1

N i

N

= 1 g i

(

θ 0

; y i j x i

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 188 / 207

6. Properties of Maximum Likelihood Estimators

Proof (MLE convergence, cont’d)

Let us now consider the same expression in terms of random variables: b now denotes the ML estimator, H

N θ the score vector. We have:

= H

N θ

; Y j x and s

N

(

θ 0

; Y j x ) p

N b

θ 0

=

1

N

H

N θ

; Y j x

1 p

Ns (

θ 0

; Y j x ) where the score vectors associated to the variables Y i are i .

i .

d .

1 s (

θ 0

; Y j x ) =

N i

N

= 1 s i

(

θ 0

; Y i j x i

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 189 / 207

6. Properties of Maximum Likelihood Estimators

Proof (MLE convergence, cont’d)

Let us consider the …rst element: s (

θ 0

) =

1

N i

N

= 1 s i

(

θ 0

; Y i j x i

)

The individual scores s i

(

θ 0

; Y i j x i

) are i .

i .

d .

with

E

θ

( s i

(

θ 0

; Y i j x i

)) = 0

E x

V

θ

( s i

(

θ 0

; Y i j x i

)) =

E x

( I i

(

θ 0

)) = I (

θ 0

)

By using the Lindberg-Levy Central Limit Theorem , we have: p

Ns (

θ 0

) d

! N ( 0 , I (

θ 0

))

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 190 / 207

6. Properties of Maximum Likelihood Estimators

Proof (MLE convergence, cont’d)

We known that:

1

N

H

N θ

; Y j x =

1

N i

N

= 1

H i θ

; Y i j x i where the hessian matrices H i plim b

θ 0

= 0 , plim

θ θ 0

θ

; Y i

= j x i are i .

i .

d .

Besides, because

0 as well. By applying a law of large numbers, we get:

1

N

H

N θ

; Y j x p

!

E

X

E

θ

( H i

(

θ 0

; Y i j x i

)) with

E

X

E

θ

( H i

(

θ 0

; Y i j x i

)) =

E

X

E

θ

2 ` i

(

θ

; Y i j x i

)

∂θ∂θ

>

= I (

θ 0

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 191 / 207

6. Properties of Maximum Likelihood Estimators

Reminder:

If X

N and Y

N verify

X

N

( K , K ) p

!

X

( K , K )

Y

N

( K , 1 ) d

! N 0

( K , 1 )

,

Σ

( K , K ) then

X

N

Y

N

( K , K ) ( K , 1 ) d

! N 0

( K , 1 )

, X

Σ

X

>

( K , K ) ( K , K ) ( K , K )

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 192 / 207

6. Properties of Maximum Likelihood Estimators

Proof (MLE convergence, cont’d)

Here we have p

N b

θ 0

=

1

N

H

N θ

; Y j x

1 p

Ns (

θ 0

; Y j x )

1

N

H

N θ

; Y j x

1 p

!

I

1

(

θ 0

) symmetric matrix

Then, we get: p

N b

θ 0 p

Ns (

θ 0

) d

! N ( 0 , I (

θ 0

)) d

! N 0 , I

1

(

θ 0

) I (

θ 0

) I

1

(

θ 0

)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 193 / 207

6. Properties of Maximum Likelihood Estimators

Proof (MLE convergence, cont’d)

And …nally....

p

N b

θ 0 d

! N 0 , I

1

(

θ 0

)

The magic of the MLE.....

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 194 / 207

6. Properties of Maximum Likelihood Estimators

Example (Exponential Distribution)

Suppose that D

1

, D

2

, .., D

N

D i

Exp (

θ 0

) , with are i .

i .

d ., positive random variable with f

D

( d ;

θ

) =

1 exp

θ d

θ

, 8 d 2 R +

E

θ

( D i

) =

θ 0

V

θ

( D i

) =

θ

2

0 where

θ 0 is the true value of

θ

.

Question: what is the asymptotic distribution of the MLE? Propose a consistent estimator of the asymptotic variance of b

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 195 / 207

6. Properties of Maximum Likelihood Estimators

Solution

We shown that b

= ( 1 / N ) ∑ i

N

= 1

D i and: s i

(

θ

; D i

) =

` i

(

θ

; D i

)

=

∂θ

1

θ

+

D i

θ

2

The (average) Fisher information matrix associated to D i is:

I (

θ

) =

V

θ

1

+

θ

D i

θ

2

=

1

θ

4

V

θ

Then, the asymptotic distribution of b is: p

N b

θ 0 d

! N 0 ,

θ

2

( D i

) =

1

θ

2 or equivalently asy

N

θ 0

,

θ

2

!

N

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 196 / 207

6. Properties of Maximum Likelihood Estimators

Solution, cont’d

The asymptotic variance of b is:

V asy

=

θ

2

N

A consistent estimator of

V as b is simply de…ned by:

V asy

= b

2

N

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 197 / 207

6. Properties of Maximum Likelihood Estimators

Example (Linear Regression Model)

Let us consider the previous linear regression model y i

N .

i .

d .

0 ,

σ

2

.

Let us denote

θ the K + 1

β

+

1 vector de…ned by

ε i

, with

ε i

>

= x i

>

θ

=

β

>

σ

2

.

The MLE estimator of

θ is de…ned by:

=

2

= i

N

= 1

X i

X i

>

!

1 i

N

= 1

X i

>

Y i

!

2

=

1

N i

N

= 1

Y i

X i

>

2

Question: what is the asymptotic distribution of b

?

Propose an estimator of the asymptotic variance.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 198 / 207

6. Properties of Maximum Likelihood Estimators

Solution

This model satisfy the regularity conditions. We shown that the average

Fisher information matrix is equal to:

I (

θ

) =

1

σ

2

E

X

0

X i

X i

>

0

1

2

σ

4

From the MLE convergence theorem, we get immediately: p

N b

θ 0 d

! N 0 , I

1

(

θ 0

) where

θ 0 is the true value of

θ

.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 199 / 207

6. Properties of Maximum Likelihood Estimators

Solution, cont’d

The asymptotic variance covariance matrix of b is equal to:

V asy

= N

1

I

1

(

θ 0

) = I

N

1

(

θ 0

) with

I

N

(

θ

) =

N

σ

2

E

X

0

X i

X i

>

0

N

2

σ

4

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 200 / 207

6. Properties of Maximum Likelihood Estimators

Solution, cont’d

A consistent estimate of I

N

(

θ

) is: b

N

(

θ

) = b 1 asy

=

N

σ

2

Q

0

X

0

N

2 b

4

!

with

Q

X

1

=

N i

N

= 1 x i x i

>

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 201 / 207

6. Properties of Maximum Likelihood Estimators

Solution, cont’d

Thus we get: asy

N

β

0

, b 2 ∑ i

N

= 1 x i x i

>

2 asy

N

σ

2

0

,

2 b 4

!

N

1

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 202 / 207

6. Properties of Maximum Likelihood Estimators

Summary

Under regular conditions

1 The MLE is consistent.

2

The MLE is asymptotically e¢ cient and its variance attains the

FDCR or Cramer-Rao bound.

3

The MLE is asymptotically normally distributed.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 203 / 207

6. Properties of Maximum Likelihood Estimators

But, …nite sample properties can be very di¤erent from large sample properties:

1 The maximum likelihood estimator is consistent but can be severely biased in …nite samples

2 The estimation of the variance-covariance matrix can be seriously doubtful in …nite samples.

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 204 / 207

6. Properties of Maximum Likelihood Estimators

Theorem (Equivariance)

Under regular conditions and if g ( .

) is a continuously di¤erentiable function of

θ and is de…ned from

R K to

R P

, then: g b p

!

g (

θ 0

) p

N g b g (

θ 0

) d

! N 0 , G (

θ 0

) I

1

(

θ 0

) G (

θ 0

)

> where

θ 0 de…ned by is the true value of the parameters and the matrix G

G (

θ

)

( P , K )

=

∂ g (

θ

∂θ

>

)

(

θ 0

) is

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 205 / 207

Key Concepts of the Chapter 2

1

2

3

4

5

6

7

8

9

10

Likelihood and log-likelihood function

Maximum likelihood estimator (MLE) and Maximum likelihood estimate

Gradient and Hessian Matrix (deterministic elements)

Score Vector and Hessian Matrix (random elements)

Fisher information matrix associated to the sample

(Average) Fisher information matrix for one observation

FDCR or Cramer Rao Bound: the notion of e¢ ciency

Asymptotic distribution of the MLE

Asymptotic variance of the MLE

Estimator of the asymptotic variance of the MLE

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 206 / 207

End of Chapter 2

Christophe Hurlin (University of Orléans)

Christophe Hurlin (University of Orléans)

Advanced Econometrics - HEC Lausanne

December 9, 2013 207 / 207

Download