# Chapter2 MLE

```Chapter 2: Maximum Likelihood Estimation
Christophe Hurlin
University of Orléans
December 9, 2013
Christophe Hurlin (University of Orléans)
December 9, 2013
1 / 207
Section 1
Introduction
Christophe Hurlin (University of Orléans)
December 9, 2013
2 / 207
1. Introduction
The Maximum Likelihood Estimation (MLE) is a method of
estimating the parameters of a model. This estimation method is one
of the most widely used.
The method of maximum likelihood selects the set of values of the
model parameters that maximizes the likelihood function. Intuitively,
this maximizes the "agreement" of the selected model with the
observed data.
The Maximum-likelihood Estimation gives an uni…ed approach to
estimation.
Christophe Hurlin (University of Orléans)
December 9, 2013
3 / 207
2. The Principle of Maximum Likelihood
What are the main properties of the maximum likelihood estimator?
I
I
I
I
Is it asymptotically unbiased?
Is it asymptotically e¢ cient? Under which condition(s)?
Is it consistent?
What is the asymptotic distribution?
How to apply the maximum likelihood principle to the multiple linear
regression model, to the Probit/Logit Models etc. ?
... All of these questions are answered in this lecture...
Christophe Hurlin (University of Orléans)
December 9, 2013
4 / 207
1. Introduction
The outline of this chapter is the following:
Section 2: The principle of the maximum likelihood estimation
Section 3: The likelihood function
Section 4: Maximum likelihood estimator
Section 5: Score, Hessian and Fisher information
Section 6: Properties of maximum likelihood estimators
Christophe Hurlin (University of Orléans)
December 9, 2013
5 / 207
1. Introduction
References
Amemiya T. (1985), Advanced Econometrics. Harvard University Press.
Greene W. (2007), Econometric Analysis, sixth edition, Pearson - Prentice Hil
Pelgrin, F. (2010), Lecture notes Advanced Econometrics, HEC Lausanne (a
special thank)
Ruud P., (2000) An introduction to Classical Econometric Theory, Oxford
University Press.
Zivot, E. (2001), Maximum Likelihood Estimation, Lecture notes.
Christophe Hurlin (University of Orléans)
December 9, 2013
6 / 207
Section 2
The Principle of Maximum Likelihood
Christophe Hurlin (University of Orléans)
December 9, 2013
7 / 207
2. The Principle of Maximum Likelihood
Objectives
In this section, we present a simple example in order
1
To introduce the notations
2
To introduce the notion of likelihood and log-likelihood.
3
To introduce the concept of maximum likelihood estimator
4
To introduce the concept of maximum likelihood estimate
Christophe Hurlin (University of Orléans)
December 9, 2013
8 / 207
2. The Principle of Maximum Likelihood
Example
Suppose that X1 ,X2 ,
,XN are i.i.d. discrete random variables, such that
Xi
Pois (θ ) with a pmf (probability mass function) de…ned as:
Pr (Xi = xi ) =
exp ( θ ) θ xi
xi !
where θ is an unknown parameter to estimate.
Christophe Hurlin (University of Orléans)
December 9, 2013
9 / 207
2. The Principle of Maximum Likelihood
Question: What is the probability of observing the particular sample
fx1 , x2 , .., xN g, assuming that a Poisson distribution with as yet unknown
parameter θ generated the data?
This probability is equal to
Pr ((X1 = x1 ) \ ... \ (XN = xN ))
Christophe Hurlin (University of Orléans)
December 9, 2013
10 / 207
2. The Principle of Maximum Likelihood
Since the variables Xi are i.i.d. this joint probability is equal to the
product of the marginal probabilities
N
Pr ((X1 = x1 ) \ ... \ (XN = xN )) =
∏ Pr (Xi = xi )
i =1
Given the pmf of the Poisson distribution, we have:
N
Pr ((X1 = x1 ) \ ... \ (XN = xN )) =
∏
i =1
exp ( θ ) θ xi
xi !
N
= exp ( θN )
θ ∑ i =1 x i
N
∏ xi !
i =1
Christophe Hurlin (University of Orléans)
December 9, 2013
11 / 207
2. The Principle of Maximum Likelihood
De…nition
This joint probability is a function of θ (the unknown parameter) and
corresponds to the likelihood of the sample fx1 , .., xN g denoted by
LN (θ; x1 .., xN ) = Pr ((X1 = x1 ) \ ... \ (XN = xN ))
with
1
N
LN (θ; x1 .., xN ) = exp ( θN )
θ ∑ =1 x i
N
∏ xi !
i =1
Christophe Hurlin (University of Orléans)
December 9, 2013
12 / 207
2. The Principle of Maximum Likelihood
Example
Let us assume that for N = 10, we have a realization of the sample equal
to f5, 0, 1, 1, 0, 3, 2, 3, 4, 1g , then:
LN (θ; x1 .., xN ) = Pr ((X1 = x1 ) \ ... \ (XN = xN ))
LN (θ; x1 .., xN ) =
Christophe Hurlin (University of Orléans)
e 10θ θ 20
207, 360
December 9, 2013
13 / 207
2. The Principle of Maximum Likelihood
Question: What value of θ would make this sample most probable?
Christophe Hurlin (University of Orléans)
December 9, 2013
14 / 207
2. The Principle of Maximum Likelihood
This Figure plots the function LN (θ; x ) for various values of θ. It has a
single mode at θ = 2, which would be the maximum likelihood estimate,
or MLE, of θ.
1.2
x 10
-8
1
0.8
0.6
0.4
0.2
0
0
Christophe Hurlin (University of Orléans)
0.5
1
1.5
2
θ
2.5
3
3.5
4
December 9, 2013
15 / 207
2. The Principle of Maximum Likelihood
Christophe Hurlin (University of Orléans)
December 9, 2013
16 / 207
2. The Principle of Maximum Likelihood
Consider maximizing the likelihood function LN (θ; x1 .., xN ) with respect to
θ. Since the log function is monotonically increasing, we usually maximize
ln LN (θ; x1 .., xN ) instead. In this case:
N
ln LN (θ; x1 .., xN ) =
θN + ln (θ ) ∑ xi
N
ln
i =1
∂ ln LN (θ; x1 .., xN )
=
∂θ
N+
∂2 ln LN (θ; x1 .., xN )
=
∂θ 2
1
θ2
Christophe Hurlin (University of Orléans)
∏ xi !
i =1
1 N
xi
θ i∑
=1
N
∑ xi < 0
i =1
December 9, 2013
17 / 207
2. The Principle of Maximum Likelihood
Under suitable regularity conditions, the maximum likelihood estimate
(estimator) is de…ned as:
b
θ = arg max ln LN (θ; x1 .., xN )
θ 2R+
FOC :
∂ ln LN (θ; x1 .., xN )
∂θ
=
b
θ
N+
N
() bθ = (1/N ) ∑ xi
1 N
∑ xi = 0
b
θ i =1
i =1
SOC :
b
θ is a maximum.
∂2 ln LN (θ; x1 .., xN )
∂θ 2
Christophe Hurlin (University of Orléans)
=
b
θ
1
2
b
θ
N
∑ xi < 0
i =1
December 9, 2013
18 / 207
2. The Principle of Maximum Likelihood
The maximum likelihood estimate (realization) is:
b
θ
1
b
θ (x ) =
N
N
∑ xi
i =1
Given the sample f5, 0, 1, 1, 0, 3, 2, 3, 4, 1g , we have b
θ (x ) = 2.
The maximum likelihood estimator (random variable) is:
1
b
θ=
N
Christophe Hurlin (University of Orléans)
N
∑ Xi
i =1
December 9, 2013
19 / 207
2. The Principle of Maximum Likelihood
Continuous variables
The reference to the probability of observing the given sample is not
exact in a continuous distribution, since a particular sample has
probability zero. Nonetheless, the principle is the same.
The likelihood function then corresponds to the pdf associated to the
joint distribution of (X1 , X2 , .., XN ) evaluated at the point
(x1 , x2 , .., xN ) :
LN (θ; x1 .., xN ) = fX 1 ,..,X N (x1 , x2 , .., xN ; θ )
Christophe Hurlin (University of Orléans)
December 9, 2013
20 / 207
2. The Principle of Maximum Likelihood
Continuous variables
If the random variables fX1 , X2 , .., XN g are i.i.d. then we have:
N
LN (θ; x1 .., xN ) =
∏ fX (xi ; θ )
i =1
where fX (xi ; θ ) denotes the pdf of the marginal distribution of X (or
Xi since all the variables have the same distribution).
The values of the parameters that maximize LN (θ; x1 .., xN ) or its log
are the maximum likelihood estimates, denoted b
θ (x ).
Christophe Hurlin (University of Orléans)
December 9, 2013
21 / 207
Section 3
The Likelihood function
De…nitions and Notations
Christophe Hurlin (University of Orléans)
December 9, 2013
22 / 207
3. The Likelihood Function
Objectives
1
Introduce the notations for an estimation problem that deals with a
marginal distribution or a conditional distribution (model).
2
De…ne the likelihood and the log-likelihood functions.
3
Introduce the concept of conditional log-likelihood
4
Propose various applications
Christophe Hurlin (University of Orléans)
December 9, 2013
23 / 207
3. The Likelihood Function
Notations
Let us consider a continuous random variable X , with a pdf denoted
fX (x; θ ) , for x 2 R
θ = (θ 1 ..θ K )| is a K
that θ 2 Θ RK .
1 vector of unknown parameters. We assume
Let us consider a sample fX1 , .., XN g of i.i.d. random variables with
the same arbitrary distribution as X .
The realisation of fX1 , .., XN g (the data set..) is denoted fx1 , .., xN g
or x for simplicity.
Christophe Hurlin (University of Orléans)
December 9, 2013
24 / 207
3. The Likelihood Function
Example (Normal distribution)
If X
N m, σ2 then:
1
fX (z; θ ) = p
exp
σ 2π
(z
m )2
2σ2
!
8z 2 R
with K = 2 and
θ=
Christophe Hurlin (University of Orléans)
m
σ2
December 9, 2013
25 / 207
3. The Likelihood Function
De…nition (Likelihood Function)
The likelihood function is de…ned to be:
LN : Θ
RN ! R+
N
(θ; x1 , .., xn ) 7 ! LN (θ; x1 , .., xn ) = ∏ fX (xi ; θ )
i =1
Christophe Hurlin (University of Orléans)
December 9, 2013
26 / 207
3. The Likelihood Function
De…nition (Log-Likelihood Function)
The log-likelihood function is de…ned to be:
`N : Θ
RN ! R
N
(θ; x1 , .., xn ) 7 ! `N (θ; x1 , .., xn ) =
Christophe Hurlin (University of Orléans)
∑ ln fX (xi ; θ )
i =1
December 9, 2013
27 / 207
3. The Likelihood Function
Remark: the (log-)likelihood function depends on two type of arguments:
Christophe Hurlin (University of Orléans)
December 9, 2013
28 / 207
3. The Likelihood Function
Notations: In the rest of the chapter, I will use the following alternative
notations:
LN (θ; x )
`N (θ; x )
L (θ; x1 , .., xN )
ln LN (θ; x )
Christophe Hurlin (University of Orléans)
LN (θ )
ln L (θ; x1 , .., xN )
ln LN (θ )
December 9, 2013
29 / 207
3. The Likelihood Function
Example (Sample of Normal Variables)
We consider a sample fY1 , .., YN g N .i.d. m, σ2 and denote the
|
realisation by fy1 , .., yN g or y . Let us de…ne θ = m σ2 , then we have:
!
N
1
(yi m)2
LN (θ; y ) = ∏ p
exp
2σ2
i =1 σ 2π
!
N
1
N
/2
= σ2 2π
exp
(yi m)2
2σ2 i∑
=1
`N (θ; y ) =
N
ln σ2
2
Christophe Hurlin (University of Orléans)
N
ln (2π )
2
1
2σ2
N
∑ (yi
m )2
i =1
December 9, 2013
30 / 207
3. The Likelihood Function
De…nition (Likelihood of one observation)
We can also de…ne the (log-)likelihood of one observation xi :
N
Li (θ; x ) = fX (xi ; θ )
with LN (θ; x ) =
∏ Li (θ; x )
i =1
N
`i (θ; x ) = ln fX (xi ; θ )
with `N (θ; x ) =
∑ `i (θ; x )
i =1
Christophe Hurlin (University of Orléans)
December 9, 2013
31 / 207
3. The Likelihood Function
Example (Exponential Distribution)
Suppose that D1 , D2 , .., DN are i.i.d. positive random variables (durations
for instance), with Di
Exp (θ ) with θ 0 and
Li (θ; di ) = fD (di ; θ ) =
1
exp
θ
`i (θ; di ) = ln (fD (di ; θ )) =
ln (θ )
Then we have:
LN (θ; d ) = θ
`N (θ; d ) =
Christophe Hurlin (University of Orléans)
N
exp
N ln (θ )
di
θ
1 N
di
θ i∑
=1
di
θ
!
1 N
di
θ i∑
=1
December 9, 2013
32 / 207
3. The Likelihood Function
Remark: The (log-)likelihood and the Maximum Likelihood Estimator are
always based on an assumption (bet?) about the distribution of Y .
Yi
Distribution with pdf fY (y ; θ ) =) LN (θ; y ) and `N (θ; y )
In practice, generally we have no idea about the true distribution of Yi ....
A solution: the Quasi-Maximum Likelihood Estimator
Christophe Hurlin (University of Orléans)
December 9, 2013
33 / 207
3. The Likelihood Function
Remark: We can also use the MLE to estimate the parameters of a
model (with dependent and explicative variables) such that:
y = g (x; θ ) + ε
where β denotes the vector or parameters, X a set of explicative variables,
ε and error term and g (.) the link function.
In this case, we generally consider the conditional distribution of Y given
X , which is equivalent to unconditional distribution of the error term ε :
YjX
Christophe Hurlin (University of Orléans)
D () ε
D
December 9, 2013
34 / 207
3. The Likelihood Function
Notations (model)
Let us consider two continuous random variables Y and X
We assume that Y has a conditional distribution given X = x with a
pdf denoted f Y jx (y ; θ ) , for y 2 R
θ = (θ 1 ..θ K )| is a K
that θ 2 Θ RK .
1 vector of unknown parameters. We assume
Let us consider a sample fX1 , YN gN
i =1 of i.i.d. random variables and
N
a realisation fx1 , yN gi =1 .
Christophe Hurlin (University of Orléans)
December 9, 2013
35 / 207
3. The Likelihood Function
De…nition (Conditional likelihood function)
The (conditional) likelihood function is de…ned to be:
N
LN (θ; y j x ) =
∏ f Y jX ( yi j xi ; θ )
i =1
where f Y jX ( yi j xi ; θ ) denotes the conditional pdf of Yi given Xi .
Remark: The conditional likelihood function is the joint conditional
density of the data in which the unknown parameter is .
Christophe Hurlin (University of Orléans)
December 9, 2013
36 / 207
3. The Likelihood Function
De…nition (Conditional log-likelihood function)
The (conditional) log-likelihood function is de…ned to be:
N
`N (θ; y j x ) =
∑ ln f Y jX ( yi j xi ; θ )
i =1
where f Y jX ( yi j xi ; θ ) denotes the conditional pdf of Yi given Xi .
Christophe Hurlin (University of Orléans)
December 9, 2013
37 / 207
3. The Likelihood Function
Remark: The conditional probability density function (pdf) can denoted
by:
f Y jX ( y j x; θ ) fY ( y j X = x; θ ) fY ( y j X = x )
Christophe Hurlin (University of Orléans)
December 9, 2013
38 / 207
3. The Likelihood Function
Example (Linear Regression Model)
Consider the following linear regression model:
yi = Xi> β + εi
where Xi is a K 1 vector of random variables and β = ( β1 ..βK )> a
K 1 vector of parameters. We assume that the εi are i.i.d. with
εi
N 0, σ2 . Then, the conditional distribution of Yi given Xi = xi is:
Yi j xi
N xi> β, σ2
1
exp
Li (θ; y j x) = f Y jx ( yi j xi ; θ) = p
σ 2π
where θ = β> σ2
>
is K + 1
Christophe Hurlin (University of Orléans)
yi
xi> β
2σ2
2!
1 vector.
December 9, 2013
39 / 207
3. The Likelihood Function
Example (Linear Regression Model, cont’d)
Then, if we consider an i.i.d. sample fyi , xi gN
i =1 , the corresponding
conditional (log-)likelihood is de…ned to be:
LN (θ; y j x) =
N
N
i =1
i =1
∏ f Y jX ( yi j xi ; θ) = ∏
=
`N (θ; y j x) =
σ2 2π
N /2
N
ln σ2
2
Christophe Hurlin (University of Orléans)
exp
1
2σ2
yi
1
p exp
σ 2π
N
∑
yi
xi> β
i =1
N
ln (2π )
2
1
2σ2
N
∑
yi
2
!
xi> β
2σ2
xi> β
2!
2
i =1
December 9, 2013
40 / 207
3. The Likelihood Function
Remark: Given this principle, we can derive the (conditional) likelihood
and the log-likelihood functions associated to a speci…c sample for any
type of econometric model in which the conditional distribution of the
dependent variable is known.
Dichotomic models: probit, logit models etc.
Censored regression models: Tobit etc.
Times series models: AR, ARMA, VAR etc.
GARCH models
....
Christophe Hurlin (University of Orléans)
December 9, 2013
41 / 207
3. The Likelihood Function
Example (Probit/Logit Models)
Let us consider a dichotomic variable Yi such that Yi = 1 if the …rm i is in
default and 0 otherwise. Xi = (Xi 1 ...XiK ) denotes a a K 1 vector of
individual caracteristics. We assume that the conditional probability of
default is de…ned as:
Pr ( Yi = 1j Xi = xi ) = F xi> β
where β = ( β1 ..βK )> is a vector of parameters and F (.) is a cdf
(cumlative distribution function).
Yi =
Christophe Hurlin (University of Orléans)
1
0
with probability F xi> β
with probability 1 F xi> β
December 9, 2013
42 / 207
3. The Likelihood Function
Remark: Given the choice of the link function F (.) we get a probit or a
logit model.
Christophe Hurlin (University of Orléans)
December 9, 2013
43 / 207
3. The Likelihood Function
De…nition (Probit Model)
In a probit model, the conditional probability of the event Yi = 1 is:
Pr ( Yi = 1j Xi = xi ) = Φ (xi β) =
>
xR
i β
∞
1
p exp
2π
u2
2
du
where Φ (.) denotes the cdf of the standard normal distribution.
Christophe Hurlin (University of Orléans)
December 9, 2013
44 / 207
3. The Likelihood Function
De…nition (Logit Model)
In a logit model, the conditional probability of the event Yi = 1 is:
Pr ( Yi = 1j Xi = xi ) = Λ xi> β =
1
1 + exp
xi> β
where Λ (.) denotes the cdf of the logistic distribution.
Christophe Hurlin (University of Orléans)
December 9, 2013
45 / 207
3. The Likelihood Function
Example (Probit/Logit Models, cont’d)
What is the (conditional) log-likelihood of the sample fyi , xi gN
i =1 ?
Whatever the choice of F (.), the conditional distribution of Yi given
Xi = xi is a Bernouilli distribution since:
Yi =
with probability F xi> β
with probability 1 F xi> β
1
0
Then, for θ = β, we have:
h
iy i h
1
Li (θ; y j x) = f Y jx ( yi j xi ; θ) = F xi> β
F xi> β
i1
yi
where f Y jx ( yi j xi ; θ) denotes the conditional probability mass function
(pmf) of Yi .
Christophe Hurlin (University of Orléans)
December 9, 2013
46 / 207
3. The Likelihood Function
Example (Probit/Logit Models, cont’d)
The (conditional) likelihood and log-likelihood of the sample fyi , xi gN
i =1 are
de…ned to be:
N
LN (θ; y j x) =
∏ f Y jx ( yi j xi ; θ) =
i =1
N
`N (θ; y j x) =
=
∑ yi ln
i =1
∑
i : y i =1
h
F xi> β
ln F
xi> β
N
∏
i =1
i
+
h
F xi> β
N
+ ∑ (1
i =1
∑
i : y i =0
h
ln 1
iy i h
F xi> β
1
h
yi ) ln 1
F xi> β
i1
F xi> β
i
yi
i
where f Y jx ( yi j xi ; θ) denotes the conditional probability mass function
(pmf) of Yi .
Christophe Hurlin (University of Orléans)
December 9, 2013
47 / 207
3. The Likelihood Function
Key Concepts
1
Likelihood (of a sample) function
2
Log-likelihood (of a sample) function
3
Conditional Likelihood and log-likelihood function
4
Likelihood and log-likelihood of one observation
Christophe Hurlin (University of Orléans)
December 9, 2013
48 / 207
Section 4
Maximum Likelihood Estimator
Christophe Hurlin (University of Orléans)
December 9, 2013
49 / 207
4. Maximum Likelihood Estimator
Objectives
1
This section will be concerned with obtaining estimates of the
parameters θ.
2
We will de…ne the maximum likelihood estimator (MLE).
3
Before we begin that study, we consider the question of whether
estimation of the parameters is possible at all: the question of
identi…cation.
4
We will introduce the invariance principle
Christophe Hurlin (University of Orléans)
December 9, 2013
50 / 207
4. Maximum Likelihood Estimator
De…nition (Identi…cation)
The parameter vector θ is identi…ed (estimable) if for any other parameter
vector, θ 6= θ, for some data y , we have
LN (θ; y ) 6= LN (θ ; y )
Christophe Hurlin (University of Orléans)
December 9, 2013
51 / 207
4. Maximum Likelihood Estimator
Example
Let us consider a latent (continuous and unobservable) variable Yi such
that:
Yi = Xi> β + εi
with β = ( β1 ..βK )> , Xi = (Xi 1 ...XiK )> and where the error term εi is
i.i.d. such that E (εi ) = 0 and V (εi ) = σ2 . The distribution of εi is
symmetric around 0 and we denote by G (.) the cdf of the standardized
error term εi /σ. We assume that this cdf does not depend on σ or β.
Example: εi /σ N (0, 1).
Christophe Hurlin (University of Orléans)
December 9, 2013
52 / 207
4. Maximum Likelihood Estimator
Example (cont’d)
We observe a dichotomic variable Yi such that:
Yi =
1
0
if Yi > 0
otherwise
Problem: are the parameters θ = ( β> σ2 )> identi…able?
Christophe Hurlin (University of Orléans)
December 9, 2013
53 / 207
4. Maximum Likelihood Estimator
Solution:
To answer to this question we have to compute the (log-)likelihood of the
sample of observed data fyi , xi gN
i =1 . We have:
Pr ( Yi = 1j Xi = xi ) = Pr ( Yi > 0j Xi = xi )
xi> β
= Pr εi >
= 1
Pr εi
= 1
Pr
xi> β
εi
σ
xi>
β
σ
If we denote by G (.) the cdf associated to the distribution of εi /σ, since
this distribution is symetric around 0, then we have:
Pr ( Yi = 1j Xi = xi ) = G
Christophe Hurlin (University of Orléans)
xi>
β
σ
December 9, 2013
54 / 207
4. Maximum Likelihood Estimator
Solution (cont’d):
For θ = ( β> σ2 )> , we have
N
`N (θ; y j x) =
∑ yi ln G
i =1
xi>
β
σ
N
+ ∑ (1
yi ) ln 1
G
xi>
i =1
β
σ
This log-likelihood depends only on the ratio β/σ. So, for θ = ( β> σ2 )>
and θ = (k β> k σ)> , with k 6= 1 :
`N (θ; y j x) = `N (θ ; y j x)
The parameters β and σ2 cannot be identi…ed. We can only identify the
ratio β/σ.
Christophe Hurlin (University of Orléans)
December 9, 2013
55 / 207
4. Maximum Likelihood Estimator
Remark:
In this latent model, only the ratio β/σ can be identi…ed since
Pr ( Yi = 1j Xi = xi ) = Pr
εi
β
< xi>
σ
σ
=G
xi>
β
σ
The choice of a logit or probit model implies a normalisation on the
variance of εi /σ and then on σ2 :
e
probit : Pr ( Yi = 1j Xi = xi ) = Φ xi> β
Christophe Hurlin (University of Orléans)
e = β /σ, V εi
with β
i
σ
December 9, 2013
=1
56 / 207
4. Maximum Likelihood Estimator
De…nition (Maximum Likelihood Estimator)
A maximum likelihood estimator b
θ of θ 2 Θ is a solution to the
maximization problem:
b
θ = arg max `N (θ; y j x )
θ 2Θ
or equivalently
b
θ = arg max LN (θ; y j x )
θ 2Θ
Christophe Hurlin (University of Orléans)
December 9, 2013
57 / 207
4. Maximum Likelihood Estimator
Remarks
1
2
3
Do not confuse the maximum likelihood estimator b
θ (which is a
θ (x ) which
random variable) and the maximum likelihood estimate b
corresponds to the realisation of b
θ on the sample x.
Generally, it is easier to maximise the log-likelihood than the
likelihood (especially for the distributions that belong to the
exponential family).
When we consider an unconditional likelihood, the MLE is de…ned by:
b
θ = arg max`N (θ; x )
θ 2Θ
Christophe Hurlin (University of Orléans)
December 9, 2013
58 / 207
4. Maximum Likelihood Estimator
De…nition (Likelihood equations)
Under suitable regularity conditions, a maximum likelihood estimator
(MLE) of θ is de…ned to be the solution of the …rst-order conditions
(FOC):
∂`N (θ; y j x )
= 0
∂θ
(K ,1 )
b
θ
or
∂LN (θ; y j x )
= 0
∂θ
(K ,1 )
b
θ
These conditions are generally called the likelihood or log-likelihood
equations.
Christophe Hurlin (University of Orléans)
December 9, 2013
59 / 207
4. Maximum Likelihood Estimator
Notations
The …rst derivative (gradient) of the (conditional) log-likelihood evaluated
at the point b
θ satis…es:
∂LN (θ; y j x )
∂θ
Christophe Hurlin (University of Orléans)
b
θ
θ; y j x
∂LN b
∂θ
= g bθ; y j x = 0
December 9, 2013
60 / 207
4. Maximum Likelihood Estimator
Remark
The log-likelihood equations correspond to a linear/nonlinear system of
K equations with K unknown parameters θ 1 , .., θ K :
1
0
∂`N (θ; Y jx )
0
1
0
∂θ 1
b
θ
B
C
∂`N (θ; Y j x )
C = @ ... A
...
=B
A
@
∂θ
b
θ
∂`N (θ; Y jx )
0
∂θ K
Christophe Hurlin (University of Orléans)
b
θ
December 9, 2013
61 / 207
4. Maximum Likelihood Estimator
De…nition (Second Order Conditions)
Second order condition (SOC) of the likelihood maximisation problem: the
Hessian matrix evaluated at b
θ must be negative de…nite.
∂2 `N (θ; y j x )
∂θ∂θ >
or
∂2 LN (θ; y j x )
∂θ∂θ >
Christophe Hurlin (University of Orléans)
is negative de…nite
b
θ
is negative de…nite
b
θ
December 9, 2013
62 / 207
4. Maximum Likelihood Estimator
Remark:
The Hessian matrix (realisation) is a K
∂2 `
0
B
B
B
N ( θ; y j x )
B
=
B
∂θ∂θ >
B
@
Christophe Hurlin (University of Orléans)
K matrix:
∂2 `N (θ; y jx )
∂θ 21
2
∂ `N (θ; y jx )
∂θ 2 ∂θ 1
∂2 `N (θ; y jx )
∂θ 1 ∂θ 2
..
∂2 `N (θ; y jx )
∂θ 1 ∂θ K
∂2 `N (θ; y jx )
∂θ 22
..
..
..
..
..
∂2 `
N ( θ; y jx )
∂θ K ∂θ 1
..
..
..
∂2 `
N ( θ; y jx )
∂θ 2K
1
C
C
C
C
C
C
A
December 9, 2013
63 / 207
4. Maximum Likelihood Estimator
Reminders
A negative de…nite matrix is a symetric (Hermitian if there are
complex entries) matrix all of whose eigenvalues are negative.
The n
n Hermitian matrix M is said to be negative-de…nite if:
x| Mx < 0
for all non-zero x in Rn .
Christophe Hurlin (University of Orléans)
December 9, 2013
64 / 207
4. Maximum Likelihood Estimator
Example (MLE problem with one parameter)
Let us consider a real-valued random variable X with a pdf given by:
fX x; σ2 = exp
x2
2σ2
x
σ2
8x 2 [0, +∞[
where σ2 is an unknown parameter. Let us consider a sample fX1 , .., XN g
of i.i.d. random variables with the same arbitrary distribution as X .
Problem: What is the maximum likelihood estimator (MLE) of σ2 ?
Christophe Hurlin (University of Orléans)
December 9, 2013
65 / 207
4. Maximum Likelihood Estimator
Solution:
We have:
x2
+ ln (x ) ln σ2
2σ2
So, the log-likelihood of the sample fx1 , .., xN g is:
ln fX x; σ2 =
`N σ 2 ; x =
N
∑ ln fX
i =1
Christophe Hurlin (University of Orléans)
xi ; σ2 =
1
2σ2
N
N
i =1
i =1
∑ xi2 + ∑ ln (xi )
N ln σ2
December 9, 2013
66 / 207
4. Maximum Likelihood Estimator
Solution (cont’d):
b2 of σ2 2 R+ is a solution to the
The maximum likelihood estimator σ
maximization problem:
b2 = arg max`N σ2 ; x = arg max
σ
σ 2 2R+
σ 2 2R+
1
2σ2
∂ `N σ 2 ; x
1
= 4
∂σ2
2σ
N
N
i =1
i =1
∑ xi2 + ∑ ln (xi )
N
∑ xi2
i =1
N ln σ2
N
σ2
FOC (log-likelihood equation):
∂ `N σ 2 ; x
∂σ2
=
b2
σ
Christophe Hurlin (University of Orléans)
1
2b
σ4
N
∑ xi2
i =1
N
1
b2 =
= 0 () σ
2
2N
b
σ
N
∑ xi2
i =1
December 9, 2013
67 / 207
4. Maximum Likelihood Estimator
Solution (cont’d):
b2 is a maximum:
Check that σ
∂ `N σ 2 ; x
1
= 4
2
∂σ
2σ
N
∑ xi2
i =1
∂2 `N σ 2 ; x
=
∂σ4
N
σ2
1
σ6
N
N
∑ xi2 + σ4
i =1
SOC:
∂2 `N σ 2 ; x
∂σ4
=
b2
σ
=
=
Christophe Hurlin (University of Orléans)
1
b6
σ
N
N
∑ xi2 + σb4
i =1
b2
2N σ
N
+ 4
6
b
b
σ
σ
N
<0
b4
σ
b2 =
since σ
1
2N
N
∑ xi2
i =1
December 9, 2013
68 / 207
4. Maximum Likelihood Estimator
Conclusion:
The maximum likelihood estimator (MLE) of the parameter σ2 is de…ned
by:
1 N 2
b2 =
Xi
σ
2N i∑
=1
The maximum likelihood estimate of the parameter σ2 is equal to:
b 2 (x ) =
σ
Christophe Hurlin (University of Orléans)
1
2N
N
∑ xi2
i =1
December 9, 2013
69 / 207
4. Maximum Likelihood Estimator
Example (Sample of normal variables)
We consider a sample fY1 , .., YN g N.i.d. m, σ2 . Problem: what are
the MLE of m and σ2 ?
Solution: Let us de…ne θ = m σ2
b
θ=
with
`N (θ; y ) =
.
arg max `N (θ; y )
σ2 2R+ ,m 2R
N
ln σ2
2
Christophe Hurlin (University of Orléans)
|
N
ln (2π )
2
1
2σ2
N
∑ (yi
m )2
i =1
December 9, 2013
70 / 207
4. Maximum Likelihood Estimator
Solution (cont’d):
`N (θ; y ) =
N
ln σ2
2
N
ln (2π )
2
1
2σ2
N
∑ (yi
m )2
i =1
The …rst derivative of the log-likelihood function is de…ned by:
!
∂`N (θ;y )
∂`N (θ; y )
∂m
=
∂`N (θ;y )
∂θ
∂σ2
∂`N (θ; y )
1
= 2
∂m
σ
N
∑ (yi
i =1
Christophe Hurlin (University of Orléans)
m)
∂`N (θ; y )
=
∂σ2
N
1
+
2σ2 2σ4
N
∑ (yi
m )2
i =1
December 9, 2013
71 / 207
4. Maximum Likelihood Estimator
Solution (cont’d):
FOC (log-likelihood equations)
∂`N (θ; y )
∂θ
=
N
2b
σ2
b
θ
1
b2
σ
∑N
i =1 (yi
1
2b
σ4
+
m
b)
∑N
i =1 (yi
m
b )2
!
=
0
0
!
So, the MLE correspond to the empirical mean and variance:
b
θ=
with
m
b =
1
N
Christophe Hurlin (University of Orléans)
N
∑ Yi
i =1
m
b
b
σ2
b2 =
σ
1
N
N
∑
Yi
YN
2
i =1
December 9, 2013
72 / 207
4. Maximum Likelihood Estimator
Solution (cont’d):
∂`N (θ; y )
1
= 2
∂m
σ
N
∑ (yi
m)
i =1
∂`N (θ; y )
=
∂σ2
N
1
+ 4
2
2σ
2σ
N
∑ (yi
m )2
i =1
The Hessian matrix (realization) is:
∂2 `N (θ; y )
∂θ∂θ >
=
=
Christophe Hurlin (University of Orléans)
∂2 `N (θ;y )
∂m 2
∂2 `N (θ;y )
∂σ2 ∂m
1
σ4
∂2 `N (θ;y )
∂m∂σ2
∂2 `N (θ;y )
∂σ4
N
σ2
∑N
i =1 (yi
m)
!
1
σ4
N
2σ4
∑N
i =1 (yi
1
σ6
∑N
i =1 (yi
m)
m )2
December 9, 2013
!
73 / 207
4. Maximum Likelihood Estimator
Solution (cont’d): SOC
∂2 `N (θ; y )
∂θ∂θ >
=
b
θ
=
1
b4
σ
N
b2
σ
0
N
b2
σ
∑N
i =1 (yi
m
b)
!
0
N
2b
σ4
N
2b
σ4
b2
Nσ
b6
σ
1
b4
σ
∑N
i =1 (yi
1
b6
σ
∑N
i =1 (yi
m
b)
m
b )2
!
b 2 = ∑N
since since N m
b = ∑N
m
b )2
i =1 yi and N σ
i =1 (yi
!
N
0
∂2 `N (θ; y )
2
b
σ
is de…nite negative
=
N
0
4
∂θ∂θ > bθ
2b
σ
Christophe Hurlin (University of Orléans)
December 9, 2013
74 / 207
4. Maximum Likelihood Estimator
Example (Linear Regression Model)
Consider the linear regression model:
yi = xi> β + εi
where xi = (xi 1 ...xiK )> and β = ( β1 ..βK )> are K 1 vectors. We assume
that the εi are N .i.d. 0, σ2 . Then, the (conditional) log-likelihood of the
observations (xi , yi ) is given by
`N (θ; y j x ) =
N
ln σ2
2
where θ = ( β> σ2 )> is (K + 1)
of β and σ2 ?
Christophe Hurlin (University of Orléans)
N
ln (2π )
2
1
2σ2
N
∑
yi
xi> β
2
i =1
1 vector. Question: what are the MLE
December 9, 2013
75 / 207
4. Maximum Likelihood Estimator
Notation 1: The derivative of a scalar y by a K
x = (x1 ...xK )> is K 1 vector
0
1
1 vector
∂y
∂y
B ∂x1 C
= @ .. A
∂x
∂y
∂xK
Notation 2: If x and β are two K
1 vectors, then:
∂ x>β
= x
∂β
(K ,1 )
Christophe Hurlin (University of Orléans)
December 9, 2013
76 / 207
4. Maximum Likelihood Estimator
Solution
b
θ=
arg max
β2RK ,σ2 2R+
N
ln σ2
2
N
ln (2π )
2
1
2σ2
N
∑
xi> β
yi
i =1
The …rst derivative of the log-likelihood function is a (K + 1)
0
∂`N (θ; y j x )
[email protected]
∂θ
|
{z
}
(K +1 ) 1
Christophe Hurlin (University of Orléans)
∂`N (θ; y jx )
∂β
∂`N (θ; y jx )
∂σ2
1
0
B
B
A=B
B
@
2
∂`N (θ; y jx )
∂β1
..
∂`N (θ; y jx )
∂βK
∂`N (θ; y jx )
∂σ2
1 vector:
1
C
C
C
C
A
December 9, 2013
77 / 207
4. Maximum Likelihood Estimator
Solution (cont’d)
b
θ=
N
ln σ2
2
arg max
β2RK ,σ2 2R+
N
ln (2π )
2
1
2σ2
N
∑
yi
(K ,1 )
∂`N (θ; y j x )
=
2
∂σ
|
{z
}
(1,1 )
Christophe Hurlin (University of Orléans)
N
xi
∑ |{z}
i =1
(K ,1 )
N
1
+ 4
2σ2
2σ
|
(1,1 )
N
i =1 |
1 vector:
xi> β
{z
}
yi
∑
2
i =1
The …rst derivative of the log-likelihood function is a (K + 1)
1
∂`N (θ; y j x )
= 2
∂β
σ
|
{z
}
xi> β
yi
2
xi> β
{z
}
(1,1 )
December 9, 2013
78 / 207
4. Maximum Likelihood Estimator
Solution (cont’d):
FOC (log-likelihood equations)
0
1
xi> b
β
∑N
i =1 xi yi
∂`N (θ; y j x )
b2
σ
@
=
N
∂θ
b
θ
+ 2b1σ4 ∑Ni=1 yi xi> b
β
2b
σ2
So, the MLE is de…ned by:
b
β=
N
∑
i =1
Xi Xi>
!
Christophe Hurlin (University of Orléans)
1
N
b
θ=
∑ Xi Yi
i =1
!
b
β
b2
σ
b2 =
σ
1
N
2
1
A=
N
∑
i =1
Yi
0K
0
Xi> b
β
December 9, 2013
2
79 / 207
4. Maximum Likelihood Estimator
Solution (cont’d):
The Hessian is a (K + 1)
(K + 1) matrix:
0
B
B
B
2
B
∂ `N (θ; y j x )
B
=
B
>
∂θ∂θ
{z
} B
|
B
@
(K +1 ) (K +1 )
Christophe Hurlin (University of Orléans)
1
∂2 `N (θ; y j x )
C
∂β∂σ2
∂β∂β>
C
|
{z
}
|
{z
}
C
C
K 1
K K
C
∂2 `N (θ; y j x ) ∂2 `N (θ; y j x ) C
C
C
4
>
2
∂σ
A
∂σ ∂β
|
{z
}
|
{z
}
1 1
∂2 `N (θ; y j x )
1 K
December 9, 2013
80 / 207
4. Maximum Likelihood Estimator
Solution (cont’d):
∂`N (θ; y j x )
1
= 2
∂β
σ
∂`N (θ; y j x )
=
∂σ2
N
∑ xi
xi> β
yi
i =1
N
1
+ 4
2
2σ
2σ
N
∑
xi> β
yi
2
i =1
So, the Hessian matrix (realization) is equal to:
∂2 `N (θ; y j x )
∂θ∂θ
>
0
B
B
B
=B
B
@
Christophe Hurlin (University of Orléans)
1
σ2
xi xi>
∑N
i =1 |{z}
|{z}
1
σ4
K 1 1 K
1
σ4
>
y
x>β
∑N
i =1 xi
|{z}| i {z i }
1 K
1 1
xi yi xi> β
∑N
i =1 |{z}
|
{z
}
K 1
1 1
N
2σ4
1
σ6
xi> β
∑N
i =1 yi
|
{z
}
1 1
December 9, 2013
81 / 207
2
4. Maximum Likelihood Estimator
Solution (cont’d):
Second Order Conditions (SOC)
0
1
>
∑N
2
i =1 xi xi
∂ `N ( θ )
b2
σ
B
[email protected]
1
> y
∂θ∂θ > bθ
xi> b
β
∑N
i
i =1 xi
b4
σ
> y
Since ∑N
i
i =1 xi
N
2b
σ4
1
b4
σ
∑N
i =1 xi yi
1
b6
σ
∑N
i =1
yi
b 2 = ∑N
xi> b
β = 0 (FOC) and N σ
i =1 yi
∂2 `N ( θ )
∂θ∂θ >
Christophe Hurlin (University of Orléans)
=
b
θ
N
b2
σ
>
∑N
i = 1 xi xi
0
0
N
2b
σ4
b2
Nσ
b6
σ
!
xi> b
β
xi> b
β
xi> b
β
December 9, 2013
2
1
C
A
2
82 / 207
4. Maximum Likelihood Estimator
Solution (cont’d):
Second Order Conditions (SOC).
∂2 `N (θ; y j x )
∂θ∂θ >
=
b
θ
1
b2
σ
>
∑N
i =1 xi xi
0
0
N
2b
σ4
!
is de…nite negative
>
Since ∑N
i =1 xi xi is positive de…nite (assumption), the Hessian matrix is
de…nite negative and b
θ is the MLE of the parameters θ.
Christophe Hurlin (University of Orléans)
December 9, 2013
83 / 207
4. Maximum Likelihood Estimator
Theorem (Equivariance or Invariance Principle)
Under suitable regularity conditions, the maximum likelihood estimator of
a function g (.) of the parameter θ is g b
θ , where b
θ is the maximum
likelihood estimator of θ.
Christophe Hurlin (University of Orléans)
December 9, 2013
84 / 207
4. Maximum Likelihood Estimator
Invariance Principle
The MLE is invariant to one-to-one transformations of θ. Any
transformation that is not one to one either renders the model
inestimable if it is one to many or imposes restrictions if it is many to
one.
For the practitioner, this result is extremely useful. For example, when
a parameter appears in a likelihood function in the form 1/θ , it is
usually worthwhile to reparameterize the model in terms of γ = 1/θ.
Example: Olsen (1978) and the reparametrisation of the likelihood
function of the Tobit Model.
Christophe Hurlin (University of Orléans)
December 9, 2013
85 / 207
4. Maximum Likelihood Estimator
Example (Invariance Principle)
Suppose that the normal log-likelihood in the previous example is
parameterized in terms of the precision parameter, γ2 = 1/σ2 . The
log-likelihood
`N m, σ2 ; y =
N
ln σ2
2
N
ln (2π )
2
1
2σ2
N
ln γ2
2
N
ln (2π )
2
γ2
2
N
∑ (yi
m )2
i =1
becomes
`N m, γ2 ; y =
Christophe Hurlin (University of Orléans)
N
∑ (yi
m )2
i =1
December 9, 2013
86 / 207
4. Maximum Likelihood Estimator
Example (Invariance Principle, cont’d)
The MLE for m is clearly still Y N . But the likelihood equation for γ2 is
now:
∂`N m, γ2 ; y
N
1 N
=
(yi m)2
∂γ2
2γ2
2 i∑
=1
and the MLE for γ2 is now de…ned by:
as expected.
Christophe Hurlin (University of Orléans)
b2 =
γ
N
∑N
i =1
(Yi
m)
2
=
1
b2
σ
December 9, 2013
87 / 207
Key Concepts
1
Identi…cation.
2
Maximum likelihood estimator.
3
Maximum likelihood estimate.
4
Log-likelihood equations.
5
Equivariance or invariance principle.
6
Gradient Vector and Hessian Matrix (deterministic elements).
Christophe Hurlin (University of Orléans)
December 9, 2013
88 / 207
Section 5
Score, Hessian and Fisher Information
Christophe Hurlin (University of Orléans)
December 9, 2013
89 / 207
5. Score, Hessian and Fisher Information
Objectives
We aim at introducing the following concepts:
1
2
Hessian matrix
3
Fischer information matrix of the sample
4
Fischer information matrix of one observation for marginal and
conditional distributions
5
Average Fischer information matrix of one observation
Christophe Hurlin (University of Orléans)
December 9, 2013
90 / 207
5. Score, Hessian and Fisher Information
De…nition (Score Vector)
The (conditional) score vector is a K
sN (θ; Y j x )
(K ,1 )
Christophe Hurlin (University of Orléans)
1 vector de…ned by:
s (θ ) =
∂`N (θ; Y j x )
∂θ
December 9, 2013
91 / 207
5. Score, Hessian and Fisher Information
Remarks:
The score sN (θ; Y j x ) is a vector of random elements since it
depends on the random variables Y1 , .., .YN .
For an unconditional log-likelihood, `N (θ; x ) , the score is denoted by
sN (θ; X ) = ∂`N (θ; X ) /∂θ
The score is a K
1 vector such that:
0 ∂`
B
sN (θ; Y j x ) = @
Christophe Hurlin (University of Orléans)
N ( θ; Y
∂θ 1
jx )
.
∂`N (θ; Y jx )
∂θ K
1
C
A
December 9, 2013
92 / 207
5. Score, Hessian and Fisher Information
Corollary
By de…nition, the score vector satis…es
Eθ (sN (θ; Y j x )) = 0K
where Eθ means the expectation with respect to the conditional
distribution Y j X = x.
Christophe Hurlin (University of Orléans)
December 9, 2013
93 / 207
5. Score, Hessian and Fisher Information
Remark: If we consider a variable X with a pdf fX (x; θ ) , 8x 2 R, then
Eθ (.) means the expectation with respect to the distribution of X :
Eθ (sN (θ; X )) =
Z∞
sN (θ; x ) fX (x; θ ) dx = 0
∞
Remark: If we consider a variable Y with a conditional pdf f Y jx (y ; θ ) ,
8y 2 R, then Eθ (.) means the expectation with respect to the
distribution of Y j X = x :
Eθ (sN (θ; Y j x )) =
Christophe Hurlin (University of Orléans)
Z∞
∞
sN (θ; Y j x ) f Y jx (y ; θ ) dy = 0
December 9, 2013
94 / 207
5. Score, Hessian and Fisher Information
Proof.
If we consider a variable X with a pdf fX (x; θ ) , 8x 2 R, then:
Eθ (sN (θ; X )) =
Z
sN (θ; x ) fX (x; θ ) dx
Z
∂ ln fX (x; θ )
fX (x; θ ) dx
∂θ
Z
1
∂fX (x; θ )
= N
fX (x; θ ) dx
fX (x; θ )
∂θ
Z
∂
fX (x; θ ) dx
= N
∂θ
∂1
= N
=0
∂θ
= N
Christophe Hurlin (University of Orléans)
December 9, 2013
95 / 207
5. Score, Hessian and Fisher Information
Example (Exponential Distribution)
Suppose that D1 , D2 , .., DN are i.i.d., positive random variable with
Di
Exp (θ ) and E (Di ) = θ > 0.
fD (d; θ ) =
1
exp
θ
`N (θ; d ) =
d
θ
, 8d 2 R+
N ln (θ )
1 N
di
θ i∑
=1
The score (scalar) is equal to:
sN (θ; D ) =
Christophe Hurlin (University of Orléans)
1
N
+ 2
θ
θ
N
∑ Di
i =1
December 9, 2013
96 / 207
5. Score, Hessian and Fisher Information
Example (Exponential Distribution, cont’d)
By de…nition:
Eθ (sN (θ; D )) = Eθ
=
=
N
1
+ 2
θ
θ
N
∑ Di
i =1
!
N
1 N
+ 2 ∑ E θ ( Di )
θ
θ i =1
N
Nθ
+ 2
θ
θ
= 0
Christophe Hurlin (University of Orléans)
December 9, 2013
97 / 207
5. Score, Hessian and Fisher Information
Example (Linear Regression Model)
Let us consider the previous linear regression model yi = xi> β + εi . The
score is de…ned by:
1
0
N
1
>β
x
Y
x
∑
i
i
2
i
=
1
i
σ
A
sN (θ; Y j x ) = @
N
N
1
>β 2
Y
x
+
∑
i
i =1
i
2σ2
2σ4
Then, we have
0
Eθ (sN (θ; Y j x )) = Eθ @
Christophe Hurlin (University of Orléans)
1
σ2
N
2σ2
∑N
i =1 xi Yi
+
1
2σ4
∑N
i =1
xi> β
Yi
xi> β
2
1
A
December 9, 2013
98 / 207
5. Score, Hessian and Fisher Information
Example (Linear Regression Model, cont’d)
We know that Eθ ( Yi j x ) = xi> β. So, we have:
Eθ
1 N
∑ xi Yi
σ 2 i =1
Christophe Hurlin (University of Orléans)
xi> β
1 N
∑ xi Eθ ( Yi j x ) xi> β
σ 2 i =1
1 N
=
∑ xi xi> β xi> β
σ 2 i =1
= 0K
=
December 9, 2013
99 / 207
5. Score, Hessian and Fisher Information
Example (Linear Regression Model, cont’d)
Eθ
=
=
=
=
N
2σ2
N
2σ2
N
2σ2
N
2σ2
2
N
1
+ 4 ∑Ni=1 Yi xi> β
2
2σ
2σ
2
1
+ 4 ∑Ni=1 Eθ
Yi xi> β
2σ
1
+ 4 ∑Ni=1 Eθ (Yi Eθ ( Yi j x ))2
2σ
1
+ 4 ∑Ni=1 Vθ ( Yi j x )
2σ
Nσ2
+ 4
2σ
= 0
Christophe Hurlin (University of Orléans)
December 9, 2013
100 / 207
5. Score, Hessian and Fisher Information
The gradient vector associated to the log-likelihood function is a K
vector de…ned by:
gN (θ; y j x )
(K ,1 )
Christophe Hurlin (University of Orléans)
g (θ ) =
1
∂`N (θ; y j x )
∂θ
December 9, 2013
101 / 207
5. Score, Hessian and Fisher Information
Remarks
1
The gradient gN (θ; y j x ) is a vector of deterministic entries since it
depends on the realisation y1 , .., yN .
2
For an unconditional log-likelihood, the gradient is de…ned by
gN (θ; x ) = ∂`N (θ; x ) /∂θ
3
1 vector such that:
0 ∂` (θ; y jx ) 1
N
B
gN (θ; y j x ) = @
Christophe Hurlin (University of Orléans)
∂θ 1
.
∂`N (θ; y jx )
∂θ K
C
A
December 9, 2013
102 / 207
5. Score, Hessian and Fisher Information
Corollary
By de…nition of the FOC, the gradient vector satis…es
gN b
θ; y j x
= 0K
where b
θ=b
θ (x ) is the maximum likelihood estimate of θ.
Christophe Hurlin (University of Orléans)
December 9, 2013
103 / 207
5. Score, Hessian and Fisher Information
Example (Linear regression model)
In the linear regression model, the gradient associated to the log-likelihood
function is de…ned to be:
!
N
1
>β
x
y
x
∑
i
i
2
i
=
1
i
σ
gN (θ; y j x ) =
2
N
1
N
+
y
xi> β
i
2σ2
2σ4 ∑i =1
Given the FOC, we have:
0
gN b
θ; y j x
B
[email protected]
Christophe Hurlin (University of Orléans)
1
b2
σ
N
2b
σ2
∑N
i =1 xi yi
+
1
2b
σ4
∑N
i =1
yi
xi> b
β
xi> b
β
2
1
C
A=
0K
0
December 9, 2013
!
104 / 207
5. Score, Hessian and Fisher Information
De…nition (Hessian Matrix)
The Hessian matrix (deterministic) is de…ned as to be:
HN (θ; y j x ) =
∂2 `N (θ; y j x )
∂2 `N (θ; y jx )
is also
∂θ∂θ >
2
∂ ` (θ; Y x )
matrices N > j
∂θ∂θ
∂θ∂θ >
Remarks: The matrix
called the Hessian matrix, but do
not confuse the two
and
Christophe Hurlin (University of Orléans)
∂2 `N (θ; y jx )
.
∂θ∂θ >
December 9, 2013
105 / 207
5. Score, Hessian and Fisher Information
Random Variable
Score vector
Hessian Matrix
Christophe Hurlin (University of Orléans)
∂`N (θ; Y jx )
∂θ
∂2 `N (θ; Y jx )
∂θ∂θ >
Constant
∂`N (θ; y jx )
∂θ
Hessian Matrix
∂2 `N (θ; y jx )
∂θ∂θ >
December 9, 2013
106 / 207
5. Score, Hessian and Fisher Information
De…nition (Fisher Information Matrix)
The (conditional) Fisher information matrix associated to the sample
fY1 , .., YN g is the variance-covariance matrix of the score vector:
I N (θ ) = Vθ (sN (θ; Y j x ))
| {z }
K K
or equivalently:
I N ( θ ) = Vθ
∂`N (θ; Y j x )
∂θ
where Vθ means the variance with respect to the conditional distribution
Y j X.
Christophe Hurlin (University of Orléans)
December 9, 2013
107 / 207
5. Score, Hessian and Fisher Information
Corollary
Since by de…nition Eθ (sN (θ; Y j x )) = 0, then an alternative de…nition of
the Fisher information matrix of the sample fY1 , .., YN g is:
1
0
B
I N (θ ) = Eθ @sN (θ; Y j x )
{z
}
|
| {z }
K K
Christophe Hurlin (University of Orléans)
K 1
C
sN (θ; Y j x )> A
|
{z
}
1 K
December 9, 2013
108 / 207
5. Score, Hessian and Fisher Information
De…nition (Fisher Information Matrix)
The (conditional) Fisher information matrix of the sample fY1 , .., YN g is
also given by:
I N ( θ ) = Eθ
Christophe Hurlin (University of Orléans)
∂2 `N (θ; Y j x )
∂θ∂θ >
= Eθ ( HN (θ; Y j x ))
December 9, 2013
109 / 207
5. Score, Hessian and Fisher Information
De…nition (Fisher Information Matrix, summary)
The (conditional) Fisher information matrix of the sample fY1 , .., YN g
can alternatively be de…ned by:
I N (θ ) = Vθ (sN (θ; Y j x ))
I N (θ ) = Eθ sN (θ; Y j x )
sN (θ; Y j x )>
I N (θ ) = Eθ ( HN (θ; Y j x ))
where Eθ and Vθ denote the mean and the variance with respect to the
conditional distribution Y j X , and where sN (θ; Y j x ) denotes the score
vector and HN (θ; Y j x ) the Hessian matrix.
Christophe Hurlin (University of Orléans)
December 9, 2013
110 / 207
5. Score, Hessian and Fisher Information
De…nition (Fisher Information Matrix, summary)
The (conditional) Fisher information matrix of the sample fY1 , .., YN g
can alternatively be de…ned by:
I N ( θ ) = Vθ
I N ( θ ) = Eθ
∂`N (θ; Y j x )
∂θ
∂`N (θ; Y j x )
∂θ
I N ( θ ) = Eθ
∂`N (θ; Y j x )
∂θ
>
!
∂2 `N (θ; Y j x )
∂θ∂θ >
where Eθ and Vθ denote the mean and the variance with respect to the
conditional distribution Y j X .
Christophe Hurlin (University of Orléans)
December 9, 2013
111 / 207
5. Score, Hessian and Fisher Information
Remarks
1
Three equivalent de…nitions of the Fisher information matrix, and as a
consequence three di¤erent consistent estimates of the Fisher
information matrix (see later).
2
The Fisher information matrix associated to the sample fY1 , .., YN g
can also be de…ned from the Fisher information matrix for the
observation i.
Christophe Hurlin (University of Orléans)
December 9, 2013
112 / 207
5. Score, Hessian and Fisher Information
De…nition (Fisher Information Matrix)
The (conditional) Fisher information matrix associated to the i th
individual can be de…ned by:
I i ( θ ) = Vθ
I i ( θ ) = Eθ
∂`i (θ; Yi j xi )
∂θ
∂`i (θ; Yi j xi ) ∂`i (θ; Yi j xi )>
∂θ
∂θ
I i ( θ ) = Eθ
!
∂2 `i (θ; Yi j xi )
∂θ∂θ >
where Eθ and Vθ denote the expectation and variance with respect to the
true conditional distribution Yi j Xi .
Christophe Hurlin (University of Orléans)
December 9, 2013
113 / 207
5. Score, Hessian and Fisher Information
De…nition (Fisher Information Matrix)
The (conditional) Fisher information matrix associated to the i th
individual can be alternatively be de…ned by:
I i (θ ) = Vθ (si (θ; Yi j xi ))
I i (θ ) = Eθ si (θ; Yi j xi ) si (θ; Yi j xi )>
I i (θ ) = Eθ ( Hi (θ; Yi j xi ))
where Eθ and Vθ denote the expectation and variance with respect to the
true conditional distribution Yi j Xi .
Christophe Hurlin (University of Orléans)
December 9, 2013
114 / 207
5. Score, Hessian and Fisher Information
Theorem
The Fisher information matrix associated to the sample fY1 , .., YN g is
equal to the sum of individual Fisher information matrices:
N
I N (θ ) =
∑ I i (θ )
i =1
Christophe Hurlin (University of Orléans)
December 9, 2013
115 / 207
5. Score, Hessian and Fisher Information
Remark:
1
In the case of a marginal log-likelihood, the Fisher information matrix
associated to the variable Xi is the same for the observations i :
I i (θ ) = I (θ )
2
8i = 1, ..N
In the case of a conditional log-likelihood, the Fisher information
matrix associated to the variable Yi given Xi = xi depends on the
observation i :
I i ( θ ) 6 = I j ( θ ) 8i 6 = j
Christophe Hurlin (University of Orléans)
December 9, 2013
116 / 207
5. Score, Hessian and Fisher Information
Example (Exponential marginal distribution)
Suppose that D1 , D2 , .., DN are i.i.d., positive random variable with
Di
Exp (θ )
E ( Di ) = θ
V ( Di ) = θ 2
fD (d; θ ) =
1
exp
θ
d
θ
, 8d 2 R+
di
θ
Question: what is the Fisher information number (scalar) associated to
Di ?
`i (θ; di ) =
Christophe Hurlin (University of Orléans)
ln (θ )
December 9, 2013
117 / 207
5. Score, Hessian and Fisher Information
Solution
` (θ; di ) =
ln (θ )
di
θ
The score of the observation Xi is de…ned by:
si (θ; Di ) =
∂`i (θ; Di )
=
∂θ
1 Di
+ 2
θ
θ
Let us use the three de…nitions of the information quantity I i (θ ) :
I i (θ ) = Vθ (si (θ; Di ))
= Eθ si (θ; Di )2
= Eθ ( Hi (θ; Di ))
Christophe Hurlin (University of Orléans)
December 9, 2013
118 / 207
5. Score, Hessian and Fisher Information
Solution, cont’d
si (θ; Di ) =
∂`i (θ; Di )
=
∂θ
1 Di
+ 2
θ
θ
First de…nition:
I i (θ ) = Vθ (si (θ; Di ))
1 Di
= Vθ
+ 2
θ
θ
1
= 4 V θ ( Di )
θ
1
= 2
θ
Conclusion: I i (θ ) =I (θ ) does not depend on i.
Christophe Hurlin (University of Orléans)
December 9, 2013
119 / 207
5. Score, Hessian and Fisher Information
Solution, cont’d
si (θ; Di ) =
∂`i (θ; Di )
=
∂θ
1 Di
+ 2
θ
θ
Second de…nition:
I i (θ ) = Eθ si (θ; Di )2
= Eθ
= Vθ
=
1 Di
+ 2
θ
θ
1 Di
+ 2
θ
θ
2
!
since Eθ
1 Di
+ 2
θ
θ
=0
1
θ2
Conclusion: I i (θ ) =I (θ ) does not depend on i.
Christophe Hurlin (University of Orléans)
December 9, 2013
120 / 207
5. Score, Hessian and Fisher Information
Solution, cont’d
si (θ; Di ) =
Hi (θ; Di ) =
∂`i (θ; Di )
=
∂θ
1 Di
+ 2
θ
θ
∂2 `i (θ; Di )
1
= 2
2
∂θ
θ
2Di
θ3
Third de…nition:
I i (θ ) = Eθ ( Hi (θ; Di ))
2Di
1
= Eθ
2
θ
θ3
1
2
=
+ 3 E θ ( Di )
2
θ
θ
1
2
1
+ 3θ = 2
=
2
θ
θ
θ
Conclusion: I i (θ ) =I (θ ) does not depend on i.
Christophe Hurlin (University of Orléans)
December 9, 2013
121 / 207
5. Score, Hessian and Fisher Information
Example (Linear regression model)
We shown that:
∂2 `i (θ; Yi j xi )
∂θ∂θ >
0
B
B
B
=B
B
@
1
xi xi>
σ2 |{z}
|{z}
K 1 1 K
1
xi>
σ4 |{z}
1 K
|
Yi
xi> β
{z
}
1
2σ4
1 1
1
xi> β
{z
} C
|
C
K 1
1 1
C
2 C
C
>
1
Yi xi β
A
σ6
|
{z
}
1
xi
σ4 |{z}
Yi
1 1
Question: what is the Fisher information matrix associated to the
observation Yi ?
Christophe Hurlin (University of Orléans)
December 9, 2013
122 / 207
5. Score, Hessian and Fisher Information
Solution
The information matrix is then de…ned by:
I i (θ )
| {z }
= Eθ
∂2 `i (θ; Yi j xi )
= Eθ ( Hi (θ; Yi j xi ))
∂θ∂θ >
K +1 K +1
where Eθ means the expectation with respect to the conditional
distribution Yi j Xi = xi
0
I i (θ ) = @
1
x
σ4 i
1
x x>
σ2 i i
1 >
x
σ4 i
Eθ (Yi )
Christophe Hurlin (University of Orléans)
xi> β
1
2σ4
+
Eθ (Yi )
1
E
σ6 θ
Yi
xi> β
xi> β
2
December 9, 2013
1
A
123 / 207
5. Score, Hessian and Fisher Information
Solution (cont’d)
0
I i (θ ) = @
1
x x>
σ2 i i
1 >
x
σ4 i
Eθ (Yi )
1
2σ4
xi> β
Given that Eθ (Yi ) = xi> β and Eθ ( Yi
I i (θ ) =
Eθ (Yi )
1
x
σ4 i
1
E
σ6 θ
+
Yi
xi> β
xi> β
2
2
1
A
xi> β ) = σ2 , then we have:
!
1
>
x
x
0
i
i
σ2
0
1
2σ4
Conclusion: I i (θ ) depends on xi and I i (θ ) 6=I j (θ ) for i 6= j.
Christophe Hurlin (University of Orléans)
December 9, 2013
124 / 207
5. Score, Hessian and Fisher Information
De…nition (Average Fisher information matrix)
For a conditional model, the average Fisher information matrix for one
observation is de…ned by:
I (θ ) = EX (I i (θ ))
where EX denotes the expectation with respect to X (conditioning
variable).
Christophe Hurlin (University of Orléans)
December 9, 2013
125 / 207
5. Score, Hessian and Fisher Information
Summary: For a conditional model (and only for a conditional model),
we have:
I ( θ ) = EX
Vθ
∂`i (θ; Yi j Xi )
∂θ
I ( θ ) = EX E θ
= EX (Vθ (s (θ; Yi j Xi )))
∂`i (θ; Yi j Xi ) ∂`i (θ; Yi j Xi )>
∂θ
∂θ
!
= EX Eθ si (θ; Yi j Xi ) si (θ; Yi j Xi )>
I (θ ) = EX Eθ
Christophe Hurlin (University of Orléans)
∂2 `i (θ; Yi j Xi )
∂θ∂θ >
= EX Eθ ( Hi (θ; Yi j Xi ))
December 9, 2013
126 / 207
5. Score, Hessian and Fisher Information
Summary: For a marginal distribution, we have:
I ( θ ) = Vθ
∂`i (θ; Yi )
∂θ
I ( θ ) = Eθ
= Vθ (s (θ; Yi ))
∂`i (θ; Yi ) ∂`i (θ; Yi )>
∂θ
∂θ
!
= Eθ si (θ; Yi ) si (θ; Yi )>
I ( θ ) = Eθ
Christophe Hurlin (University of Orléans)
∂2 `i (θ; Yi )
∂θ∂θ >
= Eθ ( Hi (θ; Yi ))
December 9, 2013
127 / 207
5. Score, Hessian and Fisher Information
Example (Linear Regression Model)
In the linear model, the individual Fisher information matrix is equal to:
!
1
x x> 0
σ2 i i
I i (θ ) =
1
0
2σ4
and the average Fisher information Matrix for one observation is de…ned
by:
!
1
E Xi Xi>
0
σ2 X
I (θ ) = EX (I i (θ )) =
1
0
2σ4
Christophe Hurlin (University of Orléans)
December 9, 2013
128 / 207
5. Score, Hessian and Fisher Information
Summary: in order to compute the average information matrix I (θ ) for
one observation:
Step 1: Compute the Hessian matrix or the score vector for one
observation
Hi (θ; Yi j xi ) =
∂2 `i (θ; Yi j xi )
∂θ∂θ
>
si (θ; Yi j xi ) =
∂`i (θ; Yi j xi )
∂θ
Step 2: Take the expectation (or the variance) with respect to the
conditional distribution Yi j Xi = xi
I i (θ ) = Vθ (si (θ; Yi j xi )) = Eθ ( Hi (θ; Yi j xi ))
Step 3: Take the expectation with respect to the conditioning variable X
I (θ ) = EX (I i (θ ))
Christophe Hurlin (University of Orléans)
December 9, 2013
129 / 207
5. Score, Hessian and Fisher Information
Theorem
In a sampling model (with i.i.d. observations), one has:
IN (θ ) = N I (θ )
Christophe Hurlin (University of Orléans)
December 9, 2013
130 / 207
5. Score, Hessian and Fisher Information
pdf
Marginal Distribution
fX i (θ; xi )
Cond. Distribution (model)
f Y i jxi (θ; y j x )
Score Vector
si (θ; Xi )
si (θ; Yi j xi )
Hessian Matrix
Hi (θ; Xi )
Hi (θ; Yi j xi )
Information matrix
I i (θ ) = I (θ )
I i (θ )
Av. Infor. Matrix
I (θ ) = I i (θ )
I (θ ) = EX (I i (θ ))
with I i (θ ) = Vθ (si (θ; Yi j xi )) = Eθ si (θ; Yi j xi ) si (θ; Yi j xi )> =
Eθ ( Hi (θ; Yi j xi ))
Christophe Hurlin (University of Orléans)
December 9, 2013
131 / 207
5. Score, Hessian and Fisher Information
How to estimate the average Fisher Information Matrix?
This matrix is particularly important, since we will see that its
corresponds to the asymptotic variance covariance matrix of the
MLE.
Let us assume that we have a consistent estimator b
θ of the parameter
θ, how to estimate the average Fisher information matrix?
Christophe Hurlin (University of Orléans)
December 9, 2013
132 / 207
5. Score, Hessian and Fisher Information
De…nition (Estimators of the average Fisher Information Matrix)
If b
θ converges in probability to θ 0 (true value), then:
1
bI b
θ =
N
N
1
bI b
θ =
N
∑
i =1
1
bI b
θ =
N
N
∑ bI i
∂`i (θ; yi j xi )
∂θ
N
∑
i =1
b
θ
i =1
b
θ
∂`i (θ; yi j xi )
∂θ
∂2 `i (θ; yi j xi )
∂θ∂θ >
>
b
θ
!
b
θ
are three consistent estimators of the average Fisher information matrix.
Christophe Hurlin (University of Orléans)
December 9, 2013
133 / 207
5. Score, Hessian and Fisher Information
1
2
The …rst estimator corresponds to the average of the N Fisher
information matrices (for Y1 , .., YN ) evaluated at the estimated value
b
θ. This estimator will rarely be available in practice.
The second estimator corresponds to the average of the product of
the individual score vectors evaluated at b
θ. It is known as the BHHH
(Berndt, Hall, Hall, and Hausman, 1994) estimator or OPG estimator
1
bI b
θ =
N
Christophe Hurlin (University of Orléans)
N
∑
i =1
gi b
θ; yi j xi gi b
θ; yi j xi
>
December 9, 2013
134 / 207
5. Score, Hessian and Fisher Information
3. The third estimator corresponds to the opposite of the average of the
Hessian matrices evaluated at b
θ.
1
bI b
θ =
N
Christophe Hurlin (University of Orléans)
N
∑
i =1
θ; yi j xi
Hi b
December 9, 2013
135 / 207
5. Score, Hessian and Fisher Information
Problem
These three estimators are asymptotically equivalent, but they could give
di¤erent results in …nite samples. Available evidence suggests that in small
or moderate sized samples, the Hessian is preferable (Greene, 2007).
However, in most cases, the BHHH estimator will be the easiest to
compute.
Christophe Hurlin (University of Orléans)
December 9, 2013
136 / 207
5. Score, Hessian and Fisher Information
Christophe Hurlin (University of Orléans)
December 9, 2013
137 / 207
5. Score, Hessian and Fisher Information
Example (CAPM)
The empirical analogue of the CAPM is given by:
e
rit = αi + βi e
rmt + εt
e
rit = rit rft
| {z }
excess return of security i at time t
Christophe Hurlin (University of Orléans)
rft )
}
market excess return at time t
where εt is an i.i.d. error term with:
E ( εt ) = 0
e
rmt = (rmt
|
{z
V ( εt ) = σ 2
rmt ) = 0
E ( εt j e
December 9, 2013
138 / 207
5. Score, Hessian and Fisher Information
Example (CAPM, cont’d)
Data (data …le: capm.xls): Microsoft, SP500 and Tbill (closing prices)
from 11/1/1993 to 04/03/2003
0.10
0.08
RMSFT
0.05
0.04
0.00
0.00
-0.05
-0.04
-0.10
-0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08
-0.08
500
RSP500
Christophe Hurlin (University of Orléans)
1000
RSP500
1500
2000
RMSFT
December 9, 2013
139 / 207
5. Score, Hessian and Fisher Information
Example (CAPM, cont’d)
We consider the CAPM model rewritten as follows
e
rit = xt> β + εt
where xt = (1 e
rmt )> is 2
θ = αi : βi :
>
σ2
t = 1, ..T
1 vector of random variables,
>
= β : σ2
>
is 3
1 vector of parameters, and
where the error term εt satis…es E (εt ) = 0, V (εt ) = σ2 and
E ( εt j e
rmt ) = 0.
Christophe Hurlin (University of Orléans)
December 9, 2013
140 / 207
5. Score, Hessian and Fisher Information
Example (CAPM, cont’d)
Question: Compute three alternative estimators of the asymptotic
>
b2
αi b
variance covariance matrix of the MLE estimator b
θ= b
β σ
i
b=
β
b
αi
b
βi
T
=
t =1
b2 =
σ
Christophe Hurlin (University of Orléans)
∑ xt xt>
1
T
T
∑
t =1
e
rit
!
1
b
xt> β
T
∑ xt erit
t =1
!
2
December 9, 2013
141 / 207
5. Score, Hessian and Fisher Information
Solution The ML estimator is de…ned by:
b
θ=
T
ln σ2
2
arg max
β2R2 ,σ2 2R+
T
ln (2π )
2
1
2σ2
The problem is regular, so we have:
p
or equivalently
θ
T b
b
θ
asy
d
θ0 ! N 0, I
N
θ0 ,
1
I
T
1
1
T
∑
t =1
e
rit
b
xt> β
2
(θ0 )
(θ0 )
The asymptotic variance covariance matrix of b
θ is
Christophe Hurlin (University of Orléans)
1
V b
θ = I
T
1
(θ0 )
December 9, 2013
142 / 207
5. Score, Hessian and Fisher Information
Solution (cont’d)
First estimator: The information matrix at time t is de…ned by (third
de…nition):
0
eit xt
∂2 `t θ; R
I t (θ) = Eθ @
∂θ∂θ>
1
A = Eθ
eit xt
Ht θ; R
where Eθ means the expectation with respect to the conditional
eit Xt = xt
distribution R
0
B
I t (θ) = @
1
x x>
σ2 t t
1 >
x
σ4 t
eit
Eθ R
Christophe Hurlin (University of Orléans)
1
x
σ4 t
xt> β
1
2σ4
+
eit
Eθ R
1
E
σ6 θ
eit
R
xt> β
xt> β
December 9, 2013
2
1
C
A
143 / 207
5. Score, Hessian and Fisher Information
Solution (cont’d)
First estimator:
0
B
I t (θ) = @
1 >
x
σ4 t
eit
Given that Eθ R
eit
R
Eθ
1
2σ4
xt> β
eit
R
= xt> β and Eθ
I t (θ) =
Christophe Hurlin (University of Orléans)
eit
Eθ R
1
x
σ4 t
1
x x>
σ2 t t
1
x x>
σ2 t t
01
2
+
1
E
σ6 θ
xt> β
02
1
1
2σ4
2
eit
R
xt> β
xt> β
2
1
C
A
= σ2 , then we have:
!
December 9, 2013
144 / 207
5. Score, Hessian and Fisher Information
Solution (cont’d)
First estimator:
I t (θ) =
1
x x>
σ2 t t
01
02
1
1
2σ4
2
!
An estimator of the asymptotic variance covariance matrix of b
θ is given by:
1
bI b
θ =
T
T
1
b asy b
V
θ = bI
T
∑ It
t =1
Christophe Hurlin (University of Orléans)
b
θ =
1
b2
Tσ
1
b
θ
∑Tt=1 xt xt> 02
01
2
1
2b
σ4
1
!
December 9, 2013
145 / 207
5. Score, Hessian and Fisher Information
Solution (cont’d)
Second de…nition (BHHH):
with
1
bI b
θ =
T
rit j xt )
∂`t (θ; e
∂θ
b
θ
T
∑
t =1
0
B
[email protected]
Christophe Hurlin (University of Orléans)
1
b asy b
θ = bI
V
T
∂`t (θ; e
rit j xt )
∂θ
1
x
b2 t
σ
1
2b
σ2
+
e
rit
1
2b
σ4
e
rit
1
b
θ
b
xt> β
b
xt> β
b
θ
∂`t (θ; e
rit j xt )
∂θ
2
1
C
A=
>
b
θ
!
1
x bε
b2 t t
σ
1
+ 2b1σ4 bε2t
2b
σ2
December 9, 2013
!
146 / 207
5. Score, Hessian and Fisher Information
Solution (cont’d)
Second de…nition (BHHH):
∂`t (θ; e
rit j xt )
∂θ
=
0
= @
b
θ
1
b
xε
b2 t t
σ
1
+ 2b1σ4 bε2t
2b
σ2
1 >
x bε
b2 t t
σ
Christophe Hurlin (University of Orléans)
!
1
x x>bε2
b4 t t t
σ
1
2b
σ2
∂`t (θ; e
rit j xt )
∂θ
+
1 >
x bε
b2 t t
σ
1 2
bε
2b
σ4 t
>
b
θ
1
x bε
b2 t t
σ
1
2b
σ2
+
1 2
bε
2b
σ4 t
2
1 2
bε
2b
σ4 t
1
2b
σ2
1
2b
σ2
+
1 2
bε
2b
σ4 t
+
1
A
December 9, 2013
147 / 207
5. Score, Hessian and Fisher Information
Solution (cont’d)
Second de…nition (BHHH): so we have
1
b asy b
θ = bI
V
T
with
1
bI b
θ =
T
T
0
∑@
t =1
1 >
x bε
b2 t t
σ
Christophe Hurlin (University of Orléans)
1
x x>bε2
b4 t t t
σ
1
2b
σ2
+
1 2
bε
2b
σ4 t
1
b
θ
1
x bε
b2 t t
σ
1 2
bε
2b
σ4 t
2
1 2
bε
2b
σ4 t
1
2b
σ2
1
2b
σ2
+
+
December 9, 2013
1
A
148 / 207
5. Score, Hessian and Fisher Information
Solution (cont’d)
Third de…nition (inverse of the Hessian): we know that
1
b asy b
θ = bI
V
T
Ht b
θ; e
rit j xt
1
bI b
θ =
T
0
[email protected]
Christophe Hurlin (University of Orléans)
1 >
x
b4 t
σ
T
∑
t =1
1
x x>
b2 t t
σ
e
rit
1
b
θ
Ht b
θ; e
rit j xt
b
xt> β
1
2b
σ4
1
x
b4 t
σ
1
b6
σ
e
rit
e
rit
b
xt> β
b
xt> β
December 9, 2013
2
1
A
149 / 207
5. Score, Hessian and Fisher Information
Solution (cont’d)
Third de…nition (inverse of the Hessian):
Ht b
θ; e
rit j xt
0
[email protected]
1 >
x
b4 t
σ
1
x x>
b2 t t
σ
e
rit
b
xt> β
1
2b
σ4
1
x
b4 t
σ
1
b6
σ
Given the FOC (log-likelihood equations), ∑Tt=1 xt e
rit
e
rit
b
xt> β
2
b2 .
= Tσ
T
∑ Ht
t =1
b
θ; e
rit j xt
Christophe Hurlin (University of Orléans)
=
1
b2
σ
∑Tt=1 xt xt>
01 2
e
rit
e
rit
b
xt> β
b
xt> β
2
1
A
b = 0 and
xt> β
02
1
T
2b
σ4
!
December 9, 2013
150 / 207
5. Score, Hessian and Fisher Information
Solution (cont’d)
Third de…nition (inverse of the Hessian):
θ coïncides with the …rst one:
So, in this case, the third estimator of bI b
1
bI b
θ =
T
T
∑
t =1
1
b asy b
V
θ = bI
T
θ; e
rit j xt
Ht b
Christophe Hurlin (University of Orléans)
=
1
b
θ
1
b2
Tσ
∑Tt=1 xt xt>
01 2
02
1
1
2b
σ4
December 9, 2013
!
151 / 207
5. Score, Hessian and Fisher Information
Solution (cont’d)
These three estimates of the asymptotic variance covariance matrix
are asymptotically equivalent, but can be largely di¤erent in …nite
sample...
1 1 b
b asy b
θ
θ = bI
V
T
with
1 T
bI b
θ =
It b
θ
T t∑
=1
>!
T
1
∂
`
θ;
e
r
x
∂
`
θ;
e
r
x
(
)
(
)
j
j
t
t
t
t
it
it
bI b
θ =
T t∑
∂θ
∂θ
b
b
θ
θ
=1
1
bI b
θ =
T
Christophe Hurlin (University of Orléans)
T
∑(
t =1
Ht (θ; e
rit j xt ))
December 9, 2013
152 / 207
5. Score, Hessian and Fisher Information
Christophe Hurlin (University of Orléans)
December 9, 2013
153 / 207
5. Score, Hessian and Fisher Information
Christophe Hurlin (University of Orléans)
December 9, 2013
154 / 207
5. Score, Hessian and Fisher Information
Christophe Hurlin (University of Orléans)
December 9, 2013
155 / 207
Key Concepts
1
Gradient and Hessian Matrix (deterministic elements).
2
Score Vector (random elements).
3
Hessian Matrix (random elements).
4
Fisher information matrix associated to the sample.
5
(Average) Fisher information matrix for one observation.
Christophe Hurlin (University of Orléans)
December 9, 2013
156 / 207
Section 6
Properties of Maximum Likelihood Estimators
Christophe Hurlin (University of Orléans)
December 9, 2013
157 / 207
6. Properties of Maximum Likelihood Estimators
Objectives
MLE is a good estimator? Under which conditions the MLE is
unbiased, consistent and corresponds to the BUE (Best Unbiased
Estimator)? => regularity conditions
Is the MLE consistent?
Is the MLE optimal or e¢ cient?
What is the asymptotic distribution of the MLE? The magic of the
MLE...
Christophe Hurlin (University of Orléans)
December 9, 2013
158 / 207
6. Properties of Maximum Likelihood Estimators
De…nition (Regularity conditions)
Greene (2007) identify three regularity conditions
R1 The …rst three derivatives of ln fX (θ; xi ) with respect to θ are
continuous and …nite for almost all xi and for all θ. This condition
ensures the existence of a certain Taylor series approximation and the
…nite variance of the derivatives of `i (θ; xi ).
R2 The conditions necessary to obtain the expectations of the …rst and
second derivatives of ln fX (θ; Xi ) are met.
R3 For all values of θ, ∂3 ln fX (θ; xi ) /∂θ i ∂θ j ∂θ k is less than a function
that has a …nite expectation. This condition will allow us to truncate the
Taylor series.
Christophe Hurlin (University of Orléans)
December 9, 2013
159 / 207
6. Properties of Maximum Likelihood Estimators
De…nition (Regularity conditions, Zivot 2001)
A pdf fX (θ; x ) is regular if and only of:
R1 The support of the random variables X , SX = fx : fX (θ; x ) > 0g,
does not depend on θ.
R2 fX (θ; x ) is at least three times di¤erentiable with respect to θ, and
these derivatives are continuous.
R3 The true value of θ lies in a compact set Θ.
Christophe Hurlin (University of Orléans)
December 9, 2013
160 / 207
6. Properties of Maximum Likelihood Estimators
Under these regularity conditions, the maximum likelihood estimator b
θ
possesses many appealing properties:
1
The maximum likelihood estimator is consistent.
2
The maximum likelihood estimator is asymptotically normal (the
magic of the MLE..).
3
The maximum likelihood estimator is asymptotically optimal or
e¢ cient.
4
The maximum likelihood estimator is equivariant: if b
θ is an estimator
of θ then g (b
θ ) is an estimator of g (θ ).
Christophe Hurlin (University of Orléans)
December 9, 2013
161 / 207
6. Properties of Maximum Likelihood Estimators
Theorem (Consistency)
Under regularity conditions, the maximum likelihood estimator is
consistent
p
b
θ ! θ0
N !∞
or equivalently:
p limb
θ = θ0
N !∞
where θ 0 denotes the true value of the parameter θ.
Christophe Hurlin (University of Orléans)
December 9, 2013
162 / 207
6. Properties of Maximum Likelihood Estimators
Sketch of the proof (Greene, 2007)
Because b
θ is the MLE, in any …nite sample, for any θ 6= b
θ (including the
true θ 0 ) it must be true that
θ; y j x
ln LN b
ln LN (θ; y j x )
Consider, then, the random variable LN (θ; Y j x ) /LN (θ 0 ; Y j x ). Because
the log function is strictly concave, from Jensen’s Inequality, we have
Eθ
ln
LN (θ; Y j x )
LN (θ 0 ; Y j x )
Christophe Hurlin (University of Orléans)
ln Eθ
LN (θ; Y j x )
LN (θ 0 ; Y j x )
December 9, 2013
163 / 207
6. Properties of Maximum Likelihood Estimators
Sketch of the proof, cont’d
The expectation on the right-hand side is exactly equal to one, as
Eθ
LN (θ; Y j x )
LN (θ 0 ; Y j x )
=
=
Z
Z
LN (θ; y j x )
LN (θ 0 ; y j x )
LN (θ 0 ; y j x ) dy
LN (θ; y j x ) dy
= 1
is simply the integral of a joint density.
Christophe Hurlin (University of Orléans)
December 9, 2013
164 / 207
6. Properties of Maximum Likelihood Estimators
Sketch of the proof, cont’d
So we have
Eθ
ln
LN (θ; Y j x )
LN (θ 0 ; Y j x )
ln Eθ
LN (θ; Y j x )
LN (θ 0 ; Y j x )
= ln (1) = 0
Divide the left hand side of this equation by N to produce
Eθ
1
ln LN (θ; Y j x )
N
Eθ
1
ln LN (θ 0 ; Y j x )
N
This produces a central result:
Christophe Hurlin (University of Orléans)
December 9, 2013
165 / 207
6. Properties of Maximum Likelihood Estimators
Theorem (Likelihood Inequality)
The expected value of the log-likelihood is maximized at the true value of
the parameters. For any θ, including b
θ:
Eθ
1
`N (θ 0 ; Yi j xi )
N
Christophe Hurlin (University of Orléans)
Eθ
1
`N (θ; Yi j xi )
N
December 9, 2013
166 / 207
6. Properties of Maximum Likelihood Estimators
Sketch of the proof, cont’d
Notice that
1
1
`N (θ; Yi j xi ) = ∑Ni=1 `i (θ; Yi j xi )
N
N
where the elements `i (θ; Yi j xi ) for i = 1, ..N are i.i.d.. So, using a law
of large numbers, we get:
1
p
`N (θ; Yi j xi ) ! Eθ
N !∞
N
Christophe Hurlin (University of Orléans)
1
`N (θ; Yi j xi )
N
December 9, 2013
167 / 207
6. Properties of Maximum Likelihood Estimators
Sketch of the proof, cont’d
The Likelihood inequality for θ = b
θ implies
Eθ
with
and thus
1
`N (θ 0 ; Yi j xi )
N
Eθ
1
`N bθ; Yi j xi
N
1
p
`N (θ 0 ; Yi j xi ) ! Eθ
N !∞
N
1
`N (θ 0 ; Yi j xi )
N
1
`N bθ; Yi j xi
N
1
`N bθ; Yi j xi
N
lim Pr
N !∞
p
! Eθ
N !∞
1
`N (θ 0 ; Yi j xi )
N
Christophe Hurlin (University of Orléans)
1
`N bθ; Yi j xi
N
=1
December 9, 2013
168 / 207
6. Properties of Maximum Likelihood Estimators
Sketch of the proof, cont’d So we have two results:
lim Pr
N !∞
1
`N (θ 0 ; Yi j xi )
N
1
`N bθ; Yi j xi
N
It necessarily implies that
1
`N bθ; Yi j xi
N
1
`N (θ 0 ; Yi j xi )
N
1
`N bθ; Yi j xi
N
p
!
N !∞
=1
8N
1
`N (θ 0 ; Yi j xi )
N
If θ is a scalar, we have immediatly:
b
θ
p
! θ0
N !∞
For a more general case with dim (θ ) = K , see a formal proof in Amemiya
(1985).
Amemiya T., (1985) Advanced Econometrics. Harvard University Press
Christophe Hurlin (University of Orléans)
December 9, 2013
169 / 207
6. Properties of Maximum Likelihood Estimators
Remark
The proof of the consistency of the MLE is largely easiest when we have a
formal expression for the maximum likelihood estimator b
θ
b
θ=b
θ (X1 , .., XN )
Christophe Hurlin (University of Orléans)
December 9, 2013
170 / 207
6. Properties of Maximum Likelihood Estimators
Example
Suppose that D1 , D2 , .., DN are i.i.d., positive random variable with
Di
Exp (θ 0 ), with
fD (d; θ ) =
1
exp
θ
E θ ( Di ) = θ 0
d
θ
, 8d 2 R+
Vθ (Di ) = θ 20
where θ 0 is the true value of θ. Question: show that the MLE is
consistent.
Christophe Hurlin (University of Orléans)
December 9, 2013
171 / 207
6. Properties of Maximum Likelihood Estimators
Solution
The log-likelihood function associated to the sample fd1 , .., dN g is de…ned
by:
1 N
`N (θ; d ) = N ln (θ )
di
θ i∑
=1
We admit that maximum likelihood estimator corresponds to the sample
mean:
1
b
θ = ∑N
Di
N i =1
Christophe Hurlin (University of Orléans)
December 9, 2013
172 / 207
6. Properties of Maximum Likelihood Estimators
Solution, cont’d
Then, we have:
1
E θ ( Di ) = θ
Eθ b
θ = ∑N
N i =1
As a consequence
and
b
θ is unbiased
1
θ2
Vθ b
θ = 2 ∑N
V
D
=
(
)
i
θ
i =1
N
N
Eθ b
θ =θ
b
θ
Christophe Hurlin (University of Orléans)
lim Vθ b
θ =0
N !∞
p
! θ
N !∞
December 9, 2013
173 / 207
6. Properties of Maximum Likelihood Estimators
Lemma
Under stronger conditions, the maximum likelihood estimator converges
almost surely to θ 0
p
a.s .
b
θ ! θ 0 =) b
θ ! θ0
N !∞
Christophe Hurlin (University of Orléans)
N !∞
December 9, 2013
174 / 207
6. Properties of Maximum Likelihood Estimators
1
If we restrict ourselves to the class of unbiased estimators (linear and
nonlinear) then we de…ne the best estimator as the one with the
smallest variance.
2
With linear estimators (next chapter), the Gauss-Markov theorem tells
us that the ordinary least squares (OLS) estimator is best (BLUE).
3
When we expand the class of estimators to include linear and
nonlinear estimators it turns out that we can establish an absolute
lower bound on the variance of any unbiased estimator b
θ of θ under
certain conditions.
4
θ has a variance that is equal to the
Then if an unbiased estimator b
lower bound then we have found the best unbiased estimator
(BUE).
Christophe Hurlin (University of Orléans)
December 9, 2013
175 / 207
6. Properties of Maximum Likelihood Estimators
De…nition (Cramer-Rao or FDCR bound)
Let X1 , .., XN be an i.i.d. sample with pdf fX (θ; x ). Let b
θ be an unbiased
estimator of θ; i.e., Eθ (b
θ ) = θ. If fX (θ; x ) is regular then
Vθ b
θ
I N 1 (θ 0 )
FDCR or Cramer-Rao bound
where I N (θ 0 ) denotes the Fisher information number for the sample
evaluated at the true value θ 0 .
Christophe Hurlin (University of Orléans)
December 9, 2013
176 / 207
6. Properties of Maximum Likelihood Estimators
Remarks
1
Hence, the Cramer-Rao Bound is the inverse of the information matrix
associated to the sample. Reminder: three de…nitions for I N (θ 0 ) .
!
∂`N (θ; Y j x )
I N ( θ 0 ) = Vθ
∂θ
θ0
I N ( θ 0 ) = Eθ
I N ( θ 0 ) = Eθ
2
If θ is a vector then Vθ b
θ
is positive semi-de…nite
Christophe Hurlin (University of Orléans)
∂`N (θ; Y j x )>
∂θ
θ0
!
∂2 `N (θ; Y j x )
∂`N (θ; Y j x )
∂θ
∂θ∂θ >
θ0
θ0
I N 1 (θ 0 ) means that Vθ b
θ
!
I N 1 (θ 0 )
December 9, 2013
177 / 207
6. Properties of Maximum Likelihood Estimators
Theorem (E¢ ciency)
Under regularity conditions, the maximum likelihood estimator is
asymptotically e¢ cient and attains the FDCR (Frechet - Darnois Cramer - Rao) or Cramer-Rao bound:
θ = I N 1 (θ 0 )
Vθ b
where I N (θ 0 ) denotes the Fisher information matrix associated to the
sample evaluated at the true value θ 0 .
Christophe Hurlin (University of Orléans)
December 9, 2013
178 / 207
6. Properties of Maximum Likelihood Estimators
Example (Exponential Distribution)
Suppose that D1 , D2 , .., DN are i.i.d., positive random variable with
Di
Exp (θ 0 ), with
fD (d; θ ) =
1
exp
θ
E θ ( Di ) = θ 0
d
θ
, 8d 2 R+
Vθ (Di ) = θ 20
where θ 0 is the true value of θ. Question: show that the MLE is e¢ cient.
Christophe Hurlin (University of Orléans)
December 9, 2013
179 / 207
6. Properties of Maximum Likelihood Estimators
Solution
We shown that the maximum likelihood estimator corresponds to the
sample mean,
1 N
b
Di
θ=
N i∑
=1
θ2
Vθ b
θ = 0
N
Eθ b
θ = θ0
Christophe Hurlin (University of Orléans)
December 9, 2013
180 / 207
6. Properties of Maximum Likelihood Estimators
Solution, cont’d
The log-likelihood function is
`N (θ; d ) =
N ln (θ )
1 N
di
θ i∑
=1
The score vector is de…ned by:
sN (θ; D ) =
Christophe Hurlin (University of Orléans)
∂`N (θ; D )
=
∂θ
N
1
+ 2
θ
θ
N
∑ Di
i =1
December 9, 2013
181 / 207
6. Properties of Maximum Likelihood Estimators
Solution, cont’d
Let us use one of the three de…nitions of the information quantity I N (θ ) :
I N ( θ ) = Vθ
= Vθ
=
=
∂`N (θ; D )
∂θ
N
1
+ 2
θ
θ
N
∑ Di
i =1
!
1 N
∑ i = 1 V θ ( Di )
θ4
Nθ 2
N
= 2
4
θ
θ
Then, b
θ is e¢ cient and attains the Cramer-Rao bound.
Christophe Hurlin (University of Orléans)
θ2
Vθ b
θ = I N 1 (θ 0 ) =
N
December 9, 2013
182 / 207
6. Properties of Maximum Likelihood Estimators
Theorem (Convergence of the MLE)
Under suitable regularity conditions, the MLE is asymptotically normally
distributed with
p
d
N b
θ θ 0 ! N 0, I 1 (θ 0 )
where θ 0 denotes the true value of the parameter and I (θ 0 ) the (average)
Fisher information matrix for one observation.
Christophe Hurlin (University of Orléans)
December 9, 2013
183 / 207
6. Properties of Maximum Likelihood Estimators
Corollary
Another way, to write this result, is to say that for large sample size N, the
MLE b
θ is approximatively distributed according a normal distribution
or equivalently
b
θ
asy
b
θ
N θ0 , N
asy
1
I
1
(θ 0 )
N θ 0 , I N 1 (θ 0 )
where I N (θ 0 ) = N I (θ 0 ) denotes the Fisher information matrix
associated to the sample.
Christophe Hurlin (University of Orléans)
December 9, 2013
184 / 207
6. Properties of Maximum Likelihood Estimators
De…nition (Asymptotic Variance)
The asymptotic variance of the MLE is de…ned by:
Vasy b
θ = I N 1 (θ 0 )
where I N (θ 0 ) denotes the Fisher information matrix associated to the
sample. This asymptotic variance of the MLE corresponds to the
Cramer-Rao or FDCR bound.
Christophe Hurlin (University of Orléans)
December 9, 2013
185 / 207
6. Properties of Maximum Likelihood Estimators
The magic of the MLE
Christophe Hurlin (University of Orléans)
December 9, 2013
186 / 207
6. Properties of Maximum Likelihood Estimators
Proof (MLE convergence)
At the maximum likelihood estimator, the gradient of the log-likelihood
equals zero (FOC):
gN b
θ
(K ,1 )
gN b
θ; y j x
=
∂`N (θ; y j x )
∂θ
b
θ
= 0K
where b
θ=b
θ (x ) denotes here the ML estimate. Expand this set of
equations in a Taylor series around the true parameters θ 0 . We will use the
mean value theorem to truncate the Taylor series at the second term:
gN b
θ = gN (θ 0 ) + HN θ
b
θ
θ0 = 0
The Hessian is evaluated at a point θ that is between b
θ and θ 0 , for
b
instance θ = ω θ + (1 ω ) θ 0 for some 0 < ω < 1.
Christophe Hurlin (University of Orléans)
December 9, 2013
187 / 207
6. Properties of Maximum Likelihood Estimators
Proof (MLE convergence, cont’d)
We then rearrange this equation and multiply the result by
p
1 p
N b
θ θ0 =
HN θ
NgN (θ 0 )
p
N to obtain:
By dividing HN θ and gN (θ 0 ) by N, we obtain:
p
N b
θ
θ0
1
HN θ
N
1
=
1
HN θ
N
1
=
p
p
N
1
gN (θ 0 )
N
Ng (θ 0 )
where g (θ 0 ) denotes the sample mean of the individual gradient vectors
g (θ 0 ) =
Christophe Hurlin (University of Orléans)
1
1
gN (θ 0 ) =
N
N
N
∑ gi (θ 0 ; yi j xi )
i =1
December 9, 2013
188 / 207
6. Properties of Maximum Likelihood Estimators
Proof (MLE convergence, cont’d)
Let us now consider the same expression in terms of random variables: b
θ
now denotes the ML estimator, HN θ = HN θ; Y j x and sN (θ 0 ; Y j x )
the score vector. We have:
p
N b
θ
θ0 =
1
HN θ; Y j x
N
1
p
Ns (θ 0 ; Y j x )
where the score vectors associated to the variables Yi are i.i.d.
s (θ 0 ; Y j x ) =
Christophe Hurlin (University of Orléans)
1
N
N
∑ si (θ 0 ; Yi j xi )
i =1
December 9, 2013
189 / 207
6. Properties of Maximum Likelihood Estimators
Proof (MLE convergence, cont’d)
Let us consider the …rst element:
s (θ 0 ) =
1
N
N
∑ si (θ 0 ; Yi j xi )
i =1
The individual scores si (θ 0 ; Yi j xi ) are i.i.d. with
Eθ (si (θ 0 ; Yi j xi )) = 0
Ex Vθ (si (θ 0 ; Yi j xi )) = Ex (I i (θ 0 )) = I (θ 0 )
By using the Lindberg-Levy Central Limit Theorem, we have:
p
Christophe Hurlin (University of Orléans)
d
Ns (θ 0 ) ! N (0, I (θ 0 ))
December 9, 2013
190 / 207
6. Properties of Maximum Likelihood Estimators
Proof (MLE convergence, cont’d)
We known that:
1
HN θ; Y j x =
N
1
N
N
∑ Hi
θ; Yi j xi
i =1
where the hessian matrices Hi θ; Yi j xi are i.i.d. Besides, because
plim b
θ θ 0 = 0, plim θ θ 0 = 0 as well. By applying a law of large
numbers, we get:
1
p
HN θ; Y j x ! EX Eθ ( Hi (θ 0 ; Yi j xi ))
N
with
EX Eθ ( Hi (θ 0 ; Yi j xi )) = EX Eθ
Christophe Hurlin (University of Orléans)
∂2 `i (θ; Yi j xi )
∂θ∂θ >
= I (θ 0 )
December 9, 2013
191 / 207
6. Properties of Maximum Likelihood Estimators
Reminder:
If XN and YN verify
XN
(K ,K )
d
YN
(K ,1 )
!N
p
!
X
(K ,K )
0 , Σ
(K ,1 ) (K ,K )
then
d
XN YN
(K ,K )(K ,1 )
Christophe Hurlin (University of Orléans)
!N
0 , X
Σ
X>
(K ,1 ) (K ,K )(K ,K )(K ,K )
December 9, 2013
192 / 207
6. Properties of Maximum Likelihood Estimators
Proof (MLE convergence, cont’d)
Here we have
p
θ
N b
θ0 =
1
HN θ; Y j x
N
1
p
Ns (θ 0 ; Y j x )
1
1
p
HN θ; Y j x
! I 1 (θ 0 ) symmetric matrix
N
p
d
Ns (θ 0 ) ! N (0, I (θ 0 ))
Then, we get:
p
N b
θ
Christophe Hurlin (University of Orléans)
d
θ 0 ! N 0, I
1
(θ 0 ) I (θ 0 ) I
1
(θ 0 )
December 9, 2013
193 / 207
6. Properties of Maximum Likelihood Estimators
Proof (MLE convergence, cont’d)
And …nally....
p
θ
N b
d
θ 0 ! N 0, I
1
(θ 0 )
The magic of the MLE.....
Christophe Hurlin (University of Orléans)
December 9, 2013
194 / 207
6. Properties of Maximum Likelihood Estimators
Example (Exponential Distribution)
Suppose that D1 , D2 , .., DN are i.i.d., positive random variable with
Di
Exp (θ 0 ), with
fD (d; θ ) =
1
exp
θ
E θ ( Di ) = θ 0
d
θ
, 8d 2 R+
Vθ (Di ) = θ 20
where θ 0 is the true value of θ. Question: what is the asymptotic
distribution of the MLE? Propose a consistent estimator of the asymptotic
variance of b
θ.
Christophe Hurlin (University of Orléans)
December 9, 2013
195 / 207
6. Properties of Maximum Likelihood Estimators
Solution
We shown that b
θ = (1/N ) ∑N
i =1 Di and:
si (θ; Di ) =
∂`i (θ; Di )
=
∂θ
1 Di
+ 2
θ
θ
The (average) Fisher information matrix associated to Di is:
1 Di
+ 2
θ
θ
I ( θ ) = Vθ
=
1
1
V D = 2
4 θ ( i)
θ
θ
Then, the asymptotic distribution of b
θ is:
p
d
N b
θ θ 0 ! N 0, θ 2
or equivalently
Christophe Hurlin (University of Orléans)
b
θ
asy
N
θ2
θ0 ,
N
!
December 9, 2013
196 / 207
6. Properties of Maximum Likelihood Estimators
Solution, cont’d
The asymptotic variance of b
θ is:
θ2
Vasy b
θ =
N
θ is simply de…ned by:
A consistent estimator of Vas b
b asy
V
Christophe Hurlin (University of Orléans)
2
b
θ
b
θ =
N
December 9, 2013
197 / 207
6. Properties of Maximum Likelihood Estimators
Example (Linear Regression Model)
Let us consider the previous linear regression model yi = xi> β + εi , with εi
N .i.d. 0, σ2 . Let us denote θ the K + 1 1 vector de…ned by
>
θ = β> σ2
b
β=
N
∑
i =1
. The MLE estimator of θ is de…ned by:
Xi Xi>
!
1
N
∑
i =1
b
θ=
Xi> Yi
b
β
b2
σ
!
b2 =
σ
1
N
N
∑
i =1
Yi
Xi> b
β
2
Question: what is the asymptotic distribution of b
θ? Propose an estimator
of the asymptotic variance.
Christophe Hurlin (University of Orléans)
December 9, 2013
198 / 207
6. Properties of Maximum Likelihood Estimators
Solution
This model satisfy the regularity conditions. We shown that the average
Fisher information matrix is equal to:
1
E
σ2 X
I (θ ) =
Xi Xi>
0
0
1
2σ4
From the MLE convergence theorem, we get immediately:
p
N b
θ
d
θ 0 ! N 0, I
1
(θ 0 )
where θ 0 is the true value of θ.
Christophe Hurlin (University of Orléans)
December 9, 2013
199 / 207
6. Properties of Maximum Likelihood Estimators
Solution, cont’d
The asymptotic variance covariance matrix of b
θ is equal to:
with
Vasy b
θ =N
I N (θ ) =
Christophe Hurlin (University of Orléans)
1
I
N
E
σ2 X
1
(θ 0 ) = I N 1 (θ 0 )
Xi Xi>
0
0
N
2σ4
December 9, 2013
200 / 207
6. Properties of Maximum Likelihood Estimators
Solution, cont’d
A consistent estimate of I N (θ ) is:
with
bI N (θ ) = V
b asy1 b
θ =
bX = 1
Q
N
Christophe Hurlin (University of Orléans)
N b
Q
b2 X
σ
0
0
N
2b
σ4
!
N
∑ xi xi>
i =1
December 9, 2013
201 / 207
6. Properties of Maximum Likelihood Estimators
Solution, cont’d
Thus we get:
b
β
asy
Christophe Hurlin (University of Orléans)
N
b
σ
>
b 2 ∑N
β0 , σ
i =1 xi xi
2 asy
N
2b
σ4
σ20 ,
N
1
!
December 9, 2013
202 / 207
6. Properties of Maximum Likelihood Estimators
Summary
Under regular conditions
1
The MLE is consistent.
2
The MLE is asymptotically e¢ cient and its variance attains the
FDCR or Cramer-Rao bound.
3
The MLE is asymptotically normally distributed.
Christophe Hurlin (University of Orléans)
December 9, 2013
203 / 207
6. Properties of Maximum Likelihood Estimators
But, …nite sample properties can be very di¤erent from large sample
properties:
1
The maximum likelihood estimator is consistent but can be severely
biased in …nite samples
2
The estimation of the variance-covariance matrix can be seriously
doubtful in …nite samples.
Christophe Hurlin (University of Orléans)
December 9, 2013
204 / 207
6. Properties of Maximum Likelihood Estimators
Theorem (Equivariance)
Under regular conditions and if g (.) is a continuously di¤erentiable
function of θ and is de…ned from RK to RP , then:
p
N g b
θ
p
g b
θ ! g (θ 0 )
d
g (θ 0 ) ! N 0, G (θ 0 ) I
1
(θ 0 ) G (θ 0 )>
where θ 0 is the true value of the parameters and the matrix G (θ 0 ) is
de…ned by
∂g (θ )
G (θ ) =
∂θ >
(P ,K )
Christophe Hurlin (University of Orléans)
December 9, 2013
205 / 207
Key Concepts of the Chapter 2
1
Likelihood and log-likelihood function
2
Maximum likelihood estimator (MLE) and Maximum likelihood
estimate
3
Gradient and Hessian Matrix (deterministic elements)
4
Score Vector and Hessian Matrix (random elements)
5
Fisher information matrix associated to the sample
6
(Average) Fisher information matrix for one observation
7
FDCR or Cramer Rao Bound: the notion of e¢ ciency
8
Asymptotic distribution of the MLE
9
Asymptotic variance of the MLE
10
Estimator of the asymptotic variance of the MLE
Christophe Hurlin (University of Orléans)