Uploaded by Jack Nixon

Marie de la Scoala MaximumLikelihoodEstimation

advertisement
Maximum Likelihood estimation
1
Estimation
Let w = (w1 , . . . , wn ) denote the data, where wi depending on the context
may denote a random variable or, with a slight abuse of notation, its (possible) realization(s). Let θ denote a finite dimensional parameter. We let
L(θ; w) = f (w; θ)
denote the Likelihood function, i.e., the probability density or mass function
of the data considered as a function of the parameter. Throughout we will
assume that the data are iid. Therefore, we have
n
Y
L(θ; w) = f (w; θ) =
f (wi ; θ),
i=1
where (with a slight abuse of notation) f (wi ; θ) denotes the probability density or mass function of wi .
Example 1. wi is a Bernoulli random variable, i.e.,

1 with probability θ
wi =
0 with probability 1 − θ
.
Then, the Likelihood function of a random sample (a sample of iid random
variables) is given by
L(θ; w) =
n
Y
θwi (1 − θ)1−wi .
i=1
In what follows, it will turn out to be convenient to work with a oneto-one transformation of the Likelihood function, namely the log-likelihood
(function) given by
l(θ; w) ≡ log L(θ; w) = log
n
Y
i=1
where ≡ denotes “equals by definition.”
1
f (wi ; θ) =
n
X
i=1
log f (wi ; θ),
Example 1 - continued.
l(θ; w) =
n
X
wi log θ + (1 − wi ) log(1 − θ).
i=1
The Maximum Likelihood Estimator (MLE) is given by the maximizer
of the Likelihood function or equivalently of the log-likelihood. Under some
regularity conditions, the maximizer, say θ̂, is given by the parameter value
that satisfies the first order condition(s), i.e.,
∂
∂
l(θ̂; w) ≡
l(θ; w)
∂θ
∂θ
= 0.
(1)
θ=θ̂
Example 1 - continued.
∂
l(θ̂; w)
∂θ
⇔
⇔
⇔
Pn
i=1
=0
wi 1θ̂ + (1 − wi ) 1−1 θ̂ (−1) = 0
1
θ̂
Pn
i=1
wi
Pn
(1 − θ̂)
i=1
=
1
1 − θ̂
n
X
(1 − wi )
i=1
n
X
= θ̂ n −
wi
!
wi
i=1
⇔
Pn
i=1
⇔
wi
= θ̂n
n
1X
=
wi ≡ w̄
n i=1
θ̂
The MLE is given by the sample average, θ̂ = w̄.
2
Asymptotic properties
Next, we will derive several asymptotic properties of the MLE. We will rely
on (where all limits are taken as “n → ∞”)
• Law of Large Numbers (LLN): Let x1 , . . . , xn be an iid random sample
with µ ≡ E[xi ]. Then,
p
x̄ → µ
2
where x̄ ≡
1
n
Pn
p
i=1
xi and where “→” denotes “convergence in proba-
bility.”
• Central Limit Theorem (CLT): Let x1 , . . . , xn be an iid random sample
with µ ≡ E[xi ] and σ 2 ≡ Var[xi ] < ∞. Then,
√
where x̄ ≡
1
n
Pn
i=1
d
n(x̄ − µ) → N (0, σ 2 )
d
xi and where “→” denotes “convergence in distribu-
tion.”
• Continuous Mapping Theorem (CMT): Let {Xn } be a sequence of random variables (e.g., Xn = x̄; the sample average depends on the sample
size n). Then, for any continuous function g(·)
p
p
Xn → c ⇒ g(Xn ) → g(c)
and
d
d
Xn → X ⇒ g(Xn ) → g(X)
where c denotes some constant and X some random variable.
• Slutsky: Let {Xn } and {Yn } denote two sequences of random variables
p
d
such that Xn → X and Yn → c for some random variable X and some
contant c. Then,
d
d
d
Xn + Yn → X + c, Xn Yn → Xc, and Xn /Yn → X/c.
• Mean value Theorem: For any continuously differentiable function g(·)
g(x) = g(x∗ ) + g 0 (x∗∗ )(x − x∗ ),
where x∗∗ ∈ [x, x∗ ] (assuming x∗ > x).
In what follows, it will be convenient to define the “individual” log-likelihood
l(θ; wi ) ≡ log f (wi ; θ)
3
such that
l(θ; w) =
n
X
l(θ; wi ).
i=1
Similarly, we call
∂
l(θ; wi )
∂θ
the “individual” score and let
n
X ∂
∂
l(θ; w) =
l(θ; wi )
∂θ
∂θ
i=1
be the score. Let θ∗ denote true parameter under which the data was generated. Then, we have that
∂
∗
l(θ ; wi ) = 0,
E
∂θ
where
∂
∂
l(θ∗ ; wi ) ≡
l(θ; wi )
∂θ
∂θ
(2)
.
θ=θ∗
To show that equation (2) holds, note that
Z
∂
∂
∗
E
l(θ ; wi ) =
l(θ∗ ; wi )f (wi ; θ∗ )dwi ,
∂θ
∂θ
where the expectation is taken with respect to the true data generating process (dgp), i.e., we are integrating with respect to f (wi ; θ∗ ). Using l(θ; wi ) =
log f (wi ; θ), we have (under some regularity conditions that allow us to
change the order of integration and differentiation)
Z
∂
∂
1
∗
∗
E
l(θ ; wi ) =
f (wi ; θ ) f (wi ; θ∗ )dwi
∗
∂θ
f (wi ; θ ) ∂θ
Z
Z
∂
∂
∂
∗
=
f (wi ; θ )dwi =
f (wi ; θ∗ )dwi =
1 = 0.
∂θ
∂θ
∂θ
Note that equation (2) can be seen as population analogue to the first order
condition given in equation (1)—the “missing”
1
n
being immaterial. Simi-
larly, this can be seen as motivation for the MLE: The true value θ∗ sets the
expected (“individual”) score equal to zero and the MLE θ̂ sets the sample
4
analogue/average equal to zero. Since sample averages converge (in probability) to population moments (cf. LLN) we would hope that the solution
to the “sample problem” (what value of θ solves equation (1)?) converges
(in probability) to the solution of the “population problem” (what value of θ
solves equation (2)?). It can be shown that this intuition is correct and that
under certain regularity conditions MLE is consistent, i.e.,
p
θ̂ → θ∗ .
For the purpose of this class, we take this as given.
Example 1 - continued. Check for yourself that equation (2) holds.
In order to derive the asymptotic distribution of the MLE, we consider
the Taylor expansion of the log-likelihood at θ̂ around the true value θ∗ ,
∂
∂2
∂
∗
l(θ̂; w) =
l(θ ; w) + 2 l(θ̄; w)(θ̂ − θ∗ ).
∂θ
∂θ
∂ θ
Since the left hand side is zero by “construction” (see equation (1)), we have
∂
l(θ∗ ; w)
∂θ
−1
∂2
∂
= − 2 l(θ̄; w)
l(θ∗ ; w)
∂ θ
∂θ
−1
√ 1 ∂
1 ∂2
= − 2 l(θ̄; w)
n
l(θ∗ ; w).
n∂ θ
n ∂θ
2
− ∂∂2 θ l(θ̄; w)(θ̂ − θ∗ ) =
⇔
⇔
(θ̂ − θ∗ )
√
n(θ̂ − θ∗ )
(3)
Let’s consider the two terms on the right hand side in turn. Note that
2
n
1 X ∂2
∂
1 ∂2
p
∗
∗
∗
l(θ ; w) =
l(θ ; wi ) → E 2 l(θ ; wi )
n ∂ 2θ
n i=1 ∂ 2 θ
∂ θ
p
p
by the LLN. Given that θ̂ → θ∗ which implies that θ̄ → θ∗ (since θ̄ is
“between” θ̂ and θ∗ ), it can be shown that
2
1 ∂2
∂
p
∗
l(θ̄; w) → E 2 l(θ ; wi ) .
n ∂ 2θ
∂ θ
Therefore, by the CMT we have that the first term on the right hand side of
equation (3) satisfies
−1
2
−1
1 ∂2
∂
p
∗
→ −E 2 l(θ ; wi )
.
− 2 l(θ̄; w)
n∂ θ
∂ θ
5
Now, consider the second term on the right hand side of equation (3). Given
equation (2), we have
√ 1 ∂
√
n
l(θ∗ ; w) = n
n ∂θ
d
→N
1 ∂
∂
∗
∗
l(θ ; w) − E
l(θ ; wi )
n ∂θ
∂θ
"
2 #!
∂
l(θ∗ ; wi )
0, E
∂θ
(4)
by the CLT, where we have used the fact that
"
"
2 #
2
2 #
∂
∂
∂
∂
l(θ∗ ; wi ) = E
l(θ∗ ; wi )
−E
l(θ∗ ; wi ) = E
l(θ∗ ; wi )
Var
∂θ
∂θ
∂θ
∂θ
using equation (2) again.
Combining the two results (using Slutsky), we get
−2 "
2 #!
2
√
∂
∂
d
E
l(θ∗ ; wi )
.
n(θ̂ − θ∗ ) → N 0, −E 2 l(θ∗ ; wi )
∂ θ
∂θ
This result further simplifies given the so-called information equality, i.e.,
"
2 #
2
∂
∂
∗
∗
E
l(θ ; wi )
= −E 2 l(θ ; wi ) .
(5)
∂θ
∂ θ
The information equality can be derived as follows (recall l(θ; wi ) = log f (wi ; θ))
2
∂
∂
∂
∂
1
∂
∗
∗
∗
E 2 l(θ ; wi ) = E
l(θ ; wi )
=E
f (wi ; θ )
∂ θ
∂θ ∂θ
∂θ f (wi ; θ∗ ) ∂θ
" 2 #
1
∂
1
∂2
∗
∗
=E −
f (wi ; θ )
+E
f (wi ; θ )
f (wi ; θ∗ ) ∂θ
f (wi ; θ∗ ) ∂ 2 θ
"
2 #
∂
= −E
l(θ∗ ; wi )
+0
∂θ
The second term is zero because
Z 1
∂2
1
∂2
∗
∗
E
f (wi ; θ ) =
f (wi ; θ ) f (wi ; θ∗ )dwi
f (wi ; θ∗ ) ∂ 2 θ
f (wi ; θ∗ ) ∂ 2 θ
Z
Z
∂2
∂2
∗
=
f (wi ; θ )dwi = 2
f (wi ; θ∗ )dwi
∂ 2θ
∂ θ
∂2
= 2 1=0
∂ θ
6
where changing the order of integration and differentiation is allowed under
certain regularity conditions.
Given equation (5), we obtain our “final” result
√
d
n(θ̂ − θ∗ ) → N
−E
h
∂2
l(θ∗ ; wi )
∂2θ
i
−1 !
2
∂
0, −E 2 l(θ∗ ; wi )
.
∂ θ
(6)
is called the Fisher “information”. The term information is
intuitive because it corresponds to the (expected value of) second derivative
of the likelihood function. If the likelihood function has a lot of “curvature”
then it will be easy to find its maximum, i.e., we have a lot of information,
and the MLE will have a small variance, which is given by the inverse of the
information (a lot of information = small variance).
Example 1 - continued. The MLE is given by w̄. Since there exists a closed
form expression for the MLE (θ̂ = w̄) we can derive its asymptotic distribution directly. In particular, we have (using θ∗ as the true value for notational
consistency)
√
n(θ̂ − θ∗ ) =
√
d
n(w̄ − θ∗ ) → N (0, θ∗ (1 − θ∗ ))
by a CLT, since Var(wi ) = θ∗ (1 − θ∗ ). Check for yourself that
2
1
∂
∗
,
−E 2 l(θ ; wi ) = ∗
∂ θ
θ (1 − θ∗ )
i.e., the above theory gives the same result as the standard CLT in this
example.
3
3.1
Inference
t-test
We can use the result in equation (6) to do inference, i.e., test certain hypotheses of interest and construct confidence intervals. In order to test
H0 : θ∗ = θ0
7
(7)
we can for example rely on the t-test. Under H0 , i.e., if the true value of θ,
namely θ∗ , equals θ0 , we have that
2
−1 !
∂
0, −E 2 l(θ0 ; wi )
.
∂ θ
i
h 2
∂
We cannot use this result directly because E ∂ 2 θ l(θ0 ; wi ) is unknown. But
√
d
n(θ̂ − θ0 ) → N
we can replace it with a consistent estimator:
2
n
1 X ∂2
∂
p
l(θ0 ; wi ) → E 2 l(θ0 ; wi ) .
n i=1 ∂ 2 θ
∂ θ
(8)
Note that θ0 is “known” under the null hypothesis and therefore does not
need to be replaced with a (consistent) estimator. However, in practice we
often use
2
n
∂
1 X ∂2
p
l(θ̂; wi ) → E 2 l(θ0 ; wi ) .
n i=1 ∂ 2 θ
∂ θ
The typical t-statistic used in practice is then given by

√
n r
θ̂ − θ0
− n1
−1
Pn
∂2
i=1 ∂ 2 θ l(θ̂; wi )
(9)



θ̂ − θ0
= r


−1 
P
2
− ni=1 ∂∂2 θ l(θ̂; wi )
and satisfies
√
n r
θ̂ − θ0
− n1
d
Pn
∂2
i=1 ∂ 2 θ l(θ̂; wi )
−1 → N (0, 1),
(10)
by Slutsky. In order to test H0 , we rely on the t-test that rejects if the
absolute value of the t-statistic (because the testing problem is two-sided:
H0 : θ∗ = θ0 vs. H1 : θ∗ 6= θ0 where H1 denotes the “alternative hypothesis”)
exceeds the corresponding critical value, z1−α/2 . The corresponding 1 − α
confidence interval (CI) is given by


v
!−1
u
n
u
X ∂2


t −
l(θ̂; wi )
θ̂ ± z1−α/2
.
2
∂ θ
i=1
Inspection of the CI explains why we typically use the estimator given in
equation (9) rather than the one given in equation (8): We would not be
able to write down the CI in such a compact form if we used equation (8).
8
3.1.1
Delta method
A very useful result is given by the delta method (often used in combination
with the t-test), which states that for any continuously differentiable function
g(θ) for which g 0 (θ∗ ) 6= 0 (where g 0 (θ) denotes the derivative of g(θ))
2
−1 !
√
∂
d
n(g(θ̂) − g(θ∗ )) → N 0, (g 0 (θ∗ ))2 −E 2 l(θ∗ ; wi )
.
∂ θ
This follows from the mean value theorem, i.e.,
= g(θ∗ ) + g 0 (θ̄)(θ̂ − θ∗ )
√
√
⇔
n(g(θ̂) − g(θ∗ )) = g 0 (θ̄) n(θ̂ − θ∗ ),
g(θ̂)
p
g 0 (θ̄) → g 0 (θ∗ ) (by the CMT), equation (6), and Slutsky. Similar to above,
we can use the delta method to test H0 : g(θ∗ ) = g(θ0 ) or to construct confidence intervals for g(θ∗ ) by replacing the unknown variance by a (consistent)
estimator such as
(g 0 (θ̂))2
!−1
n
1 X ∂2
l(θ̂; wi )
−
n i=1 ∂ 2 θ
−1 !
2
∂
p
→ (g 0 (θ∗ ))2 −E 2 l(θ∗ ; wi )
.
∂ θ
See here for a multivariate version of the delta method which is very useful
(not only for this course).
3.2
Alternative tests
Next to the t-test for testing H0 given in equation (7), which is a special
case of what is generally referred to as Wald(-type) tests, there are two
additional (ML-specific) tests: the Score (or Lagrange Multiplier) test and
the Likelihood Ratio test.
3.2.1
Score test
We first consider the Score test. Note that under H0 : θ∗ = θ0 equation (4)
implies
√ 1 ∂
d
n
l(θ0 ; w) → N
n ∂θ
"
0, E
9
2 #!
∂
l(θ0 ; wi )
.
∂θ
As above, the variance is unknown. One possible estimator is given by
2
n 1X ∂
l(θ0 ; wi ) .
n i=1 ∂θ
Here, we typically do not replace θ0 by θ̂, because if we consider


√ 1 ∂
∂
n n ∂θ l(θ0 ; w)
l(θ
;
w)
0
d
∂θ
= q
→
q P
N (0, 1),
P
2
2
n
n
∂
1
∂
l(θ
;
w
)
l(θ
;
w
)
0
i
0
i
i=1 ∂θ
i=1 ∂θ
n
we note that the left hand side does not depend on θ̂. Put differently, we can
use the above test statistic, i.e., the left hand side, for testing H0 : θ∗ = θ0
without ever computing θ̂, which can be time consuming in certain models.
It is standard practice to rely on the square of the above test statistic, which
satisfies
2
∂
n n1 ∂θ
l(θ0 ; w)
d
 →
q
χ2 (1)
P
2
n
∂
1
i=1 ∂θ l(θ0 ; wi )
n
√

by the CMT, where χ2 (k) denotes a chi-square distribution with degree-offreedom equal to k. As above, the resulting Score test rejects if the above
(squared) test-statistic exceeds the corresponding critical value.
3.2.2
Likelihood Ratio test
Next, we consider the Likelihood Ratio test. The Likelihood Ratio statistic
for testing H0 : θ∗ = θ0 is given by
2(l(θ̂; w) − l(θ0 ; w)).
Under H0 , the above test statistic satisfies
d
2(l(θ̂; w) − l(θ0 ; w)) → χ2 (1).
To show this, consider a second-order Taylor expansion (a generalization of
the mean value Theorem)
1 ∂2
∂
l(θ̄; w)(θ0 − θ̂)2 ,
l(θ0 ; w) = l(θ̂; w) + l(θ̂; w)(θ0 − θ̂) +
2
∂θ
2∂ θ
10
where θ̄ lies between θ0 and θ̂. Given equation (1), we obtain
2(l(θ̂; w)−l(θ0 ; w)) = −
∂2
1 ∂2
2
l(
θ̄;
w)(
θ̂−θ
)
=
−
l(θ̄; w)n(θ̂−θ0 )2 . (11)
0
∂ 2θ
n ∂ 2θ
Since
r
−
√
√
1 ∂2
l(
θ̄;
w)
n(
θ̂
−
θ
)
=
nq
0
n ∂ 2θ
θ̂ − θ0
d
−1 → N (0, 1)
2
− n1 ∂∂2 θ l(θ̄; w)
similar to equation (10) above, the right hand side of equation (11) satisfies
−
1 ∂2
d
l(θ̄; w)n(θ̂ − θ0 )2 → χ2 (1)
2
n∂ θ
by the CMT. The Likelihood Ratio test therefore rejects the null hypothesis
when the Likelihood Ratio statistic exceeds the corresponding critical value.
The Likelihood Ratio statistic also applies more generally, i.e., when θ is not
necessarily scalar and if the null hypothesis specifies only part of the (true)
parameter vector. If the null hypothesis does not specify the entire parameter
vector θ0 needs to be replaced with a “restricted” estimator, say θ̃ (see also
here). In general, the asymptotic distribution of the test statistic is given
by a χ2 (r), where r denotes the number of restrictions imposed by the null
hypothesis. Take θ = (θ1 , θ2 )0 . Then, some examples are given by
• H0 : θ1∗ = 0 ⇒ r = 1
• H0 : θ1∗ = θ2∗ ⇒ r = 1
• H0 : θ1∗ = θ2∗ = 0 ⇒ r = 2.
4
Conditional ML
So far, we have considered (unconditional) Maximum Likelihood estimation,
i.e., we have specified a distribution function for wi , f (wi ; θ), where we have
implicitly taken wi to be scalar. In many cases, however, we observe more
than one random variable for each entity/individual in our sample. Furthermore, we typically differentiate between an outcome variable yi and (a vector
11
of) explanatory variables xi = (xi1 , . . . , xiK )0 . Let wi = (yi , x0i )0 . Now, we
can write the likelihood function as follows
L(θ; w) = f (w; θ) =
n
Y
f (wi ; θ) =
i=1
n
Y
f (yi , xi ; θ),
i=1
where f (yi , xi ; θ) denotes the joint distribution of yi and xi . By Bayes’ rule,
we have
f (yi , xi ; θ) = f (yi |xi ; θ)f (xi ; θ).
Plugging in and taking logarithms, we obtain
l(θ; w) =
n
X
i=1
log f (yi |xi ; θ) +
n
X
log f (xi ; θ) ≡ l(θ; y|x) +
i=1
n
X
log f (xi ; θ),
i=1
where l(θ; y|x) denotes the so-called conditional log-likelihood (function).
(Here, we use the notation |x to highlight the conditioning on x.) In economics, we often feel more comfortable specifying f (yi |xi ; θ) rather than
f (yi , xi ; θ) or, equivalently, we do not feel comfortable specifying f (xi ; θ).
It turns out that basically all derivations above go through if we replace the
log-likelihood with a conditional log-likelihood. The resulting estimator is
sometimes referred to as conditional MLE, but since conditioning is so common in economics, we typically don’t explicitly mention the “conditional.”
Example 2. yi ∈ {0, 1}. Consider the probit model where we specify the
conditional mass function of yi given xi as follows
1−yi
f (yi |xi ; θ) = Φ(x0i θ)yi (1 − Φ(x0i θ))
.
The conditional “individual” log-likelihood is defined as
l(θ; yi |xi ) = yi log(Φ(x0i θ)) + (1 − yi ) log(1 − Φ(x0i θ)).
Taking the derivative with respect to (the vector) θ, we obtain
∂
1
1
l(θ; yi |xi ) = yi
φ(x0i θ)xi − (1 − yi )
φ(x0i θ)xi
0
∂θ
Φ(xi θ)
1 − Φ(x0i θ)
1
1
= yi
− (1 − yi )
φ(x0i θ)xi .
Φ(x0i θ)
1 − Φ(x0i θ)
12
To show that equation (2) holds in the context of conditional Maximum
Likelihood (estimation), note that
∂
∗
E
l(θ ; yi |xi ) xi
∂θ
1
1
= E[yi |xi ]
− (1 − E[yi |xi ])
φ(x0i θ∗ )xi
0 ∗
0 ∗
Φ(xi θ )
1 − Φ(xi θ )
1
1
0 ∗
0 ∗
− (1 − Φ(xi θ ))
φ(x0i θ∗ )xi = 0.
= Φ(xi θ )
Φ(x0i θ∗ )
1 − Φ(x0i θ∗ )
Therefore, by the law of iterated expectations
∂
∂
∗
∗
E
l(θ ; yi |xi ) = E E
l(θ ; yi |xi ) xi = E[0] = 0.
∂θ
∂θ
Similarly, the other “results” go through.
13
Download