Uploaded by Wang Tony

EMII W7

advertisement
Econometric Methods II
(EBC2120)
Lecture 6
Contact details:
Martin Schumann
m.schumann@maastrichtuniversity.nl
Office number currently unknown
1 / 42
Binary choice models
Last lecture, we discussed Maximum Likelihood estimation, which
is often used to estimate nonlinear models.
Nonlinear models arise frequently when the response variable is
not continuous but rather “jumps”.
In binary choice models, we observe a binary variable
⎧
⎪
⎪1 if “success” (i is employed, bought something etc.)
Yi = ⎨
⎪
0 if “failure”
⎪
⎩
Our goal is to model Pr(Yi = 1∣Xi ), the probability of success
conditional on some observed covariates Xi .
Since Yi is Bernoulli-distributed,
E[Yi ∣Xi ] = 0 ⋅ Pr(Yi = 0∣Xi ) + 1 ⋅ Pr(Yi = 1∣Xi ) = Pr(Yi = 1∣Xi ).
2 / 42
Motivation: latent variable
Binary choice models can also be motivated using a latent variable,
i.e. an unobserved variable Yi∗ with the observed dependent
variable Yi depending on it.
We assume that Yi∗ follows a linear model:
Yi∗ = Xi′ θ + ϵi , i = 1, ..., n,
where Yi∗ for instance represents firm i’s expected returns from
innovation while Xi contains observed firm characteristics.
A researcher may only be able to observe whether or not firm i
conducts innovation, i.e. we observe
⎧
⎪
⎪1, Yi∗ = Xi′ θ + ϵi > 0
Yi = ⎨
⎪
0, Yi∗ = Xi′ θ + ϵi ≤ 0
⎪
⎩
.
3 / 42
Let’s assume that the model error has a distribution F (⋅) that is
symmetric around 0 (the center is wlog if Xi contains a constant
regressor) and continuous. Then,
Pr(Yi = 1∣Xi ) = Pr(Yi∗ > 0∣Xi ) = Pr(Xi′ θ + ϵi > 0∣Xi )
= Pr(ϵi > −Xi′ θ∣Xi ) = 1 − F (−Xi′ θ) = F (Xi′ θ),
where we used the symmetry around 0 in the last equation.
Thus, in total,
E[Yi ∣Xi ] = Pr(Yi = 1∣Xi ) = F (Xi′ θ).
(1)
Compare this with the linear regression model
E[Yi ∣Xi ] = Xi′ θ.
(2)
4 / 42
Marginal (or partial) effects
The nonlinearity in (1) has important consequences for the
interpretation of θ:
In (2),
mik (Xi , θ) =
∂
E[Yi ∣Xi ] = θk = mit (θk ),
∂Xik
where the k-index refers to the (continuous) k-th component of Xi .
In (1),
mik (Xi , θ) =
∂
E[Yi ∣Xi ] = θk f (Xi′ θ),
∂Xik
where f (⋅) denotes the pdf of F .
5 / 42
⇒ In nonlinear models, the marginal effects are not constant! They
depend on the values of each component of θ and on the data
through f (Xi′ θ)!
If Xik is discrete (e.g. a dummy), the marginal effect is usually
calculated as the effect of a one-unit increase in the regressor on
the conditional probability. Let
Xi(−k) = (Xi1 , ..., Xik−1 , Xik+1 , ..., XiK )′ . Then the marginal effect
of Xik for individual i is
′
mik (Xi , θ) = F ((Xik + 1)′ θk + Xit(−k)
θ(−k) ) − F (Xi′ θ).
Since f (Xi′ β) > 0, the sign of θk coincides with the sign of the
marginal effect of Xik . However, the magnitude of the effect
depends on Xi and θ.
6 / 42
On the other hand, relative marginal effects are constant:
∂E[Yi ∣Xi ] ∂E[Yi ∣Xi ] mik (Xi , θ) θk f (Xi′ θ) θk
/
=
=
= .
∂Xik
∂Xil
mil (Xi , θ) θl f (Xi′ θ) θl
It is also easy to find the upper bound of the magnitude as
supz∈R f (z)θk .
However, since the marginal effects in i depend on the data, we
typically need to evaluate the marginal effects at some meaningful
values.
The most popular choice is probably the average partial effect
(APE)
1 n
AP E(Xik ) = ∑ mik (Xi , θ)
n i=1
7 / 42
For the APE, the averaging is done “on the outside”.
Other popular choices is to evaluate the data at the sample mean
(or the median) x̄ to obtain the partial effect at the average (PEA)
P EA = mik (x̄′ θ) = f (
1 n ′
∑ X θ)θk .
n i=1 i
To showcase the nonlinearity in the marginal effects, it is often
interesting to report marginal effects for a range of values (e.g.
marginal effects for each quintile of the data) or at values that are
representative or of particular interest in a specific application.
For estimation of APEs etc. we simply replace any unknown
parameters θ by their maximum likelihood estimates θ̂.
8 / 42
Logit vs. probit
The two most popular choices for the link function F are
● F (Xi′ θ) = Φ(Xi′ θ), where Φ is the standard normal cdf ⇒ Probit
model
● F (Xi′ θ) = Λ(Xi′ θ), where Λ is the standard logistic cdf ⇒ Logit
model
There is surprisingly little guidance on how to choose between
both models.
In practice, it often boils down to computational aspects.
Typically, the results for the APEs are pretty similar for both
models, and the logit coefficient estimates are about 1.6 times as
large as the probit coefficient estimates.
● The latter point is a crude rule of thumb: both pdfs attain their
maximum at 0, and ϕ(0)/λ(0) ≈ 0.4/0.25 = 1.6.
9 / 42
Remark: Log-odds ratio in logit
Let pi = Pr(Yi = 1∣Xi ). The odds are defined as
odds =
pi
probability of success
=
∈ [0, ∞)
probability of failure 1 − pi
The log-odds are thus given as
log-odds = log (
pi
) ∈ (−∞, ∞).
1 − pi
Assume that the log-odds are a linear function in Xi :
10 / 42
log (
′
′
′
pi
pi
) = Xi′ θ ⇔
= eXi θ ⇔ pi (1 + eXi θ ) = eXi θ
1 − pi
1 − pi
′
⇔ pi =
′
eXi θ
1
e Xi θ
= X′ θ
= Λ(Xi′ θ).
=
′
′
Xi′ θ
1+e
e i (1 + e−Xi θ ) 1 + e−Xi θ
Thus, estimates from the logit model can be used to estimate the
log-odds.
Unfortunately, the same argument does not apply to probit.
11 / 42
Estimation
Assume the following:
Yi∗ = Xi′ θ0 + ϵi , where Yi∗ is latent
Yi = 1{Yi∗ > 0}
ϵi ∣Xi is i.i.d. across i with a distribution that is symmetric around
0 and a cdf F that is three times continuously differentiable.
Using these assumptions, the marginal pmf of Yi given Xi is
pmf (Yi ∣Xi ; θ) = F (Xi′ θ)Yi (1 − F (Xi′ θ))1−Yi I{0,1} (Yi ).
By independence across i, the joint likelihood of Y1 , ..., Yn conditional
on X1 , ..., Xn is
n
L(θ) = ∏ F (Xi′ θ)Yi (1 − F (Xi′ θ))1−Yi I{0,1} (Yi ).
(3)
i=1
12 / 42
The likelihood in (3) is called conditional likelihood since we
condition on the observed characteristics.
By fully specifying the distribution of ϵi we make the model fully
parametric, as the only unknown is the parameter.
We implicitly assume that all moments are correctly specified. In
particular, we assume that the variance of ϵi conditional on Xi is
correctly specified as well.
13 / 42
From (3) we obtain the (scaled) log-likelihood
ℓ(θ) =
1 n
′
′
∑ Yi log(F (Xi θ)) + (1 − Yi ) log(1 − F (Xi θ)).
n i=1
The likelihood is then maximized to derive the MLE θ̂.
However, there is typically no closed-form for θ̂. We this have to
rely on numerical maximization routines (e.g. Newton-Raphson).
In logit and probit, the likelihood is strictly concave if EXi Xi′ > 0.
This is good news for numerical optimizers, as the maximizer will
truly be a global rather than a local maximizer.
Let Fi = F (Xi′ θ) and fi = f (Xi′ θ).
14 / 42
The score is given by:
∂
1 n Yi
1 − Yi
ℓ(θ) = ∑ [ fi Xi −
fi Xi ]
∂θ
n i=1 Fi
1 − Fi
=
=
1 n yi (1 − Fi ) − (1 − Yi )Fi
fi Xi
∑
n i=1
Fi (1 − Fi )
1 n Yi − Fi
fi Xi
∑
n i=1 Fi (1 − Fi )
The first order conditions
∂
ℓ(θ̂)
∂θ
n
Yi − F̂i
i=1
F̂i (1 − F̂i )
∑
= 0 are therefore given by:
fˆi Xi = 0 ∈ RK ,
where F̂it = F (Xi′ θ̂) and fˆi (θ̂) = f (Xi′ θ̂).
15 / 42
Count data
Count data variables take non-negative integer values and appear
frequently in economic analysis.
● # patents granted
● # doctor visits
● # cigarettes smoked
Typically, there does not exist a natural upper bound and the
outcome is zero for some members of the population.
Let Y a be count variable and X be a vector of explanatory
variables. As usual, we are interested in modeling E[Y ∣X].
A linear model is often not appropriate, as it can produce negative
predictions when Xi β̂ < 0.
16 / 42
In practice, some people try to account for the non-negativity of
count data by using a log-transformation, i.e. they estimate
log(Yi ) = Xi′ β + ϵi .
Since often Yi = 0 for some i ∈ {1, ..., n}, people simply “add one to
each Yi ”, leading to
log(Yi + 1) = Xi′ β + ui .
This approach has several downsides:
● Since the log() is non-linear,
E[log(Yi + 1)∣Xi ] ≠ log(E[Yi ∣Xi ] + 1).
● log(Yi + 1) is discrete, so the usual %-change interpretation does
not apply, as it works only for “small” changes.
17 / 42
It is often best to model E[Yi ∣Xi ] directly, taking into account the
count nature of Yi .
A popular choice is the Poisson model, i.e. assume that
i.i.d.
Yi ∣Xi ∼ Poisson(µ(Xi )),
so that
pdf Yi ∣Xi (k) = Pr(Yi = k∣Xi ) =
e−µ(x) µ(x)k
, k ∈ N0 .
k!
Typically, the mean µ(Xi ) is modeled as µ(Xi ) = eXi β ≥ 0.
′
18 / 42
The (pooled) Poisson likelihood is
(eXi β )
′
n
L(β) = ∏ e−e
i=1
Xi′ β
Yi !
Yi
,
so the (scaled) log-likelihood is
ℓ(β) =
1 n
X′ β
′
∑ [−e i + Yi Xi β − log(Yi )] .
n i=1
Since constants can be ignored in the likelihood maximization, the
likelihood is typically written as
ℓ(β) =
1 n
′
X′ β
∑ Yi Xi β − e i .
n i=1
19 / 42
The FOC therefore is
′
∂ℓ(β) 1 n
!
= ∑ Xi (Yi − eXi β ) = 0
∂β
n i=1
and
(4)
1 n Xi′ β
∂ 2 ℓ(β)
=
−
Xi Xi′ < 0.
∑e
∂β∂β ′
n i=1
Thus, if EXi Xi′ as full rank, the likelihood is concave and the
MLE β̂ is the unique maximizer of ℓ(β).
20 / 42
Interpretation of marginal effects
Taking the derivative with respect to Xik yields
′
∂
∂ Xi′ β0
E[Yi ∣Xi ] =
e
= β0k eXi β0 .
∂Xik
∂Xik
Thus, as in binary choice models, the marginal effect of Xik
depends not only on β0 but also on the level of Xi .
Notice however that
∂
∂
E[Yi ∣Xi ]/E[Yi ∣Xi ] =
log(E[Yi ∣Xi ]).
∂Xik
∂Xik
Thus, 100 ⋅ β0k is the semi-elasticity of E[Yi ∣Xi ] w.r.t. Xik , i.e. a
one unit change in Xik induces a 100 ⋅ β0k % change in E[Yi ∣Xi ].
21 / 42
As in a linear model, let
ϵ̂i = Yi − E[Yi ∣Xi ] = Yi − µ(Xi ) = Yi − eXi β .
′
Then, (4) implies
n
∑ ϵ̂i Xi = 0,
i=1
i.e. the regressors are orthogonal to the residuals in the sample.
There is another interesting similarity between the linear model
and the poisson model: as in the linear model, correct specification
′
of E[Yi ∣Xi ] = eXi β is sufficient for consistency of the poisson MLE.
● This is not obvious: typically maximum likelihood estimates are
only consistent if the whole distribution is correctly specified.
● With the poisson model, it is thus much more common to estimate
the coefficients with maximum likelihood but using a “robust”
sandwich estimator for the standard errors.
22 / 42
Robust poisson standard errors
Recall that if the poisson likelihood ℓ(β) is correctly specified (i.e.
the data is truly generated according to a poisson model),
√
d
n(β̂ − β0 ) → N(0, F −1 ),
where F = var0 (∂θ ℓ(θ0 )).
Recall from our proof of asymptotic normality that if the
information equality does not hold (for instance because the
likelihood is incorrectly specified),
√
d
n(β̂ − β0 ) → N(0, H −1 F H −1 ),
2
∂
where H = −E ∂β∂β
′ ℓ(β) is the Hessian.
23 / 42
Despite this robustness property, the poisson model also has
downsides:
As can easily be shown, E[Yi ∣Xi ] = µ(Xi ) = var[Yi ∣Xi ], a property
that is known as equi-dispersion.
This can be problematic if there are “excess-zeros”, for instance in
the analysis of # patents where many firms have 0 patents.
Assume that X ∼ Poisson(λ) and that Y satisfies
Y ∣Y > 0 ∼ X∣X > 0, i.e. Y behaves like a Poisson variable for
values larger than 0.
If Pr(Y = 0) > Pr(X = 0), then var(Y ) > E(Y ) so that the poisson
model is not appropriate.
24 / 42
Zero inflated poisson
An alternative that allows us to explicitly address excess-zeros is
the zero inflated poisson model.
● Other “zero-inflated”... models exist as well.
The idea is that we combine a distribution that “generates the
zeros” with a poisson distribution for the non-zero numbers:
⎧
⎪
⎪p + (1 − p)Poissonλ (0), Yi = 0
pdf(Yi ∣Xi ) = ⎨
⎪
(1 − p)Poissonλ (Yi ),
Yi > 0
⎪
⎩
where p is the probability of excess-zeros.
As in the poisson model, λ = eXi β0 is a popular choice.
′
25 / 42
Sample selection on observables
Selection on observables Xi can sometimes be ignored:
Assume that we have a cross-section of size n and define the
indicator
⎧
⎪
⎪1, i stays in the sample
si = ⎨
⎪
0, i is dropped from the sample
⎪
⎩
.
Further assume that the data is generated according to the linear
model
Yi = Xi′ β + ϵi .
26 / 42
The OLS estimator is
n
−1 n
β̂ = (∑ si Xi Xi′ )
i=1
−1 n
n
∑ si Xi Yi = β + (∑ si Xi Xi )
′
i=1
i=1
∑ si Xi ϵi .
i=1
Assume that Xi is exogenous and EXi Xi′ has full rank in the
population.
For identification of β, we need to assume that
E[si Xi Xi′ ∣si = 1] has full rank,
(5)
i.e. the usual rank condition holds on the selected subsample.
Condition (5) may for instance be violated if our sample contains
only females and we include a gender dummy.
27 / 42
For consistency of β̂ we need
E[si Xi ϵi ] = 0.
(6)
Condition (6) is satisfied if s is independent of x and ϵ (“missing
completely at random”), as E[si Xi ϵi ] = E[si ]E[Xi ϵi ] = 0 by
exogeneity of Xi .
A slightly weaker condition that is still sufficient for consistency is
E[ϵi ∣Xi , si ] = 0,
which is known as “exogenous sampling”. This assumption holds if
for instance Xi is exogenous and si is a deterministic function of
Xi .
28 / 42
For instance, let Yi = β0 + β1 Xi + ϵi where Xi captures wealth and
assume that Xi is an exogenous regressor.
Suppose we select only households with a certain level of wealth,
i.e.
⎧
⎪
⎪1 if wealth ≥ 100K
si = ⎨
.
⎪
0 otherwise
⎪
⎩
In that case, si = si (Xi ) and
E[ϵi ∣Xi , si ] = E[ϵi ∣Xi , si (Xi )] = E[ϵi ∣Xi ] = 0.
Intuitively, si (Xi ) does not contain any more information than Xi
and is thus irrelevant as a conditioning variable.
29 / 42
Moreover, si is not allowed to be correlated with unobserved
factors in the selected sample.
For example, suppose we estimate
wagei = β0 + β1 educationi + ϵi ,
where ϵi contains motivation.
Assume for the moment that education and motivation are
uncorrelated in the population.
Thus, if we truly had a random sample from the population, OLS
would be consistent.
If we however have a random sample from current workers, the
average motivation for workers with low levels of education is
likely to be higher than in the population.
30 / 42
Workers with low education and with (less than) average
motivation are less likely to be in our sample of workers.
Thus, in our selected sample, a correlation between education and
motivation exists that does not exist in the population, so that
E[educationi ⋅ ϵi ] = 0 while E[educationi ⋅ ϵi ∣si = 1] ≠ 0.
31 / 42
Another example due to Elwert and Winship (2014):
Suppose beauty and (acting) talent are uncorrelated in the
population (which sounds plausible) and that both talent and
beauty cause Hollywood success.
Suppose we want to assess the effect of talent on wage using the
model
wagei = β0 + β1 talenti + ϵi ,
where ϵi contains beauty.
If we had a random sample from the population, OLS would be
consistent, as E[talent ⋅ beauty] = 0.
Suppose however that we have a sample of Hollywood actors. In
this sample, talent and beauty are correlated: any person without
talent in our sample must be beautiful! Thus,
E[talent ∗ beauty∣Hollywood success] ≠ 0.
32 / 42
Censoring and truncation
While selection on the observables Xi can sometimes be ignored,
selection on the response usually renders OLS biased.
Specific examples where the data on the response is altered and
OLS is inconsistent are censoring and truncation:
Under censoring, the observed outcome is constrained while the
covariates are accurately observed.
For instance, income studies (where income is the response
variable) often only report “income ≥ 100K.
Thus, for a household with 200K income, we observe all covariates
but the response is set to 100K.
Under truncation, observations whose outcome variable is outside
a certain range are excluded from the sample.
33 / 42
Censoring and truncation can be handled by maximum likelihood
methods (e.g. Tobit models).
Here, we will only discuss the Heckit procedure, also known as the
two-step Heckman correction or the Heckman selection model.
The Heckman correction is particularly popular in cases where we
need to correct for bias arising from non-randomly selected
samples.
For his contributions, Heckman received the Nobel prize in
Economics (together with Daniel McFadden) in 2000.
With more than 200000 citations, Heckman is probably one of the
most prominent Economists alive.
34 / 42
Let Yi∗ be a latent variable. The observed variable Y is called
left-censored or censored from below at L if
⎧
⎪
⎪Y ∗ , Y ∗ > L
Y =⎨
⎪
L, Y ∗ ≤ L
⎪
⎩
.
Similarly, Y is called right-censored or censored from above at U if
⎧
⎪
⎪Y ∗ , Y ∗ < U
Y =⎨
⎪
U, Y ∗ ≥ U
⎪
⎩
.
35 / 42
Consider the latent variable model
Yi∗ = Xi′ β0 + ϵi , ϵi ∣Xi ∼ N(0, σ02 ),
where we observe Yi = max(0, Yi∗ ).
A naive way of dealing with censored data would be to use OLS on
the subset of the data that is not affected by censoring.
However, in our model,
E[Yi ∣Xi , Yi > 0] = E[Yi∗ ∣Xi , Yi∗ > 0] = Xi′ β0 + E[ϵi ∣Xi , Yi∗ > 0]
= Xi′ β + E[ϵi ∣Xi , ϵi > −Xi′ β0 ].
36 / 42
Now, if X ∼ N(µ, σ 2 ), then
E[X∣X > a] = µ + σλ (
where
λ(x) =
a−µ
),
σ
ϕ(x)
1 − Φ(x)
is the inverse Mill’s ratio. Setting α0 = β0 /σ0 ,
E[Yi ∣Xi , Yi > 0] = Xi′ β0 + σ0 λ (−
Xi′ β0
) = Xi′ β0 + σ0 λ(−Xi′ α0 ).
σ0
Thus, using linear regression on observations with Yi > 0 are used
will lead to inconsistent estimates.
37 / 42
For Yi > 0, we can always (trivially) write
Yi = E[Yi ∣Xi , Yi > 0] + ui ,
ui = Yi − E[Yi ∣Xi , Yi > 0],
so that E[ui ∣Xi , Yi > 0] = 0 by construction.
Our model for the data with Yi > 0 thus is
Yi = Xi′ β0 + σ0 λ(−Xi′ α0 ) + ui .
Using OLS in this model would yield consistent estimates for the
parameters β0 and σ0 .
Problem: The “data” λ(−Xi′ α0 ) is unobserved since α0 is
unknown.
38 / 42
Solution: Use a first-stage probit model to estimate Pr(Yi∗ > 0):
Pr(di = 1) = Pr(Yi∗ > 0) = Pr(Xi′ β0 > −ϵi )
= Pr (
ϵi Xi′ β0
<
) = Φ(Xi′ α0 ),
σ0
σ0
where we used the symmetry of the distribution of ϵi in the second
step.
The likelihood for α can thus be written as
n
L(α) = ∏ Φ(Xi′ α)di (1 − Φ(Xi′ α))1−di .
i=1
After estimating α̂ by maximum likelihood, use the subsample
with Yi > 0 to run OLS on
Yi = Xi′ β + σλ(−Xi′ α̂) + u∗i , u∗i = ui + probit error.
39 / 42
The heckit approach is also often used to address sample selection:
Suppose we want to analyze the effect of variable x on wage.
If the population of interest is “current workers”, a random sample
of current workers yields unbiased results.
If the population is “all people at working age”, our estimates are
likely to suffer from sample selection bias.
Sometimes the Heckit is also used to alleviate self-selection bias: if
one wants to assess the effect of a voluntary job training on wage,
participation is endogenous.
By explicitly modeling the participation decision, we can get
unbiased estimates, provided the participation equation is
correctly specified.
● Since this often entails strong assumptions, many researchers
nowadays rather rely on natural experiments (see DiD).
40 / 42
Assume that we have two latent variables
Y1∗ = X1′ β1 + ϵ1
Y2∗ = X2′ β2 + ϵ2
and assume that
ϵ
0
1
( 1 ) ∼ N (( ) , (
ϵ2
0
σ12
σ12
)) .
σ22
The participation equation is
⎧
⎪
⎪1, Y1∗ > 0
d=⎨
⎪
0, else
⎪
⎩
and the outcome equation is
⎧
⎪
⎪Y ∗ ,
Y =⎨ 2
⎪
0,
⎪
⎩
d=1
.
d=0
41 / 42
Suppose we only used the observations with d = 1 ⇔ Y1∗ > 0. Then
E[Y ∣Xi , X2 , d = 1] = E[Y2∗ ∣Xi , X2 , Y1∗ > 0] = X2′ β2 + σ12 λ(−X1′ β1 ).
The heckit procedure then estimates β1 using a probit model using
observations with d ∈ {0, 1}.
In the second step, restrict the sample to observations with d = 1
and run OLS of Y on X2 and λ(−X1′ β̂1 ).
While very elegant, the heckit procedure is sensitive to the
distributional assumptions (and thus currently a bit out of
fashion).
42 / 42
Download