Uploaded by thantsinaung.saw

Dummy Variables in Econometrics: A Comprehensive Guide

advertisement
LI Econometrics
Dummy Variables
Eric Melander
University of Birmingham
About Me
Eric Melander
Lecturer in Economics
Contact
Email: e.melander@bham.ac.uk
Office: University House Extension 2105
Office hours: Monday 9:00-10:30, Thursday 9:00-10:30
I am an “applied econometrician”
▶ I do not invent the types of tools and methods you will see in this module
▶ But they are extremely useful for the types of things I do do
▶ Which is: assembling datasets and testing important causal relationships
▶ Across a range of subfields: economic history, political economy,
economics of crime/conflict
▶ For example: do welfare cuts cause crime?
The Module So Far
Over the past five weeks, you have seen
▶ Key statistical concepts
▶ Distributions and their moments, statistical inference, ...
▶ Bivariate linear regression
▶ Models of relationships between two variables, desirable properties of
estimators, conditions under which OLS satisfies these properties,
hypothesis testing, ...
▶ Multivariate linear regression
▶ Now with more variables!
▶ Violation of classical assumptions
▶ Autocorrelation and heteroscedasticity: why a problem, how to detect,
how to address
The Shape of Econometrics to Come
Over the coming five weeks, you will see
▶ Incorporation of qualitative information: dummy variables
▶ Parameter stability and structural change
▶ Model selection and misspecification
▶ Measurement error
▶ Endogeneity and instrumental variables estimation
LI Econometrics
Dummy Variables
Eric Melander
University of Birmingham
Thinking About Qualitative Data
So far, we have mainly considered variables with quantitative meaning
▶ Individuals’ education and wages, aggregate consumption and income, ...
▶ It makes sense to think about, e.g., the effect on wages (measured in £) of
one additional unit of education (measured in years)
But: often we want to consider qualitative factors in our regression models
What types of qualitative factors? To name a few:
▶ Gender
▶ Race and ethnicity
▶ Industry/sector of firms
▶ Economic geography: urban vs. rural, North vs. South, ...
▶ Specific time periods: recessions, wars, policy changes, ...
1 / 33
What Are Dummy Variables?
Dummy variables (dummies) assume only two values: {0,1}
▶ Also known as: binary variables, indicator variables, zero-one variables, ...
We use dummy variables to encode qualitative information, e.g.:
▶ Femalei = 1 if individual i is female, 0 otherwise
▶ Marriedi = 1 if individual i is married, 0 otherwise
▶ Recessiont = 1 if economy in recession in quarter t, 0 otherwise
Things to note at this stage:
▶ We wrote down the gender dummy as Femalei , which “switches on” for
female individuals
▶ We could equally have written Malei , switching on for male individuals
▶ In terms of econometrics, picking one or the other makes no difference
▶ In terms of economic interpretation, the choice matters and is
context-dependent, as we will see
2 / 33
How Might We Use Dummy Variables?
Suppose we are interested in the gender pay gap
We might specify a model like this:
ln(wagei ) = β0 + β1 Educi + δ0 Femalei + ϵi
Where ln(wagei ) is the log wage of individual i, Educi is years of education,
and Femalei is a dummy variable taking a value of 1 for female individuals
We know how to interpret β1 , but how to interpret δ0 ?
▶ (Note: to keep consistency of notation, these slides will use δ to indicate
parameters on dummy variables)
3 / 33
How to Interpret δ0 ?
Recall our model:
ln(wagei ) = β0 + β1 Educi + δ0 Femalei + ϵi
Let us first consider the expected value of the log wage of individuals
with Femalei = 1 and a given level of education (Educi )
E(ln(wagei )|Femalei = 1, Educi ) = (β0 + δ0 ) + β1 Educi
And now the that of individuals with Femalei = 0
E(ln(wagei )|Femalei = 0, Educi ) = β0 + β1 Educi
Subtracting one from the other
E(ln(wagei )|Femalei = 1, Educi ) − E(ln(wagei )|Femalei = 0, Educi ) = δ0
So: δ0 is the difference in expected wage between men and women
for a given level of education
4 / 33
Visual Interpretation of δ0
A few things to note:
▶ The intercept for men is at β0 , the intercept for women is at β0 + δ0
▶ β0 + δ0 < β0 =⇒ δ0 < 0
▶ The gradient on Educi (β1 ) is the same for men and women (parallel)
▶ Interpretation: δ0 is an intercept shift, with no change in gradient
5 / 33
A Note on Modelling Choices
Note that we chose to specify the model with a Femalei dummy
▶ Making men (Femalei = 0) the base, benchmark or reference group
▶ And therefore giving δ0 an interpretation as the difference in intercepts
for women compared to men
We could have equally used a Malei dummy, with women as reference group
ln(wagei ) = β0 + β1 Educi + δ0′ Malei + ϵi
Econometrically, this would be equivalent, but of course δ0′ = −δ0
▶ Modelling choice will depend on what makes most sense for interpretation
6 / 33
Another Note on Modelling Choices
Why don’t we include indicators for both men and women?
ln(wagei ) = β0 + β1 Educi + δ0 Femalei + δ0′ Malei + ϵi
This is because of the dummy variable trap:
▶ Since for each i we must have Femalei = 1 or Malei = 1, but not both:
Femalei + Malei = 1
▶ That is: Femalei and Malei are perfectly collinear
▶ We therefore cannot separately estimate δ0 and δ0′
In general: for a qualitative variable with m categories, we need to
include m − 1 dummy variables
7 / 33
Multiple Categories
So far, we considered qualitative variables with 2 categories
▶ But what if we want to consider multiple categories?
Suppose we have an ordinal variable, Sectori , containing information on the
sector of the economy in which individual i works, with the following values
▶ Sectori = 1 if i is in agriculture
▶ Sectori = 2 if i is in manufacturing
▶ Sectori = 3 if i is in services
One could use Sectori directly, and estimate
ln(wagei ) = β0 + β1 Educi + β2 Sectori + ϵi
β2 can be estimated, but it is not clear how one would interpret it
▶ And requires difference between agriculture and manufacturing to be the
same as that between manufacturing and services (which is unrealistic)
8 / 33
Multiple Categories
Instead, from m = 3 categories, construct m − 1 = 2 dummy variables:
▶ Manufi = 1 if i is in manufacturing, 0 otherwise
▶ Servi = 1 if i is in services, 0 otherwise
▶ This leaves agriculture as the baseline category
Then estimate
ln(wagei ) = β0 + β1 Educi + δ0 Manufi + δ1 Servi + ϵi
δ0 and δ1 capture the wage gap/premium in manufacturing and services,
respectively, relative to agriculture (for given years of education)
Note: the m − 1 dummy variables must be
▶ Mutually exclusive
▶ Exhaustive
9 / 33
Too Many Categories!
We could, in principle, turn any variable into a set of dummy variables
▶ Take the variable Educi (years of education of i). We could define:
▶ Educ0i = 1 if Educi = 0, 0 otherwise
▶ Educ1i = 1 if Educi = 1, 0 otherwise
▶ Educ2i = 1 if Educi = 2, 0 otherwise
▶ And so on
But this would result in a large number of dummy variables ...
▶ And a lot of parameters to be estimated ...
▶ Which means a large loss of degrees of freedom ...
▶ And parameter estimates which may be difficult to interpret
In practice, we could instead create more sensible dummy variables, e.g.
▶ Secondaryi = 1 if i has completed secondary school
▶ Degreei = 1 if i has completed a degree
10 / 33
Multiplicative Dummy Variables
Recall our earlier regression model
ln(wagei ) = β0 + β1 Educi + δ0 Femalei + ϵi
We have seen how this allows the intercept to differ between men and women
But what if we suspect that the return to education differs by gender?
To allow for this possibility, we include a multiplicative dummy variable
ln(wagei ) = β0 + β1 Educi + δ0 Femalei + δ1 Femalei × Educi + ϵi
This will allow for a difference in gradients, as well as in intercepts
11 / 33
Differences in Gradients and Intercepts
In our regression model, how to interpret δ0 and δ1 ?
ln(wagei ) = β0 + β1 Educi + δ0 Femalei + δ1 Femalei × Educi + ϵi
First, take expectations for i with Femalei = 1 for given Educi
E(ln(wagei )|Femalei = 1, Educi ) = (β0 + δ0 ) + (β1 + δ1 ) Educi
And for i with Femalei = 0 for given Educi
E(ln(wagei )|Femalei = 0, Educi ) = β0 + β1 Educi
As before, there is a difference in intercepts when δ0 ̸= 0
But there is also a difference in gradients when δ1 ̸= 0
▶ Intuitively: a difference in the marginal effect of education on wages
12 / 33
Visualising Multiplicative Dummy Variables
Depending on sign and significance of δ̂0 and δ̂1 , four possible scenarios
Scenario 1: δ̂0 = 0 and δ̂1 = 0 (coincident regressions)
13 / 33
Visualising Multiplicative Dummy Variables
Depending on sign and significance of δ̂0 and δ̂1 , four possible scenarios
Scenario 2: δ̂0 ̸= 0 and δ̂1 = 0 (parallel regressions)
14 / 33
Visualising Multiplicative Dummy Variables
Depending on sign and significance of δ̂0 and δ̂1 , four possible scenarios
Scenario 3: δ̂0 = 0 and δ̂1 ̸= 0 (concurrent regressions)
15 / 33
Visualising Multiplicative Dummy Variables
Depending on sign and significance of δ̂0 and δ̂1 , four possible scenarios
Scenario 4: δ̂0 ̸= 0 and δ̂1 ̸= 0 (dissimilar regressions)
16 / 33
Interactive Dummy Variables
Just as we can multiply a dummy variable with a quantitative variable, we can
multiply (interact) two dummy variables
Suppose we suspect a differential gender wage gap in services
ln(wagei ) = β0 + β1 Educi + δ0 Femalei + δ1 Servi + δ2 Femalei × Servi + ϵi
Where we have defined the dummy variables Femalei and Servi as follows:
▶ Femalei = 1 if i is female, 0 otherwise
▶ Servi = 1 if i is in service sector, 0 otherwise
How do we interpret the three parameters δ0 , δ1 and δ2 ?
17 / 33
Interpreting Interactions
Take expectations for different values of Femalei and Servi , for given Educi
Male in non-service sector
E(ln(wagei )|Femalei = 0 & Servi = 0, Educi ) = β0 + β1 Educi
Female in non-service sector
E(ln(wagei )|Femalei = 1 & Servi = 0, Educi ) = (β0 + δ0 ) + β1 Educi
Male in service sector
E(ln(wagei )|Femalei = 0 & Servi = 1, Educi ) = (β0 + δ1 ) + β1 Educi
Female in service sector
E(ln(wagei )|Femalei = 1 & Servi = 1, Educi ) = (β0 + δ0 + δ1 + δ2 ) + β1 Educi
18 / 33
Interpreting Interactions
Difference between women and men in non-service
(with some abuse of notation)
E(.|F-NS, .) − E(.|M-NS, .) = ((β0 + δ0 ) + β1 Educi ) − (β0 + β1 Educi )
= δ0
Difference between men in service and non-service
E(.|M-S, .) − E(.|M-NS, .) = ((β0 + δ1 ) + β1 Educi ) − (β0 + β1 Educi )
= δ1
Difference between difference in service and non-service for women
and difference in service and non-service for men
[E(.|F-S, .) − E(.|F-NS, .)] − [E(.|M-S, .) − E(.|M-NS, .)]
= [((β0 + δ0 + δ1 + δ2 ) + β1 Educi ) − ((β0 + δ0 ) + β1 Educi )]
− [((β0 + δ1 ) + β1 Educi ) − (β0 + β1 Educi )]
= [δ1 + δ2 ] − [δ1 ]
= δ2
19 / 33
OLS Estimator with Dummy RHS Variables
So far, we have said little about estimation and inference
Does OLS give us what we need to estimate parameters on dummy variables
and test for their significance?
Thankfully: yes!
Consider again our basic regression model, continuous Yi and Xi
Yi = β0 + β1 Xi + ϵi
We saw in the first part of the course that our OLS estimator of β1 is:
P
(Xi − X̄)(Yi − Ȳ )
P
β̂1 =
(Xi − X̄)2
20 / 33
OLS Estimator with Dummy RHS Variables
Now replace Xi with dummy variable Di
Yi = β0 + δ0 Di + ϵi
What happens if we apply the same estimator to δ0 ?
P
(Di − D̄)(Yi − Ȳ )
P
δ̂0 =
(Di − D̄)2
Does this correspond to something that makes intuitive sense?
21 / 33
OLS Estimator with Dummy RHS Variables
First, some definitions:
▶ Let n0 be those observations with Di = 0
▶ Let n1 be those observations with Di = 1
▶ (n0 + n1 = n by definition of Di )
▶ Proportion with Di = 1 is p = nn1
▶ Mean of Yi when Di = 0 is Ȳ0 =
▶ Mean of Yi when Di = 1 is Ȳ1 =
Pn0
i=1 Yi
n0
Pn
i=n0 +1 Yi
n1
22 / 33
OLS Estimator with Dummy RHS Variables
Starting with the numerator:
X
X
X
(Di − D̄)(Yi − Ȳ ) =
Di (Yi − Ȳ ) −
D̄(Yi − Ȳ )
X
X
X =
Di (Yi − Ȳ ) − D̄
Yi −
Ȳ
X
=
Di (Yi − Ȳ ) − D̄(nȲ − nȲ )
X
=
Di (Yi − Ȳ )
X
X
=
Di Yi − Ȳ
Di
=
n
X
Yi − n1 Ȳ
i=n0 +1
= n1 Ȳ1 − n1 Ȳ
= n1 Ȳ1 − n1 [(1 − p)Ȳ0 + pȲ1 ]
= n1 (1 − p)(Ȳ1 − Ȳ0 )
23 / 33
OLS Estimator with Dummy RHS Variables
And now the denominator:
X
X
X
X
(Di − D̄)2 =
Di2 − 2D̄
Di +
D̄2
X
=
Di2 − 2nD̄2 + nD̄2
X
=
Di2 − nD̄2
= n1 − np2
= n1 − n1 p
= n1 (1 − p)
Therefore:
δ̂0 =
P
n1 (1 − p)(Ȳ1 − Ȳ0 )
(Di − D̄)(Yi − Ȳ )
P
=
= Ȳ1 − Ȳ0
n1 (1 − p)
(Di − D̄)2
So OLS estimator corresponds to our intuitive interpretation of δ̂0
▶ And conducting hypothesis tests (using t- and F -tests) proceeds as normal
24 / 33
Dummy Dependent Variables
We have seen how to use dummy independent variables, but what about
dummy dependent variables?
That is, what we care about is some binary outcome, e.g.:
▶ Is an individual employed?
▶ Has an individual attained a degree?
▶ Is a politician re-elected?
▶ Is a pair of countries at war with each other?
25 / 33
Dummy Dependent Variables
If we write down a standard multiple regression model, with Di ∈ {0, 1}:
Di = β0 + β1 X1i + β2 X2i + ... + βk Xki + ϵi
This is called the linear probability model
▶ Commonly used in empirical economics research
But a few questions naturally arise
▶ How do we interpret βj ?
▶ Can we estimate this by OLS?
▶ Are any classical assumptions violated?
26 / 33
Interpreting βj with Dummy D.V.
Take expectations of the model, under assumption that E(ϵi |Xi ) = 0
E(Di |Xi ) = β0 + β1 X1i + β2 X2i + ... + βk Xki
With binary Di , either Di = 0 or Di = 1, and therefore:
E(Di |Xi ) = P (Di = 1|Xi )
So, can rewrite our equation for the expectation:
P (Di = 1|Xi ) = β0 + β1 X1i + β2 X2i + ... + βk Xki
And therefore we can interpret, e.g., β1 as:
β1 =
∂P (Di = 1|Xi )
∂X1i
So β1 is change in probability that Di = 1 (“success”) as X1i changes
▶ And this can be estimated by OLS
27 / 33
Drawbacks of Linear Probability Model
Despite its many advantages, there are four main drawbacks of the LPM
1. Non-normality of error term
2. Heteroscedasticity
3. Predicted probabilities outside [0, 1]
4. Assumes constant marginal effects
28 / 33
Classical Assumptions with Dummy D.V.
I. Non-normality of error term
h
i
b i = Di − β̂0 + β̂1 X1i + β̂2 X2i + ... + β̂k Xki
Recall: b
ϵi = Di − D
As Di ∈ {0, 1}, b
ϵi can only take two values for a given Xi
h
i
▶ 0 − β̂0 + β̂1 X1i + β̂2 X2i + ... + β̂k Xki
h
i
▶ 1 − β̂0 + β̂1 X1i + β̂2 X2i + ... + β̂k Xki
Therefore: b
ϵi cannot be normally distributed
But: not a big problem in practice
▶ With large N under CLT, converges to a normal distribution
▶ So we are “fine”
29 / 33
Classical Assumptions with Dummy D.V.
II. Heteroscedasticity
Homoscedasticity requires Var(ϵi |Xi ) = σ 2
▶ That is, σ 2 independent of i
▶ Does that hold with a dummy dependent variable?
Recall that b
ϵ can only take two values for a given Xi
h i
i
▶ 0 − β̂0 + β̂1 X1i + β̂2 X2i + ... + β̂k Xki
h
i
▶ 1 − β̂0 + β̂1 X1i + β̂2 X2i + ... + β̂k Xki
Therefore:
Var(ϵi |Xi ) = E(ϵ2i |Xi )
= P (Di = 0)(−[β̂0 + β̂1 X1i + ...])2 + P (Di = 1)(1 − [β̂0 + β̂1 X1i + ...])2
̸= σ 2
So LPM suffers from heteroscedasticity: employ usual tools
30 / 33
Predicted Probabilities Outside [0, 1]
Ideally, our estimated probabilities P (Di = 1|Xi ) should be between 0 and 1
▶ Because they are probabilities!
However, there is nothing to convey this restriction to LPM
▶ And we can sometimes get nonsensical predictions as a result
31 / 33
Assumes Constant Marginal Effects
As LPM is linear, the marginal effect of Xi on P (Di = 1) is constant
across all values of Xi
This may not be particularly realistic
▶ Might expect changes in Xi to have larger impact in centre of distribution
▶ And less impact at the extremes
32 / 33
Beyond the Linear Probability Model
Our workhorse OLS serves us relatively well even with binary outcomes
Four issues of varying severity
▶ Non-normality of errors
▶ Heteroscedasticity
▶ Sometimes-nonsensical estimated probabilities
▶ Assumption of constant marginal effects
In practice, all are usually possible to overcome
And there exist specialised tools precisely for binary outcomes
▶ Logit
▶ Probit
(Although we do not cover them in this module)
33 / 33
Download