# Diff in diff ```9: Difference-in-Differences
Simon D. Woodcock
ECON 836
Spring 2022
1
The Difference in Difference Estimator
1. Sometimes “nature” gives us a quasi-experiment. These natural experiments are often
policy-driven: an unanticipated policy change affects one group but not another. More
generally, we might imagine an unanticipated shock that affects one group but not
another. In either case, we can think of the affected group as our treatment group
(Di = 1), and the unaffected group as our control group (Di = 0). Under some
conditions, we can exploit these natural experiments to measure causal effects of the
treatment/shock.
2. When we observe outcomes for both groups both before and after the shock, then we
can improve upon simple before vs. after comparisons for the treatment group (where
we often can’t credibly separate causal effects of the shock from trend) and simple
treatment vs. control comparisons after the shock (where we often can’t credibly
separate causal effects from selection bias).
3. The basic idea is simple: compare the before vs. after change in outcomes for the
treatment group to the before vs. after change for the control group. The before
vs. after change for each group removes any unobserved heterogeneity in average
outcomes for that group. The before vs. after change for the treatment group should
therefore equal the causal effect plus trend. The before vs. after change for the control
group should contain trend only, since they are unaffected by the policy. Under the
assumption that trend is the same for the two groups, the difference between the two
before vs. after change (i.e., the “difference in differences”) should isolate the causal
effect.
4. To formalize this idea, let T = 0 denote the time period before the shock, and T = 1
denote the time period after the shock. The difference in differences (DD) is defined
as:
DD = (E [Yi (1) |Di = 1, T = 1] − E [Yi (0) |Di = 1, T = 0])
− (E [Yi (0) |Di = 0, T = 1] − E [Yi (0) |Di = 0, T = 0]) .
1
Our previous definition of AT T did not formally distinguish between time periods before and after treatment. A natural way to adapt our definition is AT T = E [Yi (1) |Di = 1, T = 1]−
E [Yi (0) |Di = 1, T = 1]. If we set this equal to the above expression for DD and solve,
we obtain the following condition under which DD = AT T :
E [Yi (0) |Di = 1, T = 1] − E [Yi (0) |Di = 1, T = 0]
= E [Yi (0) |Di = 0, T = 1] − E [Yi (0) |Di = 0, T = 0].
This is usually called the parallel trends or common trends assumption. It says that the
(counterfactual) change in outcomes that the treatment group would have experienced
in the absence of treatment is the same as the actual change in outcomes for the
control group. The observed change in outcomes for the control group thus provides a
counterfactual for what would have happened to the treatment group in the absence
of treatment. When this assumption holds, DD = AT T and it measures the before
vs. after change in outcomes for the treatment group, relative to what it would have
been under the counterfactual.
5. A sample estimator for DD is easy to operationalize using sample means: just compute
the difference between the sample mean of the outcome variable before and after the
shock for each group, and then compute the difference between the group differences:
ˆ = ȲT REAT,AF T ER − ȲT REAT,BEF ORE − ȲCON T ROL,AF T ER − ȲCON T ROL,BEF ORE .
DD
6. Of course we will usually want to control for observables X, in which case it’s easy to
operationalize a DD estimator using regression. There are various reasons we might
want to include Xs, but the usual reason is because we think the parallel trends
assumption is more likely to hold conditionally than unconditionally. For example, if
we think that Y depends on some X that varies over time and that has followed a
different time path for the two groups, then we might be more confident in the parallel
trends assumption when holding X constant. The simplest regression estimator is:
yit = α + x0it β + γDi + &micro;Tt + δDi Tt + εit
where xit is a vector of time-varying control variables; Di = 1 for the treatment group,
and zero otherwise; and Tt = 1 for the “after” time period, and zero otherwise. Notice
that our variables have two indexes now. We have (at least) two time periods (indexed
by t) for each unit of observation (people, states, or whatever the case might be;
indexed by i). Each i belongs to either the treatment or control group. This doesn’t
vary over time in the basic setup, and hence there is no t subscript on Di (we’ll see
a generalization below). Time periods are common to all individuals, and hence Tt
doesn’t have an i subscript either.
2
7. It’s straightforward to show that δ = DD. Notice that:
α = E [yit |xit , Di = 0, Tt = 0]
α + &micro; = E [yit |xit , Di = 0, Tt = 1]
⇒ &micro; = E [yit |xit , Di = 0, Tt = 1] − E [yit |xit , Di = 0, Tt = 0]
α + γ = E [yit |xit , Di = 1, Tt = 0]
α + γ + &micro; + δ = E [yit |xit , Di = 1, Tt = 1]
⇒ &micro; + δ = E [yit |xit , Di = 1, Tt = 1] − E [yit |xit , Di = 1, Tt = 0]
⇒ δ = DD.
1.1
Multiple groups and periods
1. People have long used a generalization of the DD estimator with multiple groups and
multiple time periods, where the different groups are treated at different times. However recent papers by Goodman-Bacon (2018) and de Chaisemartin and d’Haultfoeuille
(2020) have cast some doubt on the usefulness of this approach. It’s worthwhile to see
how this setup works because it shows up in a lot of papers (really a lot! nearly 20
percent of all papers published in the AER between 2010 and 2012!), and then we’ll
discuss the recent critique of this approach.
2. The usual setting is one where we have multiple groups (usually countries, states, or
provinces) that introduce similar policy changes (e.g., a tax/transfer program or a
regulation) at different moments in time, and we want to estimate the effect of these
policy changes on some outcome. This setting is usually called “staggered adoption”
of a policy.
3. The idea is intuitive. When the first group makes its policy change, all the other
groups can act as the control group for that change. When the second group makes its
policy change, all the other groups (including the first one, since its policy has already
changed and hence it isn’t changing now!) could act as the control group for that
change. And so on. One way to operationalize such a DD estimator is to introduce a
dummy for each group (call it a state in what follows, indexed by s) in place of Di ; a
dummy for each time period in place of Tt ; and a dummy AF T ERst that equals one
if state s has already made its policy change in t, and zero otherwise. That is,
yst = α + x0st β +
S
X
θj ST AT Ej +
j=1
T
X
τi T IM Ei + δAF T ERst + εst
i=1
where we’re using s, j = 1, ..., S to index states; ST AT Es is a dummy variable that
equals one if the state is s; t, i = 1, ..., T index time periods; T IM Et is a dummy variable that equals one if the time period is t; and all other terms are as defined previously.
The summation notation is a little confusing, but notice that ST AT Es , T IM Et and
AF T ERst are all dummy variables and only one of the ST AT Es and T IM Et dummies
3
s
1
1
1
2
2
2
3
3
3
t
1
2
3
1
2
3
1
2
3
y st
y11
y12
y13
y21
y22
y23
y31
y32
y33
ST AT E1
1
1
1
0
0
0
0
0
0
ST AT E2
0
0
0
1
1
1
0
0
0
Table 1:
ST AT E3 T IM E1
0
1
0
0
0
0
0
1
0
0
0
0
1
1
1
0
1
0
T IM E2
0
1
0
0
1
0
0
1
0
T IM E3
0
0
1
0
0
1
0
0
1
AF T ERst
1
1
1
0
1
1
0
0
1
will equal one for each observation (namely, those corresponding the observation’s state
s and time period t). As a consequence, we can rewrite the model more simply as:
yst = α + x0st β + θs ST AT Es + τt T IM Et + δAF T ERst + εst
= α + x0st β + θs + τt + δAF T ERst + εst
θs and τt are usually called state and time fixed effects, respectively.
4. If we have 3 states and 3 time periods, and the states introduce the policy in periods
1, 2, and 3, respectively, then our data might look like Table 1. Often our data take
the form of state means in applications like this, but they need not. For example, we
might have data on (say) individuals, each of whom resides in some state s at time t.
In this case, our data would have an extra i subscript to index individuals.
5. The recent criticism of this approach is as follows. Goodman-Bacon (2019), Callaway
and Sant’Anna (2020), and de Chaisemartin and d’Haultfoeuille (2020) show that this
multi-period multi-group version of DD estimates a weighted sum of the ATT in each
group and period. That sounds like a good thing ... but the problem is that the weights
can get a little wonky. If ATT is homogeneous – the same for every group and every
time period – then the wonky weights don’t matter: any average will get us to the
“right” ATT. But if ATT is heterogeneous – varying over time or between groups or
both – then wonky weights can cause real problems, because sometimes the weights
are negative. This can give rise to situations where ATT is positive in every group
and period, but the weighted-average ATT we get from the multi-period multi-group
DD is negative! This clearly isn’t useful. The negative weights arise when we have
“control” states that are treated in two consecutive periods ... and so Chaisemartin
and d’Haultfoeuille’s (2020) proposed alternative is an estimator that doesn’t use those
consecutive periods. Specifically, their alternative is to estimate the ATT across all
state-time pairs where treatment status changes between t and t + 1 and then average
those. Conveniently, they have provided STATA code that makes it easy to diagnose
&amp; fix problems. You can use the twowayfeweights package to compute the weights
for each group and period. If many of them are negative, you can use the fuzzy_did
and did_multiplegt packages to apply their estimator.
4
6. Callaway and Sant’Anna (2020) propose a different estimator with a similar idea:
estimate ATT as a weighted average of group-specific ATTs using sensible weights,
but in this case the weights account for differences between the xst of treatment and
control groups. These authors have provided R code to implement their estimator via
the did package, available on CRAN.
1.2
Triple Differences
1. In some cases we have more than one type of control group available to us, and we
can obtain a more convincing estimate by exploiting multiple control groups. For
example, suppose a state implements a change in health care policy aimed at the
elderly, say people 65 and older, and y is a health outcome. One possibility is to
estimate a DD model on data from the state with the policy change, both before and
after the change, with the control group being people under 65 and the treatment group
being people 65 and older. The potential problem with this DD is that other factors
unrelated to the state’s new policy might affect the health of the elderly relative to the
younger population, e.g., changes in health care technology or national/federal policy.
A different DD analysis would be to use another state as the control group and use
the elderly from the non-policy state as the control group. Here, the problem is that
changes in the health of the elderly might be systematically different across states due
to, say, income and wealth differences, rather than the policy change.
2. A more robust analysis can be obtained by combining both analyses, e.g., using control
groups within the treatment state and from a control state. If we let a index age
groups; define Tt = 1 for the “after” period and zero otherwise; Ds = 1 for the state
that implements the policy change and zero otherwise; and Ea = 1 for the elderly age
group and zero otherwise, then the expanded model is:
yast = α + x0ast β + γDs + φEa + &micro;Tt + ηDs Ea + ϕDs Tt + κEa Tt + δDs Ea Tt + εast
The coefficient of interest is δ, the coefficient on the triple interaction term, Ds Ea Tt .
Without covariates xast , the OLS estimate δ̂ can be expressed as follows:
δ̂ =
(ȳT REAT,ELDERLY,AF T ER − ȳT REAT,ELDERLY,BEF ORE )
− (ȳCON T ROL,ELDERLY,AF T ER − ȳCON T ROL,ELDERLY,BEF ORE )
(ȳT REAT,Y OU N G,AF T ER − ȳT REAT,Y OU N G,BEF ORE )
−
− (ȳCON T ROL,Y OU N G,AF T ER − ȳCON T ROL,Y OU N G,BEF ORE )
This is the difference between two DD estimators, and hence we call it the difference-indifference-in-differences, or triple difference, (DDD) estimator. The first DD measures
the before vs. after change in outcomes for elderly workers in the treatment state vs.
the control state. The second DD measures the before vs. after change in outcomes for
young workers in the treatment state vs. the control state. Since we don’t expect to
find any causal effect for young workers, you can think of this second DD as a placebo
5
DD. By taking the difference between the two DD estimators, the DDD estimator
controls for two kinds of trend. The first DD should net out any trend in the health
outcomes of elderly workers that are common to both states. The latter (placebo)
DD should capture the difference in trends between the treatment and control states.
Subtracting this from the first DD removes any difference in state-level trends from
the resulting DDD estimate.
1.3
1. We can never directly test the parallel trends assumption, but we can look for supporting evidence. Are there pre-existing trends in y before the treatment is applied?
If so, are those trends (approximately) parallel for the treatment and control groups?
It is good practice to plot y over time for the treatment and control groups in the pretreatment period, and see whether pre-treatment trends are (approximately) parallel.
If they are not, it is unlikely that they would have been parallel in the post-treatment
period under the counterfactual, which makes the parallel trends assumption is dubious.
2. Policy changes are often bundled. We need to ensure that no other relevant policies
are changing over the same time period, lest we confound the effect of the policy of
interest with other changes.
3. DD is often conceptualized as two time periods: “before” and “after.” In reality, time is
continuous and we often need to make a judgement about how to define these periods,
i.e., how long before the policy change should the “before” period be, and how long
after the change should the “after” period be. Sometimes policy changes are anticipated,
and this might affect outcomes in the pre-treatment period. Similarly, policy changes
sometimes take time to be implemented and affect behavior and outcomes.
4. If the policy variation is at a different level of aggregation than our unit of observation,
then we need to be careful about the calculation of standard errors. For example, we
often have data on individuals but policies vary at the state level. In this case, we
should cluster standard errors at the state level (or whatever level we have the policy
variation). This works best when we have a large number of clusters.
6
```