9: Difference-in-Differences Simon D. Woodcock ECON 836 Spring 2022 1 The Difference in Difference Estimator 1. Sometimes “nature” gives us a quasi-experiment. These natural experiments are often policy-driven: an unanticipated policy change affects one group but not another. More generally, we might imagine an unanticipated shock that affects one group but not another. In either case, we can think of the affected group as our treatment group (Di = 1), and the unaffected group as our control group (Di = 0). Under some conditions, we can exploit these natural experiments to measure causal effects of the treatment/shock. 2. When we observe outcomes for both groups both before and after the shock, then we can improve upon simple before vs. after comparisons for the treatment group (where we often can’t credibly separate causal effects of the shock from trend) and simple treatment vs. control comparisons after the shock (where we often can’t credibly separate causal effects from selection bias). 3. The basic idea is simple: compare the before vs. after change in outcomes for the treatment group to the before vs. after change for the control group. The before vs. after change for each group removes any unobserved heterogeneity in average outcomes for that group. The before vs. after change for the treatment group should therefore equal the causal effect plus trend. The before vs. after change for the control group should contain trend only, since they are unaffected by the policy. Under the assumption that trend is the same for the two groups, the difference between the two before vs. after change (i.e., the “difference in differences”) should isolate the causal effect. 4. To formalize this idea, let T = 0 denote the time period before the shock, and T = 1 denote the time period after the shock. The difference in differences (DD) is defined as: DD = (E [Yi (1) |Di = 1, T = 1] − E [Yi (0) |Di = 1, T = 0]) − (E [Yi (0) |Di = 0, T = 1] − E [Yi (0) |Di = 0, T = 0]) . 1 Our previous definition of AT T did not formally distinguish between time periods before and after treatment. A natural way to adapt our definition is AT T = E [Yi (1) |Di = 1, T = 1]− E [Yi (0) |Di = 1, T = 1]. If we set this equal to the above expression for DD and solve, we obtain the following condition under which DD = AT T : E [Yi (0) |Di = 1, T = 1] − E [Yi (0) |Di = 1, T = 0] = E [Yi (0) |Di = 0, T = 1] − E [Yi (0) |Di = 0, T = 0]. This is usually called the parallel trends or common trends assumption. It says that the (counterfactual) change in outcomes that the treatment group would have experienced in the absence of treatment is the same as the actual change in outcomes for the control group. The observed change in outcomes for the control group thus provides a counterfactual for what would have happened to the treatment group in the absence of treatment. When this assumption holds, DD = AT T and it measures the before vs. after change in outcomes for the treatment group, relative to what it would have been under the counterfactual. 5. A sample estimator for DD is easy to operationalize using sample means: just compute the difference between the sample mean of the outcome variable before and after the shock for each group, and then compute the difference between the group differences: ˆ = ȲT REAT,AF T ER − ȲT REAT,BEF ORE − ȲCON T ROL,AF T ER − ȲCON T ROL,BEF ORE . DD 6. Of course we will usually want to control for observables X, in which case it’s easy to operationalize a DD estimator using regression. There are various reasons we might want to include Xs, but the usual reason is because we think the parallel trends assumption is more likely to hold conditionally than unconditionally. For example, if we think that Y depends on some X that varies over time and that has followed a different time path for the two groups, then we might be more confident in the parallel trends assumption when holding X constant. The simplest regression estimator is: yit = α + x0it β + γDi + µTt + δDi Tt + εit where xit is a vector of time-varying control variables; Di = 1 for the treatment group, and zero otherwise; and Tt = 1 for the “after” time period, and zero otherwise. Notice that our variables have two indexes now. We have (at least) two time periods (indexed by t) for each unit of observation (people, states, or whatever the case might be; indexed by i). Each i belongs to either the treatment or control group. This doesn’t vary over time in the basic setup, and hence there is no t subscript on Di (we’ll see a generalization below). Time periods are common to all individuals, and hence Tt doesn’t have an i subscript either. 2 7. It’s straightforward to show that δ = DD. Notice that: α = E [yit |xit , Di = 0, Tt = 0] α + µ = E [yit |xit , Di = 0, Tt = 1] ⇒ µ = E [yit |xit , Di = 0, Tt = 1] − E [yit |xit , Di = 0, Tt = 0] α + γ = E [yit |xit , Di = 1, Tt = 0] α + γ + µ + δ = E [yit |xit , Di = 1, Tt = 1] ⇒ µ + δ = E [yit |xit , Di = 1, Tt = 1] − E [yit |xit , Di = 1, Tt = 0] ⇒ δ = DD. 1.1 Multiple groups and periods 1. People have long used a generalization of the DD estimator with multiple groups and multiple time periods, where the different groups are treated at different times. However recent papers by Goodman-Bacon (2018) and de Chaisemartin and d’Haultfoeuille (2020) have cast some doubt on the usefulness of this approach. It’s worthwhile to see how this setup works because it shows up in a lot of papers (really a lot! nearly 20 percent of all papers published in the AER between 2010 and 2012!), and then we’ll discuss the recent critique of this approach. 2. The usual setting is one where we have multiple groups (usually countries, states, or provinces) that introduce similar policy changes (e.g., a tax/transfer program or a regulation) at different moments in time, and we want to estimate the effect of these policy changes on some outcome. This setting is usually called “staggered adoption” of a policy. 3. The idea is intuitive. When the first group makes its policy change, all the other groups can act as the control group for that change. When the second group makes its policy change, all the other groups (including the first one, since its policy has already changed and hence it isn’t changing now!) could act as the control group for that change. And so on. One way to operationalize such a DD estimator is to introduce a dummy for each group (call it a state in what follows, indexed by s) in place of Di ; a dummy for each time period in place of Tt ; and a dummy AF T ERst that equals one if state s has already made its policy change in t, and zero otherwise. That is, yst = α + x0st β + S X θj ST AT Ej + j=1 T X τi T IM Ei + δAF T ERst + εst i=1 where we’re using s, j = 1, ..., S to index states; ST AT Es is a dummy variable that equals one if the state is s; t, i = 1, ..., T index time periods; T IM Et is a dummy variable that equals one if the time period is t; and all other terms are as defined previously. The summation notation is a little confusing, but notice that ST AT Es , T IM Et and AF T ERst are all dummy variables and only one of the ST AT Es and T IM Et dummies 3 s 1 1 1 2 2 2 3 3 3 t 1 2 3 1 2 3 1 2 3 y st y11 y12 y13 y21 y22 y23 y31 y32 y33 ST AT E1 1 1 1 0 0 0 0 0 0 ST AT E2 0 0 0 1 1 1 0 0 0 Table 1: ST AT E3 T IM E1 0 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 T IM E2 0 1 0 0 1 0 0 1 0 T IM E3 0 0 1 0 0 1 0 0 1 AF T ERst 1 1 1 0 1 1 0 0 1 will equal one for each observation (namely, those corresponding the observation’s state s and time period t). As a consequence, we can rewrite the model more simply as: yst = α + x0st β + θs ST AT Es + τt T IM Et + δAF T ERst + εst = α + x0st β + θs + τt + δAF T ERst + εst θs and τt are usually called state and time fixed effects, respectively. 4. If we have 3 states and 3 time periods, and the states introduce the policy in periods 1, 2, and 3, respectively, then our data might look like Table 1. Often our data take the form of state means in applications like this, but they need not. For example, we might have data on (say) individuals, each of whom resides in some state s at time t. In this case, our data would have an extra i subscript to index individuals. 5. The recent criticism of this approach is as follows. Goodman-Bacon (2019), Callaway and Sant’Anna (2020), and de Chaisemartin and d’Haultfoeuille (2020) show that this multi-period multi-group version of DD estimates a weighted sum of the ATT in each group and period. That sounds like a good thing ... but the problem is that the weights can get a little wonky. If ATT is homogeneous – the same for every group and every time period – then the wonky weights don’t matter: any average will get us to the “right” ATT. But if ATT is heterogeneous – varying over time or between groups or both – then wonky weights can cause real problems, because sometimes the weights are negative. This can give rise to situations where ATT is positive in every group and period, but the weighted-average ATT we get from the multi-period multi-group DD is negative! This clearly isn’t useful. The negative weights arise when we have “control” states that are treated in two consecutive periods ... and so Chaisemartin and d’Haultfoeuille’s (2020) proposed alternative is an estimator that doesn’t use those consecutive periods. Specifically, their alternative is to estimate the ATT across all state-time pairs where treatment status changes between t and t + 1 and then average those. Conveniently, they have provided STATA code that makes it easy to diagnose & fix problems. You can use the twowayfeweights package to compute the weights for each group and period. If many of them are negative, you can use the fuzzy_did and did_multiplegt packages to apply their estimator. 4 6. Callaway and Sant’Anna (2020) propose a different estimator with a similar idea: estimate ATT as a weighted average of group-specific ATTs using sensible weights, but in this case the weights account for differences between the xst of treatment and control groups. These authors have provided R code to implement their estimator via the did package, available on CRAN. 1.2 Triple Differences 1. In some cases we have more than one type of control group available to us, and we can obtain a more convincing estimate by exploiting multiple control groups. For example, suppose a state implements a change in health care policy aimed at the elderly, say people 65 and older, and y is a health outcome. One possibility is to estimate a DD model on data from the state with the policy change, both before and after the change, with the control group being people under 65 and the treatment group being people 65 and older. The potential problem with this DD is that other factors unrelated to the state’s new policy might affect the health of the elderly relative to the younger population, e.g., changes in health care technology or national/federal policy. A different DD analysis would be to use another state as the control group and use the elderly from the non-policy state as the control group. Here, the problem is that changes in the health of the elderly might be systematically different across states due to, say, income and wealth differences, rather than the policy change. 2. A more robust analysis can be obtained by combining both analyses, e.g., using control groups within the treatment state and from a control state. If we let a index age groups; define Tt = 1 for the “after” period and zero otherwise; Ds = 1 for the state that implements the policy change and zero otherwise; and Ea = 1 for the elderly age group and zero otherwise, then the expanded model is: yast = α + x0ast β + γDs + φEa + µTt + ηDs Ea + ϕDs Tt + κEa Tt + δDs Ea Tt + εast The coefficient of interest is δ, the coefficient on the triple interaction term, Ds Ea Tt . Without covariates xast , the OLS estimate δ̂ can be expressed as follows: δ̂ = (ȳT REAT,ELDERLY,AF T ER − ȳT REAT,ELDERLY,BEF ORE ) − (ȳCON T ROL,ELDERLY,AF T ER − ȳCON T ROL,ELDERLY,BEF ORE ) (ȳT REAT,Y OU N G,AF T ER − ȳT REAT,Y OU N G,BEF ORE ) − − (ȳCON T ROL,Y OU N G,AF T ER − ȳCON T ROL,Y OU N G,BEF ORE ) This is the difference between two DD estimators, and hence we call it the difference-indifference-in-differences, or triple difference, (DDD) estimator. The first DD measures the before vs. after change in outcomes for elderly workers in the treatment state vs. the control state. The second DD measures the before vs. after change in outcomes for young workers in the treatment state vs. the control state. Since we don’t expect to find any causal effect for young workers, you can think of this second DD as a placebo 5 DD. By taking the difference between the two DD estimators, the DDD estimator controls for two kinds of trend. The first DD should net out any trend in the health outcomes of elderly workers that are common to both states. The latter (placebo) DD should capture the difference in trends between the treatment and control states. Subtracting this from the first DD removes any difference in state-level trends from the resulting DDD estimate. 1.3 Caveats and Practical Advice 1. We can never directly test the parallel trends assumption, but we can look for supporting evidence. Are there pre-existing trends in y before the treatment is applied? If so, are those trends (approximately) parallel for the treatment and control groups? It is good practice to plot y over time for the treatment and control groups in the pretreatment period, and see whether pre-treatment trends are (approximately) parallel. If they are not, it is unlikely that they would have been parallel in the post-treatment period under the counterfactual, which makes the parallel trends assumption is dubious. 2. Policy changes are often bundled. We need to ensure that no other relevant policies are changing over the same time period, lest we confound the effect of the policy of interest with other changes. 3. DD is often conceptualized as two time periods: “before” and “after.” In reality, time is continuous and we often need to make a judgement about how to define these periods, i.e., how long before the policy change should the “before” period be, and how long after the change should the “after” period be. Sometimes policy changes are anticipated, and this might affect outcomes in the pre-treatment period. Similarly, policy changes sometimes take time to be implemented and affect behavior and outcomes. 4. If the policy variation is at a different level of aggregation than our unit of observation, then we need to be careful about the calculation of standard errors. For example, we often have data on individuals but policies vary at the state level. In this case, we should cluster standard errors at the state level (or whatever level we have the policy variation). This works best when we have a large number of clusters. 6