§»1 Digitized by the Internet Archive in 2011 with funding from Boston Library Consortium Member Libraries http://www.archive.org/details/howmuchshouldwetOObert DEWEY Massachusetts Institute of Technology Department of Economics Working Paper Series HOW MUCH SHOULD WE TRUST DIFFERENCES-IN-DIFFERENCES ESTIMATES? Marianne Bertrand Esther Duflo Sendhil Mullainathan Working Paper 01 -34 October 2001 Room E52-251 50 Memorial Drive Cambridge, MA 02142 This paper can be downloaded without charge from the Social Science Research Network Paper Collection at http://papers.ssrn.com/abstract id=xxxxx s\ Massachusetts Institute of Technology Department of Economics Working Paper Series HOW MUCH SHOULD WE TRUST DIFFERENCESIN-DIFFERENCES ESTIMATES? Marianne Bertrand Esther Duflo Sendhil Mullainathan Working Paper 01 -34 October 2001 Room E52-251 50 Memorial Drive Cambridge, MA 02142 This paper can be downloaded without charge from the Social Science Research Network Paper Collection at http://papers.ssm.com/abstract id= xxxxx MASSACHUSETTS INSTITUTE OF TECHNOLOGY NOV 5 2001 LIBRARIES How Much Should We Trust Differences-in-Differences Estimates? Marianne Bertrand Esther Duflo Sendhil Mullainathan* This Version: October 2001 Abstract Most Difference-in-Difference correlated outcomes. Yet almost (DD) papers all that serial correlation introduces. This of interest in will DD estimation (e.g. rely on many years of data and focus on serially these papers ignore the bias in the estimated standard errors is especially troubling because the independent variable the passage of law) is itself very serially correlated, which exacerbate the bias in standard errors. To illustrate the severity of this issue, we randomly generate placebo laws in state-level data on female wages from the Current Population Survey. For each law, we use OLS error for this estimate. DD to compute the The standard DD estimate of estimation finds an "effect" significant at the Two its "effect" as well as the standard errors are severely biased: with about 20 years of data, 5% level for up to 45% very simple techniques can solve this problem for large sample of the placebo laws. sizes. The first technique and ignoring the time-series variation altogether; the second technique is to estimate standard errors while allowing for an arbitrary covariance structure between time periods. We also suggest a third technique, based on randomization inference testing methods, which works well irrespective of sample size. This technique uses the empirical distribution of estimated effects for placebo laws to form the test distribution. consists in collapsing the data NBER and CEPR; MIT, NBER and CEPR, MIT and thank Alberto Abadie, Daron Acemoglu, Joshua Angrist, Abhijit Banerjee, Victor Chernozhukov, Kei Hirano, Guido Imbens, Larry Katz, Jeffrey Kling, Kevin Lang, Steve Levitt, Kevin Murphy, Emmanuel Saez, Doug Staiger, Bob Topel and seminar participants at Harvard, MIT, University of Chicago GSB, University of California at Los Angeles, University of California Santa Barbara, and University of Texas at Austin for many helpful comments. Tobias Adrian, Shawn Cole, and Francesco Franzoni provided excellent research assistance. We are especially grateful to Khaled for motivating us to write this paper, e-mail: marianne.bertrand@gsb.uchicago.edu; eduflo@mit.edu; 'University of Chicago Graduate School of Business, NBER.. We mullain@mit.edu. 1 Introduction 1 Difference-in-Difference relationships. DD passage of law). (DD) estimation has become an it the difference in outcomes after and before the intervention for one might first isolate One would then compare changes unemployment duration in simplicity as well as arise potential to circumvent The DD DD estimation also has its for residents of states raising great appeal of many when making comparisons between heterogeneous Obviously, a its unemployment insurance states that have raised benefits to residents of states not raising benefits. its to estimate causal to this difference for unaffected groups. For example, to identify the incentive effects of social insurance, benefits. way estimation consists of identifying a specific intervention or treatment (often the One then compares groups affected by increasingly popular DD estimation comes from of the endogeneity problems that typically individuals. 1 drawbacks. Most of the debate around the validity of 2 estimate revolves around the possible endogeneity of the laws or interventions themselves. Sensitive to this concern, researchers have developed a set of informal techniques to gauge the 3 extent of the endogeneity problem. In this paper, DD estimation. We assume away biases we address an altogether different problem with in estimating the intervention's effect and instead focus on possible biases in estimating the standard error around this effect. DD estimates and standard errors for these estimates most often derive from using Ordinary Least Squares (OLS) in repeated cross-sections (or a panel) of data on individuals in treatment and control groups for several years before outcome of interest for individual i in and group whether the intervention has affected group regression using s (such as s at time ist 2 As 4 t. a state) at time One then t and Ts t be a dummy for typically estimates the following OLS: Y = A + B + cX where Formally, let Yi s t be the after a specific intervention. and Bt are fixed s t effects for the states lst +(3Tst + e ist and years and Xi S t represents the relevant individual See Meyer (1994) for an overview. See Besley and Case (1994). Another prominent concern has been whether behavioral parameter. See Heckman (1) (1996) and Blundell and MaCurdy DD (1999). estimation ever isolates a specific Abadie (2000) discusses how well control groups serve as a control. 3 Such techniques include the inclusion of pre-existing trends in states passing a law, testing for an "effect" of the law before it takes effect, or using information on political parties to instrument for passage of the law (Besley and Case 1994). 4 For simplicity of exposition, we will often refer to interventions as laws, groups as states and time periods as years in what follows. Of course this discussion generalizes to other types of DD estimates. The estimated impact controls. around that estimate are each state-year (or OLS s-t) cell. we argue In this paper, of the intervention is then the Our survey of DD that the estimation of equation 1 DD context. First, we papers, which very little and an DD DD in at DD model, has been largely serial correlation the treatment variable an especially Tst series. changes coming from the OLS estimation of equation no 5% itself 5% level in as much as 45% 1. performs on placebo laws, where state of the time. effect. variable (from the Current Population Survey) the DD random. Since these laws are rejection rates of the null hypothesis of We it estimation are typically highly positively serially we examine how percent level should be found only effect at well-understood, within a state over time. These three factors reinforce each other to create potentially assess the extent of this bias, 5% in practice subject to a possibly discuss below, finds an average of 16.5 periods. Second, the intrinsic aspect of the and year of passage are chosen the Standard errors estimation usually relies on fairly long time large mis- measurement in the standard errors To is is Three factors make estimation. most commonly used dependent variables correlated. Third, (3. 5 ignored by researchers using DD estimate standard errors after accounting for the correlation of shocks within severe serial correlation problem. While serial correlation important issue in the OLS fictitious, In fact, we a significant "effect" at find dramatically higher For example, using female wages as a dependent and covering 21 years of data, we of the simulations. find a significant 6 propose three different techniques to solve the serial correlation problem. 7 The techniques are very simple and work well for sufficiently large samples. First, first two one can remove the time-series dimension by aggregating the data into two periods: pre- and post-intervention. Second, one can allow for an arbitrary covariance structure over time within each state. Both of these solutions work well when the number of groups is large (e.g. 50 states) but fare poorly as 5 This correction accounts for the presence of a common random effect at the state-year cell level. For example, economic shocks may affect all individuals in a state on an annual basis (Moulton 1990; Donald and Lang 2001). Ignoring this grouped data problem can lead to an under-statement of the standard error. In most of what follows, we will assume that the researchers estimating equation 1 have already accounted for this problem, either by allowing for appropriate random group effects or, as we do, by collapsing the data to a higher level of aggregation, such as state-year 6 cells. Similar magnitudes arise in data manufactured to match the sure that the placebo laws are not by chance picking 7 up a CPS distributions Other techniques fare poorly. Simple parametric corrections which estimate fare poorly because even long time series (by fails specific processes (such as an AR(1)) standards) are too short to allow precise estimation of the auto- and to identify the right assumption about the auto-correlation process. because the number of groups (e.g. 50 states) is not large enough. correlation parameters block bootstrap DD and where we can be absolutely real intervention. On the other hand, number the of groups gets small. irrespective of sample size. We propose a third (and preferred) solution which works well This solution, based on the randomization inference tests used statistics literature, uses the distribution of in the estimated effects for placebo laws to form the test statistic. The remainder of this paper proceeds as follows. In section of the auto-correlation problem: section 2.1 reviews in biased 5, how DD assess the potential relevance DD papers performs on placebo laws. to assess account how it will result affects them. Section 4 describes possible solutions. discusses implications for the existing literature. We conclude in Section 6. Auto-correlation and Standard Errors 2 2.1 Review It will be useful to quickly review exactly Consider the a we failing to take it into standard errors, and section 2.2 surveys existing Section 3 examines Section why 2, OLS estimation of equation the vector of parameters. variance of the OLS Assume estimate is OLS 1 serial correlation and denote , which we regress on vt with e has E[e] = and T 0. O,. The true (2) = of (W) -1 (3) consider a simple uni-variate time-series case in let's vt follows an AR(1) In this special case, equations 2 and 3 can be simplified to: ~ (1 + 2 P ^^-j r v2 v v z^t=i 2_,t=i t = periods of data. Suppose that the error term u t follows an AR(1) with auto-correlation parameter A A = E[ee] 1 process with auto-correlation parameter p and that the independent variable var(a) estimation. is: compare these expressions, yt OLS the vector of independent variables and = ^(V'VyWnViV'V)- estimate of the variance easily poses a problem for given by: est var(d) To more V that the error term var(d) while the why + 2P T T v~ 2 l-.t=i t and est var(d) = % t +-+ 2P ^^2) v 2-,t=i t As T— > These formulas make transparent three OLS ~P oo, the ratio of estimated to true variance equals known well A . about how facts serial correlation biases estimates of standard errors. First, positive serial correlation in the error term (p > 0) will cause an under-statement of the standard error while negative serial correlation will cause an overstatement. each how This new year is of data than serially correlated the = serially correlated (A important in the positive serial correlation intuitive: DD 0), OLS is no bias is less information in Second, the magnitude of the bias also depends on assumes. independent variables there means that there is. In fact, when the independent variable is not estimated standard errors. This second point in the context, where the intervention variable Tst is is in fact very serially correlated. Indeed, for affected states, the intervention variable typically equals period after period until one year where it turns to passed by state s 1 and then stays by time t" varies little at In other words, the variable "whether a law was 1. over time within a state, thereby exacerbating any serial correlation in the dependent variable. Finally, the magnitude of the problem depends on the length of the time series (T). All else held constant, as T increases, the bias in the OLS estimates of the standard errors worsens. A 2.2 Survey of DD Papers This quick review suggests the relevance of a depends on three the most serial correlation factors: (1) the typical length of the commonly used dependent variables; and problem time series used; (3) for existing DD papers (2) the serial correlation of whether any procedures are use to correct for serial correlation. Since these factors are inherently empirical, identified all DD we collected data on published DD papers. We papers published in 6 journals between 1990 and 2000: the American Economic Review, the Industrial and Labor Relations Review, the Journal of Political Economy, the Journal of Public Economics, the Journal of Labor Economics, and the Quarterly Journal of Economics. classified a paper as "DD" interventions. For example, as a DD if it met two following we would not paper (even though it criteria. classify a might suffer from paper must focus on specific paper that regressed wages on unemployment serial correlation issues as well). paper must use units unaffected by the law as a control. 5 First, the We We Second, the found 92 such papers. For each of these papers, we determined the number of time periods in the study, the nature of the and the technique(s) used to estimate standard variable, Table 1 summarizes the results of Sixty-nine of the 92 DD We this exercise. dependent errors. start with the lengths of the time series. papers used more than two periods of data. Four of these papers began with more than two periods but collapsed the data into two effective periods: before and Table 1 8 reports the distribution of time periods for the remaining 65 papers. of periods used of data. As we is 16.5 and the median More than 75% is 11. will see in the simulations below, lengths after. The average number of the papers use more than 5 periods such as these are more than enough to cause serious under-estimation of the standard errors. The most commonly used variables in DD estimation are employment and wages. 10 Eighteen papers study employment and thirteen study wages. Other labor market variables, such as ment and unemployment also receive significant attention, as To variables are clearly highly auto-correlated. cite do health outcomes. Most of these an example, Blanchard and Katz (1992) their survey of regional fluctuations find strong persistence in shocks to state and unemployment. It is retire- interesting to note that first-differenced variables, in employment, wages which might have a tendency to exhibit negative auto-correlation (and thereby over-state standard errors) are quite uncommon. In short, the bulk of DD papers focus on outcomes which are likely positively serially correlated. How do these 65 papers correct address the problem at all. Only for serial correlation? five of adjusting standard errors. state, The one of the solutions we suggest Two fifth When A 11 Of these five, does very four use a little in the allows for arbitrary variance-covariance matrix within in Section 4. additional points are worth noting. First, 80 of the original 92 variation. 9 it. will see later, this correction problem with grouped error terms as the unit of observation 8 vast majority of papers do not papers explicitly deal with parametric AR(k) GLS-based correction. As we way The Only 36 of these papers address is more this problem, either DD papers have a potential detailed than the level of the by clustering standard errors or we only recorded the shortest span. whatever unit of time is used. The very long time series in the data, such as the 51 or 83 at the and 99 percentile arise because several papers use monthly or quarterly data. 95 10 When a paper studies more than one variable, we record all the variables used. 11 For example, the effect of state level laws is studied using individual level data. a used several data sets with different time spans, period here is by aggregating the data. Second, several informal techniques are used endogeneity of the intervention variable. variable in equation For example, three papers include a lagged dependent seven include a trend specifically for treated states, 15 plot graphs of some 1, form to examine the dynamics of the treatment the law, two see (DDD) by the effect if is finding another control group. summary, our review suggests that DD existing DD it does not a specific data We Group old. tell set, us of the effect before triple-differences these informal techniques in Section 5. an important problem for serial correlation is likely CPS how DD many papers are likely to report under-estimated standard serious the problem women of residence. is in practice. for in their fourth interview the years 1979 to 1999. We extract information on weekly earnings, The sample contains Of the 900,000 women earnings). how return to the issue on an To assess magnitudes, we turn to a sample of women's wages from the Current Population Survey (CPS). 12 extract data on We is Estimation While the survey above shows that most errors, examine whether there papers and that this problem has been poorly addressed to date. Over-Rejection in 3 We problem effect, 3 and eleven formally attempt to do persistent, interact with the serial correlation In with the possible for dealing month focus on women between all employment Merged Outgoing Rotation in the 25 and 50 years status, education, age, We nearly 900,000 observations. define wage and state as log(weekly in the original sample, approximately 300,000 report strictly positive weekly earnings. This generates (50*21=1050) state-year cells with each cell containing on average a little less than 300 The correlogram of the women wage with positive weekly earnings. residuals is informative. We estimate first, second and third auto- correlation coefficients of the residuals from a regression of the logarithm of wages on state dummies (the relevant residuals since DD includes these dummies). The and year auto-correlation coefficients are obtained by a simple regression of the residuals on the corresponding lagged residuals. therefore imposing common auto-correlation 0.51, 2 The CPS is is and auto-correlation parameters for is strongly significant. one of the most commonly used data sets all states. The second and in the DD literature. The estimated We first are order third order auto-correlation are high as well .44 ( and the residual was following an AR(1) process. less rapidly than we would expect 13 quantify the bias induced by serial correlation in the which some affect states and not others. between 1985 and 1995. 14 Second, we them an affected state We Ts We DD draw first we randomly generate context, random from a uniform at t is then defined as a dummy after the intervention date, can then estimate DD variable which equals and 1 for all DD OLS on (equation 1) using If DD of 1.96 for the t-statistic. This exercise II error, or CPS tells no effect ((3 = 0) exactly 5% of the time The us about Type I error. Note that a small variant power. After constructing the placebo intervention, data by the outcome plus point (approximately inter- that live in The estimation To understand how new laws we would expect that when we use a threshold 15 T st times whichever effect Ts t, 2%) effect of the intervention. 16 new laws randomly drawn each time), we can By assess also allows us to assess to simulate. For example, * .02 to often in we generate a true .02 log repeatedly estimating how Type we can replace the outcome we wish can replace log(weekly earnings) by log(weekly earnings) plus Tst (with women these placebo laws. provides an appropriate estimate for the standard error, reject the null hypothesis of the effect). performs we can repeat this exercise a large number of times, each time drawing random. we distribution otherwise. generates an estimate of the laws' "effect" and a corresponding standard error. well laws, random and designate select exactly half the states (25) at by the law (even though the law does not actually have an as "affected" vention variable at if Placebo Interventions 3.1 To and decline much .33 respectively), DD finds an DD effect in this data when there 13 Solon (1984) points out that in panel data, when the number of time periods is fixed, the estimates of the autoOLS regression are biased. Using Solon's generalization of Nickell's (1981) formula for the bias, the first order auto-correlation coefficient of 0.51 we estimate in the wage data, with 21 time periods, would correspond to a true auto-correlation coefficient of 0.6 if the data generating process were correlation coefficients obtained using a simple an AR(1). However, Solon's formulas also imply that the second and third order auto-correlation coefficients would be much smaller than the coefficients we observe if the true data generating process were an AR(1) process with an auto-correlation coefficient of 0.6. To match the estimated second and third order auto-correlation parameters, the data would have to follow and AR(1) process with an auto-correlation coefficient of 0.8. We choose to limit the intervention date to the 1985-1995 period to ensure having enough observations prior and post intervention. I5 One might argue that the rejection rate could be higher than 5% we might accidentally be capturing some real on that data manufactured to track the variance structure of the CPS also produces rejection rates similar to those in the CPS. The 2% effect was chosen so that there is sufficient power in the data sets we study. interventions with the randomization procedure. However, 8 we will as show later actually is 17 one. Basic Rejection Rates 3.2 The row of Table 2 presents the first We without any correction for grouped error terms. dummies (less We was greater than first 18 column of row 2% is shows the results of 1 no true 67.5% effect no null hypothesis of One important effect of of this (we added .02 effect in for row performs a * Tst DD is micro CPS data. over-rejecting by a factor of similar exercise but on the to the data). 85.5% of the We CPS find that, in this case, data altered we reject the cases. (1990) general arguments to all cell; it DD inference. is diagonal while in practice problem, only 36 correct To properly account for state-year cell in equation for cell. OLS estimation above assumes that the variance might be block diagonal, with a constant As noted earlier, while 80 papers suffer from it. such shocks, one can assume that there is a random effect for each 1: Y lst will and year it The does not allow for aggregate year-to-year the observations within a state. In other words, correlation coefficient within each state we this exercise in the unaltered reason for this gross over-rejection has been described by Donald and Lang and matrix for the error term this case, above control variables Xi st include 4 education of the time. Thus, in this setup, who apply Moulton's shocks that affect n In as described the placebo laws, we find that the null of no effect does not account for correlation within state-year this micro data, level. The second column to contain a (2001), 5% at the rejected a stunning thirteen. CPS the fraction of simulations where the null hypothesis of no intervention i.e. Here, even though there is the report the fraction of simulations in which the absolute value of the t-statistic 1.96, was rejected The 1 in than high school, high school, some college and college and more) and a quartic in age as controls. effect re-estimate equation The 200 independent draws of placebo laws. at least when performed result of this exercise = A s + B + cX it t count only correct rejections, i.e. +P T + Vst + 3t the number of (4) tit DD estimates that are significant and positive. 18 The average of the estimated coefficients coefficients are unbiased. (3 was 0. Thus while OLS overestimates standard errors, the estimated where ust are group random group year by the random level introduced random In row 3, Row effect. 2 reports rejection rates We continue to find a we take a more 44% rejection rate. To and this aggregate data, we term diagonal. The the problem, since Row = a s +>y + /3Tst + e t cells it no effect. were the only reason for over-rejection, aggregation for the error effect is almost as high here as when use the micro data and In about 44% of the simulations, we reject the null 20 These magnitudes suggest that errors generates a dramatic bias. is (5) st would make the variance-covariance matrix correct the standard errors for clustering. context then compute means of these 3 displays the results of multiple estimations of equation 5 for placebo laws. rejection rate of the null of hypothesis of no regress log weekly refers to the aggregation. the correlation within state-year fully solve first estimate: where the bar above the variables ought to We we aggregate the data into year. This leaves us with 50 times 21 state-year cells in the aggregate data. Yst If We aggregate, earnings on the controls (education and age) and form residuals. Using allow for state- 19 drastic approach to solve this problem. state-year cells to construct a panel of states over time. residuals by state when we using the White correction (1984) which allows for an arbitrary intra-group effects correlation matrix. error can be corrected for the correlation at the The standard effects. failing to account for As we saw in serial correlation equation the serial correlation of the intervention variable 2.1, Ts t when computing standard one important factor itself. In fact, in the we would expect DD the un-corrected estimates of the standard errors for the intervention variable to be consistent in any variation of the this point, DD model where the intervention variable we construct a is not serially correlated. different type of intervention variable. half of the states to form the treatment group. As before, To illustrate we randomly select However, instead of randomly choosing one date 19 Practically, this is usually implemented by the "cluster" command in STATA. We also applied the correction procedure suggested in Moulton (1990). That procedure allows for a random effect for each group, which puts structure on the intra-cluster correlation matrices and therefore may perform better in finite samples. This is especially true when the number of clusters is small (if in fact the assumption of a constant correlation is a good approximation). The rate of rejection of the null hypothesis of no effect was not statistically different under the Moulton technique, possibly reflecting the fact that the 20 One might worry ticity in the data. number of clusters However, the results in large in this application. is deals with the clustering problem, introduces heteroskedasTable 2 do not change if we use standard heteroskedasticity-correction that the aggregation process, while it techniques. 10 after which all The law dates between 1979 and 1999. is now defined as intervention variable its is now repeatedly turned on and off, with its otherwise. In other words, the value yesterday telling us nothing Ts value today, and thereby eliminating the serial correlation in that the null of no effect is rejected only 6% select 10 the observation relates to a state that 1 if belongs to the treatment group at one of the 10 intervention dates, about we randomly the states in the treatment group are affected by the law, of the time in this case. t- In row Removing the we 4, see now serial correlation in the law removes the over-rejection. This strongly suggests that the bias in the standard errors due to than other properties of the error terms. is serial correlation, rather In rows 5 and states. When 6, we examine how we modify rejection rates vary as 12 states or 36 states are affected, the rejection rates are The placebo way laws so far have been constructed in such a affects all treated states in the same In row year. of passage can differ across treated states. We 7, we the number 39% and 43% of affected respectively. that the intervention variable create placebo laws such that the date randomly choose half of the form the states to treatment group but now randomly choose a passage date separately for each state (uniformly drawn between 1985 and 1995) if with the null of no One might still effect Or perhaps teristics of otherwise. The rejection rates continue to be high in being rejected about 35% of the time. t, and worry at this point that factors other than large rejection rates. Perhaps changes). Ts by time state s has passed the law this case, The in the treatment group. we we other features of the wage replicate our analysis in whose variance structure in variable defined to equal is still serial correlation give rise to these data, such as state specific trends or other characrise to the over-rejection. manufactured data. To directly address all Specifically, we generate data terms of relative contribution of state and year fixed the empirical variance decomposition in the CPS. The data AR(1) process with an auto-correlation parameter p. By is effects .8, we construction, find a rejection rate roughly equal to 11 matches normally distributed and follows an we can therefore be sure that there are no ambient trends and that the laws truly have no effect. Yet, in row assume that p equals 1 are by chance detecting actual laws (or other relatively discrete the distribution of the error term, give of these problems, t what we found 8, in the where we CPS, 37%. Magnitude of 3.3 Effects These excess rejection rates are stunning, but they might not be particularly problematic estimated "effects" of the placebo laws are economically insignificant. of the statistically significant intervention effects in Table We Using the aggregate data, we perform 3. 3 reports the empirical distribution of the estimated effects J3, when 2. Table these effects are significant. surprisingly, this distribution appears quite symmetric: half of the false rejections are negative Not and half are positive. On roughly to a 2 percent average, the absolute value of the effects effect. Nearly 10% range, and the remaining 60% fall in the 1 elasticities. 21 % to 2 is roughly range. .02, which corresponds About 30% fall DD estimates are often represented as For example, suppose that the law under study corresponds to be a in the child-care subsidy. Similarly, in An DD many meaning a measured 2% 5% increase increase in log earnings of .02 would correspond to an elasticity of estimates, the affected group effect in the 2 to are larger than 3%. These magnitudes are especially large considering that .4. the examine the magnitudes 200 independent simulations of equation 5 for placebo laws defined such as in row 3 of Table 3% if on the is often only a fraction of the sample, sample would indicate a much larger full effect on the treated subsample. 22 To summarize, our tions 1 findings suggest that the standard errors from the grossly understate the true standard errors, even the micro data. In other words, reporting simple accounting for serial correlation will generate DD many if one accounts OLS for estimation of equa- group data effects in estimates and their standard errors without spurious results. 23 Varying Serial Correlation 3.4 How does over-rejection vary with the question in two different ways. First, The serial correlation in the we use dependent variable? other outcome variables in the CPS We address this as left-hand side DD estimates are normalized using the magnitude of the change in the policy variable of interest. when studying the effects of changes unemployment insurance benefits on job search, one would examine the full sample of the unemployed, not only those who actually took up the program. This use of all people 22 For example, eligible for a program rather than all recipients is often motivated by an attempt to avoid the endogeneity caused by selective take-up. DD Mapping these findings to the existing literature on requires more care. Researchers often use informal techniques along with the mechanical procedure we have described. For example, to check for endogeneity, they will test whether a law appears to have an "effect" before it was passed. We discuss whether these informal techniques DD interact with the over-rejection rates in Section 5. 12 Second, we experiment with various auto-correlation parameters in the manufactured variables. data. The first part of Table 4 estimates equation 5 on employment rate, average weekly working hours and change in log wages, as well as the original log weekly earnings. As before, the table displays the rejection rate of the null hypothesis of random draws effect of of the intervention variable, both in the intervention. We no 200 independent simulations with effect in raw data and in also report in Table 4 estimates of the As we auto-correlation coefficients for each of these variables. see, data altered to create a first, second and third Order the false rejection problem diminishes with the serial correlation in the dependent variable. As expected, the first-order auto-correlation is negative (as it is 2% when the estimate the case for change in log wages), we of find that the conventional standard errors tend to underestimate the precision of the estimated treatment effect. The second The part of Table 4 uses manufactured data. error structure is designed to follow an AR(1) process. As before, the manufactured data has the same number of states and years as the CPS data and is constructed so that the relative importance of state effects, year effects and the error term in the total variance matches the variance composition of the wage data in the CPS. Not surprisingly, the false rejection rates increase with the auto-correlation There process. moderate levels, is exactly the right rejection rate such as a p of should be. As noted OLS is formula is close to negative, there is or .4, there is no auto-correlation. But even at rejection rates are already two to four times as large as they with an AR(1) parameter of what we observe in the CPS 0.8, data. the rejection rate using the standard And again, when the auto-correlation under-rejection. Varying the 3.5 The earlier, .2 when parameter in the AR(1) stylized exercise Number of States and Time Periods above focused data with 51 states and 21 time periods. Many DD papers use fewer states (or treated and control units), either because of data limitations or a desire to focus only on comparable controls. Table 5, before, we examine how the we use the CPS DD papers use fewer time periods. In over-rejection rate varies with these two important parameters. As For similar reasons, several as well as manufactured data to analyze these effects. 13 We also examine rejection rates Rows rejection. falls 1-4 24 when we have added a treatment and 10-13 show that varying the number of states does not change the extent of over- Rows as the time 5-9 and 14-18 vary the number of years. As expected, the extent of over-rejection span gets shorter, but it does so at a surprisingly slow rate. For example, even with only 7 years of data, the over-rejection rate of the DD 4 nearly 50% 8% (CPS in the data) and in the many CPS, three times too With periods. 17% (manufactured data). large. Around 70% 5 years of data, the rejection When T=50, the rejection rate manufactured data. Solutions Parametric Methods 4.1 A 16% is papers in our survey use at least that rate varies between rises to the data. effect to first natural solution to the serial correlation problem would be to specify the auto-correlation structure for the error term, estimate errors. We Row its parameters, and use equation (2) to estimate true standard implement several variations of this basic correction method Table in 2 performs the simplest of these parametric corrections, wherein an estimated in the data. 25 rejection rate is still This technique does The 34.5%. As correlation coefficient. is auto-correlation parameter is failure here biased downwards. the auto-correlation parameter .8 in the in part is is CPS .8, data (row .62 is 3), is the due to the under-estimation of the auto- In the Similarly, in the a p of AR(1) process to solve the serial correlation problem: OLS well understood, with short time-series the auto-correlation coefficient of only 0.4. autocorrelation of little 6. CPS data, OLS estimation of the estimates a first-order manufactured data where we know that estimated (row 7). the rejection rate goes If we impose a down first-order to 12.5%, a clear but only partial improvement. In row 24 8, we establish an upper-bound on the power of any potential correction, by examining CPS data, we vary the number of states by randomly selecting some states and discarding others. In both we continue to treat exactly half the states. 25 Computationally, we first estimate the first order auto-correlation coefficient of the residual by regressing the In the types of data, residual on its lag, the residual. are the and then uses The matrix same whether is or not this estimated coefficient to form an estimate of the variance-covariance matrix of block diagonal, with a matrix of the form we assume each state has its own 14 Q (in equation 2) in auto-correlation parameter. each block. The results how well the true variance-covariance matrix does. In other words, matrix implied by an AR(1) process with p of rejection rate there a is 2% is now effect, is now As to estimate standard errors. 5% when indistinguishable from the rejection rate .8 we use the variance-covariance there 32.3%. This is no effect. More expected, the interestingly, be a useful benchmark will when for other corrections. Another problem with this the auto-correlation process. CPS. In rows 9 and 10, parametric correction As noted = .35. 9, that an AR(1) does not we use the manufactured data the autocorrelation process. In row and p2 earlier, may be small sample bias given in Solon (1984). the correlogram of wages in fit to an These parameters were chosen because they match in the correctly specified to see the effect of such mis-specification of we generate data according and third auto-correlation parameters we have not AR(2) process with well the estimated wage data when we apply the formulas We then correct term follows an AR(1) process. The rejection rate p\ first, = .55 second to correct for the standard error assuming that the error rises significantly with this mis-specification of the auto-correlation structure (30.5%). In row 10, the CPS we use a process that provides an even better match of the time-series properties of data: the sum of an variance of the white noise is AR(1) (with auto-correlation parameter 13 % of the total variance of the residual). the auto-correlation in this data by fitting an AR(1), close to what we found for the CPS However, attempting to correct not look we like a plausible option: it data in row for is correct the standard errors in the we 26 26 reject the null in (the When trying to correct about 39% of the case, 2. the auto-correlation by specifying different processes does clearly difficult to find the right process. CPS data by imposing the white noise processes that we have seen match fairly well the high. and a white noise 0.95) CPS specific data. In rows 4 and AR(2) and AR(1) The 5, plus rejection rates remain 27 Note that an AR(1) plus white noise seems a priori a very reasonable process for the CPS data. Indeed, even AR(1) in the population, the repeated cross-section in the CPS implies that a non-persistent sampling error is added to the error term each year. "Similar results hold if we estimate the parameters of the processes, as in row 2, rather than impose them. if the wage follows an 15 Block Bootstrap 4.2 An Block bootstrap 1994). shirani, structure by keeping We first state. then draw this residual and compute the each state a for block bootstrap (Efron and Tib- new We implement block bootstrap as residuals. This gives a vector of residuals €j follows. for each residual vector from this distribution (with replacement). back to the original predicted value gives us a new outcome variable Y^ then estimate equation we can form 1 is a variant of bootstrap which maintains the auto-correlation is the residuals of a state together. all estimate equation We Adding ei, method with which we experiment alternative correction 1 for different this Y^ and By outcome. . We repeatedly sampling from the residual distribution of repeat this exercise to estimate a sequence of /?&. The distribution of these parameters then gives us a test statistic. The results The of the block bootstrap estimation are reported in Table rejection rates are well (row 3). still high: 35% The problem appears in the CPS data (row to be the small number 2) 29% 7. They in the are not encouraging. manufactured data as of blocks or states. In row when we 4, allow for 400 states (and 200 states passing the law), block bootstrap delivers a close to correct rejection rate. Since very few applications in practice can rely on that many groups, block bootstrap does not appear to be a realistic solution to the serial correlation problem. Empirical Variance-Covariance Matrix 4.3 Both the most parametric and the most non-parametric methods seem to fail because of lack of data: there are not enough time periods to estimate the time series process and perform a parametric correction, not make and not enough states use of the fact that for block bootstrap. However, the techniques we have a large auto-correlation process in a flexible way. is the same in In this case, all states. if number Specifically, the data is years, the variance-covariance matrix of the error of size T by the element T (where (i,i + j) T is is the number of states that above did can be used to estimate the sorted by states and (by decreasing order of) term ej tried suppose that the auto-correlation process is of time periods). the correlation between we and block diagonal, with 50 identical blocks Each ti-j- of these block We is symmetric, and can therefore use the variation across the 50 states to estimate each element of this matrix, and use this estimated matrix to 16 compute standard T of errors from equation This 2. equivalent to treating the problem as a system is seemingly unrelated equations estimated jointly for each state) in each year. (in 200 simulations) significant We 7.75% is implement in the CPS (1 for this technique in (row and 10% 2) each year), with 50 data points (one Table 8. in the The rejection rate manufactured data (row 9), of states drops, the rejection rates increase (rows method does not 6 and 8). 4, A all states. see that the rejection rate is if when a true 2% is number that this the data generating process Finally, the technique has low power. In only 8.5% First, as the second obvious issue deliver consistent estimates of the standard error not the same across effect is associated column 2 of row we 2, with the laws. Arbitrary Variance- Covariance Matrix 4.4 This procedure can be generalized to an estimator of the variance covariance matrix which consistent in the presence of any correlation pattern within states over time. consistently estimate each element of the matrix like is a improvement over the two previous methods. This correction method however has important limitations in practice. is we obtain formula to compute the standard errors. 28 Q, in this case, Of course, is we cannot but we can use a generalized White- This estimator for the variance-covariance matrix given by: V = (X'X)-'(f^u'J uA(X'X)where n c is the total number of states, X is 1 matrix of independent variables and Uj is defined for each state to be: T uj = Yl e^ x n t=i where the summation state) and Xj t is is over all is consistent as the 28 This results for this estimation procedure are 29 is the residual at time number of shown arbitrary variance-covariance matrix does quite well. allow for ejt is t (in that particular 29 This estimator of a row vector of dependent variables (including the constant). the variance-covariance matrix The elements in the state, The in states tends to infinity. Table 9. Despite its generality, rejection in the unaltered CPS the data is analogous to applying the Newey-West correction (Newey and West 1987) in the panel context where we lags to potentially be important. implemented in a straightforward way by using the cluster command in STATA and choosing entire states all This is (and not only state-year cells) as clusters. 17 6% (row 2, column found in Table 8 (row relative to the Moreover, the procedure has relatively high power compared to what we 1). 2, column We upper-bound. 2% rejection rate in the case of a and 32% in AR(1) data with p = we can In the manufactured data, 2). saw in Tables 4 6, that effect was 78% The .8. 74% and 27.5% when the number of treated units Time Ignoring 4.5 8.5% with 10 states, it does comes nearly these respectively. 5% when the number of groups This method, therefore, seems to work well, states). large enough. Series Information Another possible solution to the component is well manufactured data with no auto-correlation in Again, however, rejection rates increase significantly above (15% with 6 how with the correct covariance matrix, the arbitrary variance-covariance matrix upper-bounds, achieving rejection rates of declines also see in the estimation serial correlation problem is to simply ignore the time series and when computing the standard errors. simply average the data before and after the law and run equation variable as a panel of length 2. The rates are approximately correct 1 30 To do on this, this averaged outcome The rejection results of this exercise are reported in Table 10. now Moreover, the power of .295 in row 2 at 6%. one could is quite high relative to other techniques. Taken states at the longer the slightly however, this solution literally, same time. same and If states work only will for laws that are passed for the treated all pass the law at different times, "before" and "after" are no are not even defined for states that never pass the law. modify the technique in the following way. First, one regress Y st One can however on state fixed effects, dummies, and any relevant covariates. One then divides the residuals of the treatment into two groups: residuals from years before the estimate and the standard error dummy. This procedure does all passed at the same time. The downside when 30 the One number could still come from an OLS It is and residuals from years also does well small. when row states, residual aggregation use time-series data for specification checks. 18 after the law. The 2) for laws that are the laws are staggered over time (row 4). raw and residual aggregation) With 20 states only regression of this two-period panel on an after as well as the simple aggregation (row 3 vs. of these procedures (both of states law, year is that they do poorly has a rejection rate of 9.5%, nearly twice too large. With only 6 states, the rejection rate is as high as 31.5%. 31 extent of over-rejection in fact appears to increase faster in this case than in the clustering The method presented above. Randomization Inference 4.6 We have so two procedures that appear to do far isolated sufficiently large. In this section, the sample size. The we DD propose to compute when the number of states is present an estimation technique that does well irrespective of principle behind the test carried out throughout this paper. fairly well simple, is and motivated by the simulation To compute the standard number estimates for a large of exercises error for a specific experiment, randomly generated placebo laws and we to use the empirical distribution of the estimated effects for these placebo laws to form significance test for the true law. This test is 32 closely related to the randomization inference test (or Fisher's exact test), discussed in the statistical literature (see Rosenbaum Before laying out the test in more details, of the way randomization individuals, half of whom (1996) for an overview of this test it have received a to Ti. we have outcome data applications). illustration (such as sick days) on Define Yi to be the outcome and T, be the flu shot. i. Suppose that the we make the hypothesis that the = ai effect is constant: Yi Now, its be useful to give a simple, motivating inference works. Suppose indicator for vaccination, both for individual treatment will and let us assume that the To estimate the flu shot was administered to a effect of the shot, we could take the those receiving the shot and those not receiving hypotheses about this parameter? + 9Ti We it. random group, difference in so that a^ OLS orthogonal mean outcomes between Call this estimator 9(Ti, Yi). would use the is How can we test estimate of the standard error but this estimates rests on strong assumptions, such as homoskedasticity, normal distribution of the error and independence between observations. Instead, to 31 test whether the coefficient is different from 0, (as small as 12), the normal approximation to the t-distribution will clearly not work. not causing the over-rejection since for small degrees of freedom, a threshold of 1.96 should not produce this high of a rejection rate. For example, for 6 degrees of freedom, a 1.96 threshold should only produce roughly a 10% rejection rate. Donald and Lang (2001) discuss inference in small-sample aggregated data sets. 32 The code needed to perform this test is available from the authors upon request. But At these small samples this alone is 19 one can use the fact that under the statistically the One same. null of no people in the treatment and control groups are effect, can, therefore, generate a set of placebo interventions Ti and estimate the "effect" of these placebos. Repeated estimation will give us a distribution of effects for such placebos. We can then observe where the original estimate 9 hypothesis that the coefficient is 0. For example, to form a two-tailed test for whether the from at the cut-off value. 5% level, The we would use intuition behind random, we can generate other in the data we truly this test is way: make we want if (i.e 9 is significantly different flu shot is Y) as our assumed to be random, zero-effect shots and see what estimates these produce not indexed by to test whether the estimate is We itself). making no assumptions about It need not be homoskedastic, on any large sample Under the assumption that the treatment we could z), are this test does not rest therefore valid for any sample size. constant across unit Since the quite simple. normal or even independent across individuals. Further, effect is is the 2.5% and 97.5% values in the distribution 9(Ti, the error term (except for the randomness of the flu shot it is estimate initial have. Notice the advantages of this procedure. approximation: distribution to test the lies in this different from test other 9q, we the treatment and the control comparable under the null, and hypotheses in the same form Yio first = ai — 9oTiq, to we apply the same technique 33 to the transformed data. The statistical inference test To form the estimate we propose of the law's effect, DD model follows naturally from this example. we estimate by OLS the usual aggregate equation: Yst = We for the as + j t +pT st +€ St then generate a placebo law that affect 25 randomly chosen states in a random year, Ps t and estimate using OLS: Y = a s + 7t + 7 P + e^ st Repeating this procedure many times for st random placebo laws produces a G(.) to be this distribution and 7 to be the the hypothesis that our estimate lies in 33 the G(.) distribution. One can use the is random variable statistically different drawn from from 0, entire confidence intervals. 20 this distribution. we simply need For example, to form a two-tailed test of same technique to form distribution of 7. Define level p, To test to ask where $ we would simply 7 at the identify the \ lower the estimated coefficient to 0, otherwise we accept term. Instead, we have we In Table 11, lies and upper of the distribution and use these values as cutoffs: If outside these two cutoff value, it. As before, note that relied solely assess tail how we reject the hypothesis that a 5% Pst 400 draws of We of no We cutoff. known T st . CPS data as well Again, we randomly generate auto-correlation structure. perform 200 independent draws of Tst we construct the test statistic , For each of these draws, we perform to construct the distribution G(j). see that the basic randomization inference procedure leads us to reject the null hypothesis effect in 6% of the simulations (row 2, powerful, leading to rejecting in before, in the 23% column 1). Notice that the procedure also seems fairly when of the cases manufactured data, we can compare its there is an effect (row 2, was 78% performance to the upper-bound. manufactured data with no auto-correlation and 32% in performance of randomization inference, There 30% and an additional assumption behind is 84 % this test, is know we know will not 1988, and we its effect, between 1971 and 2000, or something The These AR(1) data with p 2% = effect The .8. (e.g. in the flu shot example, the lie in it lies 5, a different If a particular law took place in should we assume when generating the placebo laws that else? Rows it could have been passed at any random time 3 to 5 consider the effect of using different (and simpler) assumptions about the law generating process. window. In row saw namely the requirement that we know the the true process by which laws are generated. the law could only have been passed in 1988, that assume that We that half of the sample was randomly selected to get the treatment). In practice, are estimating distribution to As comparable to these upper-bounds. exact statistical process determining which units get treated statistician in 2). column in Tables 4 6, that with the correct covariance matrix, the rejection rate in the case of a 34 error on the random assignment of the laws. 34 intervention variables. For each randomly generated intervention for equal we have not used any information about the well this procedure performs in the aggregated as in manufactured data with a it is window around the Each row allows the laws actual law date. In row 3 for example, in a 5 year window centered on the actual date. In row we force the placebo laws to occur in the same year for the test 4, we assume a we 3 year as the actual laws; in other theoretical justification for this test rests on the proof of the validity of using the randomization distribution. assumption of randomization of the treatment to map out the distribution of the test tests use the strong statistic. random The only difference in our case is that state laws are not conditional on the state fixed effects and year dummies. 21 random unconditionally, but instead that they are words, we permute only which of type and type I II errors, all states were affected by the As can be laws. seen, both in terms 35 the procedures produce very similar results. repeat the exercise using a progressively smaller number of states. Even with The next few rows as few as 6 states, the procedure produces exactly the right rejection rates. To summarize, Table 11 makes clear that, of all the techniques it we have tion inference performs the best. It removes the over-rejection problem of sample size. Moreover, it and does so independently appears to have power comparable to that of the other randomization inference testing literature. considered, randomiza- is well known in statistics, it is tests. Although largely ignored in the econometrics Difference in differences estimation, which deals with small effective sample size, and complicated error distribution, seems a particularly fertile ground for the application of this testing technique. Implications for Existing Papers 5 This paper has highlighted an important problem with tions to deal with it. What DD estimation and proposed several solu- are the implications for the existing stock of papers which use the estimation technique but do not explicitly correct for serial correlation? most of these papers use long time possibility, however, is that some and variables which are series We DD have already seen that likely quite auto-correlated. One of the informal techniques used in these papers might indirectly help alleviate the auto-correlation problem. As we noted earlier, researchers have developed a formed in conjunction with the estimation of equations endogeneity of the interventions, something that random 35 OLS or 5. These tests are meant to assess the It is possible that these tests may inciden- problem. In Table 12, we report rejection rates when we perform estimation of equations may 1 not a problem in our setup, as we construct interventions that are by definition exogenous. tally lessen the auto-correlation the is set of diagnostic tests that are often per- 1 and 5 in combination with several commonly used diagnostic how many states were affected. If we see 25 of 50 states affected, was chance of being affected or was it one in which exactly 25 states will be affected? Simulations which vary this additional element show that the results are not affected by which selection process is assumed (even if it differs from the actual process). It is worth noting that while these results tell us that in simulations, using incorrect approximations do not make a difference, this may not be a general theoretical result. Similar problems arise in determining the process one in which each state had a 50% iid 22 techniques. 36 We concentrate on 3 data sets: micro term by state-year follows The aggregate cell), CPS wage data, CPS wage and manufactured data where the an AR(1) process with an autocorrelation parameter of first variable but also include a at then reject the null hypothesis of no on the pre-law dummy is dummy only effect if error term 0.8. We diagnostic test looks for pre-existing "effects" of the law. DD with the T coefficient data (with clustering of the error re-estimate the original for "this state will pass a law next year" the coefficient on the law is either insignificant or opposite signed. significant . We and the This diagnostic test implies only a small reduction in the rejection rates. The second no effect if diagnostic test examines persistence of the effect. the estimated coefficient is statistically significant We under reject the null hypothesis of OLS and the effect persists three years after the intervention date. Again, this does not lead to a significant reduction in the rejection rate in The any of the data sets. third test looks for pre-existing trends in the treatment sample. the pre-period and estimate whether there between control and treatment states). coefficient is statistically significant and is there this test, is no we reject the null of this test but reject the null effect if effect. The the OLS treatment rejection rates are lower remain above 20%. Finally, in the last test we no statistically significant pre-existing trend or a treatment trend of opposite sign of the intervention under in a significant time trend in these years for the difference Under if We perform a regression if the we allow OLS for a coefficient is treatment specific trend in the data. Under this significant and stays significant and of the same after controlling in the regression for a yearly trend interacted with the treatment test, size dummy. This diagnostic test leads to a bigger reduction in the rejection rate, especially in the aggregate data (.105 and .150), though this is still at least twice as large as in Table 12 therefore confirm that the existing stock of we would want. 37 DD papers is In short, the results very likely affected by the 36 0f course, we can only study the formal procedures which people use. One can always argue that informal procedures (such as looking at the data) will lead one to avoid the serial correlation problem. Such a claim is by construction hard to test using simulations. The only way to defend it would be go back to the original papers and submit them to the procedures discussed above. "Moreover, it is not clear that these tests are rejecting the "right" ones. Any stringent criterion will reduce the rejection rate, but how are we to assess whether the reduction is sensible? We investigated this by computing the marginal rejection rate of the randomization inference test, conditional on passing the treatment trend diagnostic test. The odds that, having passed the trend test, an intervention would pass the randomization inference test are only 26% suggesting that the treatment trend reduces the rejection rate but not in a way that alleviates the serial correlation problem. 23 estimation problem discussed in this paper. 38 Conclusion 6 Our results suggest that, because of serial correlation, DD estimation as it is commonly performed While the bias grossly under-states the standard errors around the estimated intervention effect. induced by the DD serial correlation is well understood in theory, the sheer magnitude of this problem in context should come as a surprise to most readers. Since a large fraction of the published DD papers we surveyed report t-statistics around 2, our results suggest that the findings in many of these papers may not be as precise as originally thought to be and that far too many false rejections of the null hypothesis of no effect have taken place. We propose three solutions to deal with the serial correlation problem in the DD context. Collapsing the data into pre- and post- periods or allowing for an arbitrary covariance matrix within state over time have been shown to be simple viable solutions when sample sizes are sufficiently large. Alternatively, a simple adaptation of the randomization inference testing techniques developed in the statistics literature appears to fully correct standard errors irrespective of sample We size. have attempted variants on these diagnostic tests, such as changing what constitutes an "effect" in the preWe have also attempted all the tests together. The results were qualitatively period or what constitutes a trend. similar. 24 References Abadie, Alberto, "Semiparametric Difference-in-Differences Estimators," Working Paper, Kennedy School of Government, Harvard University, 2000. Blundell, Richard and Thomas MaCurdy, "Labor Supply," in Handbook of Labor Economics, Orley Ashenfelter and David Card (eds.), December 1999. and Robert Tibshirani An Introduction Efron, Bradley Statistics and Probability, no Heckman, James, "Discussion," and James Poterba in 57, Chapman and to the Bootstrap Monograph in Applied Hall, 1994 Empirical Foundations of Household Taxation, Martin Feldstein (eds.), 1996. Donald, Stephen and Kevin Lang, "Inferences with Difference in Differences and Other Panel Data," Boston University Working Paper, 2001. MaCurdy, Thomas E., in a Longitudinal "The Use of Time Series Processes to Data Analysis," Journal of Econometrics, Meyer, Bruce, "Natural and Quasi-Natural Experiments Economic in v. of Earnings 18 (1982): 83-114. Economics," Journal of Business and 13 (1995): 151-162. Statistics, v. Moulton, Brent R., "An Illustration of a in Model the Error Structure Pitfall in Micro Units," Review of Economics and Estimating the Effects of Aggregate Variables Statistics, v. 72 (1990): 334-338. Newey, Whitney and K. D. West, "A Simple, Positive Semi-definite, Heteroskedasticity and Autocorrelation Consistent-Covariance Matrix," Econometrica, v. 55, pp. 703-708, 1987. Nickell, Stephen, "Biases in Dynamic Models with Fixed Effects," Econometrica, v. 49 (1981): 1417-1426. Rosenbaum, Paul, "Hodges-Lehmann Point Estimates of Treatment ies," Journal of the American Statistical Association, v. Effect in Observational Stud- 88 (1993): 1250-1253. Rosenbaum, Paul, "Observational Studies and Nonrandomized Experiments," Rao, eds, Handbook of Ghosh and C.R. Statistics, v. 13 (1996). Solon, Gary, "Estimating Auto-correlations in Fixed-Effects Models," Paper S. NBER Technical Working no. 32, 1984. White, Halbert, Asymptotic Theory for Econometricians, Academic Press, 1984. 25 Table 1: Survey of DD Papers" of DD papers Number with more than 2 periods of data Number which collapse data into before-after 92 Number with 65 Number 69 4 potention serial correlation problem Number with some serial correlation correction 5 GLS 4 Arbitrary variance-covariance matrix 1 Distribution of time-span for papers with more than 2 periods Average 16.5 Percentile Value 1% 5% 3 10% 25% 50% 75% 90% 95% 99% Number Informal manipulations of data Graph time series of effect See if 15 2 effect persists Examine lags of law to see timing of effect 2 DDD 11 Include trend specific to passing states 7 Explicitly include lead to look for effect prior to law 3 Include laggged dependent variable 3 Number which have Number which clustering deal with Most commonly used problem 80 36 it variables Employment Wages 18 13 Health/Medical Expenditure 8 Unemployment Fertility/Teen Motherhood 4 Insurance 4 Poverty 3 Consumption/Savings 3 6 "Notes: Data comes from a survey of all articles in six journals between 1990 and 2000: American Economic Review; Industrial Labor Relations Review; Journal of Labor Economics; Journal of Political Economy; Journal of Public Economics; and Quarterly Journal of Economics. We define an article as "Difference-in-Difference" if it: (1) examines the effect of a specific interventions and (2) uses units unaffected by the intervention as a control group. 26 3 4 5.75 11 21.5 36 51 83 Table 2: DD Rejection Rates for Placebo Laws" Law Type Data Technique Rejection Rate No A. CPS CPS CPS CPS CPS CPS CPS micro 25 states, one date aggregate aggregate aggregate aggregate aggregate = .8 Cluster OLS 25 states, one date Serially uncorrelated laws OLS OLS 12 states, one date OLS 36 states, one date 25 states, multiple dates B. AR(1), p OLS 25 states, one date micro Effect 2% Effect REAL DATA OLS .675 .855 (.027) (.020) .44 .74 (.029) (.025) .435 .72 (.029) (.026) .06 .895 (.014) (.018) .433 .673 (.029) (.027) .398 .668 (.028) (.027) .48 .71 (.029) (.026) .373 .725 (.028) (.026) MANUFACTURED DATA OLS 25 states, one date 'Notes: 1. Each cell represents the rejection rate of the null hypothesis of no effect (at the 5% significance level) on the intervention (law) variable for randomly generated placebo interventions. The number of simulations for each 2. CPS cell is at least data are data for two hundred. women between 25 and 50 in the fourth interview month of the Merged 1979 to 1999. The dependent variable unless otherwise specified is log weekly earnings. In row 3 to 7, data are aggregated to state-year level cells after controlling for demographic variables (education and age). Manufactured data are data generated so that the variances match the CPS variances. The p refers to the auto-correlation parameter in manufactured data. Outgoing Rotation Group for the years 3. All regressions also include, in addition to the intervention variable, state and year fixed the individual level regression they include the demographic controls as well. 4. Standard errors are 5. "Effect" specifies whether an effect of the placebo law has been in parenthesis and computed using the number of simulations. 27 added to the data. effects. In Table 3: Magnitude of CPS No Average Coefficient Fraction of effects < Negative 2% Effect Positive Negative .18 .17 .715 .0075 (.022) (.022) (.026) (.005) .02 -.02 .026 -.017 (.000) (.000) (.000) (.000) .59 .58 .33 1 (.028) (.028) (.027) (.000) .01 in (.01, .02] in (.02, .03] in (.03,-04] > Estimates Effect Positive Rejection Rate DD Aggregate Data .31 .3 .39 (.027) (.026) (.028) .084 .12 .16 (.016) (.019) (.021) .014 .04 (.000) (.000) .12 (.007) (.000) (.019) (.000) "Notes: 1. The positive (negative) columns report results for estimated effects of interventions which are positive (negative) data are data for women between 25 and 50 in the fourth interview month of the Merged Outgoing Rotation Group for the years 1979 to 1999. The dependent variable is log weekly earnings. Data are aggregated to state-year level cells after controlling for the demographic variables (education and age). 2. CPS 3. All regressions also include, in addition to the intervention variable, state 4. Standard errors are in parenthesis and year and computed using the number of simulations. 28 fixed effects. Table 4 Varying Auto-correlation Data and Dependent Variable Technique Rejection Rate No A. Effect 2% Effect REAL DATA Pl,P2,p3 CPS CPS CPS CPS agg, agg, Log wage .509, .440, .332 Employment .470, .418, .367 agg, Hours worked agg, Changes in log .151, .114, .063 wage B. -.046, .032, 002 OLS OLS OLS .435 .72 (.029) (.026) .415 .698 (.028) (.010) .263 .265 (.025) (.025) (.000) (.009) OLS .978 MANUFACTURED DATA Pi AR(1 OLS AR(1 OLS .2 AR(1 OLS .6 AR(1 OLS .8 AR(1 .783 (.024) .123 .738 (.019) (.025) OLS .4 AR(1 .053 (.013) OLS -.4 .19 .713 (.023) (.026) .333 .700 (.027) (.026) .373 .725 (.028) (.026) .008 .7 (.005) .026) "Notes: 1. 2. Each cell represents the rejection rate of the null hypothesis of no effect (at the 5% significance level) on the intervention variable for randomly generated interventions. The number of simulations for each cell is at least two hundred. CPS data are data for women between 25 and 50 in fourth interview month of the Merged Outgoing Rotation Group for the years 1979 to 1999. The dependent variable unless otherwise specified is log weekly earnings. Data are aggregated to state-year level cells, after controlling for the demographic variables (education and age). Manufactured data are data generated so that the variances match the CPS variances. CPS regressions also include, in addition to the intervention variable, state 3. All 4. Standard errors are 5. The in parenthesis and computed using the number variables pi refer to the estimated auto-correlation parameters of lag 29 and year of simulations. i. fixed effects. Table 5 Varying N CPS CPS CPS CPS CPS CPS CPS CPS CPS aggregate 50 20 agg agg 10 agg 6 agg 50 50 agg agg 50 agg 50 agg 50 T AR(1), p=.8 AR(1), p=.8 AR(1), p=.8 AR(1), p=.8 AR(1), p = .8 AR(1), p=.8 AR(1), p=.8 AR(1), p=.8 50 20 10 6 50 50 Rejection rate Effect A. REAL DATA 2% Effect .435 .72 (.029) (.026) .36 .53 (.028) (.029) .425 .525 (.029) (.029) 21 .45 .433 (.029) (.029) 21 .29 .675 (.026) (.027) 11 7 .16 .63 (.021) (.028) .08 .503 (.016) (.029) 5 .0775 .39 (.015) (.028) 3 2 .073 .315 (.015) (.027) MANUFACTURED DATA 21 21 21 21 11 5 50 T No 21 50 and Rejection rate 21 B. N 3 50 .35 .638 (.028) (.028) .35 .538 (.028) (.029) .3975 .505 (.028) (.029) .393 .5 (.028) (.029) .335 .588 (.027) (.028) .175 .5525 (.022) (.029) .09 .435 (.017) (.029) .4975 .855 (.029) (.020) "Notes: 1. 2. Each cell represents the rejection rate of the null hypothesis of no effect (at the 5% significance level) on the intervention variable for randomly generated interventions. The number of simulations for each cell is at least two hundred. CPS data are data for women between 25 and 50 in the fourth interview month of the Merged Outgoing Rotation Group for the years 1979 to 1999. The dependent variable unless otherwise specified is log weekly earnings. Data are aggregated to state-year level cells after controlling for the demographic variables (education and age). Manufactured data are data generated so that the variances match the CPS variances. 3. All CPS regressions also include, in addition to the intervention variable state 4. Standard errors are 5. N refers to the the CPS data The parameter in p measures the auto-correlation. parenthesis and and year fixed effects. computed using the number of simulations. number of states used in the simulation and T refers to the number used, we randomly drop states or years to fulfill the criterion. is 30 of years. When Table Data Parametric Solutions 6: Technique Estimated p Rejection rate No A. CPS CPS CPS CPS agg Standard AR(1) agg agg .368 .705 (.026) AR(1) correction imposing p=.8 .12 .4435 (.019) (.029) correction = .55 and p2 = AR(1) + White = .95 Noise and n/s=.13 Standard AR(1) .622 correction AR(l),p = AR(1) correction imposing pAR(2), pi = P2 = = Standard AR(1) .55 -35 .95, .335 .638 (.027) (.028) .373 .765 (.028) (.024) .205 .715 (.023) (.026) .06 .323 (.023) .444 correction AR(1)+ white p .5725 (.029) MANUFACTURED DATA OLS p=.i .228 (.024) .35 B. AR(1), .72 (.026) .345 p AR(1), p=.8 .435 (.029) (.027) imposing p\ CPS effect correction AR(2) agg 2% REAL DATA OLS agg Effect Standard AR(1) noise noise/signah .13 .301 correction .305 .625 (.027) (.028) .385 .4 (.028) (.028) "Notes: 1. represents the rejection rate of the null hypothesis of no effect (at the 5% significance on the intervention variable for randomly generated interventions. The number of simulations for each cell is at least two hundred. Each cell level) 2. CPS data are data for women between Merged Outgoing Rotation Group 25 and 50 in the fourth interview for the years 1979 to 1999. The dependent month of the variable unless Data are aggregated to state-year level cells, after and age). Manufactured data are data generated so that the variances match the CPS variances. An AR(1) + white noise process is the sum of an AR(1) plus an iid process, where the auto-correlation for the AR(1) component is given by p* and the relative variance of the components is given by the noise to signal otherwise specified is log weekly earnings. controlling for the demographic variables (education ratio (n/s). 3. All CPS regressions also include, in addition to the intervention variable, state and year fixed effects. 4. Standard errors are 5. the AR(1) correction in parenthesis is and computed using the number of simulations. implemented in stata using the xtgls 31 command. Table 7 Block Bootstrap N Technique Data Rejection rate No A. CPS CPS aggregate aggregate OLS B. Block Bootstrap AR(1), p = OLS .S Effect 50 .435 .72 (.029) (.026) .35 .60 (.028) (.027) MANUFACTURED DATA OLS AR(1), P =.& 2% CPS DATA 50 Block Bootstrap Effect Block Bootstrap 50 50 400 400 .38 .735 (.028) (.025) .285 .645 (.026) (.028) .415 1 (.028) (.000) .075 .98 (.015) (.008) "Notes: 1. Each level) cell represents the rejection rate of the null hypothesis of no effect (at the 5% significance for randomly generated interventions. The number of on the intervention variable simulations for each 2. cell is at least two hundred. in the fourth interview month of the the years 1979 to 1999. The dependent variable unless otherwise specified is log weekly earnings. Data are aggregated to state-year level cells after controlling for the demographic variables (education and age). Manufactured data are data CPS data are data for women between 25 and 50 Merged Outgoing Rotation Group for generated so that the variances match the 3. All CPS CPS variances. regressions also include, in addition to the intervention variable, state and year effects. 4. Standard errors are in parenthesis and computed using the number of simulations. 32 fixed Table 8: Empirical Variance-Covariance Matrix" Data N No Technique A. CPS CPS OLS agg 50 variance CPS CPS OLS agg 20 variance CPS agg OLS 10 CPS agg Empirical 10 variance CPS CPS OLS agg AR(1), rho=.8 .72 (.026) .0775 .085 (.015) (.016) .36 .53 (.028) (.029) .0825 .08 (.016) (.016) .425 .525 (.029) (.029) .0825 .0975 (.016) (.017) .45 .433 (.029) (.029) .165 .1825 (.021) (.022) 6 variance B. .435 6 Empirical agg Effect (.029) 20 Empirical agg 2% REAL DATA 50 Empirical agg Effect MANUFACTURED DATA Empirical 50 variance .105 .16 (.018) (.021) "Notes: 1. represents the rejection rate of the null hypothesis of no effect (at the 5% significance on the intervention variable for randomly generated interventions. The number of simulations for each cell is at least two hundred. Each cell level) 2. CPS data are data for women between 25 and 50 in the fourth interview month of the 1979 to 1999. The dependent variable unless otherwise specified is log weekly earnings. Data are aggregated to state-year level cells after controlling for demographic variables (education and age). Manufactured data are data Merged Outgoing Rotation Group for the years generated so that the variances match the 3. All CPS CPS variances. regressions also include, in addition to the intervention variable, state and year effects. 4. Standard errors are in parenthesis and computed using the number of simulations. 33 fixed Table 9: Data Arbitrary Variance-Covariance Matrix 3 Technique # Rejection Rate States No CPS agg OLS 51 CPS agg Cluster 51 CPS agg OLS 20 CPS agg Cluster 20 CPS CPS CPS OLS agg agg 10 Cluster OLS agg agg B. AR(1), p=.8 AR(1), p=0 10 6 Cluster Cluster Cluster 2% Effect REAL DATA A. CPS Effect 6 .435 .72 (.029) (.026) .06 .27 (.014) (.026) .36 .53 (.028) (.029) .0625 .1575 (.014) (.021) .425 .525 (.029) (.029) .085 .1025 (.016) (.018) .450 .433 (.029) (.029) .15 .1875 (.021) (.023) MANUFACTURED DATA 50 50 .045 .275 (.012) (.026) .035 .74 (.011) (.025) "Notes: 1. 2. Each cell represents the rejection rate of the null hypothesis of no effect (at the 5% significance level) on the intervention variable for randomly generated interventions. The number of simulations for each cell is at least two hundred. CPS data are data for women between 25 and 50 in the fourth interview month of the Merged Outgoing Rotation Group for the years 1979 to 1999. The dependent variable unless otherwise specified is log weekly earnings. Data are aggregated to state-year level cells after controlling for the demographic variables (education and age). Manufactured data are data generated so that the variances match the CPS variances. CPS regressions also include, in addition to the intervention variable, state 3. All 4. Standard errors are in parenthesis and and year fixed computed using the number of simulations. 34 effects. Table 10 Ignoring Time Series Data" Data N Technique Rejection rate No CPS agg Federal Federal Federal Staggered CPS agg Federal Federal Federal Staggered CPS agg Federal Federal Federal Staggered CPS agg Federal Federal Federal Staggered Effect 2% Effect REAL DATA A. OLS 50 Simple Aggregation 50 Residual Aggregation Residual Aggregation OLS 50 50 20 Simple Aggregation 20 Residual Aggregation Residual Aggregation OLS 20 20 10 Simple Aggregation 10 Residual Aggregation Residual Aggregation OLS 10 10 6 Simple Aggregation 6 Residual Aggregation Residual Aggregation 6 6 .435 .72 (.029) (.026) .060 .295 (.014) (.026) .053 .210 (.013) (.024) .048 .335 (.012) (.027) .36 .53 (.028) (.029) .060 .188 (.014) (.023) .095 .193 (.017) (.023) .073 .210 (.015) (.024) .425 .525 (.029) (.029) .078 .095 (.015) (.017) .095 .198 (.017) (.023) .103 .223 (.018) (.024) .450 .433 (.029) (.029) .130 .138 (.019) (.020) .315 .388 (.027) (.028) .275 .335 (.026) (.027) B. MANUFAC'i'UUKI DAT.'Simple Aggregation .050 50 ' AR(1), P- =.8 Federal (.013) Federal Staggered AR(1), p-- =0 Federal Federal Staggered Residual Aggregation Residual Aggregation Simple Aggregation 50 .045 .235 (.012) (.024) 50 .075 .355 (.015) (.028) 50 Residual Aggregation Residual Aggregation .243 (.025) .053 .713 (.013) (.026) 50 .045 .773 (.012) (.024) 50 .105 .860 (.018) (.020) 'Notes: Each cell represents the rejection rate of the null hypothesis randomly generated interventions. The number of simulations of for no effect (at the each 5% cell is at least significance level) on the intervention variable for two hundred. data are data for women between 25 and 50 in the fourth interview month of the Merged Outgoing Rotation Group for the years 1979 to 1999. The dependent variable unless otherwise specified is log weekly earnings. Data are aggregated to state-year level cells after controlling for demographic variables (education and age). Manufactured data are data generated so that the variances match the CPS variances. All CPS regressions also include, in addition to the intervention variable, state and year fixed effects. Standard errors are in parenthesis and computed using the number of simulations. CPS 35 36 Table 11: Randomization Inference" Data N Technique Rejection rate No A. CPS CPS agg CPS CPS CPS CPS CPS CPS CPS CPS OLS agg agg agg agg CPS agg agg CPS CPS CPS agg agg CPS agg 5 year window 20 20 10 Randomization Inference RI 5 year window 10 RI same year 10 OLS 6 Randomization Inference 5 year 10 10 RI 3 year window RI 20 20 OLS agg agg 20 RI same year CPS agg CPS 51 RI 3 year window CPS agg CPS 51 Randomization Inference RI 51 51 OLS agg CPS agg window RI same year agg agg 5 year RI 3 year window agg agg 51 Randomization Inference RI CPS window 6 6 RI 3 year window 6 RI same year 6 !> Rejection rate 2% Effect Effect vr.. .435 .72 (.029) (.026) .065 .235 (.014) (.024) .06 .23 (.014) (.024) .06 .23 (.014) (.024) .04 .24 (.011) (.025) .36 .53 (.028) (.029) .045 .1 (.012) (.017) .04 .125 (.011) (.019) .065 .095 (.014) (.017) .045 .115 (.012) (.018) .425 .525 (.029) (.029) .055 .115 (.013) (.018) .055 .095 (.013) (.017) .05 .125 (.013) (.019) .07 .115 (.015) (.018) .450 .433 (.029) (.029) .04 .07 (.011) (.015) .035 .065 (.011) (.014) .03 .06 (.010) (.014) .055 .07 (.013) (.015) MANUFACTURED DATA AR(1), rho=0.8 iid data B. Randomization Inference Randomization Inference 50 50 .05 .3 (.011) (.025) .08 .84 (.019) (.026) 'Notes: 1- Each 2. CPS cell represents the rejection rate of the null hypothesis of no effect (at the variable for randomly generated interventions. 5% significance level) on the intervention data are data for women between 25 and 50 in the fourth interview month of the Merged Outgoing Rotation Group for the years 1979 to 1999. The dependent variable unless otherwise specified is log weekly earnings Data are aggregated to state-year level cells after controlling for education and age. Manufactured data are data generated so that the variances match the CPS variances. All CPS regressions also include, in addition to the intervention variable, state fixed effects and year fixed effects. Standard errors are in parenthesis and computed using the number of simulations. 37 Table 12: Effect of 2 CPS Data: No Ho: P = OLS Informal Tests' Rejected effect before OLS+Persistence of OLS+No effect No Effect = 2% .8 effect .360 .725 .345 .625 (.034) (.031) (.034) (.031) the law .400 .634 .325 .565 (.033) (.032) (.033) (.035) .378 .602 .305 .565 (.033) (.031) (.033) (.035) effect treatment specific trend in pre-period OLS+Coef 2% if: coef significant (OLS) OLS+No AR(1), p Aggregate Effect .248 .518 .235 .530 (.029) (.035) (.030) (.035) .106 .418 .150 .375 (.018) (.018) (.025) (.034) significant with treatment specific trend (TREND) "Notes: 1. 2. Each cell represents the rejection rate of the null hypothesis of no effect (at the 5% significance level) on the intervention variable for randomly generated interventions. The number of simulations for each cell is at least two hundred. CPS data are data for women between 25 and 50 in the fourth interview month of the Merged Outgoing Rotation Group for the years 1979 to 1999. The dependent variable unless otherwise specified is log weekly earnings. Data are aggregated to state-year level cells after controlling for demographic variables (age and education). Manufactured data are data generated so that the variances match the CPS variances. 3. All CPS regressions also include, in addition to the intervention variable, state and year fixed effects. 4. Standard errors are in parenthesis and computed using the number 38 of simulations. Date Due lM/'} Lib-26-67 MIT LIBRARIES 3 9080 02246 1112