PS5 for Econometrics 101, Warwick Econ Ph.D Exercise 1: Using number of calls before response to recover the average eect of the treatment within a subgroup (Behaghel et al., 2012) This question is a continuation of exercise 1 in the previous problem set + the question on non-response in the exam of last year. In this exercise, we assume that it was actually not a letter which was sent to households, but that they were called over the phone to ask them whether someone in the household runs a business or not. Some households did not answer the rst phone call so they had to be called again. Households were called a maximum of 25 times. If they could not be reached in less than 25 calls, they were regarded as non respondents. In this context, Ri = 0 if household i did not take any of the 25 calls, and Ri = 1 otherwise. The graph below shows response rate in the treatment and in the control groups as a function of the number of calls. For instance, you can see that a little bit less than 50% of treated households answered the questionnaire after 5 calls, while in the control group, this rate was reached after 25 calls only. 1 25 Control 0 5 10 15 Number of calls Treatment 20 Figure 3: Response rates by assignment according to the number of phone calls 0 .1 .2 .3 .4 .5 .6 a) What is the value of Pb(Ri = 1|Di = 1)? Of Pb(Ri = 0|Di = 0)? Will Lee bounds (those which were studied in the end term exam of last year) be wide or narrow in this application? b) Explain intuitively how you could use number of calls to recover the eect of the treatment in a subpopulation. Hint: the red lines on the25graph might help. Under which assumption does this method rely? c) How could you test whether that assumption is likely to hold? Solution 2 a) Pb(Ri = 1|Di = 1) is the sample response rate in the treatment group after 25 calls, which is slightly less than 65%, as you can see from the Figure. Pb(Ri = 1|Di = 0) is the sample response rate in the control group after 25 calls, which is equal to 50%. Lee bounds will be wide in this application, because the dierence in response rates across the treatment and the control group is large. b) A solution could be to compare the entire subpopulation of respondents in the control group (all of those who answered the questionnaire, irrespective of how many calls one had to make before reaching them), to households in the treatment group which replied in less than 6 calls. The reasoning in Behaghel et al. goes as follows. The graph clearly shows that treatment has an eect on response: the two groups were similar in the beginning of the experiment, but in the end the response rate is 15 percentage points larger in the treatment group. Because of this, respondents in the treatment and in the control group are probably not comparable: control respondents include only forgiving subjects, while treated respondents include both forgiving and unforgiving subjects. But number of calls also has an eect on response: the more you call people, the more they are likely to answer. Because of this, the response rate after 6 calls in the treatment group is the same as in the control group after 25 calls. Calling the control group 19 times more than the treatment group compensates, in terms of response rate, the holding grudge eect which leads some control group subjects not to reply in the rst place. If those 19 extra phone calls allow you to catch up control group subjects who have been induced not to reply in less than 6 calls because they have been denied the treatment, those two groups will indeed be comparable, and comparing them will capture the average eect of the treatment among subjects who respond in less than 6 calls when they are sent to the control group and in less than 25 calls when they are sent to the control group. But if those 19 extra phone calls do not allow you to catch up control group subjects who have been induced not to respond because of the fact they were denied treatment, then you cannot be sure that the two groups you are comparing are indeed comparable. c) The assumption under which this method relies is that subjects responding in less than 6 calls in the treatment group are comparable, in terms of their potential outcomes, to those who respond in less than 25 calls in the control group. It is not possible to directly test this assumption, but an indirect test amounts to comparing the mean of a number of baseline covariates in those two groups. Exercise 2: Group-level OLS regressions with time and group xed eects estimate a weighted sum of Wald-DIDs (de Chaisemartin and D'Haultfoeuille, 2015). Assume you observe repeated cross-sections of the same population at various dates and that units can belong to various groups (e.g. counties or states). The date at which a unit is observed is represented by a random variable T ∈ {0, 1, ..., t}, and the group a unit belongs to is represented by a random variable G ∈ {0, 1, ..., g}. For instance, if there are 5 periods 3 (0,1,2,3,4) in the data and groups are American states, t = 4 while g = 49. Let Y be an outcome variable. Let D be a treatment variable. Let β denote the coecient of D in a 2SLS regression of Y on a constant, group dummies (1{G = g})1≤g≤g , time dummies (1{T = t})1≤t≤t , and D, with a rst stage fully saturated in (T, G). The goal of this exercise is to show that if T ⊥⊥ G, then β is equal to a weighted sum of Wald-DIDs. To alleviate a little bit the notations, for every random variable X and for any (g, t) ∈ {0, 1, ..., g} × {0, 1, ..., t}, let Xgt denote a random variable with the same probability distribution as X|G = g, T = t. For instance, D00 denotes a random variable with the same probability distribution as D|G = 0, T = 0. Because these two random variables have the same probability distribution, E(Xgt ) = E(X|G = g, T = t). For every (g, g 0 , t) ∈ {0, ..., g}2 × {1, ..., t}, let DIDD (g, g 0 , t) = E(Dgt ) − E(Dgt−1 ) − E(Dg0 t ) − E(Dg0 t−1 ) . DIDD (g, g 0 , t) is the di-in-di comparing the evolution of the mean of the treatment variable between groups g and g 0 and between periods t − 1 and t. Assume that for every 1 ≤ t ≤ t the mean of treatment does not follow a parallel evolution in any pair of groups between t − 1 and t.1 This is equivalent to assuming that DIDD (g, g 0 , t) 6= 0 for every (g, g 0 , t). If this is satised, then we can dene 0 WDID (g, g , t) = E(Ygt ) − E(Ygt−1 ) − E(Yg0 t ) − E(Yg0 t−1 ) . E(Dgt ) − E(Dgt−1 ) − E(Dg0 t ) − E(Dg0 t−1 ) WDID (g, g 0 , t) is the di-in-di comparing the evolution of the mean of the outcome variable between groups g and g 0 and between periods t − 1 and t, divided by the same di-in-di but for the treatment variable. Finally, for (g, t) ∈ {1, ..., g} × {1, ..., t}, let DIDD (g, g − 1, t)P (G ≥ g)P (T ≥ t) (E (D|G ≥ g, T ≥ t) − E (D|G ≥ g) − E (D|T ≥ t) + E(D)) . Pt t=1 DIDD (g, g − 1, t)P (G ≥ g)P (T ≥ t) (E (D|G ≥ g, T ≥ t) − E (D|G ≥ g) − E (D|T ≥ t) + E(D)) g=1 a wgt = P g a) First, let's try to interpret the assumption that T ⊥⊥ G by considering an example. If the data is a survey of a representative sample of 100 000 individuals in the American population made in three dierent years (100 000 individuals surveyed in 1991, 100 000 dierent individuals surveyed in 1992, 100 000 dierent individuals surveyed in 1993), and G represents individuals' state of residence, what does the assumption that T ⊥⊥ G require? Is it likely to be satised in this example? b) What is the predicted value for D from the rst stage regression? (remember that the rst stage is fully saturated in (T, G), which means that it has all the time dummies, all the group dummies, and all their interactions) Z̃) , where Z̃ is the residual from an OLS regression c) Conclude from question b) that β = cov(Y, V (Z̃) of E(D|G, T ) on a constant, group dummies (1{G = g})1≤g≤g , and time dummies (1{T = t})1≤t≤t . 1 If for some t, there are groups which experience a parallel evolution of their mean treatment between t−1 and t, the formula that you will obtain in question f.8) remains valid after grouping together these groups. 4 e = cov(E(D|T, G), Z) e = cov(D, Z) e . Conclude that β = d) Show that V (Z) cov(Y,Z̃) . cov(D,Z̃) e) Let α, αg , and αt respectively denote the coecients of the constant and of the group and time dummies in the regression of E(D|G, T ) on a constant, group dummies (1{G = g})1≤g≤g , and time dummies (1{T = t})1≤t≤t . e.1) As T ⊥⊥ G, one can show that αt is equal to the coecient of the dummy 1{T = t} in a regression of E(D|G, T ) on a constant and time dummies (1{T = t})1≤t≤t (without the group dummies). Use this to give a formula for αt . e.2) As T ⊥⊥ G, one can show that αg is equal to the coecient of the dummy 1{G = g} in a regression of E(D|G, T ) on a constant and group dummies (1{G = g})1≤g≤g (without the time dummies). Use this to give a formula for αg . e.3) Conclude from the 2 preceding questions that e = E(D|G, T ) − α − (E(D|G) − E(D|G = 0)) − (E(D|T ) − E(D|T = 0)) . Z f) Now that we have an explicit expression for Ze, we can try to rewrite β . Let's consider rst its numerator. e = E((E(D|G, T ) − E(D|G) − E(D|T ) + E(D))E(Y |G, T )). f.1) Show that cov(Y, Z) f.2) Deduce from this that as T ⊥⊥ G, e = cov(Y, Z) g t X X P (G = g)P (T = t)(E(Dgt ) − E(D|G = g) − E(D|T = t) + E(D))E(Ygt ). t=0 g=0 f.3) Deduce from this that e = cov(Y, Z) g t X X P (G = g)P (T = t)(E(Dgt )−E(D|G = g)−E(D|T = t)+E(D)) (E(Ygt ) − E(Yg0 ) − (E(Y0t ) − E(Y00 ))) . t=0 g=0 You need to check that the summation of the three terms added wrt to question f.2) is equal to 0. f.4) Deduce from this that e = cov(Y, Z) g t X X P (G = g)P (T = t)(E(Dgt )−E(D|G = g)−E(D|T = t)+E(D)) (E(Ygt ) − E(Yg0 ) − (E(Y0t ) − E(Y00 ))) . t=1 g=1 f.5) Deduce from this that e = cov(Y, Z) g t X X P (G = g)P (T = t)(E(Dgt )−E(D|G = g)−E(D|T = t)+E(D)) g t X X WDID (g 0 , g 0 −1, t0 )DIDD (g 0 , g 0 −1, t0 ) t0 =1 g 0 =1 t=1 g=1 You need to show that E(Ygt ) − E(Yg0 ) − (E(Y0t ) − E(Y00 )) = g t X X t0 =1 g 0 =1 5 WDID (g 0 , g 0 − 1, t0 )DIDD (g 0 , g 0 − 1, t0 ). It's easier to do if you start from the right hand side. f.6) Deduce from this that e = cov(Y, Z) g t X X g t X X WDID (g, g−1, t)DIDD (g, g−1, t) P (G = g 0 )P (T = t0 )(E(Dg0 t0 )−E(D|G = g 0 )−E(D|T = t0 )+E(D)). t0 =t g 0 =g t=1 g=1 f.7) Deduce from this that e = cov(Y, Z) g t X X WDID (g, g−1, t)DIDD (g, g−1, t)P (G ≥ g)P (T ≥ t) (E (D|G ≥ g, T ≥ t) − E (D|G ≥ g) − E (D|T ≥ t) + E(D)) . t=1 g=1 f.8) Similarly, one can show that e = cov(D, Z) g t X X DIDD (g, g − 1, t)P (G ≥ g)P (T ≥ t) (E (D|G ≥ g, T ≥ t) − E (D|G ≥ g) − E (D|T ≥ t) + E(D)) . t=1 g=1 Conclude from this and question f.7) that β = g t X X a WDID (g, g − 1, t)wgt . t=1 g=1 As you saw in question b), this 2SLS regression is equivalent to an OLS regression of Y on time dummies, group dummies, and the mean of the treatment variable in each group × period cell. An interesting special case of this type of OLS regression is when the unit of observation is a group × period cell (say county × year), and the dependent variable is the average value of Y in each group × period cell. This type of group level regression is extremely frequent in economics. In such instances, groups are stable over time (unless counties or states appear or disappear over the period under consideration, which is not going to be the case with 20th century data), thus implying that the assumption that T ⊥⊥ G will be satised. For instance, Gentzkow et al. (2011) estimate a regression of political participation in county c and year t on county and year dummies, and on the number of newspapers available in county c and year t. The result you just derived shows that the coecient of the number of newspapers in their regression is equal to a weighted sum of Wald-DIDs which are comparisons of the evolution of political participation across pairs of counties and consecutive elections, scaled by the same comparison but for the evolution of the number of newspapers available. Solution a) We will have T ⊥⊥ G if for each state, P (G = g|T = 1991) = P (G = g|T = 1992) = P (G = g|T = 1993): the share that each state accounts for in the American population should not vary over time. This is unlikely to be exactly satised, but that should not be too strongly violated either. b) If the rst stage regression is fully saturated in (T, G), the regression function and the CEF are the same thing. Therefore, the predicted value for D from the rst stage is E(D|T, G). 6 c) Therefore, the second stage is an OLS regression of Y on a constant, the time dummies, the group dummies, and E(D|T, G). Then, the result follows from Frisch-Waugh. d) The rst equality follows from the fact that E(D|T, G) = α + g X αg 1{G = g} + g=1 t X αt 1{T = t} + Ze t=1 and Ze is by construction uncorrelated with the time and group dummies. The second equality follows from the law of iterated expectations. e.1) The regression of E(D|G, T ) on a constant and time dummies (1{T = t})1≤t≤t corresponds to the second example of a saturated regression in the lecture notes. Therefore, αt = E(E(D|G, T )|T = t) − E(E(D|G, T )|T = 0) = E(D|T = t) − E(D|T = 0). The second equality follows from the second law of iterated expectations. e.2) Following the same reasoning, αg = E(E(D|G, T )|G = g) − E(E(D|G, T )|G = 0) = E(D|G = g) − E(D|G = 0). e.3) Therefore, e Z = E(D|G, T ) − α − g X (E(D|G = g) − E(D|G = 0)) 1{G = g} − (E(D|T = t) − E(D|T = 0)) 1{T = t} t=1 g=1 = t X E(D|G, T ) − α − (E(D|G) − E(D|G = 0)) − (E(D|T ) − E(D|T = 0)) . f.1) e = cov(Y, E(D|G, T ) − E(D|G) − E(D|T )) cov(Y, Z) = E((E(D|G, T ) − E(D|G) − E(D|T ) + E(D))Y ) = E((E(D|G, T ) − E(D|G) − E(D|T ) + E(D))E(Y |G, T )). The rst equality follows from the fact that α, E(D|G = 0) and E(D|T = 0) are constant, and from the fact that the mean of E(D|G, T ) − E(D|G) − E(D|T ) + E(D) is equal to 0 (rst law of iterated of expectations). The second equality follows from the rst law of iterated expectations. f.2) The result merely follows from the denition of an expectation, and from the fact that G ⊥⊥ T . 7 f.3) g t X X P (G = g)P (T = t)(E(Dgt ) − E(D|G = g) − E(D|T = t) + E(D))E(Y0t ) t=0 g=0 = t X P (T = t)E(Y0t ) E(D) − = P (G = g)E(D|G = g) + g=0 t=0 t X g X g X P (G = g|T = t)E(Dgt ) − E(D|T = t) g=0 P (T = t)E(Y0t ) (E(D) − E(D) + E(D|T = t) − E(D|T = t)) t=0 = 0. One can show that the other two terms added in the summation when going from question f.2) to f.3) also sum up to 0. f.4) It's easy to see that each term in the summation in f.3) with g = 0 or t = 0 is equal to 0. f.5) g t X X WDID (g 0 , g 0 − 1, t0 )DIDD (g 0 , g 0 − 1, t0 ) t0 =1 g 0 =1 = g t X X E(Yg0 t0 ) − E(Yg0 t0 −1 ) − E(Yg0 −1t0 ) − E(Yg0 −1t0 −1 ) t0 =1 g 0 =1 = t X E(Ygt0 ) − E(Ygt0 −1 ) − (E(Y0t0 ) − E(Y0t0 −1 )) t0 =1 = E(Ygt ) − E(Yg0 ) − (E(Y0t ) − E(Y00 )) . f.6) This follows from permuting the summations. f.7) For instance, we have g t X X P (G = g 0 )P (T = t0 )E(Dg0 t0 ) t0 =t g 0 =g = g t X X P (G = g 0 , T = t0 )E(Dg0 t0 ) t0 =t g 0 =g g t X X P (G = g 0 , T = t0 ) = P (G ≥ g, T ≥ t) E(Dg0 t0 ) P (G ≥ g, T ≥ t) 0 0 t =t g =g = P (G ≥ g)P (T ≥ t) g t X X P (G = g 0 , T = t0 |G ≥ g, T ≥ t)E(Dg0 t0 ) t0 =t g 0 =g = P (G ≥ g)P (T ≥ t)E(D|G ≥ g, T ≥ t). The result can be derived similarly for the other terms. f.8) The result follows combining the formula for β obtained in d), the formula obtained in f.7), and that given to you in f.8). 8