TESTING IF THE HETEROGENEITY DISTRIBUTION OF A RANDOMIZED EXPERIMENT CHANGES DURING THE EXPERIMENTAL PERIOD: A STATISTICAL ANALYSIS OF A SOCIAL EXPERIMENT Marcel Voia Department of Economics, Carleton University, 1125 Colonel By Drive, Ottawa , Ontario K1S 5B6, Canada. E-mail: mvoia@connect.carleton.ca Ričardas Zitikis Department of Statistical and Actuarial Sciences, University of Western Ontario, London, Ontario N6A 5B7, Canada. E-mail: zitikis@stats.uwo.ca February 2006 Abstract. The paper considers statistical tests for determining if the heterogeneity distribution of the treatment group in a randomized experiment changes during the experimental period. Solving the problem is of practical interest since heterogeneity changes may indicate serious selection problems in the randomized experiment. To make the tests easily implementable in practice, we discuss estimating critical values, for which we use bootstrap. To asses the actual performance of the tests, we conduct simulation studies. Finally, the tests are applied to analyze a social experiment data set, which is the main goal of the present paper. Classification codes: C12, D31, D63. Key words and phrases: Stochastic dominance, Kolmogorov-Smirnov type statistic, asymptotic distribution, bootstrap, selection, testing for intersection. 1 2 1. Introduction and motivation In this paper we consider statistical tests that can assist researchers in identifying changes in distributions during experimental periods of randomized experiments. Moreover, we argue that the tests can also identify if there is unobserved heterogeneity. We shall now discuss these issues in detail. When the use of a treatment is a matter of choice, selection issues might arise even though randomization is used for allocating individuals to the treatment and control groups. For example, those individuals who believe that they would benefit most from the treatment may disproportionately be the ones that choose to avail themselves from the treatment. Therefore, selection into the treatment group can be a serious problem for a randomized experiment, and researchers should therefore take appropriate measures in order to find consistent estimates of the treatment effect. Athey and Imbens (2003) have introduced an estimator that can be used to estimate the effect of a treatment on the entire distribution of a treatment program. Their estimator allows for unobserved heterogeneity to be different between the two groups. Therefore, it allows for self-selection, or noncompliance, in one of the groups. Thus, in the absence of a treatment, the differences between the two groups are determined by the differences in the conditional distributions of unobserved heterogeneity in the groups. These differences can in turn be compared by employing statistical tests for the equality of two distributions, for testing their first-order stochastic dominance (FSD), second-order stochastic dominance (SSD), higher-order dominance, or intersection. We find an extensive literature on testing for stochastic dominance, which essentially starts with the work by McFadden (1989) where he proposes and analyzes a Kolmogorov-Smirnov-type test statistic for stochastic dominance. Subsequently, Anderson (1996), Davidson and Duclos (2000), Barrett and Donald (2003), Whang, Linton, and Maasoumi (2005) develop powerful statistical inferential results for stochastic dominance of any order. Horvath, Kokoszka and Zitikis (2004) contribute to the literature by showing how to modify the statistics in order to test for stochastic dominance over non-compact intervals. In this paper we are interested in identifying if the heterogeneity distribution of the individuals from the treatment group changes during the treatment period and if the heterogeneity distribution changes between groups during the treatment 3 period. Using the fact that we can construct the counterfactual distribution of the treatment in the second period (which is the distribution of the treated individuals as they were not treated; cf. Athey and Imbens (2003) for detail), we can identify if the distribution of the individuals from the treatment group changes over the treatment period by testing if the counterfactual treatment distribution in the second period intersects the treatment distribution at the baseline. Hence, our statistical tests are based on checking whether a distribution dominates or intersects another one. For example, using the test we can check if the treatment and control distributions dominate each other or intersect during the period following the baseline. The paper is organized as follows. In Section 2 we formulate the problem rigorously. In Section 3 we describe various tests. In Section 4 we apply the tests on different simulation designs. In Section 5 we apply the methodology to an experimental treatment data set (cf. Decker et al 2000) and present findings. Section 6 contains concluding notes. Technical results, tables, and figures are relegated to appendices at the end of the paper. 2. Mathematical formulation of the problem Assume that each individual in the population of interest can be assigned to one and only one of the two sub-groups, which correspond to control and treatment. Let G be a random variable taking two values: G = 0 if a randomly selected individual is assigned to the control group and G = 1 if assigned to the treatment group. (We use the upper-case G to indicate that this random variable assigns the individuals to groups.) Next, there are two time periods. The first one, which we denote by t = 0, is the time at the introduction of a certain treatment policy. The second period, which we denote by t = 1, is the period after the introduction of the treatment policy, or the time when the effect of the policy is measured. (We use the lower-case t to denote the time periods since the assignment of individuals to the two time periods is not random.) Hence, we have the pair (G, t) that can take on one of the four possible values: (0, 0), (0, 1), (1, 0), and (1, 1). The variable of interest is X (G,t) , which can, for example, mean the time out-of-work measured in weeks, as it is in the example we analyze in the present paper below. We shall be interested in properties of the 4 conditional distribution functions £ ¤ F (g,t) (x) := P X (G,t) ≤ x| G = g for various choices of g, t ∈ {0, 1}. We assume that at the time of the random assignment (when the treatment was not yet enforced) the control and the treatment groups have the same heterogeneity distribution, which we write as F (0,0) ≡ F (1,0) . As for the two distributions F (0,1) and F (1,1) , there are three possibilities, and their descriptions are given next and followed by a discussion and further notes. (1) The distributions are equal. It means that the treatment does not have any effect on the outcome variable. In this case we write the null hypothesis as (1) H0 : F (0,1) ≡ F (1,1) . (2) One of the distributions dominates another one. We shall concentrate on the case when F (0,1) (x) ≤ F (1,1) (x) for all x. We formulate the corresponding null hypothesis as (2) H0 : F (0,1) ≤ F (1,1) . (Data might suggest testing the null hypothesis F (0,1) ≥ F (1,1) , which can be done analogously by interchanging the roles of F (0,1) and F (1,1) .) (3) The two distributions intersect, and there are two points y0 and z0 such that F (0,1) (y0 ) < F (1,1) (y0 ) and F (0,1) (z0 ) > F (1,1) (z0 ). In this case we write the null hypothesis as (3) H0 : F (0,1) ./ F (1,1) . Several problems of practical interest can be formulated, and we are now to discuss them. First, we are naturally interested in whether the distributions of the control and the treatment groups differ after the introduction of a new policy. This requires (1) testing the above defined null hypothesis H0 (1) (1) against the alternative H1 , where (1) H1 := “not H0 ”, where we use the notation “:=” for “equality by definition”. In order for the treatment to be effective over the whole treatment group, the distributions of the control and the treatment groups have to be different and should not intersect. This would ensure that the treatment has had an effect on the whole treatment group. For example, in the context of reducing out-of-work time, we would be interested in testing whether F (0,1) ≤ F (1,1) or not. In other words, we 5 (2) are interested in testing the above defined H0 (2) (2) against the alternative H1 , where (2) H1 := “not H0 ”. On the other hand, if for some individuals the treatment is not effective, or if the individuals consider the treatment as not worth taking (assuming that the treatment is not mandatory), then these individuals may not be taking the treatment. Also, it may happen that some individuals are not taking the treatment because they are negatively affected by the fact that they have been selected for treatment and thus decided not to follow it. Therefore, if we look at the effect of the policy during the time period t = 1, then we shall observe that at some point the two distributions intersect. Hence, we are interested in testing whether F (0,1) and F (1,1) intersect, that (3) (3) (3) (3) is, we want to test H0 against the alternative H1 , where H1 := “not H0 ”. The other interesting (and similar) problem concerns the distributions F (1,0) and F (1,1) . Specifically, we maybe interested in testing whether the behavior of those in the treatment group has changed after the introduction of a new policy if compared to their behavior before the introduction of the policy. For this, we may want to test if F (1,0) and F (1,1) differ, or dominate each other, or intersect. That is, just like above, we are interested in testing whether any of the null hypotheses H0 : F (1,0) ≡ F (1,1) , H0 : F (1,0) ≤ F (1,1) , or H0 : F (1,0) ./ F (1,1) holds against the corresponding alternative “not H0 ”. 3. Methodology (0,1) Let X1 (0,1) , . . . , Xn be independent and identically distributed random variables, each having the distribution function F (0,1) . Denote the corresponding empirical (0,1) . Likewise, let X (1,1) , . . . , X (1,1) be independent and distribution functions by F[ m 1 identically distributed random variables, each having the distribution function F (1,1) . (1,1) . We assume that Denote the corresponding empirical distribution functions by F[ all the X’s are independent random variables. In other words, we consider the case of two independent populations. Furthermore, we assume that the sample sizes n and m are comparable, which is a natural assumption. Specifically, we assume that there exists a number 0 < η < 1 such that both sample sizes n and m tend to infinity in such a way that m → η. n+m 6 (1) Testing H0 (1) vs H1 . Considerations in this subsection are based on the classical Komogorov-Smirnov test. Namely, with the help of the parameter ¯ ¯ κ := sup ¯F (0,1) (x) − F (1,1) (x)¯ , x we rewrite the null and the alternative hypotheses as follows: (1) (1) H0 : κ = 0 vs H1 : κ > 0. (3.1) Next, we need to construct an empirical estimator for κ and to establish its asymptotic distribution (or a bound) so that critical values would be possible to calculate, or estimate. We define an estimator of κ by ¯ ¯ ¯[ ¯ [ (0,1) (1,1) κ b := sup ¯F (x) − F (x)¯ . x The estimator κ b is consistent (cf. Theorem 7.1 below). The asymptotic behavior of the estimator under the null and the alternative hypotheses is investigated in Theorem 7.2. Namely, based on the theorem we have that r nm b := K κ b n+m (1) is an appropriate statistic for testing the null hypothesis H0 against the alternative (1) b > kα , where kα is H1 . The corresponding rejection (i.e., critical) region is R : K the α-critical value of the (classical) Kolmogorov-Smirnov test. (2) Testing H0 (2) vs H1 . Considerations in this subsection follow those in Whang, Linton, and Maasoumi (2005). Namely, with the help of the parameter ¡ ¢ δ := sup F (0,1) (x) − F (1,1) (x) , x (2) (2) we rewrite the hypotheses H0 and H1 as (2) (2) H0 : δ = 0 vs H1 : δ > 0. (3.2) The empirical estimator of δ is ³ ´ [ (0,1) (x) − F (1,1) (x) . δb := sup F[ x The estimator δb is consistent (cf. Theorem 7.1). The asymptotic behavior of the estimator under the null and the alternative hypotheses is investigated in Theorem 7.3. Namely, based on the theorem we conclude that r nm b b := D δ n+m 7 (1) is an appropriate statistic for testing the null hypothesis H0 against the alternative (1) b > dα , where dα is the α-critical H1 . The corresponding rejection region is R : D value of the maximum of a Gaussian stochastic process Γ (cf. Appendix 7 below for definition) which depends on both F (0,1) and F (1,1) . Since the distributions are not, in general, identical, the critical value dα is not distribution free and thus needs to be estimated. For this, we use bootstrap whose detailed description follows. (0,1) (0,1) (0,1)∗ (0,1)∗ From X1 , . . . , Xn we sample with replacement and obtain n values X1 , . . . , Xn ∗ (0,1) (x) be the corresponding empirical distribution function. Next, from Let F[ (1,1) (1,1) X1 (1,1)∗ (1,1)∗ , . . . , Xm we sample with replacement and obtain m values X1 , . . . , Xm . ∗ (1,1) (x) be the corresponding empirical distribution function. With the notaLet F[ tion above, we define the process r ´ r nm ³ ´ ∗ ∗ nm ³ [ ∗ [ [ (0,1) (0,1) (1,1) (x) − F (1,1) (x) , F (x) − F (x) − F[ ∆ (x) := n+m n+m and then, in turn, the quantity b ∗ := sup ∆∗ (x) D x b ∗ . Now We repeat the above sampling procedure M times and obtain M values of D we are in the position to define an estimator d∗α of dα as the smallest x such that b ∗ are at or below x. With the at least 100(1 − α)% of the obtained M values of D (2) just defined d∗α , the rejection region for testing the null hypothesis H0 against the (2) b > d∗ . alternative H1 is R : D α (3) Testing H0 (3) vs H1 . Again, our considerations follow those in Whang, Linton, and Maasoumi (2005). First we note that if there is an x0 such that the strict inequality F (0,1) (x0 ) > F (1,1) (x0 ) holds, then the earlier introduced parameter δ is strictly positive. Likewise, the existence of x1 such that F (0,1) (x1 ) < F (1,1) (x1 ) results in a strictly positive value of the parameter θ := sup(F (1,1) (x) − F (0,1) (x)). x Hence, if the two distributions F (0,1) and F (1,1) intersect, then the parameter τ := min(δ, θ) is strictly positive. In view of the discussion above, we reformulate the null hypoth(3) esis as H0 : τ > 0. Under the alternative, the two distribution functions dominate each other. Hence, the parameter τ will never be positive. In fact, we have τ = 0 . 8 since F (0,1) (x) and F (1,1) (x) always coincide at x = ±∞. Hence, we reformulate the (3) alternative as H1 : τ = 0. The way the null and alternative hypotheses appear above poses a serious problem in developing a statistical test of desired size or level. To circumvent the problem, we shall formulate our problem somewhat differently. That is, we shall test the null hypothesis (not 3) H0 : F (0,1) dom F (1,1) , where “ F (0,1) dom F (1,1) ” means that one of the distributions dominates another one, without specifying whether F (0,1) ≤ F (1,1) or F (0,1) ≥ F (1,1) . The alternative (not 3) H1 (not 3) , which is the complement of H0 specified (3) H0 : F (0,1) ./ F (1,1) by definition, coincides with the earlier (not 3) . Hence, if we reject the null hypothesis H0 : τ= 0, then we shall have significant evidence to claim that the two distributions F (0,1) and F (1,1) intersect. In summary, we shall test (not 3) H0 (not 3) : τ = 0 vs H1 : τ > 0. (3.3) We define an estimator of τ by b θ), b τb := min(δ, where δb is same as above, and [ (1,1) (x) − F (0,1) (x)). θb := sup(F[ x The estimator τb is consistent (cf. Theorem 7.1), and its asymptotic properties are described in Theorem 7.4. Namely, based on the theorem we have that r nm Tb := τb n+m (not 3) is an appropriate statistic for testing the null hypothesis H0 (not 3) tive H1 against the alterna- (3) (recall that it coincides with H0 ). The corresponding rejection region is R : Tb > tα , where tα is the α-critical value of a distribution (cf. Theorem 7.4) that depends on F (0,1) and F (1,1) (x). Hence, we need to estimate tα , for which we use a bootstrap as follows. With the same process ∆∗ (x) as defined earlier, let Tb∗ := max(sup ∆∗ (x), sup(−∆∗ (x))) x x (the maximum is not a typographical error). We repeat the above sampling procedure M times and in this way obtain M values of Tb∗ . Now we define the estimator 9 t∗α as the smallest x such that at least 100(1 − α)% of the obtained M values of Tb∗ (3) (3) are at or below x. With the t∗α , the rejection region for testing H0 against H1 . is R : Tb > t∗α . 4. Simulation designs In the present section we assess the performance of the tests discussed in the previous section so that we would gain confidence in the performance of tests on a real data set in Section 5 below. We consider that the simulated data comes from a randomized treatment experiment, where the treatment is aimed at reducing the duration of unemployment. We consider six situations, and they are presented in the six subsections below. The first subsection considers a linear data generating process (DGP) with no differences in distributions in the second period (no treatment effect). The second subsection considers a non-linear DGP with no differences in distributions in the second period (again, no treatment effect). The third subsection considers a linear data generating process (DGP) with no selection problem in the second period and differences in distribution (with treatment effect). The forth one considers a non-linear DGP with no selection problem in the second period and differences in distribution (again, with treatment effect). The fifth subsection considers linear data generating process (DGP) with selection problem in the second period. Finally, the sixth subsection considers non-linear DGP with selection problem in the second period. We note at the outset that in the last two cases we have different heterogeneity distributions for the control and the treatment groups in the second period (which indicates a selection problem), whereas the distributions in the first time period are same for all six cases. To simplify considerations, in all the six subsections we assume that n = m, and thus specify only n throughout. 4.1. Linear DGP with equality of two distributions in the second period. (0,1) We simulate two data sets X1 (0,1) , . . . , Xn (1,1) and X1 (1,1) , . . . , Xn using the model X (G,1) = 16 + 4ε, where ε ∼ N (0, 1) is a standard normal random variable. Using the simulated data, we perform the Kolmogorov-Smirnov test for the equality of the distributions F (0,1) (non-treated) and F (1,1) (treated). We use the formulas F (0,1) (x) = F (1,1) (x) = Φ( x−16 ) to produce the two graphs in Figure 8.1.a. 4 10 (1) We know from Theorem 7.2 that under the null hypothesis, which is H0 : b asymptotically has the Kolmogorov-Smirnov F (0,1) = F (1,1) , the test statistic K distribution, which we denote by FKS . Hence, the asymptotic P -value of the test b where P∗ denotes the conditional distribution given the (simulated) is P∗ [Γ > K], (0,1) values of X1 (0,1) , . . . , Xn (1,1) and X1 (1,1) , . . . , Xn b is calculated using the sim. (The K ulated values.) The right-hand side of the following equality is useful for practical calculations of the asymptotic P -value: b = 1 − FKS (K). b P∗ [Γ > K] In each of the four cases n = 200, n = 500, n = 1000, n = 2000, we simulate 1000 b In each of sets of random variables and in this way obtain B = 1000 values of K. b and draw histograms the four cases we therefore obtain 1000 values of 1 − FKS (K) of these P -values in Figure 8.2. If the test gives a P -value smaller than a significance level 0 < α < 0.5 (cf., e.g., Abadie, 2000), then we reject the given null hypothesis. Considering now the level of significance α = 0.1, our findings show (cf. Table 8.1, row 4.1) that we do not reject the null of equality of distributions for n = 200, 500, 1000, 2000. Also the histograms of P -values show (cf. Figure 8.2) that as the sample size increases the test converges to its asymptotic distribution. 4.2. Non-Linear DGP with the equality of two distributions in the second (0,1) period. We simulate two data sets X1 (0,1) , . . . , Xn (1,1) and X1 (1,1) , . . . , Xn using the (same) model X (G,1) = exp{2.2 + 0.4ε}, where ε ∼ N (0, 1). Using the simulated data, we perform the Kolmogorov-Smirnov test for the equality of the distributions F (0,1) and F (1,1) . Therefore, we use: F (0,1) (x) = x−2.2 F (1,1) (x) = Φ( log 0.4 ) to produce the two graphs in Figure 8.1.d. For the non-linear DGP case, our findings show (cf. Table 8.1, row 4.2) that we do not reject the null of equality of distributions for n = 500, 1000, 2000 at the level of significance α = 0.1, but we reject the null of equality of distributions for n = 200 at the level of significance α = 0.1. This result suggests that when dealing with data that comes from a non-linear model the sample sizes need to be larger. The histograms of P -values confirm the above results by showing (cf. Figure 8.3) that as the sample size increases the test converges to its asymptotic distribution but at a slower rate than in the case of the linear model above. 11 4.3. Linear DGP with no selection problem in the second period. Our model in this subsection is X (G,1) = 16 + 3(1 − G) + (1 − G)4ε1 + G4ε2 with independent random variables ε1 , ε2 ∼ N (0, 1) and ( 0 with probability 12 , G= 1 with probability 12 . Hence, we simulate two sets of random numbers, corresponding to the control and treatment groups, according to X (G,1) |G=0 = 19 + 4ε1 and X (G,1) |G=1 = 16 + 4ε2 respectively. Using the simulated data, we perform a test for dominance of F (0,1) and F (1,1) . We use the formulas F (0,1) (x) = Φ( x−19 ) and F (1,1) (x) = Φ( x−16 ) to 4 4 produce the two graphs in Figure 8.1.b. (0,1) We simulate two data sets X1 (0,1) , . . . , Xn (1,1) and X1 (1,1) , . . . , Xn from the distrib- utions F (0,1) and F (1,1) , respectively. Using the simulated data, we perform the test that the distribution F (0,1) (non-treated) is below the distribution F (1,1) (treated). (2) We know from Theorem 7.3 that under the hypothesis H0 : F (0,1) ≤ F (1,1) the test b is such that, asymptotically, P[D b > xα ] does not exceed the significance statistic D level α whenever xα solves the equation P[Γ+ > xα ] = α. Hence, the critical reb of the test using gion is (xα , ∞). We estimate the asymptotic P -value P∗ [Γ+ > D] bootstrap as follows b ≈ P∗ [D b ∗ > D], b P∗ [Γ+ > D] b ∗ := supx (∆∗ (x)) with the earlier notation ∆∗ (x). In each of the four cases where D n = 200, 500, 1000, 2000, we simulate 1000 sets of random variables and in this b For each value of D, b we calculate P∗ [D b ∗ > D] b using way obtain 1000 values of D. b we have obtained a value 1000 bootstrap iterations. Hence, for each value of D b ∗ > D], b which is an approximate P -value of the test. To visualize the of P∗ [D distribution of the 1000 P -values for each of the four sample sizes specified above, we have produced the histograms in Figure 8.4. For the linear DGP case, our findings show (cf. Table 8.1, row 4.3) that we do not reject the null of dominance of distributions for n = 500, 1000, 2000 at the level of significance α = 0.1, but the test (mistakenly) rejects the null of dominance when n = 200 at the level of significance α = 0.1. The histograms of P -values are given in Figure 8.4. 12 4.4. Non-Linear DGP without selection problem in the second period. The theoretical model of this subsection is X (G,1) = exp(2.7 + 0.2(1 − G) + (1 − G)0.4ε1 + G0.4ε2 ) with ε1 , ε2 , and G as before. Hence, we have the equations X (G,1) |G=0 = exp(2.9 + x−2.9 0.4ε1 ) and X (G,1) |G=1 = exp(2.7+0.4ε2 ). We used the formulas F (0,1) (x) = Φ( log 0.4 ) x−2.7 and F (1,1) (x) = Φ( log 0.4 ) to construct the distributions in Figure 8.1.e. We perform a simulation study along the lines of the previous subsections. For the non-linear DGP case, our findings show (cf. Table 8.1, row 4.4) that we do not reject the null of dominance of distributions for n = 500, 1000, 2000 at the level of significance α = 0.1, but we reject the null of dominance of distributions for n = 200 at the level of significance α = 0.05. The histograms of P -values is given in Figure 8.5. 4.5. Linear DGP with selection problem in the second period. We simulate observations of X (G,1) = 16 + 3(1 − G) + (1 − G)ε1 + G4ε2 with same independent random variables ε1 , ε2 and G as above. Hence, we have the equations X (G,1) |G=0 = 19 + 1ε1 and X (G,1) |G=1 = 16 + 4ε2 . Using the simulated data, we perform a test for dominance vs intersection of F (0,1) and F (1,1) . We use the formulas F (0,1) (x) = Φ( x−19 ) and F (1,1) (x) = Φ( x−16 ) to produce the two graphs 1 4 in Figure 8.1.c. (0,1) We simulate two data sets X1 (0,1) , . . . , Xn (1,1) and X1 (1,1) , . . . , Xn from the distrib- utions F (0,1) and F (1,1) , respectively. Using the simulated data, we perform the test (not 3) : F (0,1) dom F (1,1) . Theorem 7.4 says that under the the test statistic Tb is such that, asymptotically, P[Tb > xα ] does for the null hypothesis H0 (not 3) hypothesis H0 not exceed the significance level α whenever xα solves the equation P[max(Γ+ , Γ− ) > xα ] = α. The critical value xα is not distribution free, and so the asymptotic P value of the test, P∗ [max(Γ+ , Γ− ) > Tb], is not calculable. Hence, we use bootstrap to estimate the P -value: P∗ [max(Γ+ , Γ− ) > Tb] ≈ P∗ [Tb∗ > Tb], where Tb∗ := max(supx (∆∗ (x)), supx (−∆∗ (x))) with the same ∆∗ (x)) as above. In each of the four cases n = 200, 500, 1000, 2000, we simulate 1000 sets of random variables and obtain 1000 values of Tb. For each value of Tb, we then calculate 13 P∗ [Tb∗ > Tb] using 1000 bootstrap iterations. Hence, for each value of Tb we obtain a value of P∗ [Tb∗ > Tb], which is an approximate P -value of the test. To visualize the distribution of the 1000 P -values for each of the four sample sizes specified above, we produce histograms in Figure 8.6. At the level of significance α = 0.2, our findings show (cf. Table 8.1, row 4.5) that we reject the null of dominance of distributions for n = 1000, 2000 which means that we accept the alternative of intersection of distributions. Also, at the same level of significance, we do not reject the null of dominance for n = 200, 500. The histograms of P -values are given in Figure 8.6. Note that if we formulate the null hypothesis as equality of two cdf’s, then the critical values are those of the Kolmogorov-Smirnov test considered earlier. At the level of significance α = 0.1, our findings show (cf. Table 8.1, row 4.5KS ) that we reject the null of equality of distributions for n = 500, 1000, 2000 which means we accept the alternative of intersection of distributions. This note shows that it is important to consider various plausible null hypotheses and analyze data from various angles. 4.6. Non-Linear DGP with selection problem in the second period. The model of this subsection is X (G,1) = exp(2.7 + 0.2(1 − G) + (1 − G)0.1ε1 + G0.4 ε2 ) with same independent random variables ε1 , ε2 and G as before. Hence, we have the equations X (G,1) |G=0 = exp(2.9 + 0.1 ε1 ) and X (G,1) |G=1 = exp(2.7 + 0.4 ε2 ). We use x−2.9 x−2.7 the two formulas F (0,1) (x) = Φ( log 0.1 ) and F (1,1) (x) = Φ( log 0.4 ) to produce the two graphs in Figure 8.1.f. For this case, our findings show (cf. Table 8.1, row 4.6) that we reject the null of dominance of distributions for n = 500, 1000, 2000 at the level of significance α = 0.2, but we do not reject the null of dominance of distributions for n = 200 at the level of significance α = 0.2. The histograms of P -values are given in Figure 8.7. Note that if we formulate the null hypothesis as equality of two cdf’s, then at the level of significance α = 0.05, our findings show (cf. Table 8.1/4.6KS ) that we reject the null of equality of distributions for n=200,500,1000,2000 which means we accept the alternative of intersection of distributions for all sample sizes. In this case we can see that the null hypothesis of equality of distributions is rejected even for (relatively) small samples. 14 5. Application 5.1. Data. The data we analyze is from the Job Search Assistance (JSA) Demonstration Experiment (cf. Decker et al 2000). The experiment tested if the JSA demonstration services would speed up re-employment and reduce the unemployment insurance (UI) benefits claimed by the demonstration participants when workers are encouraged to search more effectively and aggressively for a new job. The demonstration was conducted in the District of Columbia (D.C.) and Florida. The D.C. demonstration operated in a single office and served a targeted sample of claimants from the full D.C. claimant population. Claimant selection occurred between June 1995 and June 1996, and a total of 8,071 claimants were randomly assigned to a control group and three alternative treatment groups. The three service strategies developed for promoting rapid re-employment and reduced UI spells among targeted UI claimants are: (1) Structured Job Search Assistance (SJSA). Claimants assigned to this treatment were required to participate in an orientation, testing, a job search workshop, and a one-on-one assessment interview. Claimants who failed to participate in any service, unless explicitly excused, could be denied benefits. After completion of the services, claimants were required to have two additional contacts with demonstration staff to report on their job search progress (cf. Decker et al, 2000, p.VII). (2) Individualized Job Search Assistance (IJSA). This treatment assigned claimants to services based on their assessed needs. All claimants were required to participate in an orientation and a one-on-one assessment interview. During the assessment interview, the claimant and a demonstration staff member developed a service plan to address the claimants needs. If the service plan included demonstration-specific services, such as testing, a job search workshop, or additional counseling, these services would become mandatory (cf. Decker et al, 2000, p.VII). (3) Individualized Job Search Assistance With Training (IJSA+). This treatment was identical to the second treatment, except for the inclusion of a coordinated effort with local Economic Dislocation and Worker Adjustment Act (EDWAA) staff to enroll interested claimants in training. During the 15 orientation, an EDWAA staff member discussed local opportunities for training. Training opportunities were also discussed during the assessment interview, and any claimant interested in training was scheduled to meet with an EDWAA staff member at the demonstration office (cf. Decker et al, 2000, p.VIII). We consider applying the test on the data associated with SJSA treatment in D.C. because: (1) The estimates obtain on the JSA treatments reduced UI receipt significantly over the initial benefit year. The largest impact occurred in D.C. for the SJSA treatment, which reduced average UI receipt by more than a week, or by 182$ per claimant (cf. Decker et al, 2000, p.83). (2) SJSA increased the rate at which D.C. claimants exited UI throughout the entire potential UI spell. The impact of SJSA is represented by the difference between the exit rates for the SJSA and control groups. At the five-week mark, the cumulative exit rate for the SJSA group was 17.7%, which was more than 50% higher than the 11.6% rate for the control group. The absolute magnitude of this difference then remained relatively steady over time, even though the SJSA services were received early in the UI spell (cf. Decker et al, 2000, p.99). (3) At the same time SJSA was associated with a modest increase in the likelihood of being employed in each quarter (about 2 to 3 percentage points), and the estimated impacts are statistically significant in about half of the quarters (cf. Decker et al, 2000, p.138). The following information can help in identifying a potential selection into the SJSA treatment in D.C. (cf. also Figure 8.8.a): (1) About 78% of those who did not attended the orientation reported that they had gotten a job before their scheduled orientation (cf. Decker et al, 2000, p.46). (2) 5% were excused in D.C. from the orientation (cf. Decker et al, 2000, p.48). (3) About 15% failed to attend the orientation in D.C. Overall about 82% of those who were not excused attended the orientation (cf. Decker et al, 2000, p.48). 16 (4) About 97% of claimants who attended the orientation attended assessment. Attendance rates at the assessment interview are lower in SJSA (81% for D.C.) than in IJSA and IJSA+ in D.C. (cf. Decker et al, 2000, p.49). (5) D.C. staff may not have aggressively assigned claimants to group services because they felt that one-on-one counseling was more effective or more acceptable to demonstration participants and because of a shortage of resources for providing group services. The D.C. office had difficulty maintaining sufficient staff who were trained to conduct group services and had a shortage of space for providing group services. In contrast, the D.C. office had ample staff for one-on-one counseling and adequate office space for conducting one-on-one services (cf. Decker et al, 2000, p.50). 5.2. Empirical Results. Looking at the EDFs of the treatment and control groups for the SJSA experiment (cf. Figure 8.8.a), we observe that for lower durations of unemployment the treatment dominates the control (i.e., there is a treatment effect), but for higher durations of unemployment (above 30 weeks) the treatment is dominated by the control group (i.e., there is no treatment effect). Therefore, the treatment is not uniform over the group of treated individuals, and it is possible to observe a change in unobserved heterogeneity at the period t = 1 for the individuals from the treatment group. To test if indeed is a change in unobserved heterogeneity for the individuals from the treated group we perform a test for the dominance vs intersection of distributions. We obtain that the P -value is 0.213 (cf. Table 8.2) for the test. Alternatively if we use Kolmogorov-Smirnov statistic to test the equality of distributions against the alternative that the distributions intersect, we get that the P -value is 0.195 The results, however, of the two tests are similar. Given that our sample is larger than 2000 observations and that the simulations show that for n = 2000 the test is very close to the true value (cf. Figure 8.8.b), we have that the two distributions intersect at a higher level of significance. We can also conclude that there is a change in unobserved heterogeneity in the treated group at t = 1. 6. Conclusions In this paper we developed a theoretical framework necessary to test if the heterogeneity distribution of the treatment group in a randomized experiment changes during the experimental period. To make the tests easily implementable in practice, 17 we discuss how to estimate critical values using bootstrap methodologies. To asses the performance of the tests, we conduct simulation studies. We apply the tests to analyze a social experiment data set (the SJSA experiment). The tests identify that there is a change in unobserved heterogeneity in the treated group at t = 1. The change in unobserved heterogeneity for the treated individuals at time t = 1 is a strong evidence of selection into the treatment. This selection can be explained by the burden of participating in this treatment felt by the individuals from the lower tail of distribution (individuals with long spells of unemployment). 18 References Abadie, A. (2000), Bootstrap Tests for Distributional Treatment Effects in Instrumental Variable Models. NBER Technical Working Papers, National Bureau of Economic Research. Anderson, G. (1996), Nonparametric tests of stochastic dominance in income distributions. Econometrica, 64, 1183-1193. Athey, S. and G. Imbens, (2003), Identification and Inference in Nonlinear DifferenceIn-Differences Models. NBER Technical Working Paper No. t0280. Barrett, G.F. and Donald, S.G. (2003), Consistent tests for stochastic dominance Econometrica 71, 71-104. Davidson, R. and Duclos, J.-Y. (2000), Statistical inference for stochastic dominance and for the measurement of poverty and inequality. Econometrica 68, 1435-1464. Decker, P.T, Olsen, R.B., Freeman, L., Klepinger, D.H. (2000). Assisting Unemployment Insurance Claimants: The Long-Term Impacts of the Job Search Assistance Demonstration. W.E. Upjohn Institute for Employment Research, Kalamazoo, MI Fraker, T. and Maynard, R. (1987), The Adequacy of Comparison Group Designs for Evaluations of Employment Related Programs. Journal of Human Resources, 22, 194-227. Heckman, J. (1992), Randomization and Social Policy Evaluation. In Charles Manski and Irwin Garfinkle, eds., Evaluating Welfare and Training Programs (Cambridge, Mass.: Harvard University Press), 201-230. Heckman, J. (1997), Randomization as an Instrumental Variables Estimator: A Study of Implicit Behavioral Assumptions in One Widely-used Estimator. Journal of Human Resources, 32, 442-462. Heckman, J. and Hotz, J. (1989), Choosing Among Alternative Nonexperimental Methods for Estimating the Impact of Social Programs: The Case of Manpower Training. Journal of the American Statistical Association, 84 (408), 862-880. Whang, Y.-J., Linton, O. and Maasoumi, E. (2005). Consistent testing for stochastic dominance under general sampling schemes. Review of Economic Studies, 72. Meyer, B., K. Viscusi and D. Durbin (1995), Workers Compensation and Injury Duration: Evidence from a Natural Experiment. American Economic Review, 85, 322-340. 19 McFadden, D. (1989), Testing for stochastic dominance. In: Studies in the Economics of Uncertainty (eds. T.B. Fomby and T.K. Seo). Springer-Verlag, New York. Schmid, F. and Trede, M. (1996a), Testing for first order stochastic dominance in either direction. Comput. Statist. 11, 165-173. Schmid, F. and Trede, M. (1998), A Kolmogorov-type test for second-order stochastic dominance. Statist. Probab. Lett. 37, 183-193. Shaked, M. and Shanthikumar, J.G. (1994), Stochastic Orders and their Applications. Academic Press, Boston, MA, 1994. 7. Appendix: technical results and proofs b θ, b and τb are strongly (and thus weakly) consistent b, δ, Theorem 7.1. The statistics κ estimators of, respectively, κ, δ, θ, and τ . The proofs of the above theorem and those to be presented below are relegated to the second half of this section. The rest of this appendix is devoted to establishing distributional results for the b and τb. Throughout, we use the following Gaussian stochastic three estimators κ b, δ, process Γ(x) := √ η B1 (F (0,1) (x)) − p 1 − η B2 (F (1,1) (x)), where B1 and B2 are two independent (standard) Brownian bridges on the interval [0, 1]. Note that when the two distributions F (0,1) and F (1,1) are equal, then supx |Γ(x)| is not smaller than supt |B(t)|, and is exactly supt |B(t)| when the distributions are continuous. The distribution of the random variable supt |B(t)| does not depend on any unknown parameter, has been tabulated, and is known as the Kolmogorov-Smirnov distribution. (1) Theorem 7.2. Under the null hypothesis H0 , we have that b →d Γ, K (7.1) (1) where Γ := supx |Γ(x)|. Under the alternative hypothesis H1 , we have that ·r ¸ £ ¤ nm lim P |b κ − κ| > x ≤ P Γ > x . n,m→∞ n+m b tends in probability to +∞ under the alternative. Hence, the test statistic K (7.2) 20 (2) b is such that Theorem 7.3. Under the null hypothesis H0 , the test statistic D £ ¤ £ ¤ b > x ≤ P Γ+ > x , lim sup P D (7.3) n,m→∞ (2) where Γ+ := supx (Γ(x)). Under the alternative hypothesis H1 , we have that ·r ¸ £ ¤ nm b lim P (δ − δ) > x ≤ P Γ > x , (7.4) n,m→∞ n+m where Γ is same as in Theorem 7.2. Hence, under the alternative, the test statistic b tends in probability to +∞. D (3) Theorem 7.4. Under the null hypothesis H0 , the test statistic Tb is such that £ ¤ £ ¤ lim sup P Tb > x ≤ P max(Γ+ , Γ− ) > x , (7.5) n,m→∞ where Γ− := supx (−Γ(x)), and Γ+ is same as in Theorem 7.3. Under the alternative (3) hypothesis H1 , we have that ·r ¸ £ ¤ nm lim P (b τ − τ) > x ≤ P Γ > x . n,m→∞ n+m (7.6) Hence, the test statistic Tb tends in probability to +∞ under the alternative. Proof of Theorem 7.1. The strong consistency of the four estimators follows from the classical Glivenko-Cantelli Lemma which implies that the two suprema [ (0,1) (x) − F (0,1) (x)| and sup |F (1,1) (x) − F (1,1) (x)| sup |F[ x x converge to 0 almost surely. ¤ Proof of Theorem 7.2. Under the null hypothesis, we have F (0,1) (x) = F (1,1) (x) b = supx |∆(x)|, where for all x, and so K r r nm nm (0,1) (0,1) (x) − F (1,1) (x) − F (1,1) (x)). ∆(x) := (F[ (x)) − (F[ n+m n+m b converges in distribution to Γ. Statement (7.1) is proved. To Consequently, K prove statement (7.2), we first note that under the alternative we have the equality b = supx |Ξ(x)|, where K r nm (F (0,1) (x) − F (1,1) (x)). Ξ(x) := ∆(x) + n+m Obviously now, r nm |b κ − κ| ≤ sup |∆(x)| →d Γ. n+m x (7.7) 21 Statement (7.2) follows. The last note of Theorem 7.2 follows from statement (7.7) p nm and the fact that n+m κ → ∞, since κ > 0 under the alternative. ¤ b = supx (Ξ(x)) with the earProof of Theorem 7.3. We first write the equation D lier defined function Ξ(x). Now we note that, under the null hypothesis, supx (Ξ(x))) does not exceed supx (∆(x))), and we already know that the latter quantity converges in distribution to Γ+ . Statement (7.4) follows. To prove statement (7.5), we write the bound r nm b |δ − δ| ≤ sup |∆(x)|. (7.8) n+m x Statement (7.5) follows. The last note of Theorem 7.3 follows from statement (7.8) p nm and the fact that n+m δ → ∞, since δ > 0 under the alternative. ¤ Proof of Theorem 7.4. With the earlier defined function Ξ(x), we have that Tb is the minimum between supx (Ξ(x)) and supx (−Ξ(x))). Following arguments in the proof of Theorem 7.3, we have that supx (Ξ(x)) ≤ supx (∆(x)) provided that F (0,1) ≤ F (1,1) . If, however, F (0,1) ≥ F (1,1) , then we analogously prove that supx (−Ξ(x)) ≤ supx (−∆(x)). Since we do not know which of the two cases happens, we estimate Tb from above by the maximum between supx (∆(x)) and supx (−∆(x)). Since supx (∆(x)) →d Γ+ and supx (−∆(x)) →d Γ− , we have statement (7.7). Statement (7.8) follows from the bound r nm |b τ − τ | ≤ sup |∆(x)|. n+m x The last note of Theorem 7.4 follows from statement (7.9) and the fact that ∞, since τ > 0 under the alternative. (7.9) p nm n+m δ→ ¤ 22 8. Appendix: tables and figures Table 8.1. P -values of tests for the equality, dominance and intersection of the control and treatment distributions at t = 1. Linear and non-linear DGPs. Subsection n = 200 n = 500 n = 1000 n = 2000 4.1 0.165 0.525 0.839 0.914 4.2 0.062 0.291 0.685 0.783 4.3 0.073 0.462 0.716 0.801 4.4 0.003 0.222 0.723 0.808 4.5 0.270 0.208 0.143 0.085 4.6 0.213 0.140 0.104 0.057 4.5KS 0.146 0.096 0.023 0.005 4.6KS 0.028 0.001 0.000 0.000 Table 8.2. P -values of tests for distributional differences between the control and treatment groups for SJSA at t = 1. Intersection P -value 0.213 P -valueKS 0.195 20 25 30 35 0.8 0.4 0.0 20 25 20 25 30 35 0.4 0.0 0.4 15 10 20 30 40 f) 60 80 60 0.4 0.0 20 40 60 duration variable duration variable e) f) Figure 8.1. EDFs as functions of duration for linear models: a) equality of distributions, b) dominance of distributions, c) intersection of distributions. EDFs as functions of duration for nonlinear models: d) equality of distributions, e) dominance of distributions, f) intersection of distributions. 50 0.8 duration variable c) EDFs treatment and control duration variable 40 30 0.8 b) EDFs treatment and control duration variable 0.4 20 15 a) 0.0 10 10 duration variable 0.8 5 EDFs treatment and control 0.8 0.4 0.0 15 0.8 10 0.0 EDFs treatment and control EDFs treatment and control EDFs treatment and control 23 80 0.8 0.0 0.4 Density 0.8 0.4 0.0 Density 1.2 24 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.6 0.8 1.0 0.8 0.0 0.4 Density 0.4 0.0 Density 0.4 P−values, N=500 0.8 P−values, N=200 0.2 0.0 0.2 0.4 0.6 0.8 P−values, N=1000 1.0 0.0 0.2 0.4 0.6 P−values, N=2000 Figure 8.2. Histograms of P -values for the equality of distributions. Linear model. 0.8 1.0 0.8 0.0 0.4 Density 1.0 0.5 0.0 Density 1.5 1.2 25 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.6 0.8 1.0 P−values, N=500 0.8 0.0 0.4 Density 0.8 0.4 0.0 Density 1.2 P−values, N=200 0.2 0.0 0.2 0.4 0.6 0.8 P−values, N=1000 1.0 0.0 0.2 0.4 0.6 P−values, N=2000 Figure 8.3. Histograms of P -values for the equality of distributions. Non-linear model. 0.8 1.0 0.8 0.0 0.4 Density 1.0 0.5 0.0 Density 1.5 1.2 26 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.6 0.8 1.0 0.4 0.0 0.4 Density 0.8 0.8 P−values, N=500 0.0 Density P−values, N=200 0.2 0.0 0.2 0.4 0.6 0.8 P−values, N=1000 1.0 0.0 0.2 0.4 0.6 P−values, N=2000 Figure 8.4. Histograms of P -values for the dominance of distributions. Linear model. 0.8 1.0 1.0 0.0 0.5 Density 1.0 0.5 0.0 Density 1.5 1.5 27 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.6 0.8 1.0 P−values, N=500 0.8 0.0 0.4 Density 0.8 0.4 0.0 Density 1.2 P−values, N=200 0.2 0.0 0.2 0.4 0.6 0.8 P−values, N=1000 1.0 0.0 0.2 0.4 0.6 P−values, N=2000 Figure 8.5. Histograms of P -values for the dominance of distributions. Non-Linear model. 0.8 1.0 0.8 0.0 0.4 Density 0.8 0.4 0.0 Density 28 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.6 0.8 1.0 P−values, N=500 0.0 0.4 Density 0.8 0.4 0.0 Density 0.4 0.8 P−values, N=200 0.2 0.0 0.2 0.4 0.6 0.8 P−values, N=1000 1.0 0.0 0.2 0.4 0.6 P−values, N=2000 Figure 8.6. Histograms of P -values for the intersection of distributions. Linear model. 0.8 1.0 1.2 0.8 0.0 0.4 Density 0.8 0.4 0.0 Density 1.2 29 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.6 0.8 1.0 P−values, N=500 0.8 0.0 0.4 Density 0.8 0.4 0.0 Density 1.2 P−values, N=200 0.2 0.0 0.2 0.4 0.6 0.8 P−values, N=1000 1.0 0.0 0.2 0.4 0.6 P−values, N=2000 Figure 8.7. Histograms of P -values for the intersection of distributions. Non-linear model. 0.8 1.0 0.4 Density 0.8 1.0 0.6 0.0 0.2 EDFs treatment and control 30 0 10 20 30 40 0.0 SJSA treatment/control duration variables a) 0.2 0.4 0.6 P−values b) Figure 8.8. a) Treatment and control EDFs in the second period, b) Histograms of P -values for the SJSA treatment. 0.8 1.0