Who should be treated? Empirical welfare maximization methods for treatment choice Toru Kitagawa Aleksey Tetenov The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper CWP10/15 Who should be Treated? Empirical Welfare Maximization Methods for Treatment Choice∗ Toru Kitagawa†and Aleksey Tetenov‡ 9 March 2015 Abstract One of the main objectives of empirical analysis of experiments and quasi-experiments is to inform policy decisions that determine the allocation of treatments to individuals with different observable covariates. We propose the Empirical Welfare Maximization (EWM) method, which estimates a treatment assignment policy by maximizing the sample analog of average social welfare over a class of candidate treatment policies. The EWM approach is attractive in terms of both statistical performance and practical implementation in realistic settings of policy design. Common features of these settings include: (i) feasible treatment assignment rules are constrained exogenously for ethical, legislative, or political reasons, (ii) a policy maker wants a simple treatment assignment rule based on one or more eligibility scores in order to reduce the dimensionality of individual observable characteristics, and/or (iii) the proportion of individuals who can receive the treatment is a priori limited due to a budget or a capacity constraint. We show that when the propensity score is known, the average social welfare attained by EWM rules converges at least at n−1/2 rate to the maximum obtainable welfare uniformly over a minimally constrained class of data distributions, and this uniform convergence rate is minimax optimal. In comparison with this benchmark rate, we examine how the uniform convergence rate of the average welfare improves or deteriorates depending on the richness of the class of candidate decision rules, the distribution of conditional treatment effects, and the lack of knowledge of the propensity score. We provide an asymptotically valid inference procedure for the population welfare gain obtained by exercising the EWM rule. We offer easily implementable algorithms for computing the EWM rule and an application using experimental data from the National JTPA Study. ∗ We would like to thank Shin Kanaya, Charles Manski, Joerg Stoye, and seminar participants at Academia Sinica, NYU, and University of Virginia for beneficial comments. We would also like to thank the participants at 2014 Bristol Econometrics Study Group, 2014 SETA Conference, Hakone Microeconometrics Conference, and ICEEE 2015 for beneficial comments and discussions. Financial support from the ESRC through the ESRC Centre for Microdata Methods and Practice (CeMMAP) (grant number RES-589-28-0001) is gratefully acknowledged. † Cemmap/University College London, Department of Economics. Email: t.kitagawa@ucl.ac.uk ‡ Collegio Carlo Alberto, Email: aleksey.tetenov@carloalberto.org 1 1 Introduction Treatment effects often vary with observable individual characteristics. An important objective of empirical analysis of experimental and quasi-experimental data is to determine the individuals who should be treated based on their observable characteristics. Empirical researchers often use regression estimates of individual treatment effects to infer the set of individuals who benefit or do not benefit from the treatment and to suggest who should be targeted for treatment. This paper advocates the Empirical Welfare Maximization (EWM) method, which offers an alternative way to choose optimal treatment assignment based on experimental or observational data from program evaluation studies. We study the frequentist properties of the EWM treatment choice rule and show its optimality in terms of welfare convergence rate, which measures how quickly the average welfare attained by practicing the estimated treatment choice rule converges to the maximal welfare attainable with the knowledge of the true data generating process. We also argue that the EWM approach is well-suited for policy design problems, since it easily accommodates many practical policy concerns, including (i) feasible treatment assignment rules being constrained exogenously for ethical, legislative, or political reasons, (ii) the policy maker facing a budget or capacity constraint that limits the proportion of individuals who can receive one of the treatments, or (iii) the policy maker wanting to have a simple treatment assignment rule based on one or more indices (eligibility scores) to reduce the dimensionality of individual characteristics. Let the data be a size n random sample of Zi = (Yi , Di , Xi ), where Xi ∈ X ⊂Rdx refers to observable pre-treatment covariates of individual i, Di ∈ {0, 1} is a binary indicator of the individual’s treatment assignment, and Yi ∈ R is her/his post-treatment observed outcome. The population from which the sample is drawn is characterized by P , a joint distribution of (Y0,i , Y1,i , Di , Xi ), where Y0,i and Y1,i are potential outcomes that would have been observed if i’s treatment status were Di = 0 and Di = 1, respectively. We assume unconfoundedness, meaning that in the data treatments are assigned independently of the potential outcomes (Y0,i , Y1,i ) conditionally on observable characteristics Xi . Based on this data, the policy-maker has to choose a conditional treatment rule that determines whether individuals with covariates X in a target population will be assigned to treatment 0 or to treatment 1. We restrict our analysis to non-randomized treatment rules. The set of treatment rules could then be indexed by their decision sets G ⊂ X of covariate values, which determine the group of individuals {X ∈ G} to whom treatment 1 is assigned. We denote the collection of candidate treatment rules by G = {G ⊂ X }. The goal of our analysis is to empirically select a treatment assignment rule that gives the 2 highest welfare to the target population. We assume that the joint distribution of (Y0,i , Y1,i , Xi ) of the target population is identical to that of the sampled population.1 We consider the utilitarian welfare criterion defined by the average of the individual outcomes in the target population. When treatment rule G is applied to the target population, the utilitarian welfare equals W (G) ≡ EP [Y1 · 1 {X ∈ G} + Y0 · 1 {X ∈ / G}] (1.1) where EP (·) is the expectation with respect to P . Denoting the conditional mean treatment response by md (x) ≡ E[Yd |X = x] and the conditional average treatment effect by τ (x) ≡ m1 (x) − m0 (x), we could also express the welfare criterion as W (G) = EP (m0 (X)) + EP [τ (X) · 1 {X ∈ G}] . (1.2) Assuming unconfoundedness, equivalence of the distributions of (Y0,i , Y1,i , Xi ) between the target and sampled populations, and the overlap condition for the propensity score e(X) = EP [D|X] in the sampled population, the welfare criterion (1.1) can be written equivalently as Y (1 − D) YD · 1 {X ∈ G} + · 1 {X ∈ / G} W (G) ≡ EP e(X) 1 − e(X) YD Y (1 − D) = EP (Y0 ) + EP − · 1 {X ∈ G} . e(X) 1 − e(X) (1.3) Hence, if the probability distribution of observables (Y, D, X) was fully known to the decisionmaker, an optimal treatment rule from the utilitarian perspective can be written as G∗ ∈ arg max W (G). (1.4) G∈G Or, equivalently, as a maximizer of the welfare gain relative to EP (Y0 ): G∗ ∈ arg max EP [τ (X) · 1 {X ∈ G}] , or G∈G Y (1 − D) YD ∗ − · 1 {X ∈ G} . G ∈ arg max EP G∈G e(X) 1 − e(X) (1.5) (1.6) The main idea of Empirical Welfare Maximization (EWM) is to solve a sample analog of the population maximization problem (1.4), ∈ arg max Wn (G), G∈G Yi Di Yi (1 − Di ) where Wn (G) = En · 1 {Xi ∈ G} + · 1 {Xi ∈ / G} e(Xi ) 1 − e(Xi ) ĜEW M 1 (1.7) In Section 4.2, we consider a setting where the target and the sampled populations have identical conditional treatment effects, but different marginal distributions of X. 3 and En (·) is the sample average. One notable feature of our framework is that the class of candidate treatment rules G = {G ⊂ X } is not as rich as the class of all subsets of X , and it may not include the first-best decision set, G∗F B ≡ {x ∈ X : τ (x) ≥ 0} , (1.8) which maximizes the population welfare (1.1) if any assignment rule were feasible to implement. Our framework with a constrained class of feasible assignment rules allows us to incorporate several exogenous constraints that generally restrict the complexity of feasible treatment assignment rules. For instance, when assigning treatments to individuals in the target population, it may not be realistic to implement a complex treatment assignment rule due to legal or ethical restrictions or due to public accountability for the treatment eligibility criterion. The largest welfare that could be obtained by any treatment rule in class G is WG∗ ≡ sup W (G). (1.9) G∈G In line with Manski (2004) and the subsequent literature on statistical treatment rules, we evaluate the performance of estimated treatment rules Ĝ ∈ G in terms of their average welfare loss (regret) relative to the maximum feasible welfare WG∗ h i h i WG∗ − EP n W (Ĝ) = EP n WG∗ − W (Ĝ) ≥ 0, (1.10) where the expectation EP n is taken over different realizations of the random sample. This criterion measures the average difference between the best attainable population welfare and the welfare attained by implementing estimated policy Ĝ. Since we assess the statistical performance of Ĝ by its welfare value W (Ĝ), we do not require arg maxG∈G W (G) to be unique or Ĝ to converge to a specific set. Assuming that the propensity score e(X) is known and bounded away from zero and one, as is the case in randomized experiments, we derive a non-asymptotic distribution-free upper bound of h i EP n WG∗ − W (ĜEW M ) as a function of sample size n and a measure of complexity of G. Based on this bound, we show that the average welfare of the EWM treatment rule converges to WG∗ at rate O(n−1/2 ) uniformly over a minimally constrained class of probability distributions. We also show that this uniform convergence rate of ĜEW M is optimal in the sense that no estimated treatment choice rule of any kind can attain a faster uniform convergence rate compared to the EWM rule, i.e., minimax rate optimality of ĜEW M . For further refinement of this theoretical result, we analyze how this uniform convergence rate improves if the first-best decision rule G∗F B 4 is feasible, i.e., G∗F B ∈ G, and if the class of data generating processes is constrained by the margin assumption, which restricts the distribution of conditional treatment effects in a neighborhood of zero. We show that ĜEW M remains minimax rate optimal with these additional restrictions. When the data are from an observational study, the propensity score is usually unknown, so it is not feasible to implement the EWM rule (1.7). As a feasible version of the EWM rule, we consider hybrid EWM approaches that plug in estimators of the regression equations or the propensity score in the sample analogs of (1.5) or (1.6). Specifically, with estimated regression functions m̂d (x) = Ê(Yd |X = x) = Ê(Y |X = x, D = d), we define the m-hybrid rule as Ĝm−hybrid ∈ arg max En [τ̂ m (Xi ) · 1 {Xi ∈ G}] , G∈G (1.11) where τ̂ m (Xi ) ≡ m̂1 (Xi ) − m̂0 (Xi ). Similarly, with the estimated propensity score ê(x), we define an e-hybrid rule as Ĝe−hybrid ∈ arg max En [τ̂ ei · 1 {Xi ∈ G}] , (1.12) G∈G where τ̂ ei ≡ h Yi Di ê(Xi ) − Yi (1−Di ) 1−ê(Xi ) i · 1 {εn ≤ ê (Xi ) ≤ 1 − εn } with a converging positive sequence εn → 0 as n → ∞. We investigate the performance of these hybrid approaches in terms of the uniform convergence rate of the welfare loss and clarify how this rate is affected by the estimation uncertainty in m̂d (·) and ê(·). When performing the treatment choice analysis, it could also be of interest to assess the sampling uncertainty of the estimated welfare gain from implementing the treatment rule Ĝ. For this purpose, this paper proposes an inference procedure for W (Ĝ) − W (G0 ), where G0 is a benchmark treatment assignment rule, such as no treatment (G0 = ∅) or the non-individualized implementation of the treatment (G0 = X ). Since the welfare criterion function involves optimization over a class of sets, estimation of the EWM and hybrid treatment rules could present challenging computational problems when G is rich, similarly to the maximum score estimation (Manski (1975), Manski and Thompson (1989)). We argue, however, that exact maximization of EWM criterion is now practically feasible for many problems in economics using widely-available optimization software and an approach proposed by Florios and Skouras (2008), which we extend and improve upon. To illustrate EWM in practice, we compare EWM and plug-in treatment rules computed from the experimental data of the National Job Training Partnership Act Study analyzed by Bloom et al. (1997). 5 1.1 Related Literature Our paper contributes to a growing literature on statistical treatment rules in econometrics, including Manski (2004), Dehejia (2005), Hirano and Porter (2009), Stoye (2009, 2012), Chamberlain (2011), Bhattacharya and Dupas (2012), Tetenov (2012), and Kasy (2014). Manski (2004) proposes to assess the welfare properties of statistical treatment rules by their maximum regret and derives finite-sample bounds on the maximum regret of Conditional Empirical Success rules. CES rules take a finite partition of the covariate space and, separately for each set in this partition, assign the treatment that yields the highest sample average outcome. CES rules can be viewed as a type of EWM rules for which G consists of all unions of the sets in the partition and the empirical welfare criterion uses the sample propensity score. Manski shows that with the partition fixed, their welfare regret converges to zero at least at n−1/2 rate. We show that this rate holds for a broader class of EWM rules and that it cannot be improved uniformly without additional restrictions on P . Stoye (2009) shows that in the absence of ex-ante restrictions on how outcome distributions vary with covariates, finite-sample minimax regret is attained by rules that take the finest partition of the covariate space and operate independently for each covariate value. This important result implies that with continuous covariates, minimax regret does not converge to zero with sample size because the first-best treatment rule may be arbitrarily “wiggly” and difficult to approximate from countable data. Our approach does not give rise to Stoye’s non-convergence result because we restrict the complexity of G and define regret relative to the maximum attainable welfare in G instead of the unconstrained first-best welfare. The problem of conditional treatment choice has some similarities to the classification problem in machine learning and statistics, since it seeks a way to optimally “classify” individuals into those who benefit from the treatment and those who do not. This similarity allows us to draw on recent theoretical results for classification by Devroye et al. (1996), Tsybakov (2004), Massart and Nédélec (2006), Audibert and Tsybakov (2007), and Kerkyacharian et al. (2014), among others, and to adapt them to the treatment choice problem. The minimax rate optimality of the EWM treatment choice rule (proved in Theorems 2.1 and 2.2 below) is analogous to the minimax rate optimality of the Empirical Risk Minimization classifier in the classification problem shown by Devroye and Lugosi (1995). There are, however, substantive differences between treatment choice and classification problems: (1) the observed outcomes are real-valued rather than binary, (2) in treatment choice only one of the potential outcomes is observed for each individual, whereas in classification the correct choice is known for each training sample observation, (3) the EWM criterion depends on the propensity score, which may be unknown (as in observational studies), (4) policy settings often 6 impose constraints on practicable treatment rules or on the proportion of the population that could be treated. Accommodating these fundamental differences and establishing the convergence rate results for the welfare loss criterion constitute the main theoretical contributions of this paper. Several works in econometrics consider the plug-in approach to treatment choice using estimated regression equations, Ĝplug−in = {x : τ̂ m (x) ≥ 0} , τ̂ m (x) = m̂1 (x) − m̂0 (x), (1.13) where m̂d (x) is a parametric or a nonparametric estimator of E(Yd |X = x). Hirano and Porter (2009) establish local asymptotic minimax optimality of plug-in rules for parametric and semiparametric models of treatment response. Bhattacharya and Dupas (2012) apply nonparametric plug-in rules with an aggregate budget constraint and derive some of their properties. Armstrong and Shen (2014) consider statistical inference for the first-best decision rule G∗F B from the perspective of inference for conditional moment inequalities. In empirical practice of program evaluation, researchers assess who should be treated by stratifying the population based on the predicted value of the individual outcome in the absence of treatment and estimating the average causal effects for each strata using the experimental data. Abadie et al. (2014) point out that the naive implementation of this idea is subject to a bias due to the endogenous stratification and provide a method to correct the bias. To assess the effect heterogeneity, estimation and inference for conditional treatment effects based on parametric or nonparametric regressions are often reported, but the stylized output of statistical inference (e.g., confidence intervals, p-values) fails to offer the policy maker a direct guidance on what treatment rule to follow. In contrast, our EWM approach offers the policy maker a specific treatment assignment rule designed to maximize the social welfare. By formulating the treatment choice problem as a formal statistical decision problem, a certain treatment assignment rule could be obtained by specifying a prior distribution for P and solving for a Bayes decision rule (see Dehejia (2005), Chamberlain (2011), and Kasy (2014) for Bayesian approaches to the treatment choice problem). In contrast to the Bayesian approach, the EWM approach is purely data-driven and does not require a prior distribution over the data generating processes. The analysis of optimal individualized treatment rules has also received considerable attention in biostatistics. Qian and Murphy (2011) propose a plug-in approach using the treatment response function m̂D (X) ≡ Dm̂1 (X) + (1 − D)m̂0 (X) estimated by penalized least squares. If the square loss E(Y − m̂D (X))2 converges to the minimum E(Y − DE(Y1 |X) + (1 − D)E(Y0 |X))2 , then the welfare of the plug-in treatment rule converges to the maximum W (G∗F B ). Assuming that 7 treatment response functions are linear in the chosen basis, they show welfare convergence rate of n−1/2 or better (with a margin condition). Zhao et al. (2012) propose another surrogate loss function approach that estimates the treatment rule using a Support Vector Machine with a growing basis. This yields welfare convergence rates that depend on the dimension of the covariates, similarly to nonparametric plug-in rules. These approaches are computationally attractive but cannot be used to choose from a constrained set of treatment rules or under a capacity constraint. Elliott and Lieli (2013) and Lieli and White (2010) also proposed maximizing the sample analog of a utilitarian decision criterion similar to EWM. They consider the problem of forecasting binary outcomes based on observations of (Yi , Xi ), where a forecast leads to a binary decision. 2 Theoretical Properties of EWM 2.1 Setup and Assumptions Throughout our investigation of theoretical properties of EWM, we maintain the following assumptions. Assumption 2.1. (UCF) Unconfoundedness: (Y1 , Y0 ) ⊥ D|X. (BO) Bounded Outcomes: There exists M < ∞ such that the support of outcome variable Y is contained in [−M/2, M/2]. (SO) Strict Overlap: There exist κ ∈ (0, 1/2) such that the propensity score satisfies e(x) ∈ [κ, 1 − κ] for all x ∈ X . (VC) VC-class: A class of decision sets G has a finite VC-dimension 2 v < ∞ and is countable.3 The assumption of unconfoundedness (selection on observables) holds if data are obtained from an experimental study with a randomized treatment assignment. In observational studies, unconfoundedness is a non-testable and often controversial assumption. Our analysis could be applied 2 The VC-diemension of G is defined by the maximal number of points in X that can be shattered by G. The VC-dimension is commonly used to measure the complexity of a class of sets in the statistical learning literature (see Vapnik (1998), Dudley (1999, Chapter 4), and van der Vaart and Wellner (1996) for extensive discussions). Note that the VC-dimension is smaller by one compared to the VC-index used to measure the complexity of a class of sets in the empirical process theory, e.g., van der Vaart and Wellner (1996). 3 Coutability of G is imposed in order to avoid measurability complications in proving our theoretical results. 8 to the observational studies in which unconfoundedness is credible. The second assumption (BO) implies boundedness of the treatment effects, i.e., PX (|τ (X)| ≤ M ) = 1, where PX is the marginal distribution of X and τ (·) is the conditional treatment effect τ (X) = E (Y1 − Y0 |X). Since the implementation of EWM does not require knowledge of M and unbounded Y is rare in social science, this assumption is innocuous and imposed only for analytical convenience. The third assumption (SO) is a standard assumption in the treatment effect literature. It is satisfied in randomized controlled trials by design, but it may be violated in observational studies if almost all the individuals are in the same group (treatment or control) for some values of X. We let P(M, κ) denote the class of distributions of (Y0 , Y1 , D, X) that satisfy Assumption 2.1 (UCF), (BO), and (SO). The fourth assumption (VC) restricts the complexity of the class of candidate treatment rules G in terms of its VC-dimension. If X has a finite support, then the VC-dimension v of any class G does not exceed the number of support points. If some of X is continuously distributed, Assumption 2.1 (VC) requires G to be smaller than the Borel σ-algebra of X . The following examples illustrate several practically relevant classes of the feasible treatment rules satisfying Assumption 2.1 (VC). Example 2.1. (Linear Eligibility Score) Suppose that a feasible assignment rule is constrained to those that assign the treatment according to an eligibility score. By the eligibility score, we mean a scalar-valued function of the individual’s observed characteristics that determines whether one receives the treatment based on whether the eligibility score exceeds a certain threshold. The main objective of data analysis is therefore to construct an eligibility score that yields a welfaremaximizing treatment rule. Specifically, we assume that the eligibility score is constrained to being linear in a subvector of x ∈ Rdx , xsub ∈ Rdsub , dsub ≤ dx . The class of decision sets generated by Linear Eligibility Scores (LES) is defined as nn o o GLES ≡ x ∈ Rdx : β 0 + xTsub β sub ≥ 0 : β 0 , β Tsub ∈ Rdsub +1 . (2.1) We accordingly obtain an EWM assignment rule by maximizing Yi (1 − Di ) Yi Di T T · 1 β 0 + Xsub,i β sub ≥ 0 + · 1 β 0 + Xsub,i β sub < 0 Wn (β) ≡ En e(Xi ) 1 − e(Xi ) in β = β 0 , β Tsub ∈ Rdsub +1 . It is well known that the class of half-spaces spanned by β 0 , β Tsub ∈ Rdsub +1 has the VC-dimension v = dsub +1, so Assumption 2.1 (VC) holds. In Section 5, we discuss 9 how to compute ĜEW M when the class of decision sets is given by GLES . A plug-in rule based on a parametric linear regression also selects a treatment rule from GLES , but their welfare does not converge to the maximum welfare WG∗LES if the regression equations are misspecified, whereas the welfare of ĜEW M always does (as shown in Theorem 2.1 below). Example 2.2. (Generalized Eligibility Score) Let fj (·), j = 1, . . . , m, and g(·) be known functions of x ∈ Rdx . Consider a class of assignment rules generated by Generalized Eligibility Scores (GES), o o nn Xm β j fj (x) ≥ g(x) , (β 1 , ..., β m ) ∈ Rm . GGES ≡ x ∈ Rdx : j=1 The class of decision sets GGES generalizes the linear eligibility score rules (2.1), as it allows for eligibility scores that are nonlinear in x, i.e., GGES can accommodate decision sets that partition the space of covariates by nonlinear boundaries. It can be shown that GGES has the VC-dimension v = m + 1 (Theorem 4.2.1 in Dudley (1999)). Example 2.3. (Intersection Rule of Multiple Eligibility Scores) Consider a situation where there are L ≥ 2 eligibility scores. Let GGES,l , l = 1, . . . , L, be classes of decision sets such that each of them is generated by contour sets of the l-th eligibility score. Suppose that a feasible decision rule is constrained to those that assign the treatment if the individual has all the L eligibility scores exceeding thresholds. In this case, the class of decision sets is constructed by the intersections, nT o T L G = G : G ∈ G , l = 1, . . . , L . An intersection of a finite number of G ≡ L GES,l l l GES,l l=1 l=1 VC-classes is a VC-class with a finite VC-dimension (Theorem 4.5.4 in Dudley (1999)); thus, Assumption 2.1 (VC) holds for this G. We can also consider a class of treatment rules that assigns a treatment if at least one of the L eligibility scores exceeds a threshold. In this case, instead of intersections, the class of decision sets is formed by the unions of {GGES,l , l = 1, . . . , L}, which is also known to have a finite VC-dimension (Theorem 4.5.4 in Dudley (1999)) 2.2 Uniform Rate Optimality of EWM To analyze statistical performance of EWM rules, we focus on a non-asymptotic upper bound h i of the worst-case welfare loss supP ∈P(M,κ) EP n WG∗ − W (ĜEW M ) and examine how it depends on sample size n and VC-dimension v. This finite sample upper bound allows us to assess the uniform convergence rate of the welfare and to examine how richness (complexity) of the class of candidate decision rules affects the worst-case performance of EWM. The main reason that we focus on the uniform convergence rate rather than a pointwise convergence rate is that the pointwise 10 convergence rate of the welfare loss can vary depending on a feature of the data distribution and fails to provide a guaranteed learning rate of an optimal policy when no additional assumption, other than Assumption 2.1, is available. For heuristic illustration of the derivation of the uniform convergence rate, consider the following inequality, which holds for any G̃ ∈ G: W (G̃) − W (ĜEW M ) = ≤ W (G̃) − Wn (ĜEW M ) + Wn (ĜEW M ) − W (ĜEW M ) W (G̃) − Wn (G̃) + sup Wn (G̃) − W (G̃) G∈G ( ∵ Wn (ĜEW M ) ≥ Wn (G̃) ) ≤ 2 sup |Wn (G) − W (G)| . G∈G Since it applies to W (G̃) for all G̃, it also applies to WG∗ = sup W (G̃): WG∗ − W (ĜEW M ) ≤ 2 sup |Wn (G) − W (G)| . (2.2) G∈G Therefore, the expected welfare loss can be bounded uniformly in P by a distribution-free upper bound of EP n (supG∈G |Wn (G) − W (G)|). Since Wn (G)−W (G) can be seen as the centered empirical process indexed by G ∈ G, an application of the existing moment inequality for the supremum of centered empirical processes indexed by a VC-class yields the following distribution-free upper bound. A proof, which closely follows the proofs of Theorems 1.16 and 1.17 in Lugosi (2002) in the classification problem, is given in Appendix A.2. Theorem 2.1. Under Assumption 2.1, we have r h i M v ∗ , sup EP n WG − W (ĜEW M ) ≤ C1 κ n P ∈P(M,κ) where C1 is a universal constant defined in Lemma A.4 in Appendix A.1. This theorem shows that the convergence rate of the worst-case welfare loss for the EWM rule is no slower than n−1/2 . The upper bound is increasing in the VC-dimension of G, implying that, as the candidate treatment assignment rules become more complex in terms of VC-dimension, ĜEW M tends to overfit the data in the sense that the distribution of regret WG∗ − W (ĜEW M ) is more and more dispersed, and, with n fixed, this overfitting results in inflating the average welfare regret.4 4 Note that WG∗ weakly increases if a more complex class G is chosen. Our welfare loss criterion is defined for a specific class G and does not capture the potential gain in the maximal welfare from the choice of a more complex G. 11 The next theorem concerns a universal lower bound of the worst-case average welfare loss. It shows that no data-based treatment choice rule can have a uniform convergence rate faster than n−1/2 . Theorem 2.2. Suppose that Assumption 2.1 holds and the VC-dimension of G is v ≥ 2. Then, for any treatment choice rule Ĝ, as a function of (Z1 , . . . , Zn ), it holds h i n √ o rv − 1 ∗ −1 sup EP n WG − W (Ĝ) ≥ 4 exp −2 2 M for all n ≥ 16 (v − 1) . n P ∈P(M,κ) This theorem, combined with Theorem 2.1, implies that ĜEW M is minimax rate optimal over the class of data generating process P (M, κ), since the rate of the convergence of the upper bound of h i supP ∈P(M,κ) EP n WG∗ − W (ĜEW M ) agrees with the convergence rate of the universal lower bound. Accordingly, we can conclude that no other data-driven procedure for obtaining a treatment choice rule can outperform ĜEW M in terms of the uniform convergence rate over P (M, κ). It is worth noting that the rate lower bound is uniform in P and does not apply pointwise. Theorem 2.3 shows that EWM rules have faster convergence rates for some distributions. It is also possible h i that EP n W (Ĝ) > WG∗ for some pairs of Ĝ and P , but it can never hold for all distributions in P(M, κ).5 2.3 Rate Improvement by Margin Assumption The welfare loss upper bounds obtained in Theorem 2.1 can indeed tighten up and the uniform convergence rate can improve, as we further constrain the class of data generating processes. In this section, we investigate (i) what feature of data generating processes can affect the upper bound on the welfare loss of the EWM rule, and (ii) whether or not the EWM rule remains minimax rate optimal even under the additional constraints. For this goal, we consider imposing the following two assumptions. Assumption 2.2. (FB) Correct Specification: The first-best treatment rule G∗F B defined in (1.8) belongs to the class 5 For example, if Ĝ is a nonparametric plug-in rule and the first-best decision rule G∗F B for distribution P does not belong to G, then the welfare of Ĝ will exceed WG∗ in sufficiently large samples. However, the uniform lower bound still applies because there exist other distributions for which EP n W (Ĝ) ≤ WG∗ − (n−1/2 bound) for the same sample size. 12 of candidate treatment rules G. (MA) Margin Assumption: There exist constants 0 < η ≤ M and 0 < α < ∞ such that α t PX (|τ (X)| ≤ t) ≤ , ∀0 ≤ t ≤ η, η where M < ∞ is the constant as defined in Assumption 2.1 (BO). The assumption of correct specification means that the class of the feasible policy rules specified by G contains an unconstrained first-best treatment rule G∗F B . This assumption is plausible if, for instance, the policy maker’s specification of G is based on a credible assumption about the shape of the contour set {x : τ (x) ≥ 0}. This assumption can be, on the other hand, restrictive if the specification of G comes from some exogenous constraints for feasible policy rules, as in the case of Example 2.1. The second assumption (MA) concerns the way in which the distribution of conditional treatment effect τ (X) behaves in the neighborhood of τ (X) = 0. A similar assumption has been considered in the literature on classification analysis (Mammen and Tsybakov (1999), Tsybakov (2004), among others), and we borrow the term “margin assumption” from Tsybakov (2004). Parameters η and α characterize the size of population with the conditional treatment effect close to the margin τ (X) = 0. Smaller η and α imply that more individuals can concentrate in a neighborhood of τ (X) = 0. The next examples illustrate this interpretation of η and α. Example 2.4. Suppose that X contains a continuously distributed covariate and that the conditional treatment effect τ (X) is continuously distributed. If the probability density function of τ (X) is bounded from above by pτ < ∞, then the margin assumption holds with α = 1 and η = (2pτ )−1 . Example 2.5. Suppose that X is a scalar and follows the uniform distribution on [−1/2, 1/2]. Specify the conditional treatment effects to be τ (X) = (−X)3 . In this specification, τ (X) “flats out” at τ (X) = 0, and accordingly, the density function of τ (X) is unbounded in the neighborhood of τ (X) = 0. This specification leads to PX (|τ (X)| ≤ t) = 2t1/3 , so the margin assumption holds with α = 1/3 and η = 1/8. Example 2.6. Suppose that the distribution of X is the same as in Example 2.5. Let h > 0 and specify τ (X) as ( τ (X) = X −h X +h for X ≤ 0, for X > 0. 13 This τ (X) is discontinuous at X = 0, and the distribution of τ (X) has zero probability around the margin of τ (X) = 0. It holds ( PX (|τ (X)| ≤ t) = 0 for t ≤ h t − h for h < t ≤ 1 2 +h . By setting η = h, the margin condition holds for arbitrarily large α. In general, if the distribution of τ (X) has a gap around the margin of τ (X) = 0, the margin condition holds with arbitrarily large α. From now on, we denote the class of P satisfying Assumptions 2.1 and 2.2 by PF B (M, κ, η, α).6 The next theorem provides the upper bound of the welfare loss of the EWM rule when a class of data distributions is constrained to PF B (M, κ, η, α). Theorem 2.3. Under Assumptions 2.1 and 2.2, h i v 1+α 2+α EP n W (G∗F B ) − W (ĜEW M ) ≤ c n P ∈PF B (M,κ,η,α) sup holds for all n, where c is a positive constant that depends only on M , κ, η, and α. Similarly to Theorem 2.1, the presented welfare loss upper bound is non-asymptotic, and it is valid for every sample size. Our derivation of this theorem can be seen as an extension of the finite sample risk bound for the classification error shown in Theorem 2 of Massart and Nédélec (2006). Our rate upper bound is consistent with the uniform convergence rate of the classification risk of the empirical risk minimizing classifier shown in Theorem 1 of Tsybakov (2004).7 This coincidence is somewhat expected, given that the empirical welfare criterion that the EWM rule maximizes resembles the empirical classification risk in the classification problem. 1+α The next theorem shows that the uniform convergence rate of n− 2+α obtained in Theorem 2.3 attains the minimax rate lower bound, implying that any treatment choice rule Ĝ based on data 1+α (including ĜEW M ) cannot attain a uniform convergence rate faster than n− 2+α . This means that the EWM rule remains rate optimal even when the class of data generating processes is constrained additionally by Assumption 2.2. 6 7 Note that PF B (M, κ, η, α) depends on the set of feasible treatment rules G via Assumption 2.2 (FB). Tsybakov (2004) defines the complexity of the decision sets G in terms of the growth coefficient ρ of the bracketing number of G. We control complexity of G in terms of the VC-dimension, which corresponds to Tsybakov’s growth coefficient ρ being arbitrarily close to zero. 14 Theorem 2.4. Suppose Assumptions 2.1 and 2.2 hold. Assume that the VC-dimension of G satisfies v ≥ 2. Then, for any treatment choice rule Ĝ as a function of (Z1 , . . . , Zn ), it holds sup EP n P ∈PF B (M,κ,η,α) h W (G∗F B ) 1+α i n √ o 2(1+α) α v − 1 2+α − 2+α −1 2+α − W (Ĝ) ≥ 2 exp −2 2 M η n n o for all n ≥ max (M/η)2 , 42+α (v − 1). The following remarks summarize some analytical insights associated with Theorems 2.1 - 2.4. Remark 2.1. The convergence rates of the worst-case EWM welfare loss obtained by Theorems 2.1 and 2.3 highlight how margin coefficient α influences the uniform performance of the EWM rule. Higher α improves the welfare loss convergence rate of EWM, and the convergence rate approaches n−1 in an extreme case, where the distribution of τ (X) has a gap around τ (X) = 0. As fewer individuals are around the margin of τ (X) = 0, we can attain the maximal welfare quicker. Conversely, as α approaches zero (more individuals around the margin), the welfare loss convergence rate of EWM approaches n−1/2 , and it corresponds to the uniform convergence rate of Theorem 2.1. Remark 2.2. The upper bounds of welfare loss convergence rate shown in Theorems 2.1 and 2.3 are increasing in the VC-dimension of G. Since they are valid at every n, we can allow the VCdimension of the candidate treatment rules to grow with the sample size. For instance, if we consider a sequence of candidate decision sets {Gn : n = 1, 2, . . . }, for which the VC-dimension grows with the sample size at rate nλ , 0 < λ < 1, Theorems 2.1 and 2.3 imply that the welfare loss uniform convergence rate of the EWM rule slows down to n− to n (1+α) −(1−λ) 2+α 1−λ 2 for the case without Assumption 2.2 and for the case with Assumption 2.2. Note that the welfare loss lower bounds shown in Theorems 2.2 and 2.4 have the VC-dimensions of the same order as in the corresponding upper bounds, so we can conclude that the EWM rule is also minimax rate optimal even in the situations where the complexity of G grows with the sample size. Remark 2.3. Note that the welfare loss lower bounds of Theorems 2.2 and 2.4 are valid for any estimated treatment choice rule Ĝ irrespective of whether Ĝ is constrained to G or not. Therefore, the nonparametric plug-in rule Ĝplug−in defined in (1.13) is subject to the same lower bound. In Section 4.3, we further discuss the welfare loss uniform convergence rate of the nonparametric plug-in rule. Remark 2.4. Let PF B (M, κ) be the class of data generating processes that satisfy Assumption 2.1 and Assumption 2.2 (FB). A close inspection of the proofs of Theorems 2.1 and 2.2 given in 15 Appendix A.2 shows that the same lower and upper bounds of Theorems 2.1 and 2.2 can be obtained even when P (M, κ) is replaced with PF B (M, κ). In this sense, Assumption 2.2 (MA) plays the main role in improving the welfare loss convergence rate. 2.4 Unknown Propensity Score We have so far considered situations where the true propensity score is known. This would not be the case if the data were obtained from an observational study in which the assignment of treatment is not generally under the control of the experimenter. To cope with the unknown propensity score, this section considers two hybrids of the EWM approach and the parametric/nonparametric plugin approach: the m-hybrid rule defined in (1.11) and the e-hybrid rule defined in (1.12). The e-hybrid rule employs the trimming rule 1 {εn ≤ ê (Xi ) ≤ 1 − εn } with a deterministic sequence {εn : n = 1, 2, . . . }, which we assume to converge to zero faster than some polynomial rate, n ≤ O (n−a ), a > 0.8 Let Wnτ (G) be the sample analogue of the welfare criterion (1.2) that one would construct if the true regression equations were known, Wnτ (G) ≡ En (m0 (Xi )) + En (τ (Xi ) · 1{Xi ∈ G}), and Ŵnτ (G) be the empirical welfare with the conditional treatment effect estimators τ̂ m (·) plugged in, Ŵnτ (G) ≡ En [m0 (Xi ) + τ̂ m (Xi ) 1 {Xi ∈ G}] . (2.3) Since the m-hybrid rule maximizes Ŵnτ (·), it holds Ŵnτ (Ĝm−hybrid ) − Ŵnτ (G̃) ≥ 0 for any G̃ ∈ G. The following inequalities therefore follow: W (G̃) − W (Ĝm−hybrid ) ≤ Wnτ (G̃) − Ŵnτ G̃ − Wnτ (Ĝm−hybrid ) + Ŵnτ Ĝm−hybrid (2.4) +W (G̃) − W (Ĝm−hybrid ) − Wnτ (G̃) + Wnτ (Ĝm−hybrid ) n h n o n oi 1X [τ (Xi ) − τ̂ m (Xi )] 1 Xi ∈ G̃ − 1 Xi ∈ Ĝm−hybrid = n i=1 +W (G̃) − Wnτ (G̃) + Wnτ (Ĝm−hybrid ) − W (Ĝm−hybrid ) n 1X m ≤ |τ̂ (Xi ) − τ (Xi )| + 2 sup |Wnτ (G) − W (G)| . n G∈G i=1 This implies that the average welfare loss of the m-hybrid rule can be bounded by " n # h i X 1 ∗ m τ EP n WG − W (Ĝm−hybrid ) ≤ EP n |τ̂ (Xi ) − τ (Xi )| + 2EP n sup |Wn (G) − W (G)| . n G∈G i=1 8 The trimming sequence εn is introduced only to simplify the derivation of the rate upper bound of the welfare loss. In practical terms, if the overlap condition is well satisfied in the given data, the trimming is not necessary for computing the e-hybrid rule. 16 (2.5) For the e-hybrid rule, replacing Wnτ (·) and Ŵnτ (·) in (2.4) with the empirical welfare Wn (·) defined h i e i (1−Di ) in (1.7) and Ŵn (G) ≡ En Y1−e(X + τ̂ · 1{X ∈ G} , respectively, yields a similar upper bound i i i) EP n h WG∗ i − W (Ĝe−hybrid ) ≤ EP n " # n 1X e |τ̂ i − τ i | + 2EP n sup |Wn (G) − W (G)| , n G∈G (2.6) i=1 Yi D i e(Xi ) Yi (1−Di ) 1−e(Xi ) . Since the uniform convergence rate of EP n supG∈G |Wnτ (G) − W (G)| is the same as that of EP n supG∈G |Wn (G) − W (G)| ,9 these upper bounds imply that the lack of where τ i = − knowledge of the propensity score may harm the welfare loss convergence rate if the average estima tion error of the conditional treatment effect converges slower than does EP n supG∈G |Wn (G) − W (G)| . It is therefore convenient to first state the condition regarding the convergence rate of the average estimation error of the conditional treatment effect estimators. Condition 2.1. (m) (m-hybrid case): Let τ̂ m (x) = m̂1 (x) − m̂0 (x) be an estimator for the conditional treatment effect τ (x) = m1 (x) − m0 (x). For a class of data generating processes Pm , there exists a sequence ψ n → ∞ such that " lim sup sup ψ n EP n n→∞ P ∈Pm # n 1X m |τ̂ (Xi ) − τ (Xi )| < ∞ n (2.7) i=1 holds. (e) (e-hybrid case): Let τ̂ ei = Yi (1−Di ) Yi Di e(Xi ) − 1−e(Xi ) , h Yi Di ê(Xi ) − Yi (1−Di ) 1−ê(Xi ) i · 1 {εn ≤ ê (Xi ) ≤ 1 − εn } be an estimator for τ i = where ê(·) is an estimated propensity score. For a class of data generating processes Pe , there exists a sequence φn → ∞ such that # " n 1X e |τ̂ i − τ i | < ∞. lim sup sup φn EP n n n→∞ P ∈Pe (2.8) i=1 In Appendix B, we show that the estimators τ̂ m (·) and τ̂ ei constructed via local polynomial regressions satisfy this condition for a certain class of data generating processes. 9 This claim follows by applying the proof of Theorem 2.1 with the following class of functions: F τ ≡ {f (Xi ; G) ≡ m0 (Xi ) + τ (Xi ) · 1{Xi ∈ G} : G ∈ G} . F τ is the VC-subgraph class with the VC-dimension at most v by Lemma A.1 in Appendix A.1. 17 Theorems 2.5 and 2.6 below derive the uniform convergence rate bounds of the hybrid rules in two different scenarios. In Theorem 2.5, we constrain the class of data generating processes only by Assumption 2.1 and Condition 2.1, and, importantly, we allow the class of decision rules G to exclude the firstbest rule G∗F B . Theorem 2.5 follows as a corollary of Theorem 2.1 and inequalities (2.5) and (2.6), so we omit a proof. Theorem 2.5. Suppose Assumption 2.1 holds. (m) (m-hybrid case): Given a class of data generating processes Pm , if an estimator for the conditional treatment effect τ̂ m (·) satisfies Condition 2.1 (m), then, h i −1/2 sup EP n WG∗ − W (Ĝm−hybrid ) ≤ O ψ −1 ∨ n . n P ∈Pm ∩P(M,κ) (e) (e-hybrid case): Given a class of data generating processes Pe , if an estimator for the propensity score ê(·) satisfies Condition 2.1 (e), then, h i −1/2 sup EP n WG∗ − W (Ĝe−hybrid ) ≤ O φ−1 ∨ n . n P ∈Pe ∩P(M,κ) A comparison of Theorem 2.5 with Theorem 2.1 shows that the uniform rate upper bounds for the hybrid EWM rules are no faster than the welfare loss convergence rate of the EWM with known propensity score. Note that if some nonparametric estimator is used to estimate τ (·) or e (·), ψ n or φn specified in Condition 2.1 is generally slower than n1/2 . Hence, the welfare loss upper bounds −1 of the hybrid rules are determined by the nonparametric rate ψ −1 n or φn . A special case where the estimation of τ (·) or e (·) does not affect the uniform convergence rate is when τ (·) or e (·) is assumed to belong to a parametric family and it is estimated parametrically, i.e., ψ n or φn is equal to n1/2 . In the second scenario, we consider the case where G contains the first-best decision rule G∗F B and the data generating processes are constrained further by the margin assumption (Assumption 2.2) with margin coefficient α ∈ (0, 1]. Theorem 2.6. Suppose Assumptions 2.1 and 2.2 hold with a margin coefficient α ∈ (0, 1]. Assume that a stronger version of Condition 2.1 holds, where (2.7) and (2.8) are replaced by " 2 # m lim sup sup EP n ψ̃ n max |τ̂ (Xi ) − τ (Xi )| < ∞ and n→∞ P ∈Pm 1≤i≤n " lim sup sup EP n n→∞ P ∈Pe φ̃n max 1≤i≤n |τ̂ ei 2 # < ∞, − τ i| 18 (2.9) (2.10) for sequences ψ̃ n → ∞ and φ̃n → ∞, respectively. Then, we have h i −(1+α) 1+α sup EP n W (G∗F B ) − W (Ĝm−hybrid ) ≤ O ψ̃ n ∨ n− 2+α log ψ̃ n , P ∈Pm ∩PF B (M,κ,α,η) sup P ∈Pe ∩PF B (M,κ,α,η) h i −(1+α) 1+α ∨ n− 2+α log φ̃n . EP n W (G∗F B ) − W (Ĝe−hybrid ) ≤ O φ̃n Theorem 2.6 shows that even when τ (·) or e(·) have to be estimated, the margin coefficient α influences the rate upper bound of the welfare loss. A higher α leads to a faster rate of the welfare loss convergence regardless of whether τ (·) and e(·) are estimated parametrically or nonparametrically. In the situation where τ (·) or e (·) is estimated parametrically (with a compact support of X), ψ̃ n or φ̃n is equal to n1/2 ; thus, the uniform welfare loss convergence rate is given by the second 1+α argument in O (·), n− 2+α . On the other hand, when τ (·) or e (·) is estimated nonparametrically, which of the two terms in O (·) converges slower depends on the dimension of X and the degree of smoothness of the underlying nonparametric function. See Corollaries 2.1 and 2.2 below for specific expressions of ψ̃ n and φ̃n when local polynomial regressions are used to estimate τ (·) or e(·). Note that Theorems 2.5 and 2.6 concern the upper bound of the convergence rate. We do not have the universal rate lower bound results for these constrained classes of data generating processes. We leave the investigation of the sharp rate bound of the hybrid-EWM welfare loss for future research. 2.4.1 Hybrid EWM with Local Polynomial Estimators In this subsection, we focus on local polynomial estimators for τ (x) and e(x), and spell out classes of data generating processes Pm and Pe as well as ψ n , ψ̃ n , φn , and φ̃n that satisfy Condition 2.1 and the assumption of Theorem 2.6. Consider the m-hybrid approach in which the leave-one-out local polynomial estimators are used to estimate m1 (Xi ) and m0 (Xi ), i.e., m̂1 (Xi ) and m̂0 (Xi ) are constructed by fitting the local polynomials excluding the i-th observation.10 For any multi-index s = (s1 , . . . , sdx ) ∈ Ndx Pdx sd x s1 s and any (x1 , . . . , xdx ) ∈ Rdx , we define |s| ≡ i=1 si , s! ≡ s1 ! · · · sdx !, x ≡ x1 · · · xdx , and kxk ≡ x21 + · · · + x2dx . Let K(·) : Rdx → R be a kernel function and h > 0 be a bandwidth. At each Xi , i = 1, . . . , n, we define the leave-one-out local polynomial coefficient estimators with 10 The reason to consider the leave-one-out fitted values is to simplify analytical verification of Condition 2.1. We believe that the welfare loss convergence rates of the hybrid approaches will not be affected even when the i-th observation is included in estimating m̂1 (Xi ) and m̂0 (Xi ). 19 degree l ≥ 0 as θ̂1 (Xi ) = arg min θ T Yj − θ U j6=i,Dj =1 X θ̂0 (Xi ) = arg min θ X T Yj − θ U j6=i,Dj =0 Xj − Xi h 2 Xj − Xi h 2 K K Xj − Xi h Xj − Xi h , , Xj −Xi Xj −Xi where U is the vector with elements indexed by the multi-index s, i.e., U ≡ h s h Xj −Xi .11 With a slight abuse of notation, we define U (0) = (1, 0, . . . , 0)T . Let λn,1 (Xi ) be h |s|≤l −1 P Xj −Xi T Xj −Xi K Xi −Xj the smallest eigenvalue of B1 (Xi ) ≡ nhdx U U and j6=i,Dj =1 h h h P −1 Xj −Xi T Xj −Xi K Xi −Xj . U λn,0 (Xi ) be the smallest eigenvalue of B0 (Xi ) ≡ nhdx U j6=i,Dj =0 h h h Accordingly, we construct leave-one-out local polynomial fits for m1 (Xi ) and m0 (Xi ) by m̂1 (Xi ) = U T (0)θ̂1 (Xi ) · 1 {λn,1 (Xi ) ≥ tn } , m̂0 (Xi ) = U T (0)θ̂0 (Xi ) · 1 {λn,0 (Xi ) ≥ tn } , where tn is a positive sequence that slowly converges to zero, such as tn ∝ (log n)−1 . These trimming rules regularize the regressor matrices of the local polynomial regressions and simplify the proof of the uniform consistency of the local polynomial estimators. To characterize Pm in Condition 2.1, we impose the following restrictions. Assumption 2.3. (Smooth-m) Smoothness of the Regressions: The regression equations m1 (·) and m0 (·) belong to a Hölder class of functions with degree β m ≥ 1 and constant Lm < ∞.12 (PX) Support and Density Restrictions on PX : Let X ⊂ Rdx be the support of PX . Let Leb(·) be the Lebesgue measure on Rdx . There exist constants c and r0 such that Leb (X ∩ B(x, r)) ≥ cLeb(B(x, r)) ∀0 < r ≤ r0 , ∀x ∈ X , and PX has the density function dPX dx (·) (2.11) with respect to the Lebesgue measure of Rdx that is bounded from above and bounded away from zero, 0 < pX ≤ dPX dx (x) ≤ p̄X < ∞ for all x ∈ X . 11 We specify the same bandwidth for these two local polynomial regressions only to suppress notational burden. 12 Let Ds denote the differential operator Ds ≡ s +···+s dx ∂ 1 sd s ∂x11 ···xd x x dx . Let β ≥ 1 be an integer. For any x ∈ Rdx and any (β − 1) times continuously differentiable function f : R → R, we denote the Taylor expansion polynomial of degree P (x0 −x)s s D f (x). Let L > 0. The Hölder class of functions in Rdx with |s|≤β−1 s! (β − 1) at point x by fx (x0 ) ≡ degree β and constant 0 < L < ∞ is defined as the set of function f : Rdx → R that are (β − 1) times continuously β differentiable and satisfy, for any x and x0 ∈ Rdx , the inequality |fx (x0 ) − f (x)| ≤ L kx − x0 k . 20 (Ker) Bounded Kernel with Compact Support: The kernel function K(·) have support [−1, 1]dx , R Rdx K(u)du = 1, and supu K (u) ≤ Kmax < ∞. Smoothness of the regression equations, Assumption 2.3 (Smooth-m), is a standard assumption in the context of nonparametric regressions. Assumption 2.3 (PX) is borrowed from Audibert and Tsybakov (2007), and it provides regularity conditions on the marginal distribution of X. Inequality condition (2.11) constrains the shape of the support of X, and it essentially rules out the case where X has “sharp” spikes, i.e., X ∩ B(x, r) has an empty interior or Leb (X ∩ B(x, r)) converges to zero as r → 0 faster than the rate of r2 for some x in the boundary of X . Lemma B.4 in Appendix B shows that when Pm consists of the data generating processes 1 satisfying Assumption 2.3 (Smooth-m) and (PX), Condition 2.1 (m) holds with ψ n = n 2+dx /βm , 1 − 2+d 1/β and (2.9) holds with ψ̃ n = n 2+dx /βm (log n) x m −2 . The following corollary therefore follows. Corollary 2.1. Let Pm consist of data generating processes that satisfy Assumption 2.3 (Smoothm) and (PX). Let m̂1 (Xi ) and m̂0 (Xi ) be the leave-one-out local polynomial estimators with degree l = (β m − 1), whose kernels satisfy Assumption 2.3 (Ker). (i) Suppose Assumption 2.1 holds. Then, it holds h i 1 − sup EP n WG∗ − W (Ĝm−hybrid ) ≤ O n 2+dx /βm . P ∈Pm ∩P(M,κ) (ii) Suppose Assumptions 2.1 and 2.2 hold with margin coefficient α ∈ (0, 1]. Then, it holds h i sup EP n W (G∗F B ) − W (Ĝm−hybrid ) P ∈Pm ∩PF B (M,κ,α,η) 1 1+α +2 (1+α) − 2+d1+α − 2+d /β /β x 2+α x m (log n) m ≤ O n ∨n log n . Next, consider the e-hybrid approach. For each i = 1, . . . , n, define a leave-one-out local polynomial fit for propensity score as ê (Xi ) = U T (0)θ̂e (Xi ) · 1 {λn (Xi ) ≥ tn } , X Xj − Xi Xj − Xi 2 T K . θ̂e (Xi ) = arg min Dj − θ U θ h h j6=i We then construct an estimate of individual treatment effect as Yi Di Yi (1 − Di ) τ̂ i = − · 1 {εn ≤ ê(Xi ) ≤ 1 − εn } , 0 < n ≤ O n−a , a > 0, ê(Xi ) 1 − ê(Xi ) 21 To ensure Condition 2.1 (ii), we now assume smoothness of the propensity score function e(·). Assumption 2.4. This assumption is the same as Assumption 2.3 except that 2.3 (Smooth-m) is replaced by (Smooth-e) Smoothness of the Propensity Score: The propensity score e(·) belongs to a Hölder class of functions with degree β e ≥ 1 and constant Le < ∞. Again, Lemma B.4 in Appendix B shows that Pe formed by the data generating processes satisfying Assumption 2.4, Condition 2.1 (e) holds with φn = n n 1 2+dx /β e − 2+d1 /β −2 (log n) x e − 2+d1 /β x e and (2.10) with φ̃n = . Corollary 2.2. Let Pe consist of data generating processes that satisfy Assumption 2.4 (Smooth-e) and (PX). Let ê(Xi ) be the leave-one-out local polynomial estimator with degree l = (β e − 1), whose kernel satisfy Assumption 2.3 (Ker). (i) Suppose Assumption 2.1 holds. Then, it holds sup P ∈Pe ∩P(M,κ) h i 1 − EP n WG∗ − W (Ĝe−hybrid ) ≤ O n 2+dx /βe . (ii) Suppose Assumptions 2.1 and 2.2 hold with margin coefficient α ∈ (0, 1]. Then, it holds h i sup EP n W (G∗F B ) − W (Ĝe−hybrid ) P ∈Pe ∩PF B (M,κ,α,η) 1 1+α 1+α +2 (1+α) − 2+d − 2+d /β /β x 2+α x e e (log n) ≤ O n ∨n log n . A comparison of Corollaries 2.1 and 2.2 shows that the rate upper bound of welfare loss differs between the m-hybrid EWM and the e-hybrid EWM approaches when the degree of Hölder smoothness of the regression equations β m and that of the propensity score β e are different. For instance, if the propensity score e (·) is smoother than the regression equations of outcome m1 (·) and m0 (·) in the sense of β e > β m and the degree of local polynomial regressions is chosen accordingly, then the rate upper bound of the e-hybrid EWM rule converges faster than that of the m-hybrid EWM rule. 22 3 Inference for Welfare In the proposed EWM procedure, the maximized empirical welfare Wn (ĜEW M ) can be seen as an estimate of W (ĜEW M ), the welfare level attained by implementing the estimated treatment rule.13 In this section, we provide a procedure for constructing asymptotically valid confidence intervals for the population welfare gain of implementing the estimated rule. Let Ĝ ∈ G be an estimated treatment rule such as ĜEW M , Ĝm−hybrid , and Ĝe−hybrid . Define the welfare gain of implementing an estimated treatment rule Ĝ ∈ G by V (Ĝ) = W (Ĝ) − W (G0 ), where G0 is a benchmark treatment assignment rule with which the estimated treatment rule Ĝ is compared in terms of the social welfare. For instance, if the estimated treatment rule Ĝ is compared with the “no treatment” case, G0 is the empty set ∅. Alternatively, if a benchmark policy is the non-individualized uniform adoption of the treatment, G0 is set at G0 = X , and V (Ĝ) is interpreted as the welfare gain of implementing individualized treatment assignment instead of the non-individualized implementation of the treatment. √ A construction of the confidence intervals for V (Ĝ) proceeds as follows. Let ν n (G) = n (Vn (G) − V (G)), where Vn (G) ≡ Wn (G) − Wn (G0 ). If there is a random variable ν̄ n such that ν n Ĝ ≤ ν̄ n holds P n -almost surely, and if ν̄ n converges in distribution to a non-degenerate random variable ν̄, then, with c1−α̃ , the (1 − α̃)-th quantile of ν̄, it holds P n (ν n Ĝ ≤ c1−α̃ ) ≥ P n (ν̄ n ≤ c1−α̃ ) → Pr (ν̄ ≤ c1−α̃ ) = 1 − α̃, as n → ∞. Hence, if ĉ1−α̃ , a consistent estimator of c1−α̃ , is available, an asymptotically valid one-sided confi- dence intervals for V (Ĝ) with coverage probability (1 − α̃) can be given by ĉ 1−α̃ (3.1) [Vn Ĝ − √ , ∞). n √ In the algorithm summarized below, we specify ν̄ n to be ν̄ n = n supG∈G (Vn (G) − V (G)) and estimate ĉ1−α̃ by bootstrapping the supremum of the centered empirical processes.14 13 It is important to note that in finite samples, Wn ĜEW M estimates W (ĜEW M ) with an upward bias. With fixed n, the size of the bias becomes bigger as G becomes more complex. 14 The current choice of ν̄ n is likely to yield conservative confidence intervals. Keeping the same nominal coverage probability, it is feasible to tighten up the confidence intervals with a more sophisticated choice of ν̄ n , such as, √ ν̄ n = n supG∈Ĝ (Vn (G) − V (G)) , where Ĝ is a data-dependent subclass of G that contains Ĝ with probability approaching one. 23 Algorithm 3.1. 1. Let Ĝ ∈ G be an estimated treatment assignment rule (e.g., EWM rule, the hybrid-rules, etc.), and Vn (·) = Wn (·) − Wn (G0 ) be the empirical welfare gain obtained from the original sample. 2. Resample n-observations of Zi = (Yi , Di , Xi ) randomly with replacement from the original sample and construct the bootstrap analogue of the welfare gain, Vn∗ (·) = Wn∗ (·) − Wn∗ (G0 ), where Wn∗ (·) is the empirical welfare of the bootstrap sample. 3. Compute ν̄ ∗n = √ n supG∈G (Vn∗ (G) − Vn (G)). 4. Let α̃ ∈ (0, 1/2). Repeat step 2 and 3 many times and estimate ĉ1−α̃ by the empirical (1 − α̃)th quantile of the bootstrap realizations of ν̄ ∗n . Given Assumption 2.1, the uniform central limit theorem for empirical processes assures that ν̄ n converges in distribution to ν̄ the supremum of a mean zero Brownian bridge process. Furthermore, by the well-known result on the asymptotic validity of the bootstrap empirical processes (see, e.g., Section 3.6 of van der Vaart and Wellner (1996)), the bootstrap critical value ĉ1−α̃ consistently estimates the corresponding quantile of ν̄. We can therefore conclude that the confidence intervals constructed in (3.1) has the desired asymptotic coverage probability. 4 Extensions 4.1 Empirical Welfare Maximization with a Capacity Constraint Empirical welfare maximization could also be used to select treatment rules when one of the treatments is scarce and the planner faces a capacity constraint on the proportion of the target population that could be assigned to it. Capacity constraints exist in various treatment choice problems. In medicine, stocks of new drugs and vaccines could be smaller than the number of patients who may benefit from them. In education, limited number of slots is available in “magnet” schools. Training programs for the unemployed are sometimes capacity-constrained. The limited capacity of a prison system could make it infeasible to assign incarceration as a treatment for all convicted criminals. In these cases, treatment assignment rules which propose to assign too many individuals to the capacity-constrained treatment cannot be fully implemented. We assume that the availability of treatment 1 is constrained. 24 Assumption 4.1. (CC) Capacity Constraint: Proportion of the target population that could receive treatment 1 cannot exceed K ∈ (0, 1). If the population distribution of covariates PX were known, maximization of the empirical welfare criterion could be simply restricted to sets in class G that satisfy the capacity constraint GK ≡ {G ∈ G : PX (G) ≤ K}. Being a subset of G, the class of sets GK has the same complexity as G (or lower), and all previous results could be applied simply by replacing G with GK . The population distribution of covariates is often not known precisely. In these cases, it is impossible to guarantee with certainty that estimated treatment rule Ĝ will satisfy the capacity constraint with probability one. To evaluate the welfare of any treatment assignment method in this setting, we first need to make an assumption about what happens when the estimated treatment set Ĝ violates the capacity constraint. We assume that if the treatment rule G violates the capacity constraint, i.e., PX (G) > K, then the scarce treatment is randomly allocated (“rationed”) to a fraction K PX (G) of the assigned recipients with X ∈ G independently of (X, Y0 , Y1 ). If G does not violate the capacity constraint, then there is no rationing and all recipients with covariates X ∈ G receive treatment 1. This assumption holds if individuals arrive sequentially for treatment assignment in the order that is independent of (X, Y0 , Y1 ) and receive treatment 1 on a first-come first-serve basis if X ∈ G. It could also correspond to settings in which treatment 1 is allocated by lottery to individuals with X ∈ G. This allows us to clearly define the capacity-constrained welfare of the treatment rule indexed by any subset G ⊂ X of the covariate space as h o n oi n · 1 {X ∈ G} m1 (X) · min 1, PXK(G) + m0 (X) · 1 − min 1, PXK(G) WK (G) ≡ EP +m0 (X) · 1 {X ∈ / G} Then the capacity-constrained welfare gain of the treatment rule G equals VK (G) ≡ WK (G) − WK (∅) K · τ (X) · 1 {X ∈ G} = EP min 1, PX (G) K = min 1, · V (G). PX (G) 25 . Rationing dilutes the effect of treatment rules that violate the capacity constraint and we take into account this effect on welfare. In comparison, nonparametric plug-in treatment rules proposed by Bhattacharya and Dupas (2012) are only required to satisfy the capacity constraint on average over repeated data samples. The maximum welfare gain attainable by treatment rules in G in the presence of a capacity constraint equals VK∗ ≡ supG∈G VK (G). Even with full knowledge of the outcome distribution, it is feasible that the optimal policy would assign the scarce treatment to a subset of the population with PX (G) > K, requiring rationing within that subpopulation. This is unlikely to happen in practice when the distribution of covariates does not have atoms and the collection of treatment rules G is sufficiently rich. The following condition, for example, guarantees that the optimal capacity-constrained policy does not require rationing. Remark 4.1. If the collection of treatment rules G contains the upper contour sets of τ (x), Gt ≡ {x ∈ X : τ (x) ≥ t} , t ∈ R, and PX (Gt ) is continuous in t, then the optimal policy GK = arg maxG∈G VK (G) belongs to the set {Gt , t ∈ R} and satisfies the capacity constraint without rationing: PX (ĜK ) ≤ K. We propose a treatment rule that maximizes the empirical analog of the capacity-constrained welfare gain VK (G) (and, hence, welfare): ĜK ≡ arg max VK,n (G), (4.1) G∈G where K · Vn (G) VK,n (G) ≡ min 1, PX,n (G) K Yi D i Yi (1 − Di ) = min 1, · En − · 1{Xi ∈ G} , PX,n (G) e(Xi ) 1 − e(Xi ) and PX,n is the empirical probability distribution of (X1 , . . . , Xn ). The following theorem shows that the expected welfare of ĜK converges to the maximum at least at n−1/2 rate. The result is analogous to Theorem 2.1, with the additional term corresponding to potential welfare losses due to misestimation of PX (G). Theorem 4.1. Under Assumptions 2.1 and 4.1, r r M v M v + C1 , sup EP n sup WK (G) − WK (ĜK ) ≤ C1 κ n K n G∈G P ∈P(M,κ) where C1 is the universal constant in Lemma A.4. 26 4.2 Target Population that Differs from the Sampled Population Empirical Welfare Maximization method can be adapted to select treatment rules for a target population that differs from the sampled population in the distribution of covariates X, but has the same conditional treatment effect function τ (x). As before, P denotes the probability distribution of (Y0,i , Y1,i , Di , Xi ) in the sampled population. We denote by P T the probability distribution of (Y0,i , Y1,i , Xi ) in the target population and by E T the expectations with respect to that probability. The welfare of implementing treatment rule G in the target population is W T (G). We assume that the sampled population has the same conditional treatment effect as the target population. Assumption 4.2. (ID) Identical Treatment Effects: E T (Y1 − Y0 |X) = EP (Y1 − Y0 |X). (BDR) Bounded Density Ratio: Probability distributions PXT and PX have densities pTX and pX with respect to a common dominating measure on X and for some ρ(x) ≤ ρ̄ < ∞: pTX (x) = ρ(x) · pX (x). The welfare gain of treatment rule G on the target population equals Z Z T T V (G) ≡ τ (x)1{x ∈ G} dPX (x) = τ (x)1{x ∈ G}ρ(x) dPX (x). X X Assumption 4.2 (ID) implies that the first-best treatment rule G∗F B = 1{x : τ (x) ≥ 0} is the same in the sampled and the target populations. Therefore, when the first-best policy is feasible, i.e. G∗F B ∈ G, we could directly apply the EWM treatment rule ĜEW M computed for the sampled population to the target population. Note that from the definition of G∗F B , it follows that for any treatment rule G and any x τ (x)1{x ∈ G∗F B } − τ (x)1{x ∈ G} ≥ 0. In conjunction with Assumption 4.2 (BDR), this yields an upper bound on the welfare loss of applying any treatment rule in the target population expressed in terms of the welfare loss of 27 applying the same rule in the sampled population. W T (G∗F B ) − W T (G) = V T (G∗F B ) − V T (G) Z [τ (x)1{x ∈ G∗F B } − τ (x)1{x ∈ G}] ρ(x)dPX (x) = X Z [τ (x)1{x ∈ G∗F B } − τ (x)1{x ∈ G}] dPX (x) ≤ ρ̄ X = ρ̄ [W (G∗F B ) − W (G)] . All of the welfare convergence rate results derived for EWM and hybrid treatment rules in previous sections also hold when these treatment rules are applied to the target population, as long as Assumptions 2.2 (FB) and 4.2 hold. When the first-best policy is not feasible (G∗F B ∈ / G), the second-best policies for the sampled and the target populations, G∗ ∈ arg maxG∈G W (G) and GT ∗ ∈ arg maxG∈G W T (G), are generally different. The welfare of treatment rules proposed in the previous sections does not generally converge to the second-best in the target population supG∈G W T (G). The second-best in the target population could be obtained by reweighting the argument of the EWM problem by the density ratio ρ(Xi ). This method works even when the first-best treatment rule is not feasible and could be combined with a treatment capacity constraint. The reweighted EWM problem is ĜTEW M ∈ arg max En G∈G Yi Di Yi (1 − Di ) − e(Xi ) 1 − e(Xi ) · ρ(Xi ) · 1 {Xi ∈ G} . (4.2) Applying the law of iterated expectations with respect to X, we obtain EP YD Y (1 − D) − e(X) 1 − e(X) Z · ρ(X) · 1{X ∈ G} = τ (x)1{x ∈ G}ρ(x) dPX (x) = V T (G). X (4.3) Theorem 4.2 shows that the welfare loss of the reweighted EWM treatment rule in the target population converges to zero at least at n−1/2 rate. Theorem 4.2. If the distribution P in the sampled population satisfies Assumption 2.1 and the distribution P T in the target population satisfies Assumption 4.2, then r M ρ̄ v T T T sup EP n sup W (G) − W (ĜEW M ) ≤ C1 , κ n G∈G P ∈P(M,κ) where C1 is the universal constant in Lemma A.4. 28 4.3 Comparison with the Nonparametric Plug-in Rule The plug-in treatment choice rule (1.13) with parametrically or nonparametrically estimated m1 (x) and m0 (x) is intuitive and simple to implement. In situations where flexible treatment assignment rules are allowed and the dimension of conditioning covariates is small, the nonparametric plug-in rule would be a competing alternative to the EWM approach. In this section, we review the welfare loss convergence rate results of the nonparametric plug-in rule and discuss potential advantages and disadvantages of these two approaches. We denote the class of data generating processes that satisfy Assumptions 2.1 (UCF), (BO), (SO), Assumption 2.2 (MA), and Assumption 2.3 by Psmooth (M, κ, α, η, β m ). Given the smoothness assumption of the regression equations, we consider estimating m1 and m0 by local polynomial estimators of degree (β − 1). The convergence rate results of the nonparametric plug-in classifiers shown in Theorem 3.3 of Audibert and Tsybakov (2007) can be straightforwardly extended to the treatment choice context, resulting in sup P ∈Psmooth (M,κ,α,η,β m ) h i − 1+α EP n W (G∗F B ) − W (Ĝplug−in ) ≤ O n 2+dx /βm . (4.4) Furthermore, if αβ ≤ dx , Theorem 3.5 of Audibert and Tsybakov (2007) applied to the current treatment choice setup shows that the nonparametric plug-in rule attains the rate lower bound i.e., for any treatment rule Ĝ, sup P ∈Psmooth (M,κ,α,η,β m ) h i − 1+α EP n W (G∗F B ) − W (Ĝ) ≥ O n 2+dx /βm holds. In practically relevant situations where αβ ≤ dx ,15 a naive comparison of the welfare loss convergence rate of the plug-in rule presented here with that of EWM (Theorems 2.3 and 2.4) would suggest that in terms of the welfare loss converge rate, the EWM rule would outperform the nonparametric plug-in rule. It is, however, important to notice that the classes of data generating processes over which the uniform rates are ensured differ between the two cases. Psmooth (M, κ, α, η, β) is constrained by smooth regression equations and continuously distributed X, whereas PF B (M, κ, α, η) 15 In an analogy to the Proposition 3.4 of Audibert and Tsybakov (2007), when the class of data generating processes is assumed to have αβ > dx , no data generating process in this class can have the conditional treatment effect τ (x) = 0 in an interior of the support of PX . In the practice of causal inference, we a priori would not restrict the plausible data generating processes only to these extreme cases; therefore, the class of data generating processes with αβ > dx would be less relevant in practice. 29 considered in Theorems 2.3 and 2.4 allows for discontinuous regression equations and no restriction on the marginal distribution of X’s. Assumption 2.2 (FB) on PF B (M, κ, α, η) requires that {x : τ (x) ≥ 0} belongs to the pre-specified VC-class G, whereas Psmooth (M, κ, α, η, β) is free from such assumption. This non-nested relationship between PF B (M, κ, α, η) and Psmooth (M, κ, α, η, β) makes the naive rate comparison between (4.4) and Theorem 2.3 less meaningful because a data generating process in Psmooth (M, κ, α, η, β) that yields the slowest convergence rate for the nonparametric plug-in rule is in fact excluded from PF B (M, κ, α, η). Accordingly, unless we can assess which one of Psmooth (M, κ, α, η, β) and PF B (M, κ, α, η) is more likely to contain the true data generating process, these rate results offer us limited guidance on the procedure that should be used in a given application. In practical terms, we consider these two distinct approaches as complementary, and our choice between them should be based on available assumptions and the dimension of covariates in a given application. A practical advantage of the EWM rule is that the welfare loss convergence rate does not directly depend on the dimension of X, so when an available credible assumption on the level set {x : τ (x) ≥ 0} implies a certain class of decision sets with a finite VC-dimension, the EWM approach offers a practical solution to get around the curse of dimensionality of X. A potential drawback of using the EWM rule is the risk of misspecification of G, i.e., if Assumption 2.2 (FB) is not valid, the EWM rule only attains the second-best welfare, whereas the nonparametric plug-in rule is guaranteed to yield the first-best welfare in the limit. Another aspect of comparison is that the performance of the EWM rule is stable regardless of whether the underlying data generating processes, including the marginal distribution of X and the regression equations m1 (X) and m0 (X), are smooth. Furthermore, in terms of implementation, the EWM approach does not require the user to specify smoothing parameters once the class of candidate decision sets G is given. In contrast, the nonparametric plug-in rule requires smoothing parameters. The statistical performance of the nonparametric plug-in rule can be sensitive to the choice of smoothing parameters, and the theoretical results of the convergence rate given in (4.4) assume the user’s ability to choose the smoothing parameter properly. 30 5 Computing EWM Treatment Rules The Empirical Welfare Maximization estimator Ĝ, as well as hybrid estimators Ĝm−hybrid , and Ĝe−hybrid , share the same structure X gi · 1 {Xi ∈ G} , Ĝ ∈ arg max G∈G (5.1) 1≤i≤n where each gi is a function of the data, i.e., for the EWM rule ĜEW M , gi = 1 n Yi D i e(Xi ) − for the e-hybrid rule Ĝe−hybrid , gi = τ̂ ei /n, and for the m-hybrid rule Ĝm−hybrid , gi = Yi (1−Di ) 1−e(Xi ) , τ̂ m (Xi )/n. The objective function in (5.1) is non-convex and discontinuous in G, thus finding Ĝ could be computationally challenging. In this section, we propose a set of convenient tools that permit solving this optimization problem and performing inference using widely available software16 for practically important classes of sets G defined by linear eligibility scores. 5.1 Single Linear Index Rules We start with the problem of computing optimal treatment rules that assign treatments based on a linear index (linear eligibility score; LES, see Examples 2.1 and 2.2). To reduce notational complexity, we include a constant in the covariate vector X throughout the exposition of this section. An LES rule can be expressed as 1{X T β ≥ 0}. This type of treatment rule is commonly used in practice because it offers a simple way to reduce the dimension of observable characteristics. Furthermore, it is easy to enforce monotonicity of treatment assignment in specific covariates by imposing sign restrictions on the components of β. Let GLES be a collection of half-spaces of the covariate space X , which are the upper contour sets of linear functions: n o GLES = Gβ : β ∈ B ⊂Rdx +1 , Gβ = x : xT β ≥ 0 . Then the optimization problem (5.1) becomes: X max gi · 1 XiT β ≥ 0 . β∈B (5.2) 1≤i≤n This problem is similar to the maximum weighted score problem analyzed in Florios and Skouras (2008). They observe that the maximum score objective function could be rewritten as a Mixed 16 For the empirical illustration we used IBM ILOG CPLEX Optimization Studio, which is available free for academic use through the IBM Academic Initiative. 31 Integer Linear Programming problem with additional binary parameters (z1 , ..., zn ) that replace the indicator functions 1 XiT β ≥ 0 . The equality zi = 1 XiT β ≥ 0 is imposed by a combination of linear inequality constraints and the restriction that zi ’s are binary. The advantage of a MILP representation is that it is a standard optimization problem that could be solved by multiple commercial and open-source solvers. The branch-and-cut algorithms implemented in these solvers are faster than brute force combinatorial optimization. We propose replacing (5.2) by its equivalent problem: X gi · zi max β∈B, z1 ,...,zn ∈R (5.3) 1≤i≤n XiT β XT β < zi ≤ 1 + i for i = 1, . . . , n, Ci Ci zi ∈ {0, 1}, s.t. (5.4) where constants Ci should satisfy Ci > supβ∈B |XiT β|. Then the inequality constraints (5.4) and the restriction that zi ’s are binary imply that zi = 1 if and only if XiT β ≥ 0. It follows that the maximum value of (5.4) for each value of β is the same as the value of (5.2). The problem (5.3) is a linear optimization problem with linear inequality constraints and integer constraints on zi ’s if the set B is defined by linear inequalities that could be passed to any MILP solver. Florios and Skouras (2008) impose only one side of the inequality constraint (5.4) for each i. For gi > 0, it is sufficient to impose only the upper bound on zi and for gi < 0 only the lower bound. The other side of the bound is always satisfied by the solution due to the direction of the objective function. Our formulation has significant advantages. Despite a larger number of inequalities, it reduces the computation time in our applications by a factor of 10-40. Furthermore, it is not sufficient to impose only one side of the inequalities on zi ’s for optimization with a capacity constraint considered further below. Inference on the welfare gain V (ĜEW M ) of the empirical welfare maximizing policy requires √ computing ν̄ ∗n = supG∈G n (Vn∗ (G) − Vn (G)) in each bootstrap sample. Denoting the bootstrap P weights by {wi∗ }, ni=1 wi∗ = n, ν̄ ∗n could be expressed as X √ ν̄ ∗n = n sup (wi∗ − 1)gi · 1 XiT β ≥ 0 (5.5) G∈G 1≤i≤n The optimization problem for ν̄ ∗n is analogous to the optimization problem for ĜEW M . Furthermore, solving it does not require the knowledge of ĜEW M , hence all bootstrap computations could be performed in parallel with the main EWM problem. 32 5.2 Multiple Linear Index Rules We extend this method to compute treatment rules based on multiple linear scores. These rules construct J scores that are linear in covariates (or in their functions) and assign an individual to treatment if each score exceeds a specific threshold. An example of a multiple index treatment rule with three indices is when an individual is assigned to a job training program if (25 ≤ age ≤ 35) AND (wage at the previous job < $15). The results are easily extended to treatment rules that apply if any of the indices exceeds its threshold, for example, (age ≥ 40) OR (length of unemployment ≥ 2 years). Let the treatment assignment set G be defined as an intersection of upper contour sets of J linear functions: n o Gβ 1 ,...,β J , β 1 , ..., β J ∈ B , = x : xT β 1 ≥ 0, ..., xT β J ≥ 0 . G = Gβ 1 ,...,β J Then the optimization problem (5.1) becomes X max gi · 1{XiT β 1 ≥ 0, . . . , XiT β J ≥ 0}. β 1 ,...,β J ∈B (5.6) 1≤i≤n We propose its equivalent formulation as a MILP problem with auxiliary binary variables = 1, . . . , n : X max gi · zi∗ (5.7) (zi1 , . . . , ziJ , zi∗ ), i β 1 ,...,β J ∈B, zi1 ,...,ziJ ,zi∗ ∈R s.t. 1≤i≤n X T βj XiT β j < zij ≤ 1 + i for 1 ≤ i ≤ n, 1 ≤ j ≤ J, Ci Ci X j X j 1−J + zi ≤ zi∗ ≤ J −1 zi for 1 ≤ i ≤ n, 1≤j≤J zi1 , . . . , ziJ , zi∗ (5.8) (5.9) 1≤j≤J ∈ {0, 1} for 1 ≤ i ≤ n. Similarly to the single index problem, the inequalities (5.8) and the constraint that zij ’s are binary imply together that zij = 1{XiT β j ≥ 0}. Linear inequalities (5.9) and the binary constraints imply together that zi∗ = zi1 · ... · ziJ = 1{XiT β 1 ≥ 0} · ... · 1{XiT β J ≥ 0}. The problem for a collection of sets defined by the union of linear inequalities Gβ 1 ,...,β J = X : X T β 1 ≥ 0 or . . . or X T β J ≥ 0 33 could also be written as a MILP problem with the inequality constraint (5.9) replaced by J −1 X zij ≤ zi∗ ≤ 1≤j≤J 5.3 X zij for i = 1, . . . , n. (5.10) 1≤j≤J Optimization with a Capacity Constraint When there is a capacity constraint K on the proportion of population that could be assigned to treatment 1, Empirical Welfare Maximization problem (4.1) on a set G of half-spaces becomes X T Kn gi · 1 Xi β ≥ 0 . (5.11) max min 1, Pn T β∈B i=1 1{Xi β ≥ 0} 1≤i≤n This problem cannot be rewritten as a linear optimization problem in the same way as (5.3) because n o Kn the factor min 1, Pn 1{X varies with β. This factor could take fewer than n different values T β≥0} i=1 i and the maximum of (5.11) could be obtained by solving a sequence of optimization problems each of which holds this factor constant. For max β∈B, z1 ,...,zn ∈R s.t. k = bKnc , . . . , n Kn X min 1, gi · zi k 1≤i≤n XiT β XT β < zi ≤ 1 + i for 1 ≤ i ≤ n, Ci Ci zi ∈ {0, 1}, X zi ≤ k. 1≤i≤n The capacity constrained problem with multiple indexes could be solved similarly. 6 Empirical Application We illustrate the Empirical Welfare Maximization method by applying it to experimental data from the National Job Training Partnership Act (JTPA) Study. A detailed description of the study and an assessment of average program effects for five large subgroups of the target population is found in Bloom et al. (1997). The study randomized whether applicants would be eligible to receive a mix of training, job-search assistance, and other services provided by the JTPA for a period of 18 months. It collected background information on the applicants prior to random assignment, as well as administrative and survey data on applicants’ earnings in the 30-month period following 34 the assignment. We use the same sample of 11,204 adults (22 years and older) used in the original evaluation of the program and in the subsequent studies (Bloom et al., 1997, Heckman et al., 1997, Abadie et al., 2002). The probability of being assigned to the treatment was one third in this sample. We use two simple welfare outcome measures for our illustration. The first is the total individual earnings in the 30-month period following the program assignment. The second outcome measure is the 30-month earnings minus $1,000 cost for assigning individuals to the treatment, which is close to the average cost (per assignee) of the additional services provided by the JTPA program, as estimated by Bloom et al. (1997). The first outcome measure reflects social preferences that put no weight on the costs of the program incurred by the government. The second measure weighs participants’ gains and the government’s losses equally. For all treatment rules, we report the estimated intention-to-treat effect of assigning all individuals with covariates X ∈ G to the treatment. The take-up of different program services varied across individuals. We view the policy maker’s problem as a choice of eligibility criteria for the program and not a choice of the take-up rate (which is decided by individuals); hence, we are not interested in the treatment effect on compliers. Since we have to compare welfare effects of policies that assign different proportions of the population to the treatment, we report estimates of the average effect per population member E[(Y1 − Y0 ) · 1{X ∈ G}], which is proportional to the total welfare effect of the treatment rule G. Pre-treatment variables based on which we consider conditioning the treatment assignment are the individual’s years of education and earnings in the year prior to the assignment. Both variables may plausibly affect how much effect the individual gets from the program services. We do not use race, gender, or age. Though treatment effects may vary with these characteristics, policy makers usually cannot use them to determine treatment assignment, since this may be easily perceived as discrimination. Education and earnings are generally verifiable characteristics. This is an important feature for implementing the proposed treatment assignment because the empirical welfare estimates are inaccurate for the target population if the individuals could manipulate their characteristics to obtain the desired treatment. Table 1 reports the estimated welfare gains of alternative treatment rules. All of them are estimated by inverse probability weighting. The average effect of the program on 30-month earnings for the whole study population is $1,269. If treatment cost of $1,000 per assignee is taken into account, the net average effect of assigning everyone to treatment is estimated to be $269. We consider two candidate classes of treatment rules for EWM. The first is the class of quadrant 35 treatment rules: {x : s1 (education − t1 ) > 0 & s2 (prior earnings − t2 ) > 0} , GQ ≡ . s1 , s2 ∈ {−1, 0, 1}, t1 , t2 ∈ R (6.1) This class of treatment eligibility rules is easily implementable and is often used in practice. To be assigned to treatment according to such rules, an individual’s education and pre-program earnings have to be above (or below) some specific thresholds. Figure 1 demonstrates the quadrant treatment rules selected by the EWM criterion. The entire shaded area covers individuals who would be assigned to treatment if it were costless. The dark shaded area shows the EWM treatment rule that takes into account $1,000 treatment cost. The size of black dots indicates the number of individuals with different covariate values. Both rules set a minimum threshold of ten years of education and a maximum threshold on pre-program earnings ($12,200 and $6,500). The estimated proportions of population assigned to treatment under these rules are 83% and 73%. Second, we consider the class of linear treatment rules: GLES ≡ {{x : β 0 + β 1 · education + β 2 · prior earnings > 0} , β 0 , β 1 , β 2 ∈ R} . (6.2) Figure 2 displays the treatment rules from this class chosen according to the EWM criterion. They are nearly identical for no treatment cost and for a cost of $1,000, assigning 82% of the population to treatment. At higher treatment costs, EWM selects a much smaller subset of the population. Linear treatment rules that maximize empirical welfare are markedly different from the plug-in rule derived from linear regressions, which are shown in Figure 3. Without treatment costs, linear regression predicts positive treatment effects for the entire range of feasible covariate values. With a cost of $1,000, the regression predicts positive net treatment effect for about 82% of individuals. Noticeably, the direction of the treatment assignment differs between regression plug-in and linear EWM rules. The regression puts a positive coefficient on prior earnings, whereas the equation characterizing linear EWM rule puts a negative coefficient on them. If the linear regression is correctly specified, the regression plug-in and EWM rules have identical large sample limits. If the regression is misspecified, however, only linear EWM treatment rules converge with sample size to the welfare-maximizing limit. The welfare yielded by regression plug-in rules converges to a lower limit with sample size. Figure 4 shows plug-in treatment rules based on Kernel regressions of treatment and control outcomes on the covariates. The bandwidths were chosen by Silverman’s rule of thumb. The class of nonparametric plug-in rules is richer than the quadrant or the linear class of treatment rules, and 36 it may obtain higher welfare in large samples. It is clear from the figure, however, that this class of patchy decision rules may be difficult to implement in public policy, where clear and transparent treatment rules are required. 7 Conclusion The EWM approach proposed in this paper directly maximizes a sample analog of the welfare criterion of a utilitarian policy maker. This welfare-function-based statistical procedure for treatment choice differs fundamentally from parametric and nonparametric plug-in approaches, which do not integrate statistical inference and the decision problem at hand. We investigated the statistical performances of the EWM rule in terms of the uniform convergence rate of the welfare loss and demonstrated that the EWM rule attains minimax optimal rates over various classes of feasible data distributions. The EWM approach offers a useful framework for the individualized policy assignment problems, as the EWM approach can easily accommodate the constraints that policy makers commonly face in reality. We also presented methods to compute the EWM rule for many practically important classes of treatment assignment rules and demonstrated them using experimental data from the JTPA program. Several extensions and open questions remain to be answered. First, this paper assumed that the class of candidate policies G is given exogenously to the policy maker. We did not consider how to select the class G when the policy maker is free to do so. Second, we ruled out the case in which the data are subject to selection on unobservables. With self-selection into the treatment, the welfare criterion could be only set-identified, and it is not clear how to extend the EWM idea to this case. Third, we restricted our analysis to the utilitarian social welfare criterion, but in some contexts, policy makers have a non-utilitarian social welfare criterion. We leave these issues for future research. 37 Table 1: Estimated welfare gain of alternative treatment assignment rules that condition on education and pre-program earnings. Treatment rule: Share of population Empirical welfare gain Lower 90% CI assigned to treatment per population member (bootstrap) Outcome variable: 30-month post-program earnings. No treatment cost. Average treatment effect: 1 $1,269 $719 EWM quadrant treatment rule 0.827 $1,614 $787 EWM linear treatment rule 0.820 $1,657 $765 1 $1,269 0.816 $1,924 Linear regression plug-in rule Nonparametric plug-in rule With $1000 expected cost per treatment assignment Average treatment effect: 1 $269 -$281 EWM quadrant treatment rule 0.732 $797 -$20 EWM linear treatment rule 0.819 $837 -$46 Linear regression plug-in rule 0.824 $760 Nonparametric plug-in rule 0.675 $1,164 38 EWM quadrant treatment rule, no cost EWM quadrant treatment rule, $1000 cost Population density Pre-program annual earnings $20K $10K $0 6 8 10 12 14 16 18 Years of education Figure 1: Empirical Welfare-Maximizing treatment rules from the quadrant class conditioning on years of education and pre-program earnings 39 EWM linear treatment rule, no cost EWM linear treatment rule, $1000 cost Population density Pre-program annual earnings $20K $10K $0 6 8 10 12 14 16 18 Years of education Figure 2: Empirical Welfare-Maximizing treatment rules from the linear class conditioning on years of education and pre-program earnings 40 OLS plug-in treatment rule, no cost OLS plug-in treatment rule, $1000 cost Population density Pre-program annual earnings $20K $10K $0 6 8 10 12 14 16 18 Years of education Figure 3: Parametric plug-in treatment rules based on the linear regressions of treatment outcomes on years of education and pre-program earnings 41 Kernel regression < $0 Kernel regression > $0 Kernel regression > $1000 Pre-program annual earnings $20K $10K $0 6 8 10 12 14 16 18 Years of education Figure 4: Nonparametric plug-in treatment rules based on the kernel regressions of treatment outcomes on years of education and pre-program earnings 42 A Appendix: Lemmas and Proofs A.1 Notations and Basic Lemmas Let Zi = (Yi , Di , Xi ) ∈ Z. The subgraph of a real-valued function f : Z 7→ R is the set SG(f ) ≡ {(z, t) ∈ Z × R : 0 ≤ t ≤ f (z) or f (z) ≤ t ≤ 0}. The following lemma establishes a link between the VC-dimension of a class of subsets in the covariate space X and the VC-dimension of a class of subgraphs of functions on Z =R× {0, 1} × X (their subgraphs will be in Z × R). Lemma A.1. Let G be a VC-class of subsets of X with VC-dimension v < ∞. Let g and h be two given functions from Z to R. Then the set of functions from Z to R F = {fG (z) = g(z) · 1 {x ∈ G} + h(z)1 {x ∈ / G} : G ∈ G} is a VC-subgraph class of functions with VC-dimension less than or equal to v. Proof. Let zi = (yi , di , xi ). By the assumption, no set of (v + 1) points in X could be shattered by G. Take an arbitrary set of (v + 1) points in Z × R, A = {(z1 , t1 ), ..., (zv+1 , tv+1 )}. Denote the collection of subgraphs of F by SG(F) ≡ {SG(fG ), G ∈ G}. We want to show that SG(F) doesn’t shatter A. If for some i ∈ {1, . . . , (v + 1)}, (zi , ti ) ∈ SG(g) ∩ SG(h) then SG(F) cannot pick out all of the subsets of A because the i-th point is included in any S ∈ SG(F). Similarly, if for some i ∈ {1, . . . , (v + 1)}, (zi , ti ) ∈ SG(g)c ∩ SG(h)c , then point i cannot be included in any S ∈ SG(F). The remaining case is that, for each i, either (zi , ti ) ∈ SG(g)∩SG(h)c or (zi , ti ) ∈ SG(g)c ∩SG(h) holds. Indicate the former case by δ i = 0 and the latter case by δ i = 1. The points with δ i = 0 could be picked by SG(fG ) if and only if xi ∈ / G. The points with δ i = 1 could be picked if and only if xi ∈ G. Given that G is a VC-class with VC-dimension v, there exists a subset X0 of {x1 , . . . , xv+1 } such that X0 6= ({x1 , . . . , xv+1 } ∩ G) for any G ∈ G. Then there could be no set S ∈ SG(F) that picks out the set (possibly empty) {(zi , ti ) : (xi ∈ X0 and δ i = 1) or (xi ∈ / X0 and δ i = 0)}, (A.1) because this set of points could only be picked out by SG(fG ) if ({x1 , . . . , xv+1 } ∩ G) = X0 . Hence, F is a VC subgraph class of functions with VC-dimension less than or equal to v. 43 In addition to the notations introduced in the main text, the following notations are used throughout the appendix. The empirical probability distribution based on an iid size n sample of R 1/2 Zi = (Yi , Di , Xi ) is denoted by Pn . L2 (P ) metric for f is denoted by kf kL2 (P ) = Z f 2 dP , and the sup-metric of f is denoted by kf k∞ . Positive constants that only depend on the class of data generating processes, not on the sample size nor the VC-dimension, are denoted by c1 , c2 , c3 , . . . . The universal constants are denoted by the capital letter C1 , C2 , . . . . In what follows, we present lemmas that will be used in the proofs of Theorems 2.1 and 2.3. Lemmas A.2 and A.3 are classical inequalities whose proofs can be found, for instance, in Lugosi (2002). Lemma A.2. Hoeffding’s Lemma: let X be a random variable with EX = 0, a ≤ X ≤ b. Then, for s > 0, 2 2 E esX ≤ es (b−a) /8 . Lemma A.3. Let λ > 0, n ≥ 2, and let Y1 , . . . , Yn be real-valued random variables such that for 2 2 all s > 0 and 1 ≤ i ≤ n, E(esYi ) ≤ es λ /2 holds. Then, √ (i) E max Yi ≤ λ 2 ln n, i≤n p (ii) E(max |Yi |) ≤ λ 2 ln (2n). i≤n The next two lemmas give maximal inequalities that bound the mean of a supremum of centered empirical processes indexed by a VC-subgraph class of functions. The first maximal inequality (Lemma A.4) is standard in the empirical process literature, and it yields our Theorem 2.1 as a corollary. Though its proof can be found elsewhere (e.g., Dudley (1999), van der Vaart and Wellner (1996)), we present it here for the sake of completeness and for later reference in the proof of Lemma A.5. The second maximal inequality (Lemma A.5) concerns the class of functions whose diameter is constrained by the L2 (P )-norm. Lemma A.5 will be used in the proofs of Theorem 2.3. A lemma similar to our Lemma A.5 appears in Massart and Nédélec (2006, Lemma A.3). Lemma A.4. Let F be a class of uniformly bounded functions, i.e., there exists F̄ < ∞ such that kf k∞ ≤ F̄ for all f ∈ F. Assume that F is a VC-subgraph class with VC-dimension v < ∞. Then, there is a universal constant C1 such that " # r v EP n sup |En (f ) − EP (f )| ≤ C1 F̄ n f ∈F holds for all n ≥ 1. 44 Proof. Introduce (Z10 , . . . , Zn0 ), an independent copy of (Z1 , . . . , Zn ) ∼ P n . We denote the proba0 bility law of (Z10 , . . . , Zn0 ) by P n , its expectation by EP n0 (·), and the sample average with respect to (Z10 , . . . , Zn0 ) by En0 (·). Define iid Rademacher variables σ 1 , . . . , σ n such that Pr(σ 1 = −1) = Pr(σ 1 = 1) = 1/2 and they are independent of Z1 , Z10 , . . . , Zn , Zn0 . Then, " # " # EP n sup |En (f ) − EP (f )| = EP n sup EP n0 En (f ) − En0 (f )|Z1 , . . . , Zn f ∈F f ∈F " ≤ EP n sup EP n0 f ∈F # En (f ) − En0 (f ) |Z1 , . . . , Zn ( ∵ Jensen’s inequality) # " 0 ≤ E sup En (f ) − En0 (f ) P n ,P n f ∈F ) n X sup f (Zi ) − f (Zi0 ) f ∈F i=1 ) ( n X 1 0 0 E sup σ i f (Zi ) − f (Zi ) n P n ,P n ,σ f ∈F i=1 f (Zi ) − f (Zi0 ) ∼ σ i f (Zi ) − f (Zi0 ) for all i ) n n ) ( X X 1 EP n ,P n0 ,σ sup σ i f (Zi ) + sup σ i f (Zi0 ) f ∈F n f ∈F i=1 i=1 # " n X 2 EP n ,σ sup σ i f (Zi ) n f ∈F i=1 ( " #) n X 2 EP n Eσ sup σ i f (Zi ) |Z1 , . . . , Zn . (A.2) n f ∈F 1 0 E n P n ,P n = = ( ∵ ≤ = = ( i=1 Fix Z1 , . . . , Zn , and define f ≡ (f (Z1 ), . . . , f (Zn )) = (f1 , . . . , fn ), which is a vector of length n corresponding to the value of f ∈ F evaluated at each of (Z1 , . . . , Zn ). Let F = {f : f ∈ F} ⊂ Rn , which is a bounded set in Rn with radius F̄ = M/κ, since F is the set of uniformly bounded functions with |f (·)| ≤ M/κ. Introduce the Euclidean norm to F, n 0 ρ(f , f ) = 2 1X fi − fi0 n !1/2 . i=1 P Let f (0) = (0, . . . , 0), and f ∗ = (f1∗ , . . . , fn∗ ) be a random element in F maximizing | ni=1 σ i fi |. Let B0 = f (0) and construct Bk : k = 1, . . . , K̄ a sequence of covers of F, such that Bk ⊂ F is a minimal cover with radius 2−k F̄ and BK̄ = F. Note that such K̄ < ∞ exists at given n 45 and (Z1 , . . . , Zn ). Define also f (k) ∈ Bk : k = 1, . . . , K̄ be a random sequence such that f (k) ∈ arg minf ∈Bk ρ (f , f ∗ ). Since Bk is a cover with radius 2−k F̄ , ρ f (k) , f ∗ ≤ 2−k F̄ holds. In addition, we have ρ f (k−1) , f (k) ≤ ρ f (k) , f ∗ + ρ f (k−1) , f ∗ ≤ 3 · 2−k F̄ . By a telescope sum, n X σ i fi∗ = i=1 n X (0) σ i fi + i=1 = K̄ X n X (k) (k−1) σ i fi − fi k=1 i=1 K̄ X n X (k) (k−1) σ i fi − fi . k=1 i=1 We hence obtain n K̄ n X X X (k) (k−1) ∗ Eσ Eσ σ i fi ≤ σ i fi − fi i=1 i=1 k=1 ≤ K̄ X k=1 n X Eσ max σ i (fi − gi ) . f ∈Bk ,g∈Bk−1 :ρ(f ,g)≤3·2−k F̄ (A.3) i=1 We apply Lemma A.2 to obtain n i h Pn Y Eσ es i=1 σi (fi −gi ) = Eσi esσi (fi −gi ) i=1 ≤ n Y es 2 (f 2 i −gi ) /2 i=1 = exp s2 nρ2 (f , g)/2 2 ≤ exp s2 n 3 · 2−k F̄ /2 . √ An application of Lemma A.3 (ii) with λ = 3 n · 2−k F̄ and n = |Bk | |Bk−1 | ≤ |Bk |2 then yields q n X √ −k ≤ 3 Eσ max σ (f − g ) n · 2 F̄ 2 ln 2 |Bk |2 i i i −k f ∈Bk ,g∈Bk−1 :ρ(f ,g)≤3·2 F̄ i=1 q √ −k = 3 n · 2 F̄ 2 ln 2N (2−k F̄ , F,ρ)2 q √ −k = 6 n · 2 F̄ ln 21/2 N (2−k F̄ , F,ρ), 46 where N (r, F,ρ) is the covering number of F with radius r in terms of norm ρ. Accordingly, n K q X X √ Eσ σ i fi∗ ≤ 6 n · 2−k F̄ ln 21/2 N (2−k F̄ , F,ρ) i=1 k=1 ∞ q √ X ≤ 12 n 2−(k+1) F̄ ln 21/2 N (2−k F̄ , F,ρ) √ ≤ 12 n k=1 1 Z F̄ q ln 21/2 N (F̄ , F,ρ)d, (A.4) 0 where the last line follows from the fact that N (F̄ , F,ρ) is decreasing in . To bound (A.4) from above, we apply a uniform entropy bound for the covering number. In Theorem 2.6.7 of van der Vaart and Wellner (1996), by setting r = 2 and Q at the empirical probability measure of (Z1 , . . . , Zn ), we have, 2v (v+1) 1 , N (F̄ , F, ρ) ≤ K(v + 1) (16e) (A.5) where K > 0 is a universal constant. Plugging this into (A.4) leads to Z 1 q n X √ ∗ Eσ ln(21/2 K) + ln(v + 1) + (v + 1) ln(16e) − 2v ln d σ i fi ≤ 12F̄ n 0 i=1 Z 1 q √ ln(21/2 K) + ln 2 + 2 ln(16e) − 2 ln d ≤ 12F̄ nv 0 √ = C 0 F̄ nv, (A.6) R1 p where C 0 = 12 0 ln(21/2 K) + ln 2 + 2 ln(16e) − 2 ln d < ∞. Combining (A.6) with (A.2) and setting C1 = 2C 0 leads to the conclusion. Lemma A.5. Let F be a class of uniformly bounded functions with kf k∞ ≤ F̄ < ∞ for all f ∈ F. Assume that F is a VC-subgraph class with VC-dimension v < ∞. Assume further that supf ∈F kf kL2 (P ) ≤ δ. Then, there exists a positive universal constant C2 such that " # r v EP n sup (En (f ) − EP (f )) ≤ C2 δ F̄ n f ∈F holds for all n ≥ C1 F̄ 2 v/δ 2 , where C1 is the universal constant defined in Lemma A.4. Proof. By the same symmetrization argument and the same use of Rademacher variables as in the proof of Lemma A.4, we have " EP n # 2 sup (En (f ) − EP (f )) ≤ EP n n f ∈F ( " Eσ sup n X f ∈F i=1 47 #) σ i f (Zi )|Z1 , . . . , Zn . (A.7) Fix the values of Z1 , . . . , Zn , and define f , f (0) , F, and norm ρ(f , f 0 ) as in the proof of Lemma A.4. P Let f ∗ be a maximizer of ni=1 σ i f (Zi ) in F and let δ n = supf ∈F ρ(f (0) , f ) ≤ F̄ . Let B0 = f (0) and construct Bk : k = 1, . . . , K̄ a sequence of covers of F, such that Bk ⊂ F is a minimal cover with radius 2−k δ n and BK̄ = F. We define f (k) ∈ Bk : k = 1, . . . , K̄ to be a random sequence such that f (k) ∈ arg minf ∈Bk ρ (f , f ∗ ). By applying the chaining argument in the proof of Lemma A.4, Lemma A.3 (i), and the uniform bound of the covering number (A.5), we obtain Z 1 n X p √ δ n log N (δ n , F,ρ)d, Eσ σ i fi∗ ≤ 12 n i=1 0 √ 0 ≤ C δ n nv. for the universal constant C 0 defined in the proof of Lemma A.4. Hence, from (A.7), we have " # r v EP n sup (En (f ) − EP (f )) ≤ C1 EP n (δ n ) n f ∈F " #1/2 r v = C1 EP n sup En f 2 n f ∈F r " v EP n ≤ C1 n !#1/2 sup En f 2 . f ∈F Note that En f 2 is bounded by En f 2 = En f 2 − EP f 2 + EP (f 2 ) h i = En f − kf kL2 (P ) f + kf kL2 (P ) + kf k2L2 (P ) h i ≤ 2F̄ En f − kf kL2 (P ) + kf k2L2 (P ) ∵ kf kL2 (P ) ≥ EP (f ) by the Cauchy-Schwartz inequality ≤ 2F̄ En [f − EP (f )] + kf k2L2 (P ) . Combining this inequality with (A.8) yields v " # " # r u vu t 2F̄ EP n sup (En (f ) − EP (f )) + δ 2 . EP n sup (En (f ) − EP (f )) ≤ C1 n f ∈F f ∈F Solving this inequality for EP n supf ∈F (En (f ) − EP (f )) leads to " # EP n sup (En (f ) − EP (f )) ≤ F̄ C1 f ∈F r s r v v v δ2 + + . n n n F̄ 2 C1 48 (A.8) √ √ p the upper bound can be further bounded by (1 + 2) C1 F̄ δ nv , √ √ so the conclusion of the lemma follows with C2 = (1 + 2) C1 . For v n ≤ δ2 , F̄ 2 C1 that is, n ≥ C1 F̄ 2 v , δ2 The next lemma is a version of Bousquet’s concentration inequality (Bousquet, 2002). Lemma A.6. Let F be a countable family of measurable functions, such that supf ∈F EP (f 2 ) ≤ δ 2 and supf ∈F kf k∞ ≤ F̄ for some constants δ 2 and F̄ . Let S = supf ∈F (En (f ) − EP (f )). Then, for every positive t, s 2 n 2 δ + 4 F̄ E (S) t 2 F̄ t P ≤ exp (−t) . P n S − EP n (S) ≥ + n 3n A.2 Proofs of Theorems 2.1 and 2.2 Proof of Theorem 2.1. Define Yi (1 − Di ) Yi Di · 1 {Xi ∈ G} + · 1 {Xi ∈ / G} , f (Zi ; G) = e(Xi ) 1 − e(Xi ) and the class of functions on Z F = {f (·; G) : G ∈ G} . With these notations, we can express inequality (2.2) as WG∗ − W (ĜEW M ) ≤ 2 sup |En (f ) − EP (f )| . (A.9) f ∈F Note that Assumption 2.1 (BO) and (SO) imply that F has uniform envelope F̄ = M/κ. Also, by Assumption 2.1 (VC) and Lemma A.1, F is a VC-subgraph class of functions with VC-dimension at most v. We apply Lemma A.4 to (A.9) to obtain r h i M v ∗ EP n WG − W (ĜEW M ) ≤ C1 . κ n Since this upper bound does not depend on P ∈ P(M, κ), the upper bound is uniform over P(M, κ). 49 Proof of Theorem 2.2. In obtaining the rate lower bound, we normalize the support of outcomes to Y1i , Y0i ∈ − 12 , 21 . That is, we focus on bounding supP ∈P(1,κ) EP n WG∗ − W (Gn ) . The lower bound of the original welfare loss supP ∈P(M,κ) EP n WG∗ − W (Gn ) is obtained by multiplying by M the lower bound of supP ∈P(1,κ) EP n WG∗ − W (Gn ) . We consider a suitable subclass P ∗ ⊂ P (1, κ), for which the worst case welfare loss can be bounded from below by a distribution-free term that converges at rate n−1/2 . The construction of P ∗ proceeds as follows. First, let x1 , . . . , xv ∈ X be v points that are shattered by G. We constrain the marginal distribution of X to being supported only on (x1 , . . . , xv ). We put mass p at xi , i < v, and mass 1−(v −1)p at xv . Thus-constructed marginal distribution of X is common in P ∗ . Let the distribution of treatment indicator D be independent of (Y1 , Y0 , X), and D follows the Bernoulli distribution with Pr(D = 1) = 1/2. Let b = (b1 , . . . , bv−1 ) ∈ {0, 1}v−1 be a bit vector used to index a member of P ∗ , i.e., P ∗ consists of finite number of DGPs. For each j = 1, . . . , (v − 1), and depending on b, construct the following conditional distribution of Y1 given X = xj ; if bj = 1, ( Y1 = 1 2 − 21 with prob. 1 2 + γ, 1 2 with prob. (A.10) − γ, and, if bj = 0, ( Y1 = 1 2 − 21 with prob. with prob. 1 2 − γ, 1 2 (A.11) + γ, where γ ∈ 0, 21 is chosen properly in a later step of the proof. For j = v, the conditional distribution of Y1 given X = xv is degenerate at Y1 = 0. As for Y0 ’s conditional distribution, we consider the degenerate distribution at Y0 = 0 at every X = xj , j = 1, . . . , v. That is, when bj = 1, τ (xj ) = γ, and when bj = 0, τ (xj ) = −γ. Each b ∈ {0, 1}v−1 induces a unique joint distribution of n o (Y1 , Y0 , D, X) ∼ Pb and, clearly, Pb ∈ P(1, κ). We accordingly define P ∗ = Pb : b ∈ {0, 1}v−1 . With knowledge of Pb ∈ P ∗ , the optimal treatment assignment rule is G∗b = {xj : j < v, bj = 1} , which is feasible G∗b ∈ G by the construction of the support points of X. The maximized social welfare is v−1 X W (G∗b ) = pγ bj . j=1 50 Let Ĝ be an arbitrary treatment choice rule as a function of (Z1 , . . . , Zn ), and b̂ ∈ {0, 1}(v−1) be a n o binary vector whose j-th element is b̂j = 1 xj ∈ Ĝ . Consider π (b) a prior distribution for b such that b1 , . . . , bv−1 are iid and b1 ∼ Ber(1/2). The welfare loss satisfies the following inequalities, h i h i sup EP n WG∗ − W (Ĝ) ≥ sup EPbn W (G∗b ) − W (Ĝ) Pb ∈P ∗ P ∈P(1,κ) Z h i EPbn W (G∗b ) − W (Ĝ) dπ (b) bZ h i = γ EPbn PX G∗b 4Ĝ dπ (b) Zb Z n o = γ PX b(X) 6= b̂(X) dP n (Z1 , . . . , Zn |b) dπ (b) b Z1 ,...,Zn Z Z n o PX ≥ inf γ b(X) 6= b̂(X) dP n (Z1 , . . . , Zn |b) dπ (b) ≥ Gn b Z1 ,...,Zn where each b(X) and b̂ (X) is an element of b and b̂ such that b(xj ) = bj and b̂(xj ) = b̂j . We define b(xv ) = b̂ (xv ) = 0. Note that the last expression can be seen as the minimized Bayes risk with the loss function corresponding to the classification error for predicting binary unknown random variable b(X). Hence, the minimizer of the Bayes risk is attained by the Bayes classifier, 1 ∗ Ĝ = xj : π (bj = 1|Z1 , . . . , Zn ) ≥ , j < v , 2 where π (bj |Z1 , . . . , Zn ) is the posterior of bj . The minimized Bayes risk is given by Z EX [min {π (b(X) = 1|Z1 , . . . , Zn ) , 1 − π (b(X) = 1|Z1 , . . . , Zn )}] dP̃ n γ Z1 ,...,Zn Z = γ v−1 X p [min {π (bj = 1|Z1 , . . . , Zn ) , 1 − π(bj = 1|Z1 , . . . , Zn )}] dP̃ n , (A.12) Z1 ,...,Zn j=1 where P̃ n is the marginal likelihood of {(Y1,i , Y0,i , Di , Xi ) : i = 1, . . . , n} corresponding to prior π (b). For each j = 1, . . . , (v − 1) , let Kj kj+ kj− = # {i : Xi = xj } , 1 = # i : Xi = xj , Yi Di = , 2 1 = # i : Xi = xj , Yi Di = − . 2 51 The posterior for bj = 1 can be written as π (bj = 1|Z1 , . . . , Zn ) = 1 2 if # {i : Xi = xj , Di = 1} = 0, k+ ( 12 +γ ) j ( 12 −γ ) k+ ( 12 +γ ) j ( 12 −γ ) k− j k− k− j + 1 +γ j 2 ( ) ( 12 −γ ) k+ j otherwise. Hence, min {π (bj = 1|Z1 , . . . , Zn ) , 1 − π(bj = 1|Z1 , . . . , Zn )} kj+ 1 kj− 1 kj− 1 kj+ 1 , 2 +γ min 2 + γ 2 −γ 2 −γ = k + k − + γ j 12 − γ j + ( ) 1 kj+ −kj− +γ min 1, 21 −γ 1 2 1 2 +γ kj− 1 2 −γ kj+ 2 = 1+ 1 +γ 2 1 −γ 2 kj+ −kj− 1 + 2γ 1 where a = > 1. + − , k −k 1 − 2γ 1 + a| j j | P Since kj+ − kj− = i:Xi =xj 2Yi Di , plugging (A.13) into (A.12) yields v−1 X 1 P γ pEP̃ n i:X =x 2Yi Di i j j=1 1+a # " v−1 1 γX ≥ pEP̃ n P 2 i:X =x 2Yi Di i j j=1 a v−1 γ X −EP̃ n Pi:Xi =xj 2Yi Di ≥ , p a 2 = (A.13) i=1 where EP̃ n (·) is the expectation with respect to the marginal likelihood of {(Y1,i , Y0,i , Di , Xi ) , i = 1, . . . , n}. The second line follows by a > 1, and the third line follows by Jensen’s inequality. Given our prior specification for b, the marginal distribution of Y1,i is Pr(Y1,i = 1/2) = Pr(Y1,i = −1/2) = 1/2, so X X 2Y1,i EP̃ n 2Yi Di = EP̃n i=1:Xi =xj ,Di =1 i:Xi =xj n X n p k 1 p n−k = 1− E B(k, ) − k 2 2 2 k=0 52 k 2 holds, where B(k, 12 ) is a random variable following the binomial distribution with parameters k and 21 . By noting 1 E B(k, ) − 2 s 1 k k 2 E B(k, ) − ≤ 2 2 2 r k = , 4 ( ∵ Cauchy-Schwartz inequality) we obtain n n−k r k X k X n p p 2Yi Di ≤ EP̃ n 1− k 2 2 4 i:Xi =xj k=0 s B n, p2 = E 4 r np ≤ . ( ∵ Jensen’s inequality). 8 Hence, the Bayes risk is bounded from below by √ np γ p(v − 1)a− 8 2 √ np γ p(v − 1)e−(a−1) 8 ( ∵ 1 + x ≤ ex ∀x) ≥ 2 √ np pγ − 4γ = (v − 1)e 1−2γ 8 . (A.14) 2 This lower bound of the Bayes q risk has the slowest convergence rate when γ is set to be proportional to n−1/2 . Specifically, let γ = √ np pγ − 4γ (v − 1)e 1−2γ 8 2 ≥ ≥ The condition 1 − 2γ ≥ A.3 1 2 Since (v − 1)−1 ≥ p ≥ v −1 , we have ( r √ ) 1 v−1 1 2 1− exp − 2 n v 1 − 2γ r n √ o 1 v−1 1 exp −2 2 , if 1 − 2γ ≥ . 4 n 2 v−1 n . is equivalent to n ≥ 16(v − 1). This completes the proof. Proofs of Theorems 2.3 and 2.4 In proving Theorem 2.3, it is convenient to work with the normalized welfare difference, d(G, G0 ) = κ W (G) − W (G0 ) , M 53 and its sample analogue dn (G, G0 ) = κ Wn (G) − Wn (G0 ) . M (A.15) By Assumption 2.1 (BO) and (SO), both d(G, G0 ) and dn (G, G0 ) are bounded in [−1, 1], and the normalized welfare difference relates to the original welfare loss of decision set G as d(G∗F B , G) = κ [W (G∗F B ) − W (G)] ∈ [0, 1] . M (A.16) Hence, the welfare loss upper bound of ĜEW M can be obtained by multiplying M/κ by the upper bound of d(G∗F B , ĜEW M ). Note that d(G∗F B , G) can be bounded from above by PX (G∗F B 4G), since Z κ ∗ |τ (X)| dPX d(GF B , G) = M G∗F B 4G ≤ κPX (G∗F B 4G) ≤ PX (G∗F B 4G). (A.17) On the other hand, with Assumption 2.2 (MA) imposed, PX (G∗F B 4G) can be bounded from above by a function of d(G∗F B , G), as the next lemma shows. We borrow this lemma from Tsybakov (2004). Lemma A.7. Suppose Assumption 2.2 (MA) holds with margin coefficient α ∈ (0, ∞). Then α PX (G∗F B 4G) ≤ c1 (M, κ, η, α)d(G∗F B , G) 1+α holds for all G ∈ G, where c1 (M, κ, η, α) = M κηα α 1+α (1 + α). Proof. Let A = {x : |τ (x)| > t } and consider the following inequalities, Z ∗ |τ (x)| dPX W (GF B ) − W (G) = G∗F B 4G Z ≥ |τ (X)| 1 {x ∈ A} dPX (G∗F B 4G) ≥ tPX ((G∗F B 4G) ∩ A) ≥ t [PX (G∗F B 4G) − PX (Ac )] α t ∗ ≥ t PX (GF B 4G) − , η 54 where the final line uses the margin condition. The right-hand side is maximized at t = η(1 + 1 1 α)− α [PX (G∗F B 4G)] α ≤ η, so it holds 1+α α 1+α 1 ∗ W (GF B ) − W (G) ≥ ηα [PX (G∗F B 4G)] α . 1+α This, in turn, implies PX (G∗F B 4G) ≤ M κηα α 1+α Proof of Theorem 2.3. Let a = α (1 + α)d(G∗F B , G) 1+α . √ ktn with k ≥ 1, t ≥ 1, and n > 0, where t ≥ 1 is arbitrary, k is a constant that we choose later, and n is a sequence indexed by sample size n whose proper choice will be discussed in a later step. The normalized welfare loss can be bounded by d(G∗F B , ĜEW M ) ≤ d(G∗F B , ĜEW M ) − dn G∗F B , ĜEW M , as dn G∗F B , ĜEW M ≤ 0 by Assumption 2.2 (FB). Define a class of functions induced by G ∈ G. H ≡ {h(Zi ; G) : G ∈ G} , Yi D i Yi (1 − Di ) κ − [1 {Xi ∈ G} − 1 {Xi ∈ G∗F B }] . h(Zi ; G) ≡ M e(Xi ) 1 − e(Xi ) By Assumption 2.1 (VC) and Lemma A.1, H is a VC-subgraph-class with VC-dimension at most v < ∞ with envelope H̄ = 1. Using h(Zi ; G), we can write d(G∗F B , G) = −EP (h(Zi ; G)). Since d(G∗F B , G) ≥ 0 for all G ∈ G, it holds −EP (h) ≥ 0 for all h ∈ H. Since we have d(G∗F B , ĜEW M ) − dn G∗F B , ĜEW M = En h(Zi ; ĜEW M ) − EP h(Zi ; ĜEW M ) and dn G∗F B , ĜEW M ≤ 0, the normalized welfare loss can be bounded by d(G∗F B , ĜEW M ) ≤ En h(Zi ; ĜEW M ) − EP h(Zi ; ĜEW M ) h i ≤ Va d(G∗F B , ĜEW M ) + a2 , where Va En (h) − EP (h) = sup −EP (h) + a2 h∈H h h = sup En − EP −EP (h) + a2 −EP (h) + a2 h∈H 55 On event Va < 12 , d(G∗F B , ĜEW M ) ≤ a2 holds, so this implies 1 . P n d(G∗F B , ĜEW M ) ≥ a2 ≤ P n Va ≥ 2 (A.18) In what follows, our aim is to construct an exponential inequality for P n Va ≥ 12 involving only t, and we make use of such exponential tail bound to bound EP n d(G∗F B , ĜEW M ) . To apply the Bousquet’s inequality (Lemma A.6) to Va , note first that, 2 ! PX (G∗F B 4G) h ≤ EP −EP (h) + a2 (EP (h) + a2 )2 α [−EP (h)] 1+α ≤ c1 (−EP (h) + a2 )2 ∵ by Lemma A.7 and d(G∗F B , G) = −EP (h(Zi ; G)) ) 2α ≤ c1 sup ≥0 1+α (2 + a2 )2 2α 1+α 1 ≤ c1 2 sup 2 a ≥0 + a2 α !2 1+α 1 ≤ c1 2 sup a ≥0 ∨ a 2α 1 1+α a , a4 where c1 is a constant that depends only on (M, κ, η, α) as defined in Lemma A.7. We, on the other ≤ c1 hand, have sup sup h∈H Z h ≤ 1. 2 −EP (h) + a a2 Hence, Lemma A.6 gives, with probability larger than 1 − exp(−t), vh i u 2α −2 u c a 1+α n + 4E (V ) a t P t 1 2t Va ≤ EP n (Va ) + + . 2 na 3na2 (A.19) Next, we derive an upper bound of EP n (Va ) by applying the maximal inequality of Lemma A.5. Let r > 1 be arbitrary and consider partitioning H by H0 , H1 , . . . , where H0 = h ∈ H : −EP (h) ≤ a2 56 and Hj = h ∈ H : r2(j−1) a2 < −EP (h) ≤ r2j a2 , j = 1, 2, . . . . Then, X En (h) − EP (h) En (h) − EP (h) sup Va ≤ sup + −EP (h) + a2 −EP (h) + a2 h∈H0 j≥1 h∈Hj X 1 sup (En (h) − EP (h)) + (1 + r2(j−1) )−1 sup (En (h) − EP (h)) ≤ a2 h∈H0 h∈Hj j≥1 " # sup−EP (h)≤a2 (En (h) − EP (h)) 1 ≤ . P a2 + j≥1 (1 + r2(j−1) )−1 sup−E (h)≤r2j a2 (En (h) − EP (h)) (A.20) P α Since it holds khk2L2 (P ) ≤ PX (G∗F B 4G) ≤ c1 (M, κ, η, α) [−EP (h)] 1+α , where the latter inequality 1/2 follows from Lemma A.7, −EP (h) ≤ r2j a2 implies khkL2 (P ) ≤ c1 r be further bounded by Va ≤ sup α 1/2 khkL (P ) ≤c1 a 1+α 2 1 P a2 + j≥1 (1 + r2(j−1) )−1 sup α 1+α j α a 1+α . Hence, (A.20) can (En (h) − EP (h)) α j α 1/2 khkL (P ) ≤c1 r 1+α a 1+α 2 (En (h) − EP (h)) . We apply Lemma A.5 to each supremum term, and obtain 1 r α α X c12 v 1+α r 1+α j EP n (Va ) ≤ C2 2 a a n 1 + r2(j−1) j≥0 ! r 2 1 α r v ≤ C2 c12 a 1+α −2 2+α n 1 + r− 1+α r α v 1+α −2 a ≤ c2 n for n≥ C1 v 2α c1 a 1+α ⇐⇒ a≥ C1 c1 1+α 1+α 2α v 2α n (A.21) 1 where C1 and C2 are universal constants defined in Lemmas A.4 and A.5, and c2 = C2 c12 ∨ − 2+α r2 1+r 1+α 1 is a constant greater than or equal to one and depends only on (M, κ, η, α), as r > 1 is fixed. We plug in this upper bound into (A.19) to obtain vh u p v α −2 i 2α −2 r u c a 1+α 1+α + 4c t 1 2 t α na v 1+α 2t −2 Va ≤ c2 a + + . 2 n na 3na2 57 (A.22) Choose n as the root of c2 pv na α −2 1+α = 1, i.e., r 1+α v 2+α n = c2 . n (A.23) Note that the right hand side of (A.22) is decreasing in a, and a ≥ n by the construction. Hence, if n satisfies inequality (A.21), i.e., n≥ c−α 2 C1 c1 1+ α 2 v, which can be reduced to an innocuous restriction n ≥ 1 by inflating, if necessary, c1 large enough, we can substitute n for a to bound the right hand side of (A.22). In particular, by noting r α 1 v 1+α n 1 −2 and c2 a ≤ =√ ≤√ n a kt k h α −2 i2 2α α −1 2 a 1+α −2 = a2( 1+α −2) a2 ≤ n1+α 2n = c−2 2 v nn , the right-hand side of (A.22) can be bounded by s −1 2 1 c1 c−2 2 2 v nn + 4 Va ≤ √ + + 2 2 nk 3nk k n n s −1 1 c1 c−2 4 2 2 v = √ + + + 2 k nkn 3nk2n k s −1 1 c1 c−2 4 2 2 v ≤ √ + + + for n2n ≥ 1. k k 3k k (A.24) Note that condition n2n ≥ 1 used to derive the last line is valid for all n, since it is equivalent to −2(1+α) −(1+α) v , n ≥ c2 which holds for all n ≥ 1 since c2 ≥ 1 and v ≥ 1. By choosing k large enough so that the right-hand side of (A.24) is less than 12 , we can conclude 1 Pr(Va < ) ≥ 1 − exp(−t). 2 (A.25) Hence, (A.18) yields P n d(G∗F B , ĜEW M ) ≥ kt2n ≤ exp (−t) 58 for all t ≥ 1. From this exponential bound, we obtain EP n d(G∗F B , ĜEW M ) Z ∞ = P n d(G∗F B , ĜEW M ) > t0 dt0 0 Z k2n ≤ P n d(G∗F B , ĜEW M ) ≥t 0 ≤ + Z ∞ dt + 0 k2n 0 k2n P n d(G∗F B , ĜEW M ) ≥ t0 dt0 k2n e−1 2(1+α) = (1 + e−1 )kc2 2+α v 1+α 2+α n . 2(1+α) So, setting c = M κ (1 + e−1 )kc2 2+α leads to the conclusion. Proof of Theorem 2.4. As in the proof of Theorem 2.2, we work with the normalized outcome support, Y1i , Y0i ∈ − 12 , 12 . With the normalized outcome, constant η of the margin assumption satisfies η ≤ 1. Let α ∈ (0, ∞) and η ∈ (0, 1] be given. Similarly to the proof of Theorem 2.2, we consider P∗ ⊂ P (1, κ, η, α). Let x1 , . . . , xv ∈ X be v points that are shattered by G, and let γ be a positive number satisfying γ ≤ min η, 12 , whose proper choice will constructing a suitable subclass be given later. We fix the marginal distribution of X at the one supported only on (x1 , . . . , xv ) and having the probability mass function, α γ 1 , for j = 1, . . . , (v − 1), and PX (Xi = xj ) = v−1 η α γ 1 . PX (Xi = xv ) = 1 − v−1 η Thus-constructed marginal distribution of X is common in P ∗ . As in the proof of Theorem 2.2, we specify D to be independent of (Y1 , Y0 , X) and follow the Bernoulli distribution with Pr(D = 1) = 1/2. Let b = (b1 , . . . , bv−1 ) ∈ {0, 1}v−1 be a binary vector that uniquely indexes a member n o of P ∗ , and, accordingly, write P ∗ = Pb : b ∈ {0, 1}v−1 . For each j = 1, . . . , (v − 1), we specify the conditional distribution of Y1 given X = xj to be (A.10) if bj = 1 and (A.11) if bj = 0. For j = v, the conditional distribution of Y1 given X = xv is degenerate at Y1 = 21 . As for the conditional distribution of Y0 given X = xj , we consider the degenerate distribution at Y0 = 0 for j = 1, . . . , (v − 1), and the degenerate distribution at Y0 = − 21 for X = xv . In this specification of 59 P ∗ , it holds 0α for t ∈ [0, γ), γ PX (|τ (x)| ≤ t) = for t ∈ [γ, η), , η 1 for t ∈ [η, 1]. for every Pb ∈ P ∗ . Furthermore, by the construction of the support points, for every Pb ∈ P ∗ , the first-best decision rule is contained in G. Hence, it holds P ∗ ⊂ PF B (1, κ, η, α). Let π (b) be a prior distribution for b such that b1 , . . . , bv−1 are iid and b1 ∼ Ber(1/2). By following the same line of reasonings as used in obtaining (A.12), for arbitrary estimated treatment choice rule Ĝ, we obtain h i EP n W (G∗ ) − W (Ĝ) sup P ∈P(1,κ,η,α) ≥ γ v−1 α γ η Z v−1 X [min {π (bj = 1|Z1 , . . . , Zn ) , 1 − π(bj = 1|Z1 , . . . , Zn )}] dP̃ n . Z1 ,...,Zn j=1 Furthermore, by repeating the same bounding arguments as in the proof of Theorem 2.2, this Bayes α γ 1 risk can be bounded from below by (A.14) with p = v−1 , η sup P ∈P(1,κ,η,α) EP n h s ( α ) i γ γ α 4γ n γ . exp − W (G∗ ) − W (Ĝ) ≥ 2 η 1 − 2γ 8(v − 1) η The slowest convergence rate of this lower bound can be obtained by tuning γ to be converging at 1 1 α 2+α the rate of n− 2+α . In particular, by choosing γ = η 2+α v−1 assuming γ ≤ 14 , the exponential n √ term can be bounded from below by exp −2 2 , so we obtain the following lower bound, n √ o 1+α α α 1 2+α η (v − 1) 2+α n− 2+α exp −2 2 . (A.26) 2 Recall that γ is constrained to γ ≤ min η, 41 . This implies that the obtained bound is valid for 2+α α n ≥ max η −1 , 4 η (v − 1), whose stricter but simpler form is given by n ≥ max η −2 , 42+α (v − 1). (A.27) The lower bound presented in this theorem follows by denormalizing the outcomes, i.e., multiply M to (A.26) and substitute η/M for η appearing in (A.26) and (A.27). 60 A.4 Proof of Theorem 2.6 The next lemma gives a linearized solution of a certain polynomial inequality. We owe this lemma to Shin Kanaya (2014, personal communication). The technique of applying the mean value expansion to an implicit function defined as the root of a polynomial equation has been used in the context of bandwidth choice in Kanaya and Kristensen (2014). α Lemma A.8. Let A ≥ 0, B ≥ 0, and X ≥ 0. For any α ≥ 0, X ≤ AX 1+α + B implies X ≤ A1+α + (1 + α)B. α Proof. When A = B = 0, the conclusion trivially holds. When B > 0, X = AX 1+α + B has a unique root, and we denote it by X ∗ = g(A, B). When A > 0 and B = 0, we mean by g(A, 0) the α α nonzero root of X = AX 1+α . Let f (X, A, B) = X − AX 1+α − B. By the form of the inequality, the original inequality can be equivalently written as X ≤ X ∗ = g(A, B), so we aim to verify that X ∗ is bounded from above by A1+α + (1 + α)B. Consider the mean value expansion of g(A, B) in B at B = 0, X ∗ = g(A, 0) + ∂g A, B̃ × B ∂B for some 0 ≤ B̃ ≤ B. Note g(A, 0) = A1+α . In addition, by the implicit function theorem, we have, with X̃ = g(A, B̃), ∂f (X̃, A, B̃) ∂g A, B̃ = − ∂B ∂f ∂B (X̃, A, B̃) ∂X 1 = = = 1 1− − 1+α α 1+α AX̃ X̃ 1+α X̃ α α + 1+α X̃ − AX̃ 1+α X̃ X̃ 1+α + ≤ 1 + α. α 1+α B̃ Hence, X ∗ ≤ A1+α + (1 + α)B holds. The next lemma provides an exponential tail probability bound of the supremum of the centered empirical processes. This lemma follows from Theorem 2.14.9 in van der Vaart and Wellner (1996) combined with their Theorem 2.6.4. 61 Lemma A.9. Assume G is a VC-class of subsets in X with VC-dimension v < ∞. Let PX,n (·) be the empirical probability distribution on X constructed upon (X1 , . . . , Xn ) generated iid from PX (·). Then, P n C4 t 2v v sup |PX,n (G) − PX (G)| > t ≤ √ n exp −nt2 2v G∈G holds for every t > 0, where C4 is a universal constant. Proof of Theorem 2.6. We first consider the m-hybrid case. Set G̃ = G∗F B in (2.4) and rewrite (2.4) in terms of the normalized welfare loss for Ĝm−hybrid , i κ h τ ∗ τ ∗ τ τ ∗ Wn (GF B ) − Ŵn (GF B ) − Wn (Ĝm−hybrid ) + Ŵn Ĝm−hybrid d(GF B , Ĝm−hybrid ) ≤ M +d(G∗F B , Ĝm−hybrid ) − dτn G∗F B , Ĝm−hybrid n oi h n 1X κ ≤ [τ (Xi ) − τ̂ m (Xi )] 1 {Xi ∈ G∗F B } − 1 Xi ∈ Ĝm−hybrid n M i=1 ∗ τ ∗ +d(GF B , Ĝm−hybrid ) − dn GF B , Ĝm−hybrid (A.28) ≤ ρn + d(G∗F B , Ĝm−hybrid ) − dτn G∗F B , Ĝm−hybrid where d(G∗F B , Ĝm−hybrid ) is as defined in equation (A.16) in Appendix A.3, dτn G∗F B , Ĝm−hybrid = Wnτ (G∗F B ) − Wnτ (Ĝm−hybrid ), κ max |τ̂ m (Xi ) − τ (Xi )| PX,n G∗F B 4Ĝm−hybrid , M 1≤i≤n ρn ≡ and PX,n is the empirical distribution on X constructed upon (X1 , . . . , Xn ). Similarly to the proof of Theorem 3.2, define a class of functions generated by G ∈ G Hτ ≡ {h(Zi ; G) : G ∈ G} , κ h(Zi ; G) ≡ τ (Xi ) · [1 {Xi ∈ G} − 1 {Xi ∈ G∗F B }] , M which is a VC-subgraph class with the VC-dimension at most v with envelope H̄ = 1 by Lemma n o √ n (h)−EP (h) A.1. Let a = ktn be as defined in the proof of Theorem 2.3 and Vaτ ≡ suph∈Hτ E−E . 2 P (h)+a By noting d(G∗F B , Ĝm−hybrid ) − dτn G∗F B , Ĝm−hybrid ≤ Vaτ (d(G∗F B , Ĝm−hybrid ) + a2 ), 62 inequality (A.28) implies d(G∗F B , Ĝm−hybrid ) ≤ ρn + Vaτ (d(G∗F B , Ĝm−hybrid ) + a2 ). Denote event Vaτ < 1 2 by Ωt , which is equivalent to event (A.29) n o d(G∗F B , Ĝm−hybrid ) ≤ 2ρn + k2n t . Along the same line of argument that leads to (A.25) in the proof of Theorem 2.3, we obtain, for t ≥ 1, P n (Ωt ) = P n d(G∗F B , Ĝm−hybrid ) ≤ 2ρn + k2n t ≥ 1 − exp (−t) , (A.30) where n is given in (A.23). We bound ρn from above by κ max |τ̂ m (Xi ) − τ (Xi )| PX G∗F B 4Ĝm−hybrid + V0,n max |τ̂ m (Xi ) − τ (Xi )| , ρn ≤ 1≤i≤n M 1≤i≤n where V0,n = sup |PX,n (G∗F B 4G) − PX (G∗F B 4G)| , G∈G: Let λ > 0, that will be chosen properly later. Define events o n Λ1 = V0,n ≤ n−λ , o n Λ2 = PX G∗F B 4Ĝm−hybrid ≥ n−λ . Then, on Λ1 ∩Λ2 , it holds V0,n ≤ PX G∗F B 4Ĝm−hybrid . Therefore, on Λ1 ∩Λ2 ∩Ωt , d(G∗F B , Ĝm−hybrid ) can be bounded by κ max |τ̂ m (Xi ) − τ (Xi )| PX G∗F B 4Ĝm−hybrid + k2n t M 1≤i≤n α κ ≤ 4c1 max |τ̂ m (Xi ) − τ (Xi )| d(G∗F B , Ĝm−hybrid ) 1+α + k2n t, M 1≤i≤n where the second line follows from Lemma A.7 with the same definition of c1 given there. By d(G∗F B , Ĝm−hybrid ) ≤ 4 Lemma A.8 and substituting (A.23) to n , we obtain, on event Λ1 ∩ Λ2 ∩ Ωt , 1+α v 1+α 2+α d(G∗F B , Ĝm−hybrid ) ≤ c6 max |τ̂ m (Xi ) − τ (Xi )| + c7 t, 1≤i≤n n where constants c6 and c7 depend only on (M, κ, η, α). 63 (A.31) Using the upper bound derived in (A.31), we obtain, for t ≥ 1, EP n d(G∗F B , Ĝm−hybrid ) = EP n d(G∗F B , Ĝm−hybrid )1 {Λ1 ∩ Λ2 ∩ Ωt } + EP n d(G∗F B , Ĝm−hybrid )1 {Λc1 ∪ Λc2 ∪ Ωct } 1+α ! v 1+α 2+α ≤ c6 EP n max |τ̂ m (Xi ) − τ (Xi )| + c7 t + P n (Λc1 ) 1≤i≤n n +EP n d(G∗F B , Ĝm−hybrid )1{Λc2 } + P n (Ωct ) 1+α ! v 1+α 2+α m n ψ max |τ̂ (X ) − τ (X )| ≤ c6 ψ −(1+α) E t + c i i 7 P n n 1≤i≤n | n{z } | {z } A2,n A1,n 2v 1 1 C4 n−λ + exp(−t) + n−2v(λ− 2 ) exp −n−2(λ− 2 ) + |{z} | {z } 2v | {z } A4,n A5,n A3,n where ψ n is a sequence as specified in Condition 2.7. In these inequalities, the third line uses (A.31) and d(G∗F B , Ĝm−hybrid ) ≤ 1. In the fourth line, A3,n follows from Lemma A.9, A4,n follows from d(G∗F B , Ĝm−hybrid ) ≤ PX G∗F B 4Ĝm−hybrid and PX G∗F B 4Ĝm−hybrid < n−λ on Λc2 , and A5,n follows from (A.30). We now discuss convergence rates of Aj,n , j = 1, . . . , 5, individually with suitable choices of t and λ. Condition (2.9) assumed in this theorem implies 1+α ! m sup EP n ψ n max |τ̂ (Xi ) − τ (Xi )| P ∈Pm ≤ 1≤i≤n h i 1+α 2 2 m sup EP n (ψ n max1≤i≤n |τ̂ (Xi ) − τ (Xi )|) P ∈Pm ≤ sup EP n (ψ n max1≤i≤n |τ̂ m (Xi ) − τ (Xi )|) 2 1+α ! 2 P ∈Pm = O(1), −(1+α) where the third line follows from Jensen’s inequality. Hence, A1,n satisfies sup A1,n = O ψ n . P ∈Pm By setting t = (1 + α) log ψ n , we can make the convergence rate of A5,n equal to that of A1,n . At ≥ 12 , we can make A3,n and A4,n converge faster than A2,n . Hence, the uniform convergence rate of EP n d(G∗F B , Ĝm−hybrid ) over P ∈ Pm ∩ PF B (M, κ, η, α) the same time, by choosing λ > 1+α 2+α 64 is bounded by the convergence rates of the A1,n and A2,n , ! 1+α − 2+α ∨ n sup A1,n ∨ sup A2,n = ψ −(1+α) log ψ n . n P ∈Pm P ∈PF B (M,κ,η,α) This completes the proof for the m-hybrid case. A proof for the e-hybrid case follows almost identically to the proof of the m-hybrid case. The differences are that ρn in inequality (A.28) is given by ρn = κ max |τ̂ ei − τ i | PX,n G∗F B 4Ĝe−hybrid . M 1≤i≤n and that inequality (A.32) is replaced by d(G∗F B , Ĝe−hybrid ) ≤ ρn + Va (d(G∗F B , Ĝe−hybrid ) + a2 ), (A.32) where Va is as defined in the proof of Theorem 2.3. The rest of the proof goes similarly to the proof of the first claim except that the rate φn given in Condition 2.1 replaces ψ n in the first claim. A.5 Proofs of Theorems 4.1 and 4.2 Proof of Theorem 4.1. Since WK (G) − WK (G0 ) = VK (G) − VK (G0 ) for all G, G0 , sup EP n sup WK (G) − WK (ĜK ) = sup EP n sup VK (G) − VK (ĜK ) , P ∈P(M,κ) G∈G P ∈P(M,κ) G∈G and we focus on bounding the latter expression. Since ĜK maximizes VK,n (G), VK,n (G̃) ≤ VK,n (ĜK ) for any G̃ ∈ G and VK (G̃) ≤ VK,n (G̃) + sup |VK,n (G) − VK (G)| G∈G ≤ VK,n (ĜK ) + sup |VK,n (G) − VK (G)| G∈G ≤ VK (ĜK ) + 2 sup |VK,n (G) − VK (G)| . G∈G Applying the inequality for all G̃ ∈ G, we obtain sup VK (G) − VK (ĜK ) ≤ 2 sup |VK,n (G) − VK (G)| , G∈G G∈G which is also true in expectation over P n . 65 (A.33) The welfare gain estimation error for any treatment rule G could be bounded from above by: K K |VK,n (G) − VK (G)| = · Vn (G) − · V (G) max{K, PX,n (G)} max{K, PX (G)} K K K ≤ · |Vn (G) − V (G)| + V (G) · − max{K, PX,n (G)} max{K, PX,n (G)} max{K, PX (G)} M ≤ |Vn (G) − V (G)| + · |PX,n (G) − PX (G)| , K The second line comes from subtracting and adding triangle inequality. The third line uses inequalities K max{K,PX,n (G)} V (G) K max{K,PX,n (G)} ≤ 1 and then applying the and V (G) ≤ M (from Assumption 2.1 (BO)), and the observation that for any a, b ∈ R and c > 0, c(max{c, b} − max{c, a}) |max{c, b} − max{c, a}| c |b − a| c ≤ . max{c, a} − max{c, b} = max{c, a} · max{c, b} ≤ c c Then sup EP n P ∈P(M,κ) sup VK (G) − VK (ĜK ) ≤ 2 G∈G EP n sup |VK,n (G) − VK (G)| sup P ∈P(M,κ) ≤2 sup P ∈P(M,κ) EP n M sup |Vn (G) − V (G)| + 2 K G∈G sup P ∈P(M,κ) G∈G EP n sup |PX,n (G) − PX (G)| G∈G Note that since the class G has VC-dimension v < ∞, the classes of functions Y (1 − D) YD − · 1{X ∈ G}, fG (Y, D, X) ≡ e(X) 1 − e(X) hG (Y, D, X) ≡ 1{X ∈ G} − 1/2, are VC-subgraph classes with VC-dimension no greater than v by Lemma A.1. These classes of functions are uniformly bounded by M/(2κ) and 1/2. Since Vn (G) = En (fG ), V (G) = E(fG ), Pn (Xi ∈ G) = En (hG ) + 1/2 and P (X ∈ G) = E(hG ) + 1/2, we could apply Lemma A.4 and obtain r r M v M v sup EP n sup VK (G) − VK (ĜK ) ≤ C1 + C1 . κ n K n G∈G P ∈P(M,κ) The theorem’s result follows from (A.33). Proof of Theorem 4.2. Define the class of functions F = {f (·; G) : G ∈ G} on Z: Yi Di Yi (1 − Di ) f (Zi ; G) ≡ − · ρ(Xi ) · 1 {Xi ∈ G} , e(Xi ) 1 − e(Xi ) 66 Then V T (G) = EP (f (·, G)) by (4.3) and ĜTEW M = arg maxGinG En (f (·, G)). By Assumptions 2.1 (BO), (SO) and 4.2 (BDR), these functions are uniformly bounded by F̄ = M ρ̄ 2κ . By Assumption 2.1 (VC) and Lemma A.1, F is a VC-subgraph class of functions with VC-dimension at most v. Applying the argument in inequality (2.2) we obtain sup W T (G) − W T ĜTEW M = sup V T (G) − V T ĜTEW M ≤ 2 sup |En (f ) − EP (f )| . G∈G f ∈F G∈G Now we take the expectation with respect to the sampling distribution P n and apply Lemma A.4: " # r M ρ̄ v T T T EP n sup W (G) − W (ĜEW M ) ≤ 2EP n sup |En (f ) − EP (f )| ≤ C1 . κ n G∈G f ∈F B Validating Condition 2.1 for Local Polynomial Estimators This Appendix verifies that Condition 2.1 (m) and (e) hold for local polynomial estimators if the class of data generating processes Pm or Pe is constrained by Assumptions 2.3 and 2.4, respectively. B.1 Notations and Basic Lemmas In addition to the notations introduced in Section 2.4 in the main text, the following notations are used. Let µ : Rdx → R be a generic notation for a regression equation onto a vector of covariates X ∈ Rdx . In case of m-hybrid EWM, µ (·) corresponds to either of m1 (·) or m0 (·). In case of e-hybrid EWM, µ (·) corresponds to propensity score e(·). We use n to denote the size of the entire sample indexed by i = 1, . . . , n, and denote by Ji ⊂ {1, . . . , n} a subsample with which µ (Xi ) is estimated nonparametrically. Since we consider throughout the leave-one-out regression fits of µ (Xi ), Ji does not include i-th observation. In case of m-hybrid EWM, Ji is either the leave-one-out treated sample {j ∈ {1, . . . , n} : Dj = 1, j 6= i} or the leave-one-out control sample {j ∈ {1, . . . , n} : Dj = 0, j 6= i} depending on µ (·) corresponds to m1 (·) or m0 (·). Note that, in the m-hybrid case, Ji is random as it depends on a realization of (D1 , . . . , Dn ). When the e-hybrid EWM is considered, Ji is non-stochastic and it is given by Ji = {1, . . . , n} \ {i}. The size of Ji is denoted by nJi , which is equal to n1 − 1 or n0 − 1 in the m-hybrid case, and is equal to n − 1 in case of e-hybrid case. With abuse of notations, we use Yi , i = 1, . . . , n, to denote observations of the regressors and use ξ i to denote a regression residual, i.e., Yi = µ (Xi ) + ξ i , E (ξ i |Xi ) = 0, holds for all i = 1, . . . , n. For e-hybrid rule, Yi should be read as the treatment status indicator Di ∈ {1, 0}. 67 We assume that µ (·) belongs to a Hölder class of functions with degree β ≥ 1 and constant 0 < L < ∞. Define the leave-one-out local polynomial regression with degree l = (β − 1) by µ̂−i (Xi ) = U T (0)θ̂(Xi ) · 1 {λ(Xi ) ≥ tn } , X Xj − Xi Xj − Xi 2 K , θ̂−i (Xi ) = arg min Yj − θT U θ h h (B.1) j∈Ji Xj −Xi Xj −Xi s , λ(Xi ) is a smallest eigenis a regressor vector, U ≡ h h |s|≤l P −1 Xj −Xi Xi −Xj Xj −Xi value of B−i (Xi ) ≡ nhdx UT K , and tn is a sequence of j∈Ji U h h h where U Xj −Xi h trimming constant converging to zero, whose choice will be discussed later. The standard least squares calculus shows θ̂−i (Xi ) = B−i (Xi )−1 1 nhdx X j∈Ji U Xj − Xi h K Xj − Xi , h so that µ̂ (Xi ) can be written as X µ̂−i (Xi ) = Yj ω j (Xi ) · 1 {λ(Xi ) ≥ tn } , (B.2) j∈Ji 1 U T (0) [B−i (Xi )]−1 U nhdx where ω j (Xi ) = Xj − Xi h K Xj − Xi h . We first present lemmas that will be used for proving Lemma B.4 below. Lemma B.1. Suppose Assumptions 2.3 (PX) and (Ker). (i) Conditional on (X1 , . . . , Xn ) such that λ(Xi ) > 0, max |ω j (Xi )| ≤ c5 j6=i X j∈Ji |ω j (Xi )| ≤ 1 , nhdx λ(Xi ) X n c5 1 (Xj nhdx λ(Xi ) j∈Ji o − Xi ) ∈ [−h, h]dx , where c5 is a constant that depends only on β, dx and Kmax . P Xj −Xi s (ii) For any multi-index s such that |s| ≤ (β − 1), j∈Ji ω j (Xi ) = 0. h P −1 X n j −x T Xj −x K Xj −x (iii) Let λ (x) be a smallest eigenvalue of B(x) ≡ nhdx U U j=1 h h h there exist positive constants c6 and c7 that depend only on c, r0 , pX , and K(·) such that P n ({λ (x) ≤ c6 }) ≤ 2 [dim U ]2 exp −c7 nhdx holds for all x, PX -almost surely, at every n ≥ 1. 68 Proof. (i) Since kU (0)k = 1, it holds Xj − Xi Xj − Xi 1 −1 |ω j (Xi )| ≤ [B−i (Xi )] U K h h nhdx n o X − X Kmax j i d x U ≤ 1 (X − X ) ∈ [−h, h] j i d h nh x λ(Xi ) ≤ ≡ Kmax dim (U )1/2 nhdx λ(Xi ) c5 , d x nh λ(Xi ) for every 1 ≤ j ≤ n. Similarly, n o X Xj − Xi Kmax X 1 (Xj − Xi ) ∈ [−h, h]dx |ω j (Xi )| ≤ U h nhdx λ(Xi ) j∈Ji j∈Ji o X n c5 dx 1 (X − X ) ∈ [−h, h] . = j i nhdx λ(Xi ) j∈Ji (ii) This claim follows from the first order condition for θ in the least square minimization problem in (B.1). (iii) This lemma is from Equation (6.3, pp. 626) in the proof of Theorem 3.2 in Audibert and Tsybakov (2007), where a suitable choice of constant c6 is given in Equation (6.2, pp.625) in Audibert and Tsybakov (2007). The next lemma provides an exponential tail bound for the local polynomial estimators. The first statement is borrowed from Theorem 3.2 in Audibert and Tsybakov (2007), and the second statement is its immediate extension. Lemma B.2. (i) Suppose Assumption 2.3 (PX) and (Ker) hold, and µ (·) belongs to a Hölder class of functions with degree β ≥ 1 and constant 0 < L < ∞. Assume Ji is non-stochastic with nJi = n − 1 (e-hybrid case). Then, there exist positive constants c8 , c9 , and c10 that depend only on β, dx , L, c, r0 , pX , and p̄X , such that, for any 0 < h < r0 /c, any c8 hβ < δ, and any n ≥ 2, P n−1 µ̂−n (x) − µ (x) > δ ≤ c9 exp −c10 nhdx δ 2 , holds for almost all x with respect to PX , where P n−1 (·) is the distribution of n o n−1 (Yi , Xi )i=1 . (ii) Suppose Assumptions 2.1 (SO), 2.3 (PX), and (Ker) hold, and µ (·) belongs to a Hölder class of functions with degree β ≥ 1 and constant 0 < L < ∞. Assume Ji is stochastic (m-hybrid case) with Ji = {j 6= i : Dj = d}, d ∈ {1, 0}. There exist positive constants c11 , c12 , and c13 that 69 depend only on κ, β, dx , L, c, r0 , pX , and p̄X , such that for any 0 < h < r0 /c, any c11 hβ < δ, and any nJn ≥ 1, P n−1 µ̂−n (x) − µ (x) > δ|nJn ≤ c12 exp −c13 nJn hdx δ 2 holds for almost all x with respect to PX , where P n−1 (·|nJn ) is the conditional distribution of n o P (Yi , Xi )n−1 given n−1 i=1 j=1 1 {Dj = d}. Proof. (i) See Theorem 3.2 in Audibert and Tsybakov (2007). (ii) Under Assumption 2.1 (SO), the conditional distribution of covariates X given D = d, d ∈ {1, 0}, has the support X same as the unconditional distribution PX , and has bounded density on X , since dPX|D=d 1 − κ dPX κ dPX < < 1 − κ dx dx κ dx holds for all x ∈ X . Therefore, when PX satisfies Assumption 2.3 (PX), the conditional distributions PX|D=d , d ∈ {1, 0} also satisfy the support and density conditions analogous to Assumption 2.3 Pn−1 1 {Dj = d} ≥ 1, the exponential (PX). This implies that, even when we condition on nJn = j=1 inequality of (i) in the current lemma is applicable with different constant terms. The next lemma concerns an upper bound of the variance of the supremum of centered empirical processes indexed by a class of sets. Lemma B.3. Let B be a countable class of sets in X , and let {PX,n (B) : B ∈ B} be the empirical distribution based on iid observations, (X1 , . . . , Xn ), Xi ∼ PX . 1 2 . V ar sup {PX,n (B) − PX (B)} ≤ E sup {PX,n (B) − PX (B)} + n 4n B∈B B∈B Proof. In Theorem 11.10 of Boucheron et al. (2013), setting Xi,s at the centered indicator function 1 {Xi ∈ B} − PX (B), and dividing the inequality of Theorem 11.10 of Boucheron et al. (2013) by n2 lead to V ar sup {PX,n (B) − PX (B)} ≤ B∈B 2 E sup {PX,n (B) − PX (B)} n B∈B 1 + sup {PX (B) [1 − PX (B)]} n B∈B 2 1 ≤ E sup {PX,n (B) − PX (B)} + . n 4n B∈B 70 B.2 Main Lemmas and Proofs of Corollaries 2.1 and 2.2 The next lemma yields Corollaries 2.1 and 2.2. Lemma B.4. Let Pµ be a class of joint distributions of (Y, X) such that µ (·) belongs to a Hölder class of functions with degree β ≥ 1 and constant 0 < L < ∞, and Assumption 2.3 (PX) holds. Let µ̂−i (·) be the leave-one-out local polynomial fit for µ (Xi ) defined in (B.1), whose kernel function satisfies Assumption 2.3 (Ker). (i) Then, " sup EP n P ∈Pµ # n 1 1 X β µ̂−i (Xi ) − µ (Xi ) ≤ O(h ) + O √ n nhdx (B.3) i=1 holds. Hence, an optimal choice of bandwidth that leads to the fastest convergence rate of the − 1 uniform upper bound is h ∝ n 2β+dx and the resulting uniform convergence rate is # " n 1 1 X − sup EP n µ̂−i (Xi ) − µ (Xi ) ≤ O n 2+dx /β . n P ∈Pµ i=1 (ii) Let tn = (log n)−1 . Then, " 2β # 2 h log n ≤O sup EP n +O max µ̂−i (Xi ) − µ (Xi ) 1≤i≤n t2n nhdx t2n P ∈Pµ (B.4) holds. Hence, an optimal choice of bandwidth that leads to the fastest convergence rate of the 1 2β+dx uniform upper bound is h ∝ logn n and the resulting uniform convergence rate is " sup EP n P ∈Pµ max µ̂−i (Xi ) − µ (Xi ) 2 # 1≤i≤n ≤ O (tn )−2 log n n 2 2+dx /β ! . Proof. (i) First, consider the non-stochastic Ji case with nJi = (n − 1) (e-hybrid case). Since observations are iid (hence exchangeable) and the probability law of µ̂−i (·) does not depend on Xi , it holds " EP n n 1 X µ̂−i (Xi ) − µ (Xi ) n # = EP n µ̂−i (Xi ) − µ (Xi ) i=1 = EPX EP n−1 µ̂−n (Xn ) − µ (Xn ) |Xn Z = EP n−1 µ̂−n (x) − µ (x) dPX (x) ZX Z ∞ n−1 = P µ̂−n (x) − µ (x) > δ dδ dPX (x), X 0 71 (B.5) where EP n−1 [·] is the expectation with respect to the first (n − 1)-observations of (Yi , Xi ). By Lemma B.2 (i), there exist positive constants c8 , c9 , and c10 that depend only on β, dx , L, c, r0 , pX , and p̄X such that, for any 0 < h < r0 /c, any c8 hβ < δ, and any n ≥ 2, P n−1 µ̂−n (x) − µ (x) > δ ≤ c9 exp −c10 nhdx δ 2 holds for almost all Z Z ∞ P n−1 X (B.6) x with respect to PX . Hence, Z µ̂−n (x) − µ (x) > δ dδ dPX (x) ≤ c8 hβ + c9 ∞ exp −c10 nhdx δ 2 dδ 0 0 c14 = c8 h + √ nhdx 1 β = O(h ) + O √ nhdx β where c14 = c9 (2c10 )−1/2 R∞ 0 δ0 −1/2 (B.7) exp −c10 δ 0 dδ 0 < ∞. Since the upper bound (B.7) does not depend upon P ∈ Pµ , this upper bound is uniform over P ∈ Pµ , so the conclusion holds. P Next, consider the stochastic Ji case with nJi = j6=i 1 {Dj = d}, where d ∈ {1, 0}. we can interpret nJi as a binomial random variable with parameters (n − 1) and π, where π = P (Di = 1) when µ (·) corresponds to m1 (·) and π = P (Di = 0) when µ (·) corresponds to m0 (·). In either n o nJn − π ≤ 12 π = case, κ < π < 1 − κ by Assumption 2.1 (SO). Let n ≥ 1 + π2 and Ωπ,n ≡ n−1 n o 3(n−1)π (n−1)π ≤ n ≤ . Consider Jn 2 2 X EP n−1 µ̂−n (x) − µ (x) |nJn P n−1 (nJn ) EP n−1 µ̂−n (x) − µ (x) · 1 {Ωπ,n } = nJn ∈Ωπ,n EP n−1 µ̂−n (x) − µ (x) |nJn P n−1 (Ωπ,n ) nJn ∈Ωπ,n EP n−1 µ̂−n (x) − µ (x) |nJn . ≤ max ≤ max nJn ∈Ωπ,n Since nJn ≥ (n−1)π 2 ≥ 1 on Ωπ,n , Lemma B.2 (ii) implies Z Z ∞ n−1 µ̂−n (x) − µ (x) > δ|nJn dδ dPX (x) EP n−1 µ̂−n (x) − µ (x) |nJn ≤ P X 0 c15 ≤ c11 h + p , nJn hdx β where c11 and c15 are positive constants that depend only on κ, β, dx , L, c, r0 , pX , and p̄X . Since n Jn ≥ (n−1)π 2 ≥ max nJn ∈Ωπ,n nπ 4 on Ωπ,n for n ≥ 2, it holds 2c15 . EP n−1 µ̂−n (x) − µ (x) |nJn ≤ c11 hβ + √ πnhdx 72 2 Accordingly, combined with the Hoeffding’s inequality P n−1 Ωcπ,n ≤ 2 exp − π4 n , we obtain EP n−1 µ̂−n (x) − µ (x) ≤ EP n−1 µ̂−n (x) − µ (x) · 1 {Ωπ,n } + M P n−1 Ωcπ,n 2 2c15 π β ≤ c11 h + √ + 2M exp − n . 4 πnhdx The third term in the right hand side converges faster than the second term, so we have shown " n # Z 1 X EP n EP n−1 µ̂−n (x) − µ (x) dPX (x) µ̂−i (Xi ) − µ (Xi ) = n X i=1 1 β ≤ O(h ) + O √ nhdx holds for the stochastic Ji case as well. (ii) Let Ωλ,n be an event defined by {λ (Xi ) ≥ tn , ∀i = 1, . . . , n}. On Ωλ,n , (B.2) implies 2 X µ̂−i (Xi ) − µ (Xi )2 ≤ Yj ω j (Xi ) − µ (Xi ) j∈Ji 2 X X = (µ (Xj ) − µ (Xi )) ω j (Xi ) + ξ j ω j (Xi ) j∈Ji j∈Ji 2 2 X X ≤ 2 (µ (Xj ) − µ (Xi )) ω j (Xi ) + 2 ξ j ω j (Xi ) , (B.8) j∈Ji j∈Ji P where the second line follows from Yj = µ (Xj ) + ξ j and j6=i ω j (Xi ) = 0 as implied by Lemma B.1 (ii). Since µ (·) is assumed to belong to the Hölder class, Lemma B.1 (ii) and Assumption 2.3 (Ker) imply X 2 X 2 (µ (Xj ) − µ (Xi )) ω j (Xi ) = kXj − Xi kβ ω j (Xi ) j∈Ji j∈Ji X n o2 = kXj − Xi kβ ω j (Xi ) · 1 (Xj − Xi ) ∈ [−h, h]dx j∈Ji X 2 2β 2β |ω j (Xi )| ≤ dx h j∈Ji 2 n o2 1 X c5 dx 2β 2β 1 (Xj − Xi ) ∈ [−h, h] ≤ dx h j∈Ji λ(Xi ) nhdx n o2 1 X h2β dx ≤ c16 2 1 (Xj − Xi ) ∈ [−h, h] , j∈Ji tn nhdx 73 2 where c16 = d2β x c5 . Under Assumption 2.3 (PX) and conditional on Ωλ,n , " #2 X 2 1 h2β max sup PX,n (B) (µ (Xj ) − µ (Xi )) ω j (Xi ) ≤ c16 2 j∈Ji 1≤i≤n tn hdx B∈Bh !#2 " h2β 1 ≤ c16 2 sup (PX,n (B) − PX (B)) + sup PX (B) tn hdx B∈Bh B∈Bh #2 " h2β 1 ≤ c16 2 sup (PX,n (B) − PX (B)) + 2dx · p̄X tn hdx B∈Bh #2 " 2β h 2 2dx +1 2 , (P (B) − P (B)) + 2 · p̄ sup ≤ c16 2 X,n X X tn h2dx B∈Bh where Bh is the class of hypercubes in Rdx , Y dx Bh ≡ k=1 [xk − h, xk + h] : (x1 , . . . , xdx ) ∈ X . Accordingly, E Pn X max 1≤i≤n j6=i 2 (µ (Xj ) − µ (Xi )) ω j (Xi ) · 1 {Ωλ,n } n 2 o h2β h2β 1 n + 2c E sup (P (B) − P (B)) 16 2 P X,n X B∈Bh t2n tn h2dx ) ( V ar supB∈Bh (PX,n (B) − PX (B)) h2β 1 h2β ≤ c17 2 + 4c16 2 2dx 2 , tn tn h + EP n supB∈B (PX,n (B) − PX (B)) ≤ c17 h where c17 = 22dx +1 c16 p̄2X . In order to bound the variance and the squared mean terms in the curly brackets, we apply Lemma B.3 and Lemma A.5 with F̄ = 1 and δ = p̄X (2h)dx /2 . For all n satisfying nhdx ≥ C1 vBh , 2dx p̄2X it holds ! V ar sup (PX,n (B) − PX (B)) B∈Bh ≤ 2 EP n n ≤ 2 dx +1 2 ! sup (PX,n (B) − PX (B)) B∈Bh p C2 p̄X !#2 " EP n sup (PX,n (B) − PX (B)) B∈Bh ≤ 2dx C22 p̄2X + 1 4n vBh hdx 1 + , 3/2 4n n vBh hdx , n where vBh < ∞ is the VC-dimension of Bh that depends only on dx . As a result, there exist positive constants c18 , and c19 that depend only on β, dx , and p̄X , such that 2 X h2β h2β h2β (µ (Xj ) − µ (Xi )) ω j (Xi ) · 1 {Ωλ,n } ≤ c17 2 +c18 2 EP n max +c 19 3/2 1≤i≤n tn tn (nhdx ) t2n (nhdx ) j6=i 74 holds for all n satisfying nhdx ≥ C1 vBh . 2dx p̄2X Since nhdx → ∞ by the assumption, focusing on the leading terms yields 2 2β X h . lim sup sup EP n 2 max (µ (Xj ) − µ (Xi )) ω j (Xi ) · 1 {Ωλ,n } ≤ O 1≤i≤n t2n n→∞ P ∈Pµ j∈Ji (B.9) In order to bound the second term in the right hand side of (B.8), note first that 2 2 X X Xj − Xi Xj − Xi 1 ≤ √ 1 K ξ ω (X ) ξ U j i j j h h nhd λ2 (Xi ) j∈Ji nhdx j∈Ji ≤ 2 Kmax nhd t2n max 1≤k≤dim(U ) η 2ik holds conditional on Ωλ,n , where η ik , 1 ≤ k ≤ dim (U ), is the k-th entry of vector n o X Xj − Xi 1 √ ξj U 1 (Xj − Xi ) ∈ [−h, h]dx . h nhdx j∈Ji Therefore, 2 2 X Kmax 2 ξ j ω j (Xi ) · 1 {Ωλ,n } ≤ EP n max EP n max η ik . 1≤i≤n nhd t2n 1≤i≤n,1≤k≤dim(U ) j∈Ji (B.10) Conditional on (X1 , . . . , Xn ) , η ik has mean zero and every summand in η ik lies in the interval, h n o n oi − √ M dx 1 (Xj − Xi ) ∈ [−h, h]dx , √ M dx 1 (Xj − Xi ) ∈ [−h, h]dx . The Hoeffding’s inequality nh nh then implies that, for every 1 ≤ i ≤ n and 1 ≤ k ≤ dim (U ), it holds P n (|η ik | ≥ t|X1 , . . . , Xn ) ≤ 2 exp − 2M 2 nhdx ≤ 2 exp − 2M 2 nhdx t2 n o dx j∈Ji 1 (Xj − Xi ) ∈ [−h, h] P t2 n o , ∀t > 0. P max1≤i≤n j∈Ji 1 (Xj − Xi ) ∈ [−h, h]dx 75 Therefore, EP n exp 2M 2 nhdx η 2ik n o |X1 , . . . , Xn P max1≤i≤n j∈Ji 1 (Xj − Xi ) ∈ [−h, h]dx ∞ Z Pn exp = 1+ 1 ∞ Z = 1+ 1 0 η 2ik n o ≥ t0 |X1 , . . . , Xn P dt max 1 (Xj − Xi ) ∈ [−h, h]dx 2M 2 nhdx 1≤i≤n j∈Ji v u n o 2 X u 2M P n |η ik | ≥ t dx max 1 (Xj − Xi ) ∈ [−h, h]dx log t0 |X1 , . . . , Xn dt0 nh 1≤i≤n j∈Ji Z ∞ ≤ 1+2 exp −2 log t0 dt0 Z1 ∞ = 1+2 t0 −2 dt0 1 = 3 for all 1 ≤ i ≤ n and 1 ≤ k ≤ dim (U ). We can therefore apply Lemma 1.6 of Tsybakov (2009) to bound EP n maxi,k η 2ik |X1 , . . . , Xn , 2 n EP max η ik |X1 , . . . , Xn 1≤i≤n,1≤k≤dim(U ) n o X 1 1 (Xj − Xi ) ∈ [−h, h]dx log (3 dim (U ) n) ≤ 2M 2 max dx 1≤i≤n nh j∈Ji " # 1 ≤ 2M 2 dx sup (PX,n (B) − PX (B)) + 2dx p̄X log (3 dim (U ) n) . h B∈Bh By applying Lemma A.5 with F̄ = 1 and δ = p̄X (2h)dx /2 , the unconditional expectation of maxi,k η 2ik can be bounded as EP n max 1≤i≤n,1≤k≤dim(U ) for all n such that nhdx ≥ η 2ik r vBh dx /2 dx ≤ 2M C2 2 p̄X + 2 p̄X log (3 dim (U ) n) nhdx C1 vBh . 2dx p̄2X 2 (B.11) Plugging (B.11) back into (B.10) and focusing on the leading term give lim sup sup EP n n→∞ P ∈Pµ X max 0≤i≤n 2 log n ξ ω j (Xi ) · 1 {Ωλ,n } ≤ O . j6=i j nhdx t2n 76 (B.12) Combining (B.8), (B.9), and (B.12), we obtain 2 EP n max µ̂−i (Xi ) − µ (Xi ) 1≤i≤n 2 ≤ EP n max µ̂−i (Xi ) − µ (Xi ) · 1 {Ωλ,n } + M 2 P n Ωcλ,n 1≤i≤n X 2 (µ (Xj ) − µ (Xi )) ω j (Xi ) · 1 {Ωλ,n } ≤ 2EP n max j6=i 1≤i≤n X 2 ξ j ω j (Xi ) · 1 {Ωλ,n } + M 2 P n Ωcλ,n , +2EP n max j6=i 1≤i≤n 2β h log n 2 n c , + M P Ω = O +O λ,n 2 2 d tn nh x tn n so the desired conclusion is proven if P n Ωcλ,n is shown to converge faster than the O nhlog dx t2 n term. To find the convergence rate of P n Ωcλ,n , consider first the case of non-stochastic Ji . By applying Lemma B.1 (iii) with the sample size set at (n − 1), we have P n ({λ (Xi ) ≤ c6 , for some 1 ≤ i ≤ n}) = nP n ({λ (Xn ) ≤ c6 }) Z = n P n (λ(Xn ) ≤ c6 |Xn ) dPX Z = n P n−1 (λ(x) ≤ c6 ) dPX (x) (B.13) c 7 ≤ 2n [dim U ]2 exp − nhdx . 2 For the case of stochastic Ji , by viewing nJi as a binomial random variable with parameters (n − 1) and π with κ < π < 1 − κ, and recalling that, when PX satisfies Assumption 2.3 (PX), the conditional distributions PX|D=d , d ∈ {1, 0} also satisfy the support and density conditions stated in Assumption 2.3 (PX), we can apply the exponential inequality shown in Lemma B.1 (iii) to o n o n 1 nJn 3(n−1)π n−1 ≤ n ≤ bound P (λ(x) ≤ c6 |nJn ). Hence, with Ωπ,n ≡ n−1 − π ≤ 2 π = (n−1)π J n 2 2 used above, we have P n−1 (λ(x) ≤ c6 ) ≤ P n−1 ({λ(x) ≤ c6 } ∩ Ωπ,n ) + P n−1 Ωcπ,n P n−1 (λ(x) ≤ c6 |nJn ) + P n−1 Ωcπ,n . nJn ∈Ωπ,n 2 c π π 7 2 dx ≤ 2 [dim U ] exp − nh + 2 exp − n , 4 4 ≤ max Plugging this upper bound into (B.13) and focusing on the leading term leads to π P n ({λ (Xi ) ≤ c6 , for some 1 ≤ i ≤ n}) ≤ O n exp −c7 nhdx . 4 77 Hence, in either of the non-stochastic or the stochastic Ji case, since tn ≤ c6 holds for all large n and the obtained upper bounds are uniform over P ∈ Pµ , we conclude 2β 2 log n h dx +O +O n exp(−nh ) . lim sup sup EP n max µ̂−i (Xi ) − µ (Xi ) ≤ O 1≤i≤n t2n nhdx t2n n→∞ P ∈Pµ n Since tn = (log n)−1 by assumption, O(n exp −nhdx ) converges faster than O nhlog , the leading dx t2 n 2β n . terms are given by the first two terms, O ht2 + O nhlog dx t2 n n Proof of Corollary 2.1. By noting the following inequalities, X X n n 1 1 m |τ̂ (Xi ) − τ (Xi )| ≤ EP n |m̂1 (Xi ) − m1 (Xi )| EP n i=1 i=1 n n X n 1 +EP n |m̂0 (Xi ) − m0 (Xi )| i=1 n 2 2 m EP n max (τ̂ (Xi ) − τ (Xi )) ≤ 2EP n max (m̂1 (Xi ) − m1 (Xi )) 1≤i≤n 1≤i≤n 2 +2EP n max (m̂0 (Xi ) − m0 (Xi )) , 1≤i≤n we obtain the current corollary by applying Lemma B.4. The resulting uniform convergence rate is 1 given by ψ n = n 2+dx /βm . When the assumption (2.9) in Theorem 2.6 is concerned, the corresponding −1 1 2 log n 2+dx /β m rate is given by ψ̃ n = (log n) . n Proof of Corollary 2.2. (i) Assume that n is large enough so that εn ≤ κ/2 holds. Given ê (Xi ) ∈ [εn , 1 − εn ], τ̂ ei − τ i can be expressed as Yi Di e(Xi ) − ê(Xi ) Yi (1 − Di ) e(Xi ) − ê(Xi ) τ̂ ei − τ i = + , e(Xi ) ê(Xi ) 1 − e(Xi ) 1 − ê (Xi ) so |τ̂ ei − τ i | ≤ M 1 · · |ê(Xi ) − e(Xi )| κ ê (Xi ) (1 − ê (Xi )) holds. On the other hand, when ê (Xi ) ∈ / [εn , 1 − εn ], τ̂ ei = 0 and |τ i | ≤ Hence, the following bounds are valid, M · 2 · |ê(X ) − e(X )| if ê (X ) ∈ κ , 1 − κ , i i i κ 2 2 κ(2−κ) e |τ̂ i − τ i | ≤ κ M 1 κ if ê (Xi ) ∈ / 2,1 − 2 . κ · εn (1−εn ) 78 M κ imply |τ̂ ei − τ i | ≤ M κ . (B.14) Hence, X n 1 e |τ̂ − τ i | = EP n [|τ̂ en − τ n |] EP n i=1 i n M 2 ≤ · · EP n [|ê(Xn ) − e(Xn )|] κ κ (2 − κ) hκ M 1 κ i + · · P n ê (Xn ) ∈ / ,1 − . κ εn (1 − εn ) 2 2 − 1 By Lemma B.4 (i), supP ∈Pe EP n [|ê(Xn ) − e(Xn )|] ≤ O(n 2+dx /βe ), so the conclusion follows if 1 − P n ê (Xn ) ∈ / κ2 , 1 − κ2 is shown to converge faster than O(n 2+dx /βe ). To see this claim is true, note that Z hκ hκ κ i κ i n n−1 P ê (Xn ) ∈ / P ê (x) ∈ / ,1 − = ,1 − dPX (x) 2 2 2 2 ZX κ dPX (x) ≤ P n−1 |ê (x) − e(x)| ≥ 2 X c10 κ2 dx ≤ c9 exp − nh 4 holds for all n satisfying c8 hβ < κ/2, where the c8 , c9 , and c10 are the constants defined in Lemma κ 1 κ n ê (X ) ∈ B.2 (i). Since εn is assumed to converge at a polynomial rate, εn (1−ε , 1 − P / n 2 2 ) n converges faster than O(n − 2+d1 /β x e ). (ii) By (B.14), we have 2 e EP n max |τ̂ i − τ i | ≤ 2 2M 2 EP n max |ê(Xi ) − e(Xi )| (B.15) 1≤i≤n 1≤i≤n κ2 (2 − κ) 2 hκ κi M / for some 1 ≤ i ≤ n . P n ê (Xi ) ∈ ,1 − + κεn (1 − εn ) 2 2 2 2 − +2 By Lemma B.4 (ii), the first term in (B.15) converges at rate O n 2+dx /β (log n) 2+dx /β . To find the convergence rate of the second term in (B.15), consider hκ κi P n ê (Xi ) ∈ / ,1 − for some 1 ≤ i ≤ n 2 i h2κ κ n ≤ nP ê (Xn ) ∈ ,1 − / 2 2 2 c10 κ ≤ c9 n exp − nhdx , 4 where the last line follows from Lemma B.2 (i). Since εn converges at polynomial rate, we conclude the second term in (B.15) converges faster than the first term. 79 References Abadie, A., J. Angrist, and G. Imbens (2002): “Instrumental Variables Estimates of the Effect of Subsidized Training on the Quantiles of Trainee Earnings,” Econometrica, 70, 91–117. Abadie, A., M. M. Chingos, and M. R. West (2014): “Endogenous Stratification in Randomized Experiments,” unpublished manuscript. Armstrong, T. B. and S. Shen (2014): “Inference on Optimal Treatment Assignments,” unpublished manuscript. Audibert, J.-Y. and A. B. Tsybakov (2007): “Fast Learning Rates for Plug-in Classifiers,” The Annals of Statistics, 35, 608–633. Bhattacharya, D. and P. Dupas (2012): “Inferring Welfare Maximizing Treatment Assignment under Budget Constraints,” Journal of Econometrics, 167, 168 – 196. Bloom, H. S., L. L. Orr, S. H. Bell, G. Cave, F. Doolittle, W. Lin, and J. M. Bos (1997): “The Benefits and Costs of JTPA Title II-A Programs: Key Findings from the National Job Training Partnership Act Study,” Journal of Human Resources, 32, 549–576. Boucheron, S., G. Lugosi, and P. Massart (2013): Concentration Inequalities, A Nonasymptotic Theory of Independence, Oxford University Press. Bousquet, O. (2002): “A Bennet Concentration Inequality and its Application to Suprema of Empirical Processes,” Comptes Rendus de l’Académie des Sciences - Series I, 334, 495–500. Chamberlain, G. (2011): “Bayesian Aspects of Treatment Choice,” in The Oxford Handbook of Bayesian Econometrics, ed. by J. Geweke, G. Koop, and H. van Dijk, Oxford University Press, 11–39. Dehejia (2005): “Program Evaluation as a Decision Problem,” Journal of Econometrics, 125, 141–173. Devroye, L., L. Györfi, and G. Lugosi (1996): A Probabilistic Theory of Pattern Recognition, Springer. Devroye, L. and G. Lugosi (1995): “Lower Bounds in Pattern Recognition and Learning,” Pattern Recognition, 28, 1011–1018. 80 Dudley, R. (1999): Uniform Central Limit Theorems, Cambridge University Press. Elliott, G. and R. P. Lieli (2013): “Predicting Binary Outcomes,” Journal of Econometrics, 174, 15 – 26. Florios, K. and S. Skouras (2008): “Exact Computation of Max Weighted Score Estimators,” Journal of Econometrics, 146, 86 – 91. Heckman, J. J., H. Ichimura, and P. E. Todd (1997): “Matching As An Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme,” The Review of Economic Studies, 64, 605–654. Hirano, K. and J. R. Porter (2009): “Asymptotics for Statistical Treatment Rules,” Econometrica, 77, 1683–1701. Kanaya, S. and D. Kristensen (2014): “Optimal Sampling and Bandwidth Selection for Kernel Estimators of Diffusion Processes,” unpublished manuscript. Kasy, M. (2014): “Using Data to Inform Policy,” unpublished manuscript. Kerkyacharian, G., A. B. Tsybakov, V. Temlyakov, D. Picard, and V. Koltchinskii (2014): “Optimal Exponential Bounds on the Accuracy of Classification,” Constructive Approximation, 39, 421–444. Lieli, R. P. and H. White (2010): “The Construction of Empirical Credit Scoring Rules Based on Maximization Principles,” Journal of Econometrics, 157, 110 – 119. Lugosi, G. (2002): “Pattern Classification and Learning Theory,” in Principles of Nonparametric Learning, ed. by L. Györfi, Vienna: Springer, 1–56. Mammen, E. and A. B. Tsybakov (1999): “Smooth Discrimination Analysis,” Annals of Statistics, 27, 1808–1829. Manski, C. F. (1975): “Maximum Score Estimation of the Stochastic Utility Model of Choice,” Journal of Econometrics, 3, 205 – 228. ——— (2004): “Statistical Treatment Rules for Heterogeneous Populations,” Econometrica, 72, 1221–1246. 81 Manski, C. F. and T. Thompson (1989): “Estimation of Best Predictors of Binary Response,” Journal of Econometrics, 40, 97 – 123. Massart, P. and E. Nédélec (2006): “Risk Bounds for Statistical Learning,” The Annals of Statistics, 34, 2326–2366. Qian, M. and S. A. Murphy (2011): “Performance Guarantees for Individualized Treatment Rules,” The Annals of Statistics, 39, 1180–1210. Stoye, J. (2009): “Minimax regret treatment choice with finite samples,” Journal of Econometrics, 151, 70 – 81. ——— (2012): “Minimax regret treatment choice with covariates or with limited validity of experiments,” Journal of Econometrics, 166, 138 – 156. Tetenov, A. (2012): “Statistical treatment choice based on asymmetric minimax regret criteria,” Journal of Econometrics, 166, 157–165. Tsybakov, A. B. (2004): “Optimal Aggregation of Classifiers in Statistical Learning,” The Annals of Statistics, 32, 135–166. ——— (2009): Introduction to Nonparametric Estimation, Springer. van der Vaart, A. W. and J. A. Wellner (1996): Weak Convergence and Empirical Processes, Springer. Vapnik, V. N. (1998): Statistical Learning Theory, John Wiley & Sons. Zhao, Y., D. Zeng, A. J. Rush, and M. R. Kosorok (2012): “Estimating Individualized Treatment Rules Using Outcome Weighted Learning,” Journal of the American Statistical Association, 107, 1106–1118. 82