Who should be treated? Empirical welfare maximization methods for treatment choice

advertisement
Who should be treated?
Empirical welfare
maximization methods
for treatment choice
Toru Kitagawa
Aleksey Tetenov
The Institute for Fiscal Studies
Department of Economics, UCL
cemmap working paper CWP10/15
Who should be Treated? Empirical Welfare Maximization Methods
for Treatment Choice∗
Toru Kitagawa†and Aleksey Tetenov‡
9 March 2015
Abstract
One of the main objectives of empirical analysis of experiments and quasi-experiments is to
inform policy decisions that determine the allocation of treatments to individuals with different observable covariates. We propose the Empirical Welfare Maximization (EWM) method,
which estimates a treatment assignment policy by maximizing the sample analog of average
social welfare over a class of candidate treatment policies. The EWM approach is attractive in
terms of both statistical performance and practical implementation in realistic settings of policy
design. Common features of these settings include: (i) feasible treatment assignment rules are
constrained exogenously for ethical, legislative, or political reasons, (ii) a policy maker wants a
simple treatment assignment rule based on one or more eligibility scores in order to reduce the
dimensionality of individual observable characteristics, and/or (iii) the proportion of individuals
who can receive the treatment is a priori limited due to a budget or a capacity constraint. We
show that when the propensity score is known, the average social welfare attained by EWM rules
converges at least at n−1/2 rate to the maximum obtainable welfare uniformly over a minimally
constrained class of data distributions, and this uniform convergence rate is minimax optimal.
In comparison with this benchmark rate, we examine how the uniform convergence rate of the
average welfare improves or deteriorates depending on the richness of the class of candidate
decision rules, the distribution of conditional treatment effects, and the lack of knowledge of the
propensity score. We provide an asymptotically valid inference procedure for the population
welfare gain obtained by exercising the EWM rule. We offer easily implementable algorithms
for computing the EWM rule and an application using experimental data from the National
JTPA Study.
∗
We would like to thank Shin Kanaya, Charles Manski, Joerg Stoye, and seminar participants at Academia Sinica,
NYU, and University of Virginia for beneficial comments. We would also like to thank the participants at 2014
Bristol Econometrics Study Group, 2014 SETA Conference, Hakone Microeconometrics Conference, and ICEEE 2015
for beneficial comments and discussions. Financial support from the ESRC through the ESRC Centre for Microdata
Methods and Practice (CeMMAP) (grant number RES-589-28-0001) is gratefully acknowledged.
†
Cemmap/University College London, Department of Economics. Email: t.kitagawa@ucl.ac.uk
‡
Collegio Carlo Alberto, Email: aleksey.tetenov@carloalberto.org
1
1
Introduction
Treatment effects often vary with observable individual characteristics. An important objective
of empirical analysis of experimental and quasi-experimental data is to determine the individuals
who should be treated based on their observable characteristics. Empirical researchers often use
regression estimates of individual treatment effects to infer the set of individuals who benefit or
do not benefit from the treatment and to suggest who should be targeted for treatment. This
paper advocates the Empirical Welfare Maximization (EWM) method, which offers an alternative
way to choose optimal treatment assignment based on experimental or observational data from
program evaluation studies. We study the frequentist properties of the EWM treatment choice
rule and show its optimality in terms of welfare convergence rate, which measures how quickly
the average welfare attained by practicing the estimated treatment choice rule converges to the
maximal welfare attainable with the knowledge of the true data generating process. We also argue
that the EWM approach is well-suited for policy design problems, since it easily accommodates
many practical policy concerns, including (i) feasible treatment assignment rules being constrained
exogenously for ethical, legislative, or political reasons, (ii) the policy maker facing a budget or
capacity constraint that limits the proportion of individuals who can receive one of the treatments,
or (iii) the policy maker wanting to have a simple treatment assignment rule based on one or more
indices (eligibility scores) to reduce the dimensionality of individual characteristics.
Let the data be a size n random sample of Zi = (Yi , Di , Xi ), where Xi ∈ X ⊂Rdx refers to observable pre-treatment covariates of individual i, Di ∈ {0, 1} is a binary indicator of the individual’s
treatment assignment, and Yi ∈ R is her/his post-treatment observed outcome. The population
from which the sample is drawn is characterized by P , a joint distribution of (Y0,i , Y1,i , Di , Xi ),
where Y0,i and Y1,i are potential outcomes that would have been observed if i’s treatment status
were Di = 0 and Di = 1, respectively. We assume unconfoundedness, meaning that in the data
treatments are assigned independently of the potential outcomes (Y0,i , Y1,i ) conditionally on observable characteristics Xi . Based on this data, the policy-maker has to choose a conditional treatment
rule that determines whether individuals with covariates X in a target population will be assigned
to treatment 0 or to treatment 1. We restrict our analysis to non-randomized treatment rules.
The set of treatment rules could then be indexed by their decision sets G ⊂ X of covariate values,
which determine the group of individuals {X ∈ G} to whom treatment 1 is assigned. We denote
the collection of candidate treatment rules by G = {G ⊂ X }.
The goal of our analysis is to empirically select a treatment assignment rule that gives the
2
highest welfare to the target population. We assume that the joint distribution of (Y0,i , Y1,i , Xi ) of
the target population is identical to that of the sampled population.1 We consider the utilitarian
welfare criterion defined by the average of the individual outcomes in the target population. When
treatment rule G is applied to the target population, the utilitarian welfare equals
W (G) ≡ EP [Y1 · 1 {X ∈ G} + Y0 · 1 {X ∈
/ G}]
(1.1)
where EP (·) is the expectation with respect to P . Denoting the conditional mean treatment
response by md (x) ≡ E[Yd |X = x] and the conditional average treatment effect by τ (x) ≡ m1 (x) −
m0 (x), we could also express the welfare criterion as
W (G) = EP (m0 (X)) + EP [τ (X) · 1 {X ∈ G}] .
(1.2)
Assuming unconfoundedness, equivalence of the distributions of (Y0,i , Y1,i , Xi ) between the target and sampled populations, and the overlap condition for the propensity score e(X) = EP [D|X]
in the sampled population, the welfare criterion (1.1) can be written equivalently as
Y (1 − D)
YD
· 1 {X ∈ G} +
· 1 {X ∈
/ G}
W (G) ≡ EP
e(X)
1 − e(X)
YD
Y (1 − D)
= EP (Y0 ) + EP
−
· 1 {X ∈ G} .
e(X)
1 − e(X)
(1.3)
Hence, if the probability distribution of observables (Y, D, X) was fully known to the decisionmaker, an optimal treatment rule from the utilitarian perspective can be written as
G∗ ∈ arg max W (G).
(1.4)
G∈G
Or, equivalently, as a maximizer of the welfare gain relative to EP (Y0 ):
G∗ ∈ arg max EP [τ (X) · 1 {X ∈ G}] , or
G∈G
Y (1 − D)
YD
∗
−
· 1 {X ∈ G} .
G ∈ arg max EP
G∈G
e(X)
1 − e(X)
(1.5)
(1.6)
The main idea of Empirical Welfare Maximization (EWM) is to solve a sample analog of the
population maximization problem (1.4),
∈ arg max Wn (G),
G∈G
Yi Di
Yi (1 − Di )
where Wn (G) = En
· 1 {Xi ∈ G} +
· 1 {Xi ∈
/ G}
e(Xi )
1 − e(Xi )
ĜEW M
1
(1.7)
In Section 4.2, we consider a setting where the target and the sampled populations have identical conditional
treatment effects, but different marginal distributions of X.
3
and En (·) is the sample average. One notable feature of our framework is that the class of candidate
treatment rules G = {G ⊂ X } is not as rich as the class of all subsets of X , and it may not include
the first-best decision set,
G∗F B ≡ {x ∈ X : τ (x) ≥ 0} ,
(1.8)
which maximizes the population welfare (1.1) if any assignment rule were feasible to implement.
Our framework with a constrained class of feasible assignment rules allows us to incorporate several
exogenous constraints that generally restrict the complexity of feasible treatment assignment rules.
For instance, when assigning treatments to individuals in the target population, it may not be
realistic to implement a complex treatment assignment rule due to legal or ethical restrictions or
due to public accountability for the treatment eligibility criterion.
The largest welfare that could be obtained by any treatment rule in class G is
WG∗ ≡ sup W (G).
(1.9)
G∈G
In line with Manski (2004) and the subsequent literature on statistical treatment rules, we evaluate
the performance of estimated treatment rules Ĝ ∈ G in terms of their average welfare loss (regret)
relative to the maximum feasible welfare WG∗
h
i
h
i
WG∗ − EP n W (Ĝ) = EP n WG∗ − W (Ĝ) ≥ 0,
(1.10)
where the expectation EP n is taken over different realizations of the random sample. This criterion
measures the average difference between the best attainable population welfare and the welfare
attained by implementing estimated policy Ĝ. Since we assess the statistical performance of Ĝ by
its welfare value W (Ĝ), we do not require arg maxG∈G W (G) to be unique or Ĝ to converge to a
specific set.
Assuming that the propensity score e(X) is known and bounded away from zero and one, as is
the case in randomized experiments, we derive a non-asymptotic distribution-free upper bound of
h
i
EP n WG∗ − W (ĜEW M ) as a function of sample size n and a measure of complexity of G. Based
on this bound, we show that the average welfare of the EWM treatment rule converges to WG∗
at rate O(n−1/2 ) uniformly over a minimally constrained class of probability distributions. We
also show that this uniform convergence rate of ĜEW M is optimal in the sense that no estimated
treatment choice rule of any kind can attain a faster uniform convergence rate compared to the
EWM rule, i.e., minimax rate optimality of ĜEW M . For further refinement of this theoretical
result, we analyze how this uniform convergence rate improves if the first-best decision rule G∗F B
4
is feasible, i.e., G∗F B ∈ G, and if the class of data generating processes is constrained by the margin
assumption, which restricts the distribution of conditional treatment effects in a neighborhood of
zero. We show that ĜEW M remains minimax rate optimal with these additional restrictions.
When the data are from an observational study, the propensity score is usually unknown, so it is
not feasible to implement the EWM rule (1.7). As a feasible version of the EWM rule, we consider
hybrid EWM approaches that plug in estimators of the regression equations or the propensity
score in the sample analogs of (1.5) or (1.6). Specifically, with estimated regression functions
m̂d (x) = Ê(Yd |X = x) = Ê(Y |X = x, D = d), we define the m-hybrid rule as
Ĝm−hybrid ∈ arg max En [τ̂ m (Xi ) · 1 {Xi ∈ G}] ,
G∈G
(1.11)
where τ̂ m (Xi ) ≡ m̂1 (Xi ) − m̂0 (Xi ). Similarly, with the estimated propensity score ê(x), we define
an e-hybrid rule as
Ĝe−hybrid ∈ arg max En [τ̂ ei · 1 {Xi ∈ G}] ,
(1.12)
G∈G
where τ̂ ei ≡
h
Yi Di
ê(Xi )
−
Yi (1−Di )
1−ê(Xi )
i
· 1 {εn ≤ ê (Xi ) ≤ 1 − εn } with a converging positive sequence εn → 0
as n → ∞. We investigate the performance of these hybrid approaches in terms of the uniform
convergence rate of the welfare loss and clarify how this rate is affected by the estimation uncertainty
in m̂d (·) and ê(·).
When performing the treatment choice analysis, it could also be of interest to assess the sampling
uncertainty of the estimated welfare gain from implementing the treatment rule Ĝ. For this purpose,
this paper proposes an inference procedure for W (Ĝ) − W (G0 ), where G0 is a benchmark treatment
assignment rule, such as no treatment (G0 = ∅) or the non-individualized implementation of the
treatment (G0 = X ).
Since the welfare criterion function involves optimization over a class of sets, estimation of the
EWM and hybrid treatment rules could present challenging computational problems when G is
rich, similarly to the maximum score estimation (Manski (1975), Manski and Thompson (1989)).
We argue, however, that exact maximization of EWM criterion is now practically feasible for many
problems in economics using widely-available optimization software and an approach proposed by
Florios and Skouras (2008), which we extend and improve upon.
To illustrate EWM in practice, we compare EWM and plug-in treatment rules computed from
the experimental data of the National Job Training Partnership Act Study analyzed by Bloom
et al. (1997).
5
1.1
Related Literature
Our paper contributes to a growing literature on statistical treatment rules in econometrics, including Manski (2004), Dehejia (2005), Hirano and Porter (2009), Stoye (2009, 2012), Chamberlain
(2011), Bhattacharya and Dupas (2012), Tetenov (2012), and Kasy (2014). Manski (2004) proposes
to assess the welfare properties of statistical treatment rules by their maximum regret and derives
finite-sample bounds on the maximum regret of Conditional Empirical Success rules. CES rules
take a finite partition of the covariate space and, separately for each set in this partition, assign the
treatment that yields the highest sample average outcome. CES rules can be viewed as a type of
EWM rules for which G consists of all unions of the sets in the partition and the empirical welfare
criterion uses the sample propensity score. Manski shows that with the partition fixed, their welfare
regret converges to zero at least at n−1/2 rate. We show that this rate holds for a broader class of
EWM rules and that it cannot be improved uniformly without additional restrictions on P .
Stoye (2009) shows that in the absence of ex-ante restrictions on how outcome distributions
vary with covariates, finite-sample minimax regret is attained by rules that take the finest partition
of the covariate space and operate independently for each covariate value. This important result
implies that with continuous covariates, minimax regret does not converge to zero with sample
size because the first-best treatment rule may be arbitrarily “wiggly” and difficult to approximate
from countable data. Our approach does not give rise to Stoye’s non-convergence result because
we restrict the complexity of G and define regret relative to the maximum attainable welfare in G
instead of the unconstrained first-best welfare.
The problem of conditional treatment choice has some similarities to the classification problem
in machine learning and statistics, since it seeks a way to optimally “classify” individuals into those
who benefit from the treatment and those who do not. This similarity allows us to draw on recent
theoretical results for classification by Devroye et al. (1996), Tsybakov (2004), Massart and Nédélec
(2006), Audibert and Tsybakov (2007), and Kerkyacharian et al. (2014), among others, and to adapt
them to the treatment choice problem. The minimax rate optimality of the EWM treatment choice
rule (proved in Theorems 2.1 and 2.2 below) is analogous to the minimax rate optimality of the
Empirical Risk Minimization classifier in the classification problem shown by Devroye and Lugosi
(1995). There are, however, substantive differences between treatment choice and classification
problems: (1) the observed outcomes are real-valued rather than binary, (2) in treatment choice
only one of the potential outcomes is observed for each individual, whereas in classification the
correct choice is known for each training sample observation, (3) the EWM criterion depends on
the propensity score, which may be unknown (as in observational studies), (4) policy settings often
6
impose constraints on practicable treatment rules or on the proportion of the population that could
be treated. Accommodating these fundamental differences and establishing the convergence rate
results for the welfare loss criterion constitute the main theoretical contributions of this paper.
Several works in econometrics consider the plug-in approach to treatment choice using estimated
regression equations,
Ĝplug−in = {x : τ̂ m (x) ≥ 0} , τ̂ m (x) = m̂1 (x) − m̂0 (x),
(1.13)
where m̂d (x) is a parametric or a nonparametric estimator of E(Yd |X = x). Hirano and Porter
(2009) establish local asymptotic minimax optimality of plug-in rules for parametric and semiparametric models of treatment response. Bhattacharya and Dupas (2012) apply nonparametric
plug-in rules with an aggregate budget constraint and derive some of their properties. Armstrong
and Shen (2014) consider statistical inference for the first-best decision rule G∗F B from the perspective of inference for conditional moment inequalities. In empirical practice of program evaluation,
researchers assess who should be treated by stratifying the population based on the predicted value
of the individual outcome in the absence of treatment and estimating the average causal effects for
each strata using the experimental data. Abadie et al. (2014) point out that the naive implementation of this idea is subject to a bias due to the endogenous stratification and provide a method
to correct the bias.
To assess the effect heterogeneity, estimation and inference for conditional treatment effects
based on parametric or nonparametric regressions are often reported, but the stylized output of
statistical inference (e.g., confidence intervals, p-values) fails to offer the policy maker a direct
guidance on what treatment rule to follow. In contrast, our EWM approach offers the policy maker
a specific treatment assignment rule designed to maximize the social welfare. By formulating the
treatment choice problem as a formal statistical decision problem, a certain treatment assignment
rule could be obtained by specifying a prior distribution for P and solving for a Bayes decision
rule (see Dehejia (2005), Chamberlain (2011), and Kasy (2014) for Bayesian approaches to the
treatment choice problem). In contrast to the Bayesian approach, the EWM approach is purely
data-driven and does not require a prior distribution over the data generating processes.
The analysis of optimal individualized treatment rules has also received considerable attention
in biostatistics. Qian and Murphy (2011) propose a plug-in approach using the treatment response
function m̂D (X) ≡ Dm̂1 (X) + (1 − D)m̂0 (X) estimated by penalized least squares. If the square
loss E(Y − m̂D (X))2 converges to the minimum E(Y − DE(Y1 |X) + (1 − D)E(Y0 |X))2 , then
the welfare of the plug-in treatment rule converges to the maximum W (G∗F B ). Assuming that
7
treatment response functions are linear in the chosen basis, they show welfare convergence rate of
n−1/2 or better (with a margin condition). Zhao et al. (2012) propose another surrogate loss function
approach that estimates the treatment rule using a Support Vector Machine with a growing basis.
This yields welfare convergence rates that depend on the dimension of the covariates, similarly to
nonparametric plug-in rules. These approaches are computationally attractive but cannot be used
to choose from a constrained set of treatment rules or under a capacity constraint.
Elliott and Lieli (2013) and Lieli and White (2010) also proposed maximizing the sample analog
of a utilitarian decision criterion similar to EWM. They consider the problem of forecasting binary
outcomes based on observations of (Yi , Xi ), where a forecast leads to a binary decision.
2
Theoretical Properties of EWM
2.1
Setup and Assumptions
Throughout our investigation of theoretical properties of EWM, we maintain the following assumptions.
Assumption 2.1.
(UCF) Unconfoundedness: (Y1 , Y0 ) ⊥ D|X.
(BO) Bounded Outcomes: There exists M < ∞ such that the support of outcome variable Y is
contained in [−M/2, M/2].
(SO) Strict Overlap: There exist κ ∈ (0, 1/2) such that the propensity score satisfies e(x) ∈
[κ, 1 − κ] for all x ∈ X .
(VC) VC-class: A class of decision sets G has a finite VC-dimension 2 v < ∞ and is countable.3
The assumption of unconfoundedness (selection on observables) holds if data are obtained from
an experimental study with a randomized treatment assignment. In observational studies, unconfoundedness is a non-testable and often controversial assumption. Our analysis could be applied
2
The VC-diemension of G is defined by the maximal number of points in X that can be shattered by G. The
VC-dimension is commonly used to measure the complexity of a class of sets in the statistical learning literature (see
Vapnik (1998), Dudley (1999, Chapter 4), and van der Vaart and Wellner (1996) for extensive discussions). Note
that the VC-dimension is smaller by one compared to the VC-index used to measure the complexity of a class of sets
in the empirical process theory, e.g., van der Vaart and Wellner (1996).
3
Coutability of G is imposed in order to avoid measurability complications in proving our theoretical results.
8
to the observational studies in which unconfoundedness is credible. The second assumption (BO)
implies boundedness of the treatment effects, i.e.,
PX (|τ (X)| ≤ M ) = 1,
where PX is the marginal distribution of X and τ (·) is the conditional treatment effect τ (X) =
E (Y1 − Y0 |X). Since the implementation of EWM does not require knowledge of M and unbounded
Y is rare in social science, this assumption is innocuous and imposed only for analytical convenience.
The third assumption (SO) is a standard assumption in the treatment effect literature. It is satisfied
in randomized controlled trials by design, but it may be violated in observational studies if almost all
the individuals are in the same group (treatment or control) for some values of X. We let P(M, κ)
denote the class of distributions of (Y0 , Y1 , D, X) that satisfy Assumption 2.1 (UCF), (BO), and
(SO).
The fourth assumption (VC) restricts the complexity of the class of candidate treatment rules
G in terms of its VC-dimension. If X has a finite support, then the VC-dimension v of any class G
does not exceed the number of support points. If some of X is continuously distributed, Assumption
2.1 (VC) requires G to be smaller than the Borel σ-algebra of X . The following examples illustrate
several practically relevant classes of the feasible treatment rules satisfying Assumption 2.1 (VC).
Example 2.1. (Linear Eligibility Score) Suppose that a feasible assignment rule is constrained
to those that assign the treatment according to an eligibility score. By the eligibility score, we
mean a scalar-valued function of the individual’s observed characteristics that determines whether
one receives the treatment based on whether the eligibility score exceeds a certain threshold. The
main objective of data analysis is therefore to construct an eligibility score that yields a welfaremaximizing treatment rule. Specifically, we assume that the eligibility score is constrained to being
linear in a subvector of x ∈ Rdx , xsub ∈ Rdsub , dsub ≤ dx . The class of decision sets generated by
Linear Eligibility Scores (LES) is defined as
nn
o
o
GLES ≡
x ∈ Rdx : β 0 + xTsub β sub ≥ 0 : β 0 , β Tsub ∈ Rdsub +1 .
(2.1)
We accordingly obtain an EWM assignment rule by maximizing
Yi (1 − Di ) Yi Di
T
T
· 1 β 0 + Xsub,i β sub ≥ 0 +
· 1 β 0 + Xsub,i β sub < 0
Wn (β) ≡ En
e(Xi )
1 − e(Xi )
in β = β 0 , β Tsub ∈ Rdsub +1 . It is well known that the class of half-spaces spanned by β 0 , β Tsub ∈
Rdsub +1 has the VC-dimension v = dsub +1, so Assumption 2.1 (VC) holds. In Section 5, we discuss
9
how to compute ĜEW M when the class of decision sets is given by GLES . A plug-in rule based on
a parametric linear regression also selects a treatment rule from GLES , but their welfare does not
converge to the maximum welfare WG∗LES if the regression equations are misspecified, whereas the
welfare of ĜEW M always does (as shown in Theorem 2.1 below).
Example 2.2. (Generalized Eligibility Score) Let fj (·), j = 1, . . . , m, and g(·) be known functions
of x ∈ Rdx . Consider a class of assignment rules generated by Generalized Eligibility Scores (GES),
o
o
nn
Xm
β j fj (x) ≥ g(x) , (β 1 , ..., β m ) ∈ Rm .
GGES ≡
x ∈ Rdx :
j=1
The class of decision sets GGES generalizes the linear eligibility score rules (2.1), as it allows for
eligibility scores that are nonlinear in x, i.e., GGES can accommodate decision sets that partition
the space of covariates by nonlinear boundaries. It can be shown that GGES has the VC-dimension
v = m + 1 (Theorem 4.2.1 in Dudley (1999)).
Example 2.3. (Intersection Rule of Multiple Eligibility Scores) Consider a situation where there
are L ≥ 2 eligibility scores. Let GGES,l , l = 1, . . . , L, be classes of decision sets such that each
of them is generated by contour sets of the l-th eligibility score. Suppose that a feasible decision
rule is constrained to those that assign the treatment if the individual has all the L eligibility scores
exceeding thresholds. In this case, the class of decision sets is constructed by the intersections,
nT
o
T
L
G
=
G
:
G
∈
G
,
l
=
1,
.
.
.
,
L
. An intersection of a finite number of
G ≡ L
GES,l
l
l
GES,l
l=1
l=1
VC-classes is a VC-class with a finite VC-dimension (Theorem 4.5.4 in Dudley (1999)); thus,
Assumption 2.1 (VC) holds for this G. We can also consider a class of treatment rules that assigns
a treatment if at least one of the L eligibility scores exceeds a threshold. In this case, instead of
intersections, the class of decision sets is formed by the unions of {GGES,l , l = 1, . . . , L}, which is
also known to have a finite VC-dimension (Theorem 4.5.4 in Dudley (1999))
2.2
Uniform Rate Optimality of EWM
To analyze statistical performance of EWM rules, we focus on a non-asymptotic upper bound
h
i
of the worst-case welfare loss supP ∈P(M,κ) EP n WG∗ − W (ĜEW M ) and examine how it depends
on sample size n and VC-dimension v. This finite sample upper bound allows us to assess the
uniform convergence rate of the welfare and to examine how richness (complexity) of the class
of candidate decision rules affects the worst-case performance of EWM. The main reason that we
focus on the uniform convergence rate rather than a pointwise convergence rate is that the pointwise
10
convergence rate of the welfare loss can vary depending on a feature of the data distribution and
fails to provide a guaranteed learning rate of an optimal policy when no additional assumption,
other than Assumption 2.1, is available.
For heuristic illustration of the derivation of the uniform convergence rate, consider the following
inequality, which holds for any G̃ ∈ G:
W (G̃) − W (ĜEW M )
=
≤
W (G̃) − Wn (ĜEW M ) + Wn (ĜEW M ) − W (ĜEW M )
W (G̃) − Wn (G̃) + sup Wn (G̃) − W (G̃)
G∈G
( ∵ Wn (ĜEW M ) ≥ Wn (G̃) )
≤
2 sup |Wn (G) − W (G)| .
G∈G
Since it applies to W (G̃) for all G̃, it also applies to WG∗ = sup W (G̃):
WG∗ − W (ĜEW M ) ≤ 2 sup |Wn (G) − W (G)| .
(2.2)
G∈G
Therefore, the expected welfare loss can be bounded uniformly in P by a distribution-free upper
bound of EP n (supG∈G |Wn (G) − W (G)|). Since Wn (G)−W (G) can be seen as the centered empirical
process indexed by G ∈ G, an application of the existing moment inequality for the supremum of
centered empirical processes indexed by a VC-class yields the following distribution-free upper
bound. A proof, which closely follows the proofs of Theorems 1.16 and 1.17 in Lugosi (2002) in the
classification problem, is given in Appendix A.2.
Theorem 2.1. Under Assumption 2.1, we have
r
h
i
M v
∗
,
sup EP n WG − W (ĜEW M ) ≤ C1
κ n
P ∈P(M,κ)
where C1 is a universal constant defined in Lemma A.4 in Appendix A.1.
This theorem shows that the convergence rate of the worst-case welfare loss for the EWM rule
is no slower than n−1/2 . The upper bound is increasing in the VC-dimension of G, implying that, as
the candidate treatment assignment rules become more complex in terms of VC-dimension, ĜEW M
tends to overfit the data in the sense that the distribution of regret WG∗ − W (ĜEW M ) is more and
more dispersed, and, with n fixed, this overfitting results in inflating the average welfare regret.4
4
Note that WG∗ weakly increases if a more complex class G is chosen. Our welfare loss criterion is defined for a
specific class G and does not capture the potential gain in the maximal welfare from the choice of a more complex G.
11
The next theorem concerns a universal lower bound of the worst-case average welfare loss. It
shows that no data-based treatment choice rule can have a uniform convergence rate faster than
n−1/2 .
Theorem 2.2. Suppose that Assumption 2.1 holds and the VC-dimension of G is v ≥ 2. Then, for
any treatment choice rule Ĝ, as a function of (Z1 , . . . , Zn ), it holds
h
i
n √ o rv − 1
∗
−1
sup EP n WG − W (Ĝ) ≥ 4 exp −2 2 M
for all n ≥ 16 (v − 1) .
n
P ∈P(M,κ)
This theorem, combined with Theorem 2.1, implies that ĜEW M is minimax rate optimal over
the class of data generating process P (M, κ), since the rate of the convergence of the upper bound of
h
i
supP ∈P(M,κ) EP n WG∗ − W (ĜEW M ) agrees with the convergence rate of the universal lower bound.
Accordingly, we can conclude that no other data-driven procedure for obtaining a treatment choice
rule can outperform ĜEW M in terms of the uniform convergence rate over P (M, κ). It is worth
noting that the rate lower bound is uniform in P and does not apply pointwise. Theorem 2.3
shows that EWM rules have faster convergence rates for some distributions. It is also possible
h
i
that EP n W (Ĝ) > WG∗ for some pairs of Ĝ and P , but it can never hold for all distributions in
P(M, κ).5
2.3
Rate Improvement by Margin Assumption
The welfare loss upper bounds obtained in Theorem 2.1 can indeed tighten up and the uniform
convergence rate can improve, as we further constrain the class of data generating processes. In
this section, we investigate (i) what feature of data generating processes can affect the upper bound
on the welfare loss of the EWM rule, and (ii) whether or not the EWM rule remains minimax rate
optimal even under the additional constraints. For this goal, we consider imposing the following
two assumptions.
Assumption 2.2.
(FB) Correct Specification: The first-best treatment rule G∗F B defined in (1.8) belongs to the class
5
For example, if Ĝ is a nonparametric plug-in rule and the first-best decision rule G∗F B for distribution P does not
belong to G, then the welfare of Ĝ will exceed WG∗ in sufficiently large samples. However, the uniform lower bound
still applies because there exist other distributions for which EP n W (Ĝ) ≤ WG∗ − (n−1/2 bound) for the same sample
size.
12
of candidate treatment rules G.
(MA) Margin Assumption: There exist constants 0 < η ≤ M and 0 < α < ∞ such that
α
t
PX (|τ (X)| ≤ t) ≤
, ∀0 ≤ t ≤ η,
η
where M < ∞ is the constant as defined in Assumption 2.1 (BO).
The assumption of correct specification means that the class of the feasible policy rules specified
by G contains an unconstrained first-best treatment rule G∗F B . This assumption is plausible if, for
instance, the policy maker’s specification of G is based on a credible assumption about the shape
of the contour set {x : τ (x) ≥ 0}. This assumption can be, on the other hand, restrictive if the
specification of G comes from some exogenous constraints for feasible policy rules, as in the case of
Example 2.1.
The second assumption (MA) concerns the way in which the distribution of conditional treatment effect τ (X) behaves in the neighborhood of τ (X) = 0. A similar assumption has been considered in the literature on classification analysis (Mammen and Tsybakov (1999), Tsybakov (2004),
among others), and we borrow the term “margin assumption” from Tsybakov (2004). Parameters η
and α characterize the size of population with the conditional treatment effect close to the margin
τ (X) = 0. Smaller η and α imply that more individuals can concentrate in a neighborhood of
τ (X) = 0. The next examples illustrate this interpretation of η and α.
Example 2.4. Suppose that X contains a continuously distributed covariate and that the conditional treatment effect τ (X) is continuously distributed. If the probability density function of τ (X)
is bounded from above by pτ < ∞, then the margin assumption holds with α = 1 and η = (2pτ )−1 .
Example 2.5. Suppose that X is a scalar and follows the uniform distribution on [−1/2, 1/2].
Specify the conditional treatment effects to be τ (X) = (−X)3 . In this specification, τ (X) “flats
out” at τ (X) = 0, and accordingly, the density function of τ (X) is unbounded in the neighborhood
of τ (X) = 0. This specification leads to PX (|τ (X)| ≤ t) = 2t1/3 , so the margin assumption holds
with α = 1/3 and η = 1/8.
Example 2.6. Suppose that the distribution of X is the same as in Example 2.5. Let h > 0 and
specify τ (X) as
(
τ (X) =
X −h
X +h
for X ≤ 0,
for X > 0.
13
This τ (X) is discontinuous at X = 0, and the distribution of τ (X) has zero probability around the
margin of τ (X) = 0. It holds
(
PX (|τ (X)| ≤ t) =
0
for t ≤ h
t − h for h < t ≤
1
2
+h
.
By setting η = h, the margin condition holds for arbitrarily large α. In general, if the distribution
of τ (X) has a gap around the margin of τ (X) = 0, the margin condition holds with arbitrarily large
α.
From now on, we denote the class of P satisfying Assumptions 2.1 and 2.2 by PF B (M, κ, η, α).6
The next theorem provides the upper bound of the welfare loss of the EWM rule when a class of
data distributions is constrained to PF B (M, κ, η, α).
Theorem 2.3. Under Assumptions 2.1 and 2.2,
h
i
v 1+α
2+α
EP n W (G∗F B ) − W (ĜEW M ) ≤ c
n
P ∈PF B (M,κ,η,α)
sup
holds for all n, where c is a positive constant that depends only on M , κ, η, and α.
Similarly to Theorem 2.1, the presented welfare loss upper bound is non-asymptotic, and it is
valid for every sample size. Our derivation of this theorem can be seen as an extension of the finite
sample risk bound for the classification error shown in Theorem 2 of Massart and Nédélec (2006).
Our rate upper bound is consistent with the uniform convergence rate of the classification risk of
the empirical risk minimizing classifier shown in Theorem 1 of Tsybakov (2004).7 This coincidence
is somewhat expected, given that the empirical welfare criterion that the EWM rule maximizes
resembles the empirical classification risk in the classification problem.
1+α
The next theorem shows that the uniform convergence rate of n− 2+α obtained in Theorem 2.3
attains the minimax rate lower bound, implying that any treatment choice rule Ĝ based on data
1+α
(including ĜEW M ) cannot attain a uniform convergence rate faster than n− 2+α . This means that
the EWM rule remains rate optimal even when the class of data generating processes is constrained
additionally by Assumption 2.2.
6
7
Note that PF B (M, κ, η, α) depends on the set of feasible treatment rules G via Assumption 2.2 (FB).
Tsybakov (2004) defines the complexity of the decision sets G in terms of the growth coefficient ρ of the bracketing
number of G. We control complexity of G in terms of the VC-dimension, which corresponds to Tsybakov’s growth
coefficient ρ being arbitrarily close to zero.
14
Theorem 2.4. Suppose Assumptions 2.1 and 2.2 hold. Assume that the VC-dimension of G satisfies v ≥ 2. Then, for any treatment choice rule Ĝ as a function of (Z1 , . . . , Zn ), it holds
sup
EP n
P ∈PF B (M,κ,η,α)
h
W (G∗F B )
1+α
i
n √ o 2(1+α)
α
v − 1 2+α
− 2+α
−1
2+α
− W (Ĝ) ≥ 2 exp −2 2 M
η
n
n
o
for all n ≥ max (M/η)2 , 42+α (v − 1).
The following remarks summarize some analytical insights associated with Theorems 2.1 - 2.4.
Remark 2.1. The convergence rates of the worst-case EWM welfare loss obtained by Theorems
2.1 and 2.3 highlight how margin coefficient α influences the uniform performance of the EWM
rule. Higher α improves the welfare loss convergence rate of EWM, and the convergence rate
approaches n−1 in an extreme case, where the distribution of τ (X) has a gap around τ (X) = 0. As
fewer individuals are around the margin of τ (X) = 0, we can attain the maximal welfare quicker.
Conversely, as α approaches zero (more individuals around the margin), the welfare loss convergence
rate of EWM approaches n−1/2 , and it corresponds to the uniform convergence rate of Theorem 2.1.
Remark 2.2. The upper bounds of welfare loss convergence rate shown in Theorems 2.1 and 2.3
are increasing in the VC-dimension of G. Since they are valid at every n, we can allow the VCdimension of the candidate treatment rules to grow with the sample size. For instance, if we consider
a sequence of candidate decision sets {Gn : n = 1, 2, . . . }, for which the VC-dimension grows with
the sample size at rate nλ , 0 < λ < 1, Theorems 2.1 and 2.3 imply that the welfare loss uniform
convergence rate of the EWM rule slows down to n−
to n
(1+α)
−(1−λ) 2+α
1−λ
2
for the case without Assumption 2.2 and
for the case with Assumption 2.2. Note that the welfare loss lower bounds shown
in Theorems 2.2 and 2.4 have the VC-dimensions of the same order as in the corresponding upper
bounds, so we can conclude that the EWM rule is also minimax rate optimal even in the situations
where the complexity of G grows with the sample size.
Remark 2.3. Note that the welfare loss lower bounds of Theorems 2.2 and 2.4 are valid for any
estimated treatment choice rule Ĝ irrespective of whether Ĝ is constrained to G or not. Therefore,
the nonparametric plug-in rule Ĝplug−in defined in (1.13) is subject to the same lower bound. In
Section 4.3, we further discuss the welfare loss uniform convergence rate of the nonparametric
plug-in rule.
Remark 2.4. Let PF B (M, κ) be the class of data generating processes that satisfy Assumption
2.1 and Assumption 2.2 (FB). A close inspection of the proofs of Theorems 2.1 and 2.2 given in
15
Appendix A.2 shows that the same lower and upper bounds of Theorems 2.1 and 2.2 can be obtained
even when P (M, κ) is replaced with PF B (M, κ). In this sense, Assumption 2.2 (MA) plays the
main role in improving the welfare loss convergence rate.
2.4
Unknown Propensity Score
We have so far considered situations where the true propensity score is known. This would not be
the case if the data were obtained from an observational study in which the assignment of treatment
is not generally under the control of the experimenter. To cope with the unknown propensity score,
this section considers two hybrids of the EWM approach and the parametric/nonparametric plugin approach: the m-hybrid rule defined in (1.11) and the e-hybrid rule defined in (1.12). The
e-hybrid rule employs the trimming rule 1 {εn ≤ ê (Xi ) ≤ 1 − εn } with a deterministic sequence
{εn : n = 1, 2, . . . }, which we assume to converge to zero faster than some polynomial rate, n ≤
O (n−a ), a > 0.8
Let Wnτ (G) be the sample analogue of the welfare criterion (1.2) that one would construct if
the true regression equations were known, Wnτ (G) ≡ En (m0 (Xi )) + En (τ (Xi ) · 1{Xi ∈ G}), and
Ŵnτ (G) be the empirical welfare with the conditional treatment effect estimators τ̂ m (·) plugged in,
Ŵnτ (G) ≡ En [m0 (Xi ) + τ̂ m (Xi ) 1 {Xi ∈ G}] .
(2.3)
Since the m-hybrid rule maximizes Ŵnτ (·), it holds Ŵnτ (Ĝm−hybrid ) − Ŵnτ (G̃) ≥ 0 for any G̃ ∈ G.
The following inequalities therefore follow:
W (G̃) − W (Ĝm−hybrid ) ≤ Wnτ (G̃) − Ŵnτ G̃ − Wnτ (Ĝm−hybrid ) + Ŵnτ Ĝm−hybrid
(2.4)
+W (G̃) − W (Ĝm−hybrid ) − Wnτ (G̃) + Wnτ (Ĝm−hybrid )
n
h n
o
n
oi
1X
[τ (Xi ) − τ̂ m (Xi )] 1 Xi ∈ G̃ − 1 Xi ∈ Ĝm−hybrid
=
n
i=1
+W (G̃) − Wnτ (G̃) + Wnτ (Ĝm−hybrid ) − W (Ĝm−hybrid )
n
1X m
≤
|τ̂ (Xi ) − τ (Xi )| + 2 sup |Wnτ (G) − W (G)| .
n
G∈G
i=1
This implies that the average welfare loss of the m-hybrid rule can be bounded by
" n
#
h
i
X
1
∗
m
τ
EP n WG − W (Ĝm−hybrid ) ≤ EP n
|τ̂ (Xi ) − τ (Xi )| + 2EP n sup |Wn (G) − W (G)| .
n
G∈G
i=1
8
The trimming sequence εn is introduced only to simplify the derivation of the rate upper bound of the welfare
loss. In practical terms, if the overlap condition is well satisfied in the given data, the trimming is not necessary for
computing the e-hybrid rule.
16
(2.5)
For the e-hybrid rule, replacing Wnτ (·) and Ŵnτ (·) in (2.4) with the empirical welfare Wn (·) defined
h
i
e
i (1−Di )
in (1.7) and Ŵn (G) ≡ En Y1−e(X
+
τ̂
·
1{X
∈
G}
, respectively, yields a similar upper bound
i
i
i)
EP n
h
WG∗
i
− W (Ĝe−hybrid ) ≤ EP n
"
#
n
1X e
|τ̂ i − τ i | + 2EP n sup |Wn (G) − W (G)| ,
n
G∈G
(2.6)
i=1
Yi D i
e(Xi )
Yi (1−Di )
1−e(Xi ) .
Since the uniform convergence rate of EP n supG∈G |Wnτ (G) − W (G)|
is the same as that of EP n supG∈G |Wn (G) − W (G)| ,9 these upper bounds imply that the lack of
where τ i =
−
knowledge of the propensity score may harm the welfare loss convergence rate if the average estima
tion error of the conditional treatment effect converges slower than does EP n supG∈G |Wn (G) − W (G)| .
It is therefore convenient to first state the condition regarding the convergence rate of the average
estimation error of the conditional treatment effect estimators.
Condition 2.1.
(m) (m-hybrid case): Let τ̂ m (x) = m̂1 (x) − m̂0 (x) be an estimator for the conditional treatment
effect τ (x) = m1 (x) − m0 (x). For a class of data generating processes Pm , there exists a sequence
ψ n → ∞ such that
"
lim sup sup ψ n EP n
n→∞ P ∈Pm
#
n
1X m
|τ̂ (Xi ) − τ (Xi )| < ∞
n
(2.7)
i=1
holds.
(e) (e-hybrid case): Let τ̂ ei =
Yi (1−Di )
Yi Di
e(Xi ) − 1−e(Xi ) ,
h
Yi Di
ê(Xi )
−
Yi (1−Di )
1−ê(Xi )
i
· 1 {εn ≤ ê (Xi ) ≤ 1 − εn } be an estimator for τ i =
where ê(·) is an estimated propensity score. For a class of data generating processes
Pe , there exists a sequence φn → ∞ such that
#
" n
1X e
|τ̂ i − τ i | < ∞.
lim sup sup φn EP n
n
n→∞ P ∈Pe
(2.8)
i=1
In Appendix B, we show that the estimators τ̂ m (·) and τ̂ ei constructed via local polynomial
regressions satisfy this condition for a certain class of data generating processes.
9
This claim follows by applying the proof of Theorem 2.1 with the following class of functions:
F τ ≡ {f (Xi ; G) ≡ m0 (Xi ) + τ (Xi ) · 1{Xi ∈ G} : G ∈ G} .
F τ is the VC-subgraph class with the VC-dimension at most v by Lemma A.1 in Appendix A.1.
17
Theorems 2.5
and 2.6 below derive the uniform convergence rate bounds of the hybrid rules in two different
scenarios. In Theorem 2.5, we constrain the class of data generating processes only by Assumption
2.1 and Condition 2.1, and, importantly, we allow the class of decision rules G to exclude the firstbest rule G∗F B . Theorem 2.5 follows as a corollary of Theorem 2.1 and inequalities (2.5) and (2.6),
so we omit a proof.
Theorem 2.5. Suppose Assumption 2.1 holds.
(m) (m-hybrid case): Given a class of data generating processes Pm , if an estimator for the
conditional treatment effect τ̂ m (·) satisfies Condition 2.1 (m), then,
h
i
−1/2
sup
EP n WG∗ − W (Ĝm−hybrid ) ≤ O ψ −1
∨
n
.
n
P ∈Pm ∩P(M,κ)
(e) (e-hybrid case): Given a class of data generating processes Pe , if an estimator for the propensity
score ê(·) satisfies Condition 2.1 (e), then,
h
i
−1/2
sup
EP n WG∗ − W (Ĝe−hybrid ) ≤ O φ−1
∨
n
.
n
P ∈Pe ∩P(M,κ)
A comparison of Theorem 2.5 with Theorem 2.1 shows that the uniform rate upper bounds for
the hybrid EWM rules are no faster than the welfare loss convergence rate of the EWM with known
propensity score. Note that if some nonparametric estimator is used to estimate τ (·) or e (·), ψ n or
φn specified in Condition 2.1 is generally slower than n1/2 . Hence, the welfare loss upper bounds
−1
of the hybrid rules are determined by the nonparametric rate ψ −1
n or φn . A special case where
the estimation of τ (·) or e (·) does not affect the uniform convergence rate is when τ (·) or e (·) is
assumed to belong to a parametric family and it is estimated parametrically, i.e., ψ n or φn is equal
to n1/2 .
In the second scenario, we consider the case where G contains the first-best decision rule G∗F B
and the data generating processes are constrained further by the margin assumption (Assumption
2.2) with margin coefficient α ∈ (0, 1].
Theorem 2.6. Suppose Assumptions 2.1 and 2.2 hold with a margin coefficient α ∈ (0, 1]. Assume
that a stronger version of Condition 2.1 holds, where (2.7) and (2.8) are replaced by
"
2 #
m
lim sup sup EP n
ψ̃ n max |τ̂ (Xi ) − τ (Xi )|
< ∞ and
n→∞ P ∈Pm
1≤i≤n
"
lim sup sup EP n
n→∞ P ∈Pe
φ̃n max
1≤i≤n
|τ̂ ei
2 #
< ∞,
− τ i|
18
(2.9)
(2.10)
for sequences ψ̃ n → ∞ and φ̃n → ∞, respectively. Then, we have
h
i
−(1+α)
1+α
sup
EP n W (G∗F B ) − W (Ĝm−hybrid ) ≤ O ψ̃ n
∨ n− 2+α log ψ̃ n ,
P ∈Pm ∩PF B (M,κ,α,η)
sup
P ∈Pe ∩PF B (M,κ,α,η)
h
i
−(1+α)
1+α
∨ n− 2+α log φ̃n .
EP n W (G∗F B ) − W (Ĝe−hybrid ) ≤ O φ̃n
Theorem 2.6 shows that even when τ (·) or e(·) have to be estimated, the margin coefficient α
influences the rate upper bound of the welfare loss. A higher α leads to a faster rate of the welfare
loss convergence regardless of whether τ (·) and e(·) are estimated parametrically or nonparametrically. In the situation where τ (·) or e (·) is estimated parametrically (with a compact support of
X), ψ̃ n or φ̃n is equal to n1/2 ; thus, the uniform welfare loss convergence rate is given by the second
1+α
argument in O (·), n− 2+α . On the other hand, when τ (·) or e (·) is estimated nonparametrically,
which of the two terms in O (·) converges slower depends on the dimension of X and the degree of
smoothness of the underlying nonparametric function. See Corollaries 2.1 and 2.2 below for specific
expressions of ψ̃ n and φ̃n when local polynomial regressions are used to estimate τ (·) or e(·).
Note that Theorems 2.5 and 2.6 concern the upper bound of the convergence rate. We do
not have the universal rate lower bound results for these constrained classes of data generating
processes. We leave the investigation of the sharp rate bound of the hybrid-EWM welfare loss for
future research.
2.4.1
Hybrid EWM with Local Polynomial Estimators
In this subsection, we focus on local polynomial estimators for τ (x) and e(x), and spell out classes
of data generating processes Pm and Pe as well as ψ n , ψ̃ n , φn , and φ̃n that satisfy Condition 2.1
and the assumption of Theorem 2.6.
Consider the m-hybrid approach in which the leave-one-out local polynomial estimators are
used to estimate m1 (Xi ) and m0 (Xi ), i.e., m̂1 (Xi ) and m̂0 (Xi ) are constructed by fitting the
local polynomials excluding the i-th observation.10 For any multi-index s = (s1 , . . . , sdx ) ∈ Ndx
Pdx
sd x
s1
s
and any (x1 , . . . , xdx ) ∈ Rdx , we define |s| ≡
i=1 si , s! ≡ s1 ! · · · sdx !, x ≡ x1 · · · xdx , and
kxk ≡ x21 + · · · + x2dx . Let K(·) : Rdx → R be a kernel function and h > 0 be a bandwidth.
At each Xi , i = 1, . . . , n, we define the leave-one-out local polynomial coefficient estimators with
10
The reason to consider the leave-one-out fitted values is to simplify analytical verification of Condition 2.1. We
believe that the welfare loss convergence rates of the hybrid approaches will not be affected even when the i-th
observation is included in estimating m̂1 (Xi ) and m̂0 (Xi ).
19
degree l ≥ 0 as
θ̂1 (Xi ) = arg min
θ
T
Yj − θ U
j6=i,Dj =1
X
θ̂0 (Xi ) = arg min
θ
X
T
Yj − θ U
j6=i,Dj =0
Xj − Xi
h
2
Xj − Xi
h
2
K
K
Xj − Xi
h
Xj − Xi
h
,
,
Xj −Xi
Xj −Xi
where U
is
the
vector
with
elements
indexed
by
the
multi-index
s,
i.e.,
U
≡
h
s h
Xj −Xi
.11 With a slight abuse of notation, we define U (0) = (1, 0, . . . , 0)T . Let λn,1 (Xi ) be
h
|s|≤l
−1 P
Xj −Xi
T Xj −Xi K Xi −Xj
the smallest eigenvalue of B1 (Xi ) ≡ nhdx
U
U
and
j6=i,Dj =1
h
h
h
P
−1
Xj −Xi
T Xj −Xi K Xi −Xj .
U
λn,0 (Xi ) be the smallest eigenvalue of B0 (Xi ) ≡ nhdx
U
j6=i,Dj =0
h
h
h
Accordingly, we construct leave-one-out local polynomial fits for m1 (Xi ) and m0 (Xi ) by
m̂1 (Xi ) = U T (0)θ̂1 (Xi ) · 1 {λn,1 (Xi ) ≥ tn } ,
m̂0 (Xi ) = U T (0)θ̂0 (Xi ) · 1 {λn,0 (Xi ) ≥ tn } ,
where tn is a positive sequence that slowly converges to zero, such as tn ∝ (log n)−1 . These trimming
rules regularize the regressor matrices of the local polynomial regressions and simplify the proof of
the uniform consistency of the local polynomial estimators.
To characterize Pm in Condition 2.1, we impose the following restrictions.
Assumption 2.3.
(Smooth-m) Smoothness of the Regressions: The regression equations m1 (·) and m0 (·) belong to a
Hölder class of functions with degree β m ≥ 1 and constant Lm < ∞.12
(PX) Support and Density Restrictions on PX : Let X ⊂ Rdx be the support of PX . Let Leb(·) be
the Lebesgue measure on Rdx . There exist constants c and r0 such that
Leb (X ∩ B(x, r)) ≥ cLeb(B(x, r)) ∀0 < r ≤ r0 , ∀x ∈ X ,
and PX has the density function
dPX
dx (·)
(2.11)
with respect to the Lebesgue measure of Rdx that is bounded
from above and bounded away from zero, 0 < pX ≤
dPX
dx (x)
≤ p̄X < ∞ for all x ∈ X .
11
We specify the same bandwidth for these two local polynomial regressions only to suppress notational burden.
12
Let Ds denote the differential operator Ds ≡
s +···+s
dx
∂ 1
sd
s
∂x11 ···xd x
x
dx
. Let β ≥ 1 be an integer. For any x ∈ Rdx and any
(β − 1) times continuously differentiable function f : R → R, we denote the Taylor expansion polynomial of degree
P
(x0 −x)s s
D f (x). Let L > 0. The Hölder class of functions in Rdx with
|s|≤β−1
s!
(β − 1) at point x by fx (x0 ) ≡
degree β and constant 0 < L < ∞ is defined as the set of function f : Rdx → R that are (β − 1) times continuously
β
differentiable and satisfy, for any x and x0 ∈ Rdx , the inequality |fx (x0 ) − f (x)| ≤ L kx − x0 k .
20
(Ker) Bounded Kernel with Compact Support: The kernel function K(·) have support [−1, 1]dx ,
R
Rdx K(u)du = 1, and supu K (u) ≤ Kmax < ∞.
Smoothness of the regression equations, Assumption 2.3 (Smooth-m), is a standard assumption
in the context of nonparametric regressions. Assumption 2.3 (PX) is borrowed from Audibert and
Tsybakov (2007), and it provides regularity conditions on the marginal distribution of X. Inequality
condition (2.11) constrains the shape of the support of X, and it essentially rules out the case where
X has “sharp” spikes, i.e., X ∩ B(x, r) has an empty interior or Leb (X ∩ B(x, r)) converges to zero
as r → 0 faster than the rate of r2 for some x in the boundary of X .
Lemma B.4 in Appendix B shows that when Pm consists of the data generating processes
1
satisfying Assumption 2.3 (Smooth-m) and (PX), Condition 2.1 (m) holds with ψ n = n 2+dx /βm ,
1
− 2+d 1/β
and (2.9) holds with ψ̃ n = n 2+dx /βm (log n)
x
m
−2
. The following corollary therefore follows.
Corollary 2.1. Let Pm consist of data generating processes that satisfy Assumption 2.3 (Smoothm) and (PX). Let m̂1 (Xi ) and m̂0 (Xi ) be the leave-one-out local polynomial estimators with degree
l = (β m − 1), whose kernels satisfy Assumption 2.3 (Ker).
(i) Suppose Assumption 2.1 holds. Then, it holds
h
i
1
−
sup
EP n WG∗ − W (Ĝm−hybrid ) ≤ O n 2+dx /βm .
P ∈Pm ∩P(M,κ)
(ii) Suppose Assumptions 2.1 and 2.2 hold with margin coefficient α ∈ (0, 1]. Then, it holds
h
i
sup
EP n W (G∗F B ) − W (Ĝm−hybrid )
P ∈Pm ∩PF B (M,κ,α,η)
1
1+α
+2 (1+α)
− 2+d1+α
−
2+d
/β
/β
x
2+α
x m (log n)
m
≤ O n
∨n
log n .
Next, consider the e-hybrid approach. For each i = 1, . . . , n, define a leave-one-out local polynomial fit for propensity score as
ê (Xi ) = U T (0)θ̂e (Xi ) · 1 {λn (Xi ) ≥ tn } ,
X
Xj − Xi
Xj − Xi 2
T
K
.
θ̂e (Xi ) = arg min
Dj − θ U
θ
h
h
j6=i
We then construct an estimate of individual treatment effect as
Yi Di
Yi (1 − Di )
τ̂ i =
−
· 1 {εn ≤ ê(Xi ) ≤ 1 − εn } , 0 < n ≤ O n−a , a > 0,
ê(Xi )
1 − ê(Xi )
21
To ensure Condition 2.1 (ii), we now assume smoothness of the propensity score function e(·).
Assumption 2.4.
This assumption is the same as Assumption 2.3 except that 2.3 (Smooth-m)
is replaced by
(Smooth-e) Smoothness of the Propensity Score: The propensity score e(·) belongs to a Hölder class
of functions with degree β e ≥ 1 and constant Le < ∞.
Again, Lemma B.4 in Appendix B shows that Pe formed by the data generating processes
satisfying Assumption 2.4, Condition 2.1 (e) holds with φn = n
n
1
2+dx /β e
− 2+d1 /β −2
(log n)
x
e
− 2+d1 /β
x
e
and (2.10) with φ̃n =
.
Corollary 2.2. Let Pe consist of data generating processes that satisfy Assumption 2.4 (Smooth-e)
and (PX). Let ê(Xi ) be the leave-one-out local polynomial estimator with degree l = (β e − 1), whose
kernel satisfy Assumption 2.3 (Ker).
(i) Suppose Assumption 2.1 holds. Then, it holds
sup
P ∈Pe ∩P(M,κ)
h
i
1
−
EP n WG∗ − W (Ĝe−hybrid ) ≤ O n 2+dx /βe .
(ii) Suppose Assumptions 2.1 and 2.2 hold with margin coefficient α ∈ (0, 1]. Then, it holds
h
i
sup
EP n W (G∗F B ) − W (Ĝe−hybrid )
P ∈Pe ∩PF B (M,κ,α,η)
1
1+α
1+α
+2 (1+α)
− 2+d
−
2+d
/β
/β
x
2+α
x
e
e (log n)
≤ O n
∨n
log n .
A comparison of Corollaries 2.1 and 2.2 shows that the rate upper bound of welfare loss differs
between the m-hybrid EWM and the e-hybrid EWM approaches when the degree of Hölder smoothness of the regression equations β m and that of the propensity score β e are different. For instance,
if the propensity score e (·) is smoother than the regression equations of outcome m1 (·) and m0 (·)
in the sense of β e > β m and the degree of local polynomial regressions is chosen accordingly, then
the rate upper bound of the e-hybrid EWM rule converges faster than that of the m-hybrid EWM
rule.
22
3
Inference for Welfare
In the proposed EWM procedure, the maximized empirical welfare Wn (ĜEW M ) can be seen as an
estimate of W (ĜEW M ), the welfare level attained by implementing the estimated treatment rule.13
In this section, we provide a procedure for constructing asymptotically valid confidence intervals
for the population welfare gain of implementing the estimated rule.
Let Ĝ ∈ G be an estimated treatment rule such as ĜEW M , Ĝm−hybrid , and Ĝe−hybrid . Define
the welfare gain of implementing an estimated treatment rule Ĝ ∈ G by
V (Ĝ) = W (Ĝ) − W (G0 ),
where G0 is a benchmark treatment assignment rule with which the estimated treatment rule Ĝ
is compared in terms of the social welfare. For instance, if the estimated treatment rule Ĝ is
compared with the “no treatment” case, G0 is the empty set ∅. Alternatively, if a benchmark
policy is the non-individualized uniform adoption of the treatment, G0 is set at G0 = X , and V (Ĝ)
is interpreted as the welfare gain of implementing individualized treatment assignment instead of
the non-individualized implementation of the treatment.
√
A construction of the confidence intervals for V (Ĝ) proceeds as follows. Let ν n (G) = n (Vn (G) − V (G)),
where Vn (G) ≡ Wn (G) − Wn (G0 ). If there is a random variable ν̄ n such that ν n Ĝ ≤ ν̄ n holds
P n -almost surely, and if ν̄ n converges in distribution to a non-degenerate random variable ν̄, then,
with c1−α̃ , the (1 − α̃)-th quantile of ν̄, it holds
P n (ν n Ĝ ≤ c1−α̃ ) ≥ P n (ν̄ n ≤ c1−α̃ ) → Pr (ν̄ ≤ c1−α̃ ) = 1 − α̃, as n → ∞.
Hence, if ĉ1−α̃ , a consistent estimator of c1−α̃ , is available, an asymptotically valid one-sided confi-
dence intervals for V (Ĝ) with coverage probability (1 − α̃) can be given by
ĉ
1−α̃
(3.1)
[Vn Ĝ − √ , ∞).
n
√
In the algorithm summarized below, we specify ν̄ n to be ν̄ n = n supG∈G (Vn (G) − V (G)) and
estimate ĉ1−α̃ by bootstrapping the supremum of the centered empirical processes.14
13
It is important to note that in finite samples, Wn ĜEW M estimates W (ĜEW M ) with an upward bias. With
fixed n, the size of the bias becomes bigger as G becomes more complex.
14
The current choice of ν̄ n is likely to yield conservative confidence intervals. Keeping the same nominal coverage
probability, it is feasible to tighten up the confidence intervals with a more sophisticated choice of ν̄ n , such as,
√
ν̄ n = n supG∈Ĝ (Vn (G) − V (G)) , where Ĝ is a data-dependent subclass of G that contains Ĝ with probability
approaching one.
23
Algorithm 3.1.
1. Let Ĝ ∈ G be an estimated treatment assignment rule (e.g., EWM rule, the
hybrid-rules, etc.), and Vn (·) = Wn (·) − Wn (G0 ) be the empirical welfare gain obtained from
the original sample.
2. Resample n-observations of Zi = (Yi , Di , Xi ) randomly with replacement from the original
sample and construct the bootstrap analogue of the welfare gain, Vn∗ (·) = Wn∗ (·) − Wn∗ (G0 ),
where Wn∗ (·) is the empirical welfare of the bootstrap sample.
3. Compute ν̄ ∗n =
√
n supG∈G (Vn∗ (G) − Vn (G)).
4. Let α̃ ∈ (0, 1/2). Repeat step 2 and 3 many times and estimate ĉ1−α̃ by the empirical (1 − α̃)th quantile of the bootstrap realizations of ν̄ ∗n .
Given Assumption 2.1, the uniform central limit theorem for empirical processes assures that ν̄ n
converges in distribution to ν̄ the supremum of a mean zero Brownian bridge process. Furthermore,
by the well-known result on the asymptotic validity of the bootstrap empirical processes (see, e.g.,
Section 3.6 of van der Vaart and Wellner (1996)), the bootstrap critical value ĉ1−α̃ consistently
estimates the corresponding quantile of ν̄. We can therefore conclude that the confidence intervals
constructed in (3.1) has the desired asymptotic coverage probability.
4
Extensions
4.1
Empirical Welfare Maximization with a Capacity Constraint
Empirical welfare maximization could also be used to select treatment rules when one of the treatments is scarce and the planner faces a capacity constraint on the proportion of the target population
that could be assigned to it. Capacity constraints exist in various treatment choice problems. In
medicine, stocks of new drugs and vaccines could be smaller than the number of patients who may
benefit from them. In education, limited number of slots is available in “magnet” schools. Training
programs for the unemployed are sometimes capacity-constrained. The limited capacity of a prison
system could make it infeasible to assign incarceration as a treatment for all convicted criminals.
In these cases, treatment assignment rules which propose to assign too many individuals to the
capacity-constrained treatment cannot be fully implemented.
We assume that the availability of treatment 1 is constrained.
24
Assumption 4.1.
(CC) Capacity Constraint: Proportion of the target population that could receive treatment 1
cannot exceed K ∈ (0, 1).
If the population distribution of covariates PX were known, maximization of the empirical
welfare criterion could be simply restricted to sets in class G that satisfy the capacity constraint
GK ≡ {G ∈ G : PX (G) ≤ K}.
Being a subset of G, the class of sets GK has the same complexity as G (or lower), and all previous
results could be applied simply by replacing G with GK .
The population distribution of covariates is often not known precisely. In these cases, it is
impossible to guarantee with certainty that estimated treatment rule Ĝ will satisfy the capacity
constraint with probability one. To evaluate the welfare of any treatment assignment method in this
setting, we first need to make an assumption about what happens when the estimated treatment
set Ĝ violates the capacity constraint.
We assume that if the treatment rule G violates the capacity constraint, i.e., PX (G) > K,
then the scarce treatment is randomly allocated (“rationed”) to a fraction
K
PX (G)
of the assigned
recipients with X ∈ G independently of (X, Y0 , Y1 ). If G does not violate the capacity constraint,
then there is no rationing and all recipients with covariates X ∈ G receive treatment 1. This
assumption holds if individuals arrive sequentially for treatment assignment in the order that is
independent of (X, Y0 , Y1 ) and receive treatment 1 on a first-come first-serve basis if X ∈ G. It
could also correspond to settings in which treatment 1 is allocated by lottery to individuals with
X ∈ G.
This allows us to clearly define the capacity-constrained welfare of the treatment rule indexed
by any subset G ⊂ X of the covariate space as
 h
o
n
oi
n
· 1 {X ∈ G}
m1 (X) · min 1, PXK(G) + m0 (X) · 1 − min 1, PXK(G)
WK (G) ≡ EP 
+m0 (X) · 1 {X ∈
/ G}
Then the capacity-constrained welfare gain of the treatment rule G equals
VK (G) ≡ WK (G) − WK (∅)
K
· τ (X) · 1 {X ∈ G}
= EP min 1,
PX (G)
K
= min 1,
· V (G).
PX (G)
25

.
Rationing dilutes the effect of treatment rules that violate the capacity constraint and we take into
account this effect on welfare. In comparison, nonparametric plug-in treatment rules proposed by
Bhattacharya and Dupas (2012) are only required to satisfy the capacity constraint on average over
repeated data samples.
The maximum welfare gain attainable by treatment rules in G in the presence of a capacity
constraint equals VK∗ ≡ supG∈G VK (G). Even with full knowledge of the outcome distribution, it
is feasible that the optimal policy would assign the scarce treatment to a subset of the population
with PX (G) > K, requiring rationing within that subpopulation. This is unlikely to happen in
practice when the distribution of covariates does not have atoms and the collection of treatment
rules G is sufficiently rich. The following condition, for example, guarantees that the optimal
capacity-constrained policy does not require rationing.
Remark 4.1. If the collection of treatment rules G contains the upper contour sets of τ (x),
Gt ≡ {x ∈ X : τ (x) ≥ t} , t ∈ R,
and PX (Gt ) is continuous in t, then the optimal policy GK = arg maxG∈G VK (G) belongs to the set
{Gt , t ∈ R} and satisfies the capacity constraint without rationing: PX (ĜK ) ≤ K.
We propose a treatment rule that maximizes the empirical analog of the capacity-constrained
welfare gain VK (G) (and, hence, welfare):
ĜK ≡ arg max VK,n (G),
(4.1)
G∈G
where
K
· Vn (G)
VK,n (G) ≡ min 1,
PX,n (G)
K
Yi D i
Yi (1 − Di )
= min 1,
· En
−
· 1{Xi ∈ G} ,
PX,n (G)
e(Xi )
1 − e(Xi )
and PX,n is the empirical probability distribution of (X1 , . . . , Xn ). The following theorem shows
that the expected welfare of ĜK converges to the maximum at least at n−1/2 rate. The result is
analogous to Theorem 2.1, with the additional term corresponding to potential welfare losses due
to misestimation of PX (G).
Theorem 4.1. Under Assumptions 2.1 and 4.1,
r
r
M v
M v
+ C1
,
sup EP n sup WK (G) − WK (ĜK ) ≤ C1
κ n
K n
G∈G
P ∈P(M,κ)
where C1 is the universal constant in Lemma A.4.
26
4.2
Target Population that Differs from the Sampled Population
Empirical Welfare Maximization method can be adapted to select treatment rules for a target
population that differs from the sampled population in the distribution of covariates X, but has
the same conditional treatment effect function τ (x).
As before, P denotes the probability distribution of (Y0,i , Y1,i , Di , Xi ) in the sampled population.
We denote by P T the probability distribution of (Y0,i , Y1,i , Xi ) in the target population and by E T
the expectations with respect to that probability. The welfare of implementing treatment rule G in
the target population is W T (G). We assume that the sampled population has the same conditional
treatment effect as the target population.
Assumption 4.2.
(ID) Identical Treatment Effects: E T (Y1 − Y0 |X) = EP (Y1 − Y0 |X).
(BDR) Bounded Density Ratio: Probability distributions PXT and PX have densities pTX and pX
with respect to a common dominating measure on X and for some ρ(x) ≤ ρ̄ < ∞:
pTX (x) = ρ(x) · pX (x).
The welfare gain of treatment rule G on the target population equals
Z
Z
T
T
V (G) ≡
τ (x)1{x ∈ G} dPX (x) =
τ (x)1{x ∈ G}ρ(x) dPX (x).
X
X
Assumption 4.2 (ID) implies that the first-best treatment rule G∗F B = 1{x : τ (x) ≥ 0} is the
same in the sampled and the target populations. Therefore, when the first-best policy is feasible,
i.e. G∗F B ∈ G, we could directly apply the EWM treatment rule ĜEW M computed for the sampled
population to the target population. Note that from the definition of G∗F B , it follows that for any
treatment rule G and any x
τ (x)1{x ∈ G∗F B } − τ (x)1{x ∈ G} ≥ 0.
In conjunction with Assumption 4.2 (BDR), this yields an upper bound on the welfare loss of
applying any treatment rule in the target population expressed in terms of the welfare loss of
27
applying the same rule in the sampled population.
W T (G∗F B ) − W T (G) = V T (G∗F B ) − V T (G)
Z
[τ (x)1{x ∈ G∗F B } − τ (x)1{x ∈ G}] ρ(x)dPX (x)
=
X
Z
[τ (x)1{x ∈ G∗F B } − τ (x)1{x ∈ G}] dPX (x)
≤ ρ̄
X
= ρ̄ [W (G∗F B ) − W (G)] .
All of the welfare convergence rate results derived for EWM and hybrid treatment rules in previous
sections also hold when these treatment rules are applied to the target population, as long as
Assumptions 2.2 (FB) and 4.2 hold.
When the first-best policy is not feasible (G∗F B ∈
/ G), the second-best policies for the sampled
and the target populations, G∗ ∈ arg maxG∈G W (G) and GT ∗ ∈ arg maxG∈G W T (G), are generally
different. The welfare of treatment rules proposed in the previous sections does not generally
converge to the second-best in the target population supG∈G W T (G).
The second-best in the target population could be obtained by reweighting the argument of the
EWM problem by the density ratio ρ(Xi ). This method works even when the first-best treatment
rule is not feasible and could be combined with a treatment capacity constraint. The reweighted
EWM problem is
ĜTEW M
∈ arg max En
G∈G
Yi Di
Yi (1 − Di )
−
e(Xi )
1 − e(Xi )
· ρ(Xi ) · 1 {Xi ∈ G} .
(4.2)
Applying the law of iterated expectations with respect to X, we obtain
EP
YD
Y (1 − D)
−
e(X)
1 − e(X)
Z
· ρ(X) · 1{X ∈ G} =
τ (x)1{x ∈ G}ρ(x) dPX (x) = V T (G).
X
(4.3)
Theorem 4.2 shows that the welfare loss of the reweighted EWM treatment rule in the target
population converges to zero at least at n−1/2 rate.
Theorem 4.2. If the distribution P in the sampled population satisfies Assumption 2.1 and the
distribution P T in the target population satisfies Assumption 4.2, then
r
M ρ̄ v
T
T
T
sup EP n sup W (G) − W (ĜEW M ) ≤ C1
,
κ
n
G∈G
P ∈P(M,κ)
where C1 is the universal constant in Lemma A.4.
28
4.3
Comparison with the Nonparametric Plug-in Rule
The plug-in treatment choice rule (1.13) with parametrically or nonparametrically estimated m1 (x)
and m0 (x) is intuitive and simple to implement. In situations where flexible treatment assignment
rules are allowed and the dimension of conditioning covariates is small, the nonparametric plug-in
rule would be a competing alternative to the EWM approach. In this section, we review the welfare
loss convergence rate results of the nonparametric plug-in rule and discuss potential advantages and
disadvantages of these two approaches.
We denote the class of data generating processes that satisfy Assumptions 2.1 (UCF), (BO),
(SO), Assumption 2.2 (MA), and Assumption 2.3 by Psmooth (M, κ, α, η, β m ). Given the smoothness
assumption of the regression equations, we consider estimating m1 and m0 by local polynomial
estimators of degree (β − 1). The convergence rate results of the nonparametric plug-in classifiers
shown in Theorem 3.3 of Audibert and Tsybakov (2007) can be straightforwardly extended to the
treatment choice context, resulting in
sup
P ∈Psmooth (M,κ,α,η,β m )
h
i
− 1+α
EP n W (G∗F B ) − W (Ĝplug−in ) ≤ O n 2+dx /βm .
(4.4)
Furthermore, if αβ ≤ dx , Theorem 3.5 of Audibert and Tsybakov (2007) applied to the current
treatment choice setup shows that the nonparametric plug-in rule attains the rate lower bound i.e.,
for any treatment rule Ĝ,
sup
P ∈Psmooth (M,κ,α,η,β m )
h
i
− 1+α
EP n W (G∗F B ) − W (Ĝ) ≥ O n 2+dx /βm
holds.
In practically relevant situations where αβ ≤ dx ,15 a naive comparison of the welfare loss convergence rate of the plug-in rule presented here with that of EWM (Theorems 2.3 and 2.4) would
suggest that in terms of the welfare loss converge rate, the EWM rule would outperform the nonparametric plug-in rule. It is, however, important to notice that the classes of data generating processes
over which the uniform rates are ensured differ between the two cases. Psmooth (M, κ, α, η, β) is constrained by smooth regression equations and continuously distributed X, whereas PF B (M, κ, α, η)
15
In an analogy to the Proposition 3.4 of Audibert and Tsybakov (2007), when the class of data generating processes
is assumed to have αβ > dx , no data generating process in this class can have the conditional treatment effect τ (x) = 0
in an interior of the support of PX . In the practice of causal inference, we a priori would not restrict the plausible
data generating processes only to these extreme cases; therefore, the class of data generating processes with αβ > dx
would be less relevant in practice.
29
considered in Theorems 2.3 and 2.4 allows for discontinuous regression equations and no restriction on the marginal distribution of X’s. Assumption 2.2 (FB) on PF B (M, κ, α, η) requires that
{x : τ (x) ≥ 0} belongs to the pre-specified VC-class G, whereas Psmooth (M, κ, α, η, β) is free from
such assumption. This non-nested relationship between PF B (M, κ, α, η) and Psmooth (M, κ, α, η, β)
makes the naive rate comparison between (4.4) and Theorem 2.3 less meaningful because a data
generating process in Psmooth (M, κ, α, η, β) that yields the slowest convergence rate for the nonparametric plug-in rule is in fact excluded from PF B (M, κ, α, η). Accordingly, unless we can assess
which one of Psmooth (M, κ, α, η, β) and PF B (M, κ, α, η) is more likely to contain the true data
generating process, these rate results offer us limited guidance on the procedure that should be
used in a given application.
In practical terms, we consider these two distinct approaches as complementary, and our choice
between them should be based on available assumptions and the dimension of covariates in a given
application. A practical advantage of the EWM rule is that the welfare loss convergence rate does
not directly depend on the dimension of X, so when an available credible assumption on the level
set {x : τ (x) ≥ 0} implies a certain class of decision sets with a finite VC-dimension, the EWM
approach offers a practical solution to get around the curse of dimensionality of X. A potential
drawback of using the EWM rule is the risk of misspecification of G, i.e., if Assumption 2.2 (FB) is
not valid, the EWM rule only attains the second-best welfare, whereas the nonparametric plug-in
rule is guaranteed to yield the first-best welfare in the limit. Another aspect of comparison is that
the performance of the EWM rule is stable regardless of whether the underlying data generating
processes, including the marginal distribution of X and the regression equations m1 (X) and m0 (X),
are smooth. Furthermore, in terms of implementation, the EWM approach does not require the user
to specify smoothing parameters once the class of candidate decision sets G is given. In contrast,
the nonparametric plug-in rule requires smoothing parameters. The statistical performance of
the nonparametric plug-in rule can be sensitive to the choice of smoothing parameters, and the
theoretical results of the convergence rate given in (4.4) assume the user’s ability to choose the
smoothing parameter properly.
30
5
Computing EWM Treatment Rules
The Empirical Welfare Maximization estimator Ĝ, as well as hybrid estimators Ĝm−hybrid , and
Ĝe−hybrid , share the same structure
X
gi · 1 {Xi ∈ G} ,
Ĝ ∈ arg max
G∈G
(5.1)
1≤i≤n
where each gi is a function of the data, i.e., for the EWM rule ĜEW M , gi =
1
n
Yi D i
e(Xi )
−
for the e-hybrid rule Ĝe−hybrid , gi = τ̂ ei /n, and for the m-hybrid rule Ĝm−hybrid , gi =
Yi (1−Di )
1−e(Xi ) ,
τ̂ m (Xi )/n.
The objective function in (5.1) is non-convex and discontinuous in G, thus finding Ĝ could be
computationally challenging. In this section, we propose a set of convenient tools that permit
solving this optimization problem and performing inference using widely available software16 for
practically important classes of sets G defined by linear eligibility scores.
5.1
Single Linear Index Rules
We start with the problem of computing optimal treatment rules that assign treatments based
on a linear index (linear eligibility score; LES, see Examples 2.1 and 2.2). To reduce notational
complexity, we include a constant in the covariate vector X throughout the exposition of this
section. An LES rule can be expressed as 1{X T β ≥ 0}. This type of treatment rule is commonly
used in practice because it offers a simple way to reduce the dimension of observable characteristics.
Furthermore, it is easy to enforce monotonicity of treatment assignment in specific covariates by
imposing sign restrictions on the components of β.
Let GLES be a collection of half-spaces of the covariate space X , which are the upper contour
sets of linear functions:
n
o
GLES =
Gβ : β ∈ B ⊂Rdx +1 ,
Gβ = x : xT β ≥ 0 .
Then the optimization problem (5.1) becomes:
X
max
gi · 1 XiT β ≥ 0 .
β∈B
(5.2)
1≤i≤n
This problem is similar to the maximum weighted score problem analyzed in Florios and Skouras
(2008). They observe that the maximum score objective function could be rewritten as a Mixed
16
For the empirical illustration we used IBM ILOG CPLEX Optimization Studio, which is available free for academic
use through the IBM Academic Initiative.
31
Integer Linear Programming problem with additional binary parameters (z1 , ..., zn ) that replace
the indicator functions 1 XiT β ≥ 0 . The equality zi = 1 XiT β ≥ 0 is imposed by a combination
of linear inequality constraints and the restriction that zi ’s are binary. The advantage of a MILP
representation is that it is a standard optimization problem that could be solved by multiple
commercial and open-source solvers. The branch-and-cut algorithms implemented in these solvers
are faster than brute force combinatorial optimization.
We propose replacing (5.2) by its equivalent problem:
X
gi · zi
max
β∈B,
z1 ,...,zn ∈R
(5.3)
1≤i≤n
XiT β
XT β
< zi ≤ 1 + i for i = 1, . . . , n,
Ci
Ci
zi ∈ {0, 1},
s.t.
(5.4)
where constants Ci should satisfy Ci > supβ∈B |XiT β|. Then the inequality constraints (5.4) and
the restriction that zi ’s are binary imply that zi = 1 if and only if XiT β ≥ 0. It follows that the
maximum value of (5.4) for each value of β is the same as the value of (5.2).
The problem (5.3) is a linear optimization problem with linear inequality constraints and integer
constraints on zi ’s if the set B is defined by linear inequalities that could be passed to any MILP
solver. Florios and Skouras (2008) impose only one side of the inequality constraint (5.4) for each
i. For gi > 0, it is sufficient to impose only the upper bound on zi and for gi < 0 only the lower
bound. The other side of the bound is always satisfied by the solution due to the direction of the
objective function.
Our formulation has significant advantages. Despite a larger number of inequalities, it reduces
the computation time in our applications by a factor of 10-40. Furthermore, it is not sufficient to
impose only one side of the inequalities on zi ’s for optimization with a capacity constraint considered
further below.
Inference on the welfare gain V (ĜEW M ) of the empirical welfare maximizing policy requires
√
computing ν̄ ∗n = supG∈G n (Vn∗ (G) − Vn (G)) in each bootstrap sample. Denoting the bootstrap
P
weights by {wi∗ }, ni=1 wi∗ = n, ν̄ ∗n could be expressed as
X
√
ν̄ ∗n = n sup
(wi∗ − 1)gi · 1 XiT β ≥ 0
(5.5)
G∈G 1≤i≤n
The optimization problem for ν̄ ∗n is analogous to the optimization problem for ĜEW M . Furthermore,
solving it does not require the knowledge of ĜEW M , hence all bootstrap computations could be
performed in parallel with the main EWM problem.
32
5.2
Multiple Linear Index Rules
We extend this method to compute treatment rules based on multiple linear scores. These rules
construct J scores that are linear in covariates (or in their functions) and assign an individual to
treatment if each score exceeds a specific threshold. An example of a multiple index treatment rule
with three indices is when an individual is assigned to a job training program if (25 ≤ age ≤ 35)
AND (wage at the previous job < $15). The results are easily extended to treatment rules that apply
if any of the indices exceeds its threshold, for example, (age ≥ 40) OR (length of unemployment ≥
2 years).
Let the treatment assignment set G be defined as an intersection of upper contour sets of J
linear functions:
n
o
Gβ 1 ,...,β J , β 1 , ..., β J ∈ B ,
= x : xT β 1 ≥ 0, ..., xT β J ≥ 0 .
G =
Gβ 1 ,...,β J
Then the optimization problem (5.1) becomes
X
max
gi · 1{XiT β 1 ≥ 0, . . . , XiT β J ≥ 0}.
β 1 ,...,β J ∈B
(5.6)
1≤i≤n
We propose its equivalent formulation as a MILP problem with auxiliary binary variables
= 1, . . . , n :
X
max
gi · zi∗
(5.7)
(zi1 , . . . , ziJ , zi∗ ), i
β 1 ,...,β J ∈B,
zi1 ,...,ziJ ,zi∗ ∈R
s.t.
1≤i≤n
X T βj
XiT β j
< zij ≤ 1 + i
for 1 ≤ i ≤ n, 1 ≤ j ≤ J,
Ci
Ci
X j
X j
1−J +
zi ≤ zi∗ ≤ J −1
zi for 1 ≤ i ≤ n,
1≤j≤J
zi1 , . . . , ziJ , zi∗
(5.8)
(5.9)
1≤j≤J
∈ {0, 1} for 1 ≤ i ≤ n.
Similarly to the single index problem, the inequalities (5.8) and the constraint that zij ’s are binary
imply together that zij = 1{XiT β j ≥ 0}. Linear inequalities (5.9) and the binary constraints imply
together that
zi∗ = zi1 · ... · ziJ = 1{XiT β 1 ≥ 0} · ... · 1{XiT β J ≥ 0}.
The problem for a collection of sets defined by the union of linear inequalities
Gβ 1 ,...,β J = X : X T β 1 ≥ 0 or . . . or X T β J ≥ 0
33
could also be written as a MILP problem with the inequality constraint (5.9) replaced by
J −1
X
zij ≤ zi∗ ≤
1≤j≤J
5.3
X
zij for i = 1, . . . , n.
(5.10)
1≤j≤J
Optimization with a Capacity Constraint
When there is a capacity constraint K on the proportion of population that could be assigned to
treatment 1, Empirical Welfare Maximization problem (4.1) on a set G of half-spaces becomes


X
T
Kn
gi · 1 Xi β ≥ 0  .
(5.11)
max min 1, Pn
T
β∈B
i=1 1{Xi β ≥ 0} 1≤i≤n
This problem cannot be rewritten as a linear optimization problem in the same way as (5.3) because
n
o
Kn
the factor min 1, Pn 1{X
varies with β. This factor could take fewer than n different values
T β≥0}
i=1
i
and the maximum of (5.11) could be obtained by solving a sequence of optimization problems each
of which holds this factor constant.
For
max
β∈B,
z1 ,...,zn ∈R
s.t.
k = bKnc , . . . , n
Kn X
min 1,
gi · zi
k
1≤i≤n
XiT β
XT β
< zi ≤ 1 + i for 1 ≤ i ≤ n,
Ci
Ci
zi ∈ {0, 1},
X
zi ≤ k.
1≤i≤n
The capacity constrained problem with multiple indexes could be solved similarly.
6
Empirical Application
We illustrate the Empirical Welfare Maximization method by applying it to experimental data from
the National Job Training Partnership Act (JTPA) Study. A detailed description of the study and
an assessment of average program effects for five large subgroups of the target population is found
in Bloom et al. (1997). The study randomized whether applicants would be eligible to receive a
mix of training, job-search assistance, and other services provided by the JTPA for a period of
18 months. It collected background information on the applicants prior to random assignment, as
well as administrative and survey data on applicants’ earnings in the 30-month period following
34
the assignment. We use the same sample of 11,204 adults (22 years and older) used in the original
evaluation of the program and in the subsequent studies (Bloom et al., 1997, Heckman et al., 1997,
Abadie et al., 2002). The probability of being assigned to the treatment was one third in this
sample.
We use two simple welfare outcome measures for our illustration. The first is the total individual
earnings in the 30-month period following the program assignment. The second outcome measure
is the 30-month earnings minus $1,000 cost for assigning individuals to the treatment, which is
close to the average cost (per assignee) of the additional services provided by the JTPA program,
as estimated by Bloom et al. (1997). The first outcome measure reflects social preferences that put
no weight on the costs of the program incurred by the government. The second measure weighs
participants’ gains and the government’s losses equally.
For all treatment rules, we report the estimated intention-to-treat effect of assigning all individuals with covariates X ∈ G to the treatment. The take-up of different program services varied
across individuals. We view the policy maker’s problem as a choice of eligibility criteria for the
program and not a choice of the take-up rate (which is decided by individuals); hence, we are not
interested in the treatment effect on compliers. Since we have to compare welfare effects of policies
that assign different proportions of the population to the treatment, we report estimates of the
average effect per population member E[(Y1 − Y0 ) · 1{X ∈ G}], which is proportional to the total
welfare effect of the treatment rule G.
Pre-treatment variables based on which we consider conditioning the treatment assignment
are the individual’s years of education and earnings in the year prior to the assignment. Both
variables may plausibly affect how much effect the individual gets from the program services. We
do not use race, gender, or age. Though treatment effects may vary with these characteristics,
policy makers usually cannot use them to determine treatment assignment, since this may be easily
perceived as discrimination. Education and earnings are generally verifiable characteristics. This
is an important feature for implementing the proposed treatment assignment because the empirical
welfare estimates are inaccurate for the target population if the individuals could manipulate their
characteristics to obtain the desired treatment.
Table 1 reports the estimated welfare gains of alternative treatment rules. All of them are
estimated by inverse probability weighting. The average effect of the program on 30-month earnings
for the whole study population is $1,269. If treatment cost of $1,000 per assignee is taken into
account, the net average effect of assigning everyone to treatment is estimated to be $269.
We consider two candidate classes of treatment rules for EWM. The first is the class of quadrant
35
treatment rules:
{x : s1 (education − t1 ) > 0 & s2 (prior earnings − t2 ) > 0} ,
GQ ≡
.
s1 , s2 ∈ {−1, 0, 1}, t1 , t2 ∈ R
(6.1)
This class of treatment eligibility rules is easily implementable and is often used in practice. To be
assigned to treatment according to such rules, an individual’s education and pre-program earnings
have to be above (or below) some specific thresholds.
Figure 1 demonstrates the quadrant treatment rules selected by the EWM criterion. The entire
shaded area covers individuals who would be assigned to treatment if it were costless. The dark
shaded area shows the EWM treatment rule that takes into account $1,000 treatment cost. The
size of black dots indicates the number of individuals with different covariate values. Both rules set
a minimum threshold of ten years of education and a maximum threshold on pre-program earnings
($12,200 and $6,500). The estimated proportions of population assigned to treatment under these
rules are 83% and 73%.
Second, we consider the class of linear treatment rules:
GLES ≡ {{x : β 0 + β 1 · education + β 2 · prior earnings > 0} , β 0 , β 1 , β 2 ∈ R} .
(6.2)
Figure 2 displays the treatment rules from this class chosen according to the EWM criterion. They
are nearly identical for no treatment cost and for a cost of $1,000, assigning 82% of the population
to treatment. At higher treatment costs, EWM selects a much smaller subset of the population.
Linear treatment rules that maximize empirical welfare are markedly different from the plug-in
rule derived from linear regressions, which are shown in Figure 3. Without treatment costs, linear
regression predicts positive treatment effects for the entire range of feasible covariate values. With
a cost of $1,000, the regression predicts positive net treatment effect for about 82% of individuals.
Noticeably, the direction of the treatment assignment differs between regression plug-in and linear
EWM rules. The regression puts a positive coefficient on prior earnings, whereas the equation
characterizing linear EWM rule puts a negative coefficient on them. If the linear regression is
correctly specified, the regression plug-in and EWM rules have identical large sample limits. If the
regression is misspecified, however, only linear EWM treatment rules converge with sample size to
the welfare-maximizing limit. The welfare yielded by regression plug-in rules converges to a lower
limit with sample size.
Figure 4 shows plug-in treatment rules based on Kernel regressions of treatment and control
outcomes on the covariates. The bandwidths were chosen by Silverman’s rule of thumb. The class
of nonparametric plug-in rules is richer than the quadrant or the linear class of treatment rules, and
36
it may obtain higher welfare in large samples. It is clear from the figure, however, that this class of
patchy decision rules may be difficult to implement in public policy, where clear and transparent
treatment rules are required.
7
Conclusion
The EWM approach proposed in this paper directly maximizes a sample analog of the welfare
criterion of a utilitarian policy maker. This welfare-function-based statistical procedure for treatment choice differs fundamentally from parametric and nonparametric plug-in approaches, which
do not integrate statistical inference and the decision problem at hand. We investigated the statistical performances of the EWM rule in terms of the uniform convergence rate of the welfare
loss and demonstrated that the EWM rule attains minimax optimal rates over various classes of
feasible data distributions. The EWM approach offers a useful framework for the individualized
policy assignment problems, as the EWM approach can easily accommodate the constraints that
policy makers commonly face in reality. We also presented methods to compute the EWM rule for
many practically important classes of treatment assignment rules and demonstrated them using
experimental data from the JTPA program.
Several extensions and open questions remain to be answered. First, this paper assumed that
the class of candidate policies G is given exogenously to the policy maker. We did not consider
how to select the class G when the policy maker is free to do so. Second, we ruled out the case in
which the data are subject to selection on unobservables. With self-selection into the treatment,
the welfare criterion could be only set-identified, and it is not clear how to extend the EWM idea
to this case.
Third, we restricted our analysis to the utilitarian social welfare criterion, but in
some contexts, policy makers have a non-utilitarian social welfare criterion. We leave these issues
for future research.
37
Table 1: Estimated welfare gain of alternative treatment assignment rules that condition on education and pre-program earnings.
Treatment rule:
Share of population
Empirical welfare gain
Lower 90% CI
assigned to treatment
per population member
(bootstrap)
Outcome variable: 30-month post-program earnings. No treatment cost.
Average treatment effect:
1
$1,269
$719
EWM quadrant treatment rule
0.827
$1,614
$787
EWM linear treatment rule
0.820
$1,657
$765
1
$1,269
0.816
$1,924
Linear regression plug-in rule
Nonparametric plug-in rule
With $1000 expected cost per treatment assignment
Average treatment effect:
1
$269
-$281
EWM quadrant treatment rule
0.732
$797
-$20
EWM linear treatment rule
0.819
$837
-$46
Linear regression plug-in rule
0.824
$760
Nonparametric plug-in rule
0.675
$1,164
38
EWM quadrant treatment rule, no cost
EWM quadrant treatment rule, $1000 cost
Population density
Pre-program annual earnings
$20K
$10K
$0
6
8
10
12
14
16
18
Years of education
Figure 1: Empirical Welfare-Maximizing treatment rules from the quadrant class conditioning on
years of education and pre-program earnings
39
EWM linear treatment rule, no cost
EWM linear treatment rule, $1000 cost
Population density
Pre-program annual earnings
$20K
$10K
$0
6
8
10
12
14
16
18
Years of education
Figure 2: Empirical Welfare-Maximizing treatment rules from the linear class conditioning on years
of education and pre-program earnings
40
OLS plug-in treatment rule, no cost
OLS plug-in treatment rule, $1000 cost
Population density
Pre-program annual earnings
$20K
$10K
$0
6
8
10
12
14
16
18
Years of education
Figure 3: Parametric plug-in treatment rules based on the linear regressions of treatment outcomes
on years of education and pre-program earnings
41
Kernel regression < $0
Kernel regression > $0
Kernel regression > $1000
Pre-program annual earnings
$20K
$10K
$0
6
8
10
12
14
16
18
Years of education
Figure 4: Nonparametric plug-in treatment rules based on the kernel regressions of treatment
outcomes on years of education and pre-program earnings
42
A
Appendix: Lemmas and Proofs
A.1
Notations and Basic Lemmas
Let Zi = (Yi , Di , Xi ) ∈ Z. The subgraph of a real-valued function f : Z 7→ R is the set
SG(f ) ≡ {(z, t) ∈ Z × R : 0 ≤ t ≤ f (z) or f (z) ≤ t ≤ 0}.
The following lemma establishes a link between the VC-dimension of a class of subsets in the
covariate space X and the VC-dimension of a class of subgraphs of functions on Z =R× {0, 1} × X
(their subgraphs will be in Z × R).
Lemma A.1. Let G be a VC-class of subsets of X with VC-dimension v < ∞. Let g and h be two
given functions from Z to R. Then the set of functions from Z to R
F = {fG (z) = g(z) · 1 {x ∈ G} + h(z)1 {x ∈
/ G} : G ∈ G}
is a VC-subgraph class of functions with VC-dimension less than or equal to v.
Proof. Let zi = (yi , di , xi ). By the assumption, no set of (v + 1) points in X could be shattered
by G. Take an arbitrary set of (v + 1) points in Z × R, A = {(z1 , t1 ), ..., (zv+1 , tv+1 )}. Denote the
collection of subgraphs of F by SG(F) ≡ {SG(fG ), G ∈ G}. We want to show that SG(F) doesn’t
shatter A.
If for some i ∈ {1, . . . , (v + 1)}, (zi , ti ) ∈ SG(g) ∩ SG(h) then SG(F) cannot pick out all of
the subsets of A because the i-th point is included in any S ∈ SG(F). Similarly, if for some
i ∈ {1, . . . , (v + 1)}, (zi , ti ) ∈ SG(g)c ∩ SG(h)c , then point i cannot be included in any S ∈ SG(F).
The remaining case is that, for each i, either (zi , ti ) ∈ SG(g)∩SG(h)c or (zi , ti ) ∈ SG(g)c ∩SG(h)
holds. Indicate the former case by δ i = 0 and the latter case by δ i = 1. The points with δ i = 0
could be picked by SG(fG ) if and only if xi ∈
/ G. The points with δ i = 1 could be picked if and
only if xi ∈ G.
Given that G is a VC-class with VC-dimension v, there exists a subset X0 of
{x1 , . . . , xv+1 } such that X0 6= ({x1 , . . . , xv+1 } ∩ G) for any G ∈ G. Then there could be no set
S ∈ SG(F) that picks out the set (possibly empty)
{(zi , ti ) : (xi ∈ X0 and δ i = 1) or (xi ∈
/ X0 and δ i = 0)},
(A.1)
because this set of points could only be picked out by SG(fG ) if ({x1 , . . . , xv+1 } ∩ G) = X0 . Hence,
F is a VC subgraph class of functions with VC-dimension less than or equal to v.
43
In addition to the notations introduced in the main text, the following notations are used
throughout the appendix. The empirical probability distribution based on an iid size n sample of
R
1/2
Zi = (Yi , Di , Xi ) is denoted by Pn . L2 (P ) metric for f is denoted by kf kL2 (P ) = Z f 2 dP
, and
the sup-metric of f is denoted by kf k∞ . Positive constants that only depend on the class of data
generating processes, not on the sample size nor the VC-dimension, are denoted by c1 , c2 , c3 , . . . .
The universal constants are denoted by the capital letter C1 , C2 , . . . .
In what follows, we present lemmas that will be used in the proofs of Theorems 2.1 and 2.3.
Lemmas A.2 and A.3 are classical inequalities whose proofs can be found, for instance, in Lugosi
(2002).
Lemma A.2. Hoeffding’s Lemma: let X be a random variable with EX = 0, a ≤ X ≤ b. Then,
for s > 0,
2
2
E esX ≤ es (b−a) /8 .
Lemma A.3. Let λ > 0, n ≥ 2, and let Y1 , . . . , Yn be real-valued random variables such that for
2 2
all s > 0 and 1 ≤ i ≤ n, E(esYi ) ≤ es λ /2 holds. Then,
√
(i) E max Yi
≤ λ 2 ln n,
i≤n
p
(ii) E(max |Yi |) ≤ λ 2 ln (2n).
i≤n
The next two lemmas give maximal inequalities that bound the mean of a supremum of centered
empirical processes indexed by a VC-subgraph class of functions. The first maximal inequality
(Lemma A.4) is standard in the empirical process literature, and it yields our Theorem 2.1 as a
corollary. Though its proof can be found elsewhere (e.g., Dudley (1999), van der Vaart and Wellner
(1996)), we present it here for the sake of completeness and for later reference in the proof of Lemma
A.5. The second maximal inequality (Lemma A.5) concerns the class of functions whose diameter
is constrained by the L2 (P )-norm. Lemma A.5 will be used in the proofs of Theorem 2.3. A lemma
similar to our Lemma A.5 appears in Massart and Nédélec (2006, Lemma A.3).
Lemma A.4. Let F be a class of uniformly bounded functions, i.e., there exists F̄ < ∞ such that
kf k∞ ≤ F̄ for all f ∈ F. Assume that F is a VC-subgraph class with VC-dimension v < ∞. Then,
there is a universal constant C1 such that
"
#
r
v
EP n sup |En (f ) − EP (f )| ≤ C1 F̄
n
f ∈F
holds for all n ≥ 1.
44
Proof. Introduce (Z10 , . . . , Zn0 ), an independent copy of (Z1 , . . . , Zn ) ∼ P n . We denote the proba0
bility law of (Z10 , . . . , Zn0 ) by P n , its expectation by EP n0 (·), and the sample average with respect
to (Z10 , . . . , Zn0 ) by En0 (·). Define iid Rademacher variables σ 1 , . . . , σ n such that Pr(σ 1 = −1) =
Pr(σ 1 = 1) = 1/2 and they are independent of Z1 , Z10 , . . . , Zn , Zn0 . Then,
"
#
"
#
EP n sup |En (f ) − EP (f )| = EP n sup EP n0 En (f ) − En0 (f )|Z1 , . . . , Zn f ∈F
f ∈F
"
≤ EP n sup EP n0
f ∈F
#
En (f ) − En0 (f ) |Z1 , . . . , Zn
( ∵ Jensen’s inequality)
#
"
0
≤ E
sup En (f ) − En0 (f )
P n ,P n
f ∈F
)
n
X
sup f (Zi ) − f (Zi0 ) f ∈F i=1
)
(
n
X
1
0 0
E
sup σ i f (Zi ) − f (Zi ) n P n ,P n ,σ f ∈F i=1
f (Zi ) − f (Zi0 ) ∼ σ i f (Zi ) − f (Zi0 ) for all i )
n
n
)
(
X
X
1
EP n ,P n0 ,σ sup σ i f (Zi ) + sup σ i f (Zi0 )
f ∈F n
f ∈F i=1
i=1
#
"
n
X
2
EP n ,σ sup σ i f (Zi )
n
f ∈F i=1
( "
#)
n
X
2
EP n Eσ sup σ i f (Zi ) |Z1 , . . . , Zn
.
(A.2)
n
f ∈F 1
0
E
n P n ,P n
=
=
( ∵
≤
=
=
(
i=1
Fix Z1 , . . . , Zn , and define f ≡ (f (Z1 ), . . . , f (Zn )) = (f1 , . . . , fn ), which is a vector of length n
corresponding to the value of f ∈ F evaluated at each of (Z1 , . . . , Zn ). Let F = {f : f ∈ F} ⊂ Rn ,
which is a bounded set in Rn with radius F̄ = M/κ, since F is the set of uniformly bounded
functions with |f (·)| ≤ M/κ. Introduce the Euclidean norm to F,
n
0
ρ(f , f ) =
2
1X
fi − fi0
n
!1/2
.
i=1
P
Let f (0) = (0, . . . , 0), and f ∗ = (f1∗ , . . . , fn∗ ) be a random element in F maximizing | ni=1 σ i fi |.
Let B0 = f (0) and construct Bk : k = 1, . . . , K̄ a sequence of covers of F, such that Bk ⊂ F
is a minimal cover with radius 2−k F̄ and BK̄ = F. Note that such K̄ < ∞ exists at given n
45
and (Z1 , . . . , Zn ). Define also f (k) ∈ Bk : k = 1, . . . , K̄ be a random sequence such that f (k) ∈
arg minf ∈Bk ρ (f , f ∗ ). Since Bk is a cover with radius 2−k F̄ , ρ f (k) , f ∗ ≤ 2−k F̄ holds. In addition,
we have
ρ f (k−1) , f (k) ≤ ρ f (k) , f ∗ + ρ f (k−1) , f ∗ ≤ 3 · 2−k F̄ .
By a telescope sum,
n
X
σ i fi∗ =
i=1
n
X
(0)
σ i fi
+
i=1
=
K̄ X
n
X
(k)
(k−1)
σ i fi − fi
k=1 i=1
K̄ X
n
X
(k)
(k−1)
σ i fi − fi
.
k=1 i=1
We hence obtain
n
K̄
n
X
X
X
(k)
(k−1) ∗
Eσ Eσ σ i fi ≤
σ i fi − fi
i=1
i=1
k=1
≤
K̄
X
k=1
n
X
Eσ
max
σ i (fi − gi ) .
f ∈Bk ,g∈Bk−1 :ρ(f ,g)≤3·2−k F̄ (A.3)
i=1
We apply Lemma A.2 to obtain
n
i
h
Pn
Y
Eσ es i=1 σi (fi −gi )
=
Eσi esσi (fi −gi )
i=1
≤
n
Y
es
2 (f
2
i −gi ) /2
i=1
= exp s2 nρ2 (f , g)/2
2 ≤ exp s2 n 3 · 2−k F̄ /2 .
√
An application of Lemma A.3 (ii) with λ = 3 n · 2−k F̄ and n = |Bk | |Bk−1 | ≤ |Bk |2 then yields
q
n
X
√
−k
≤
3
Eσ
max
σ
(f
−
g
)
n
·
2
F̄
2 ln 2 |Bk |2
i i
i −k
f ∈Bk ,g∈Bk−1 :ρ(f ,g)≤3·2 F̄
i=1
q
√
−k
= 3 n · 2 F̄ 2 ln 2N (2−k F̄ , F,ρ)2
q
√
−k
= 6 n · 2 F̄ ln 21/2 N (2−k F̄ , F,ρ),
46
where N (r, F,ρ) is the covering number of F with radius r in terms of norm ρ. Accordingly,
n
K
q
X
X
√
Eσ σ i fi∗ ≤
6 n · 2−k F̄ ln 21/2 N (2−k F̄ , F,ρ)
i=1
k=1
∞
q
√ X
≤ 12 n
2−(k+1) F̄ ln 21/2 N (2−k F̄ , F,ρ)
√
≤ 12 n
k=1
1
Z
F̄
q
ln 21/2 N (F̄ , F,ρ)d,
(A.4)
0
where the last line follows from the fact that N (F̄ , F,ρ) is decreasing in .
To bound (A.4) from above, we apply a uniform entropy bound for the covering number. In
Theorem 2.6.7 of van der Vaart and Wellner (1996), by setting r = 2 and Q at the empirical
probability measure of (Z1 , . . . , Zn ), we have,
2v
(v+1) 1
,
N (F̄ , F, ρ) ≤ K(v + 1) (16e)
(A.5)
where K > 0 is a universal constant. Plugging this into (A.4) leads to
Z 1 q
n
X
√
∗
Eσ ln(21/2 K) + ln(v + 1) + (v + 1) ln(16e) − 2v ln d
σ i fi ≤ 12F̄ n
0
i=1
Z 1 q
√
ln(21/2 K) + ln 2 + 2 ln(16e) − 2 ln d
≤ 12F̄ nv
0
√
= C 0 F̄ nv,
(A.6)
R1 p
where C 0 = 12 0 ln(21/2 K) + ln 2 + 2 ln(16e) − 2 ln d < ∞. Combining (A.6) with (A.2) and
setting C1 = 2C 0 leads to the conclusion.
Lemma A.5. Let F be a class of uniformly bounded functions with kf k∞ ≤ F̄ < ∞ for all
f ∈ F. Assume that F is a VC-subgraph class with VC-dimension v < ∞. Assume further that
supf ∈F kf kL2 (P ) ≤ δ. Then, there exists a positive universal constant C2 such that
"
#
r
v
EP n sup (En (f ) − EP (f )) ≤ C2 δ F̄
n
f ∈F
holds for all n ≥ C1 F̄ 2 v/δ 2 , where C1 is the universal constant defined in Lemma A.4.
Proof. By the same symmetrization argument and the same use of Rademacher variables as in the
proof of Lemma A.4, we have
"
EP n
#
2
sup (En (f ) − EP (f )) ≤ EP n
n
f ∈F
(
"
Eσ sup
n
X
f ∈F i=1
47
#)
σ i f (Zi )|Z1 , . . . , Zn
.
(A.7)
Fix the values of Z1 , . . . , Zn , and define f , f (0) , F, and norm ρ(f , f 0 ) as in the proof of Lemma A.4.
P
Let f ∗ be a maximizer of ni=1 σ i f (Zi ) in F and let δ n = supf ∈F ρ(f (0) , f ) ≤ F̄ . Let B0 = f (0)
and construct Bk : k = 1, . . . , K̄ a sequence of covers of F, such that Bk ⊂ F is a minimal cover
with radius 2−k δ n and BK̄ = F. We define f (k) ∈ Bk : k = 1, . . . , K̄ to be a random sequence
such that f (k) ∈ arg minf ∈Bk ρ (f , f ∗ ). By applying the chaining argument in the proof of Lemma
A.4, Lemma A.3 (i), and the uniform bound of the covering number (A.5), we obtain
Z 1
n
X
p
√
δ n log N (δ n , F,ρ)d,
Eσ
σ i fi∗ ≤ 12 n
i=1
0
√
0
≤ C δ n nv.
for the universal constant C 0 defined in the proof of Lemma A.4. Hence, from (A.7), we have
"
#
r
v
EP n sup (En (f ) − EP (f )) ≤ C1
EP n (δ n )
n
f ∈F
"
#1/2 
r
v

= C1
EP n  sup En f 2
n
f ∈F
r "
v
EP n
≤ C1
n
!#1/2
sup En f
2
.
f ∈F
Note that En f 2 is bounded by
En f 2 = En f 2 − EP f 2 + EP (f 2 )
h
i
= En f − kf kL2 (P ) f + kf kL2 (P ) + kf k2L2 (P )
h
i
≤ 2F̄ En f − kf kL2 (P ) + kf k2L2 (P )
∵ kf kL2 (P ) ≥ EP (f ) by the Cauchy-Schwartz inequality
≤ 2F̄ En [f − EP (f )] + kf k2L2 (P ) .
Combining this inequality with (A.8) yields
v
"
#
"
#
r u
vu
t
2F̄ EP n sup (En (f ) − EP (f )) + δ 2 .
EP n sup (En (f ) − EP (f )) ≤ C1
n
f ∈F
f ∈F
Solving this inequality for EP n supf ∈F (En (f ) − EP (f )) leads to
"
#
EP n sup (En (f ) − EP (f )) ≤ F̄ C1
f ∈F
r


s
r
v v
v
δ2 
+
+
.
n
n
n F̄ 2 C1
48
(A.8)
√ √
p
the upper bound can be further bounded by (1 + 2) C1 F̄ δ nv ,
√ √
so the conclusion of the lemma follows with C2 = (1 + 2) C1 .
For
v
n
≤
δ2
,
F̄ 2 C1
that is, n ≥
C1 F̄ 2 v
,
δ2
The next lemma is a version of Bousquet’s concentration inequality (Bousquet, 2002).
Lemma A.6. Let F be a countable family of measurable functions, such that supf ∈F EP (f 2 ) ≤ δ 2
and supf ∈F kf k∞ ≤ F̄ for some constants δ 2 and F̄ . Let S = supf ∈F (En (f ) − EP (f )). Then, for
every positive t,

s 
2
n
2
δ
+
4
F̄
E
(S)
t
2
F̄
t
P
 ≤ exp (−t) .
P n S − EP n (S) ≥
+
n
3n
A.2
Proofs of Theorems 2.1 and 2.2
Proof of Theorem 2.1. Define
Yi (1 − Di )
Yi Di
· 1 {Xi ∈ G} +
· 1 {Xi ∈
/ G} ,
f (Zi ; G) =
e(Xi )
1 − e(Xi )
and the class of functions on Z
F = {f (·; G) : G ∈ G} .
With these notations, we can express inequality (2.2) as
WG∗ − W (ĜEW M ) ≤ 2 sup |En (f ) − EP (f )| .
(A.9)
f ∈F
Note that Assumption 2.1 (BO) and (SO) imply that F has uniform envelope F̄ = M/κ. Also, by
Assumption 2.1 (VC) and Lemma A.1, F is a VC-subgraph class of functions with VC-dimension
at most v. We apply Lemma A.4 to (A.9) to obtain
r
h
i
M v
∗
EP n WG − W (ĜEW M ) ≤ C1
.
κ n
Since this upper bound does not depend on P ∈ P(M, κ), the upper bound is uniform over P(M, κ).
49
Proof of Theorem 2.2. In obtaining the rate lower bound, we normalize the support of outcomes
to Y1i , Y0i ∈ − 12 , 21 . That is, we focus on bounding supP ∈P(1,κ) EP n WG∗ − W (Gn ) . The lower
bound of the original welfare loss supP ∈P(M,κ) EP n WG∗ − W (Gn ) is obtained by multiplying by
M the lower bound of supP ∈P(1,κ) EP n WG∗ − W (Gn ) .
We consider a suitable subclass P ∗ ⊂ P (1, κ), for which the worst case welfare loss can be
bounded from below by a distribution-free term that converges at rate n−1/2 . The construction of
P ∗ proceeds as follows. First, let x1 , . . . , xv ∈ X be v points that are shattered by G. We constrain
the marginal distribution of X to being supported only on (x1 , . . . , xv ). We put mass p at xi , i < v,
and mass 1−(v −1)p at xv . Thus-constructed marginal distribution of X is common in P ∗ . Let the
distribution of treatment indicator D be independent of (Y1 , Y0 , X), and D follows the Bernoulli
distribution with Pr(D = 1) = 1/2. Let b = (b1 , . . . , bv−1 ) ∈ {0, 1}v−1 be a bit vector used to
index a member of P ∗ , i.e., P ∗ consists of finite number of DGPs. For each j = 1, . . . , (v − 1), and
depending on b, construct the following conditional distribution of Y1 given X = xj ; if bj = 1,
(
Y1 =
1
2
− 21
with prob.
1
2
+ γ,
1
2
with prob.
(A.10)
− γ,
and, if bj = 0,
(
Y1 =
1
2
− 21
with prob.
with prob.
1
2
− γ,
1
2
(A.11)
+ γ,
where γ ∈ 0, 21 is chosen properly in a later step of the proof. For j = v, the conditional
distribution of Y1 given X = xv is degenerate at Y1 = 0. As for Y0 ’s conditional distribution, we
consider the degenerate distribution at Y0 = 0 at every X = xj , j = 1, . . . , v. That is, when bj = 1,
τ (xj ) = γ, and when bj = 0, τ (xj ) = −γ. Each b ∈ {0, 1}v−1 induces a unique joint distribution of
n
o
(Y1 , Y0 , D, X) ∼ Pb and, clearly, Pb ∈ P(1, κ). We accordingly define P ∗ = Pb : b ∈ {0, 1}v−1 .
With knowledge of Pb ∈ P ∗ , the optimal treatment assignment rule is
G∗b = {xj : j < v, bj = 1} ,
which is feasible G∗b ∈ G by the construction of the support points of X. The maximized social
welfare is


v−1
X
W (G∗b ) = pγ 
bj  .
j=1
50
Let Ĝ be an arbitrary treatment choice rule as a function of (Z1 , . . . , Zn ), and b̂ ∈ {0, 1}(v−1) be a
n
o
binary vector whose j-th element is b̂j = 1 xj ∈ Ĝ . Consider π (b) a prior distribution for b such
that b1 , . . . , bv−1 are iid and b1 ∼ Ber(1/2). The welfare loss satisfies the following inequalities,
h
i
h
i
sup EP n WG∗ − W (Ĝ) ≥ sup EPbn W (G∗b ) − W (Ĝ)
Pb ∈P ∗
P ∈P(1,κ)
Z
h
i
EPbn W (G∗b ) − W (Ĝ) dπ (b)
bZ
h
i
= γ EPbn PX G∗b 4Ĝ dπ (b)
Zb Z
n
o
= γ
PX
b(X) 6= b̂(X) dP n (Z1 , . . . , Zn |b) dπ (b)
b Z1 ,...,Zn
Z Z
n
o
PX
≥ inf γ
b(X) 6= b̂(X) dP n (Z1 , . . . , Zn |b) dπ (b)
≥
Gn
b
Z1 ,...,Zn
where each b(X) and b̂ (X) is an element of b and b̂ such that b(xj ) = bj and b̂(xj ) = b̂j . We define
b(xv ) = b̂ (xv ) = 0. Note that the last expression can be seen as the minimized Bayes risk with
the loss function corresponding to the classification error for predicting binary unknown random
variable b(X). Hence, the minimizer of the Bayes risk is attained by the Bayes classifier,
1
∗
Ĝ = xj : π (bj = 1|Z1 , . . . , Zn ) ≥ , j < v ,
2
where π (bj |Z1 , . . . , Zn ) is the posterior of bj . The minimized Bayes risk is given by
Z
EX [min {π (b(X) = 1|Z1 , . . . , Zn ) , 1 − π (b(X) = 1|Z1 , . . . , Zn )}] dP̃ n
γ
Z1 ,...,Zn
Z
= γ
v−1
X
p [min {π (bj = 1|Z1 , . . . , Zn ) , 1 − π(bj = 1|Z1 , . . . , Zn )}] dP̃ n ,
(A.12)
Z1 ,...,Zn j=1
where P̃ n is the marginal likelihood of {(Y1,i , Y0,i , Di , Xi ) : i = 1, . . . , n} corresponding to prior
π (b). For each j = 1, . . . , (v − 1) , let
Kj
kj+
kj−
= # {i : Xi = xj } ,
1
= # i : Xi = xj , Yi Di =
,
2
1
= # i : Xi = xj , Yi Di = −
.
2
51
The posterior for bj = 1 can be written as
π (bj = 1|Z1 , . . . , Zn ) =



1
2
if # {i : Xi = xj , Di = 1} = 0,
k+
( 12 +γ ) j ( 12 −γ )


k+
( 12 +γ ) j ( 12 −γ )
k−
j
k−
k−
j + 1 +γ j
2
(
) ( 12 −γ )
k+
j
otherwise.
Hence,
min {π (bj = 1|Z1 , . . . , Zn ) , 1 − π(bj = 1|Z1 , . . . , Zn )}
kj+ 1
kj− 1
kj− 1
kj+
1
, 2 +γ
min 2 + γ
2 −γ
2 −γ
=
k +
k −
+ γ j 12 − γ j +
(
)
1 kj+ −kj−
+γ
min 1, 21 −γ
1
2
1
2
+γ
kj−
1
2
−γ
kj+
2
=
1+
1
+γ
2
1
−γ
2
kj+ −kj−
1 + 2γ
1
where a =
> 1.
+
− ,
k
−k
1 − 2γ
1 + a| j j |
P
Since kj+ − kj− = i:Xi =xj 2Yi Di , plugging (A.13) into (A.12) yields


v−1
X
1
P

γ
pEP̃ n 
i:X =x 2Yi Di i
j
j=1
1+a
#
"
v−1
1
γX
≥
pEP̃ n P
2
i:X =x 2Yi Di i
j
j=1
a
v−1
γ X −EP̃ n Pi:Xi =xj 2Yi Di ≥
,
p
a
2
=
(A.13)
i=1
where EP̃ n (·) is the expectation with respect to the marginal likelihood of {(Y1,i , Y0,i , Di , Xi ) ,
i = 1, . . . , n}. The second line follows by a > 1, and the third line follows by Jensen’s inequality.
Given our prior specification for b, the marginal distribution of Y1,i is Pr(Y1,i = 1/2) = Pr(Y1,i =
−1/2) = 1/2, so
X
X
2Y1,i EP̃ n 2Yi Di = EP̃n i=1:Xi =xj ,Di =1
i:Xi =xj
n
X
n p k 1
p n−k =
1−
E B(k, ) −
k
2
2
2
k=0
52
k 2
holds, where B(k, 12 ) is a random variable following the binomial distribution with parameters k
and 21 . By noting
1
E B(k, ) −
2
s 1
k k 2
E B(k, ) −
≤
2
2
2
r
k
=
,
4
( ∵ Cauchy-Schwartz inequality)
we obtain
n n−k r k
X
k
X
n
p
p
2Yi Di ≤
EP̃ n 1−
k
2
2
4
i:Xi =xj
k=0
s
B n, p2
= E
4
r
np
≤
. ( ∵ Jensen’s inequality).
8
Hence, the Bayes risk is bounded from below by
√ np
γ
p(v − 1)a− 8
2
√ np
γ
p(v − 1)e−(a−1) 8 ( ∵ 1 + x ≤ ex ∀x)
≥
2
√ np
pγ
− 4γ
=
(v − 1)e 1−2γ 8 .
(A.14)
2
This lower bound of the Bayes q
risk has the slowest convergence rate when γ is set to be proportional
to n−1/2 . Specifically, let γ =
√ np
pγ
− 4γ
(v − 1)e 1−2γ 8
2
≥
≥
The condition 1 − 2γ ≥
A.3
1
2
Since (v − 1)−1 ≥ p ≥ v −1 , we have
(
r
√ )
1 v−1
1
2
1−
exp −
2
n
v
1 − 2γ
r
n √ o
1 v−1
1
exp −2 2 , if 1 − 2γ ≥ .
4
n
2
v−1
n .
is equivalent to n ≥ 16(v − 1). This completes the proof.
Proofs of Theorems 2.3 and 2.4
In proving Theorem 2.3, it is convenient to work with the normalized welfare difference,
d(G, G0 ) =
κ W (G) − W (G0 ) ,
M
53
and its sample analogue
dn (G, G0 ) =
κ Wn (G) − Wn (G0 ) .
M
(A.15)
By Assumption 2.1 (BO) and (SO), both d(G, G0 ) and dn (G, G0 ) are bounded in [−1, 1], and the
normalized welfare difference relates to the original welfare loss of decision set G as
d(G∗F B , G) =
κ
[W (G∗F B ) − W (G)] ∈ [0, 1] .
M
(A.16)
Hence, the welfare loss upper bound of ĜEW M can be obtained by multiplying M/κ by the upper
bound of d(G∗F B , ĜEW M ).
Note that d(G∗F B , G) can be bounded from above by PX (G∗F B 4G), since
Z
κ
∗
|τ (X)| dPX
d(GF B , G) =
M G∗F B 4G
≤ κPX (G∗F B 4G)
≤ PX (G∗F B 4G).
(A.17)
On the other hand, with Assumption 2.2 (MA) imposed, PX (G∗F B 4G) can be bounded from above
by a function of d(G∗F B , G), as the next lemma shows. We borrow this lemma from Tsybakov
(2004).
Lemma A.7. Suppose Assumption 2.2 (MA) holds with margin coefficient α ∈ (0, ∞). Then
α
PX (G∗F B 4G) ≤ c1 (M, κ, η, α)d(G∗F B , G) 1+α
holds for all G ∈ G, where c1 (M, κ, η, α) =
M
κηα
α
1+α
(1 + α).
Proof. Let A = {x : |τ (x)| > t } and consider the following inequalities,
Z
∗
|τ (x)| dPX
W (GF B ) − W (G) =
G∗F B 4G
Z
≥
|τ (X)| 1 {x ∈ A} dPX
(G∗F B 4G)
≥ tPX ((G∗F B 4G) ∩ A)
≥ t [PX (G∗F B 4G) − PX (Ac )]
α t
∗
≥ t PX (GF B 4G) −
,
η
54
where the final line uses the margin condition. The right-hand side is maximized at t = η(1 +
1
1
α)− α [PX (G∗F B 4G)] α ≤ η, so it holds
1+α
α
1+α
1
∗
W (GF B ) − W (G) ≥ ηα
[PX (G∗F B 4G)] α .
1+α
This, in turn, implies
PX (G∗F B 4G)
≤
M
κηα
α
1+α
Proof of Theorem 2.3. Let a =
α
(1 + α)d(G∗F B , G) 1+α .
√
ktn with k ≥ 1, t ≥ 1, and n > 0, where t ≥ 1 is arbitrary, k is
a constant that we choose later, and n is a sequence indexed by sample size n whose proper choice
will be discussed in a later step. The normalized welfare loss can be bounded by
d(G∗F B , ĜEW M ) ≤ d(G∗F B , ĜEW M ) − dn G∗F B , ĜEW M ,
as dn G∗F B , ĜEW M ≤ 0 by Assumption 2.2 (FB). Define a class of functions induced by G ∈ G.
H ≡ {h(Zi ; G) : G ∈ G} ,
Yi D i
Yi (1 − Di )
κ
−
[1 {Xi ∈ G} − 1 {Xi ∈ G∗F B }] .
h(Zi ; G) ≡
M e(Xi )
1 − e(Xi )
By Assumption 2.1 (VC) and Lemma A.1, H is a VC-subgraph-class with VC-dimension at most
v < ∞ with envelope H̄ = 1. Using h(Zi ; G), we can write d(G∗F B , G) = −EP (h(Zi ; G)). Since
d(G∗F B , G) ≥ 0 for all G ∈ G, it holds −EP (h) ≥ 0 for all h ∈ H.
Since we have
d(G∗F B , ĜEW M ) − dn G∗F B , ĜEW M = En h(Zi ; ĜEW M ) − EP h(Zi ; ĜEW M )
and dn G∗F B , ĜEW M ≤ 0, the normalized welfare loss can be bounded by
d(G∗F B , ĜEW M ) ≤ En h(Zi ; ĜEW M ) − EP h(Zi ; ĜEW M )
h
i
≤ Va d(G∗F B , ĜEW M ) + a2 ,
where
Va
En (h) − EP (h)
= sup
−EP (h) + a2
h∈H
h
h
= sup En
− EP
−EP (h) + a2
−EP (h) + a2
h∈H
55
On event Va < 12 , d(G∗F B , ĜEW M ) ≤ a2 holds, so this implies
1
.
P n d(G∗F B , ĜEW M ) ≥ a2 ≤ P n Va ≥
2
(A.18)
In what follows, our aim is to construct an exponential inequality for P n Va ≥ 12 involving only
t, and we make use of such exponential tail bound to bound EP n d(G∗F B , ĜEW M ) .
To apply the Bousquet’s inequality (Lemma A.6) to Va , note first that,
2 !
PX (G∗F B 4G)
h
≤
EP
−EP (h) + a2
(EP (h) + a2 )2
α
[−EP (h)] 1+α
≤ c1
(−EP (h) + a2 )2
∵ by Lemma A.7 and d(G∗F B , G) = −EP (h(Zi ; G)) )
2α
≤ c1 sup
≥0
1+α
(2 + a2 )2
2α
1+α
1
≤ c1 2 sup 2
a ≥0 + a2
α !2
1+α
1
≤ c1 2 sup
a ≥0 ∨ a
2α
1 1+α
a
,
a4
where c1 is a constant that depends only on (M, κ, η, α) as defined in Lemma A.7. We, on the other
≤ c1
hand, have
sup sup
h∈H
Z
h
≤ 1.
2
−EP (h) + a a2
Hence, Lemma A.6 gives, with probability larger than 1 − exp(−t),
vh
i
u
2α
−2
u c a 1+α
n
+
4E
(V
)
a t
P
t 1
2t
Va ≤ EP n (Va ) +
+
.
2
na
3na2
(A.19)
Next, we derive an upper bound of EP n (Va ) by applying the maximal inequality of Lemma A.5.
Let r > 1 be arbitrary and consider partitioning H by H0 , H1 , . . . , where H0 = h ∈ H : −EP (h) ≤ a2
56
and Hj = h ∈ H : r2(j−1) a2 < −EP (h) ≤ r2j a2 , j = 1, 2, . . . . Then,
X
En (h) − EP (h)
En (h) − EP (h)
sup
Va ≤ sup
+
−EP (h) + a2
−EP (h) + a2
h∈H0
j≥1 h∈Hj


X
1 
sup (En (h) − EP (h)) +
(1 + r2(j−1) )−1 sup (En (h) − EP (h))
≤
a2 h∈H0
h∈Hj
j≥1
"
#
sup−EP (h)≤a2 (En (h) − EP (h))
1
≤
.
P
a2 + j≥1 (1 + r2(j−1) )−1 sup−E (h)≤r2j a2 (En (h) − EP (h))
(A.20)
P
α
Since it holds khk2L2 (P ) ≤ PX (G∗F B 4G) ≤ c1 (M, κ, η, α) [−EP (h)] 1+α , where the latter inequality
1/2
follows from Lemma A.7, −EP (h) ≤ r2j a2 implies khkL2 (P ) ≤ c1 r
be further bounded by

Va ≤
sup
α
1/2
khkL (P ) ≤c1 a 1+α
2
1 
 P
a2 + j≥1 (1 + r2(j−1) )−1 sup
α
1+α j
α
a 1+α . Hence, (A.20) can
(En (h) − EP (h))
α j
α
1/2
khkL (P ) ≤c1 r 1+α a 1+α
2
(En (h) − EP (h))


.
We apply Lemma A.5 to each supremum term, and obtain
1 r
α
α X
c12 v 1+α
r 1+α j
EP n (Va ) ≤ C2 2
a
a
n
1 + r2(j−1)
j≥0
!
r
2
1
α
r
v
≤ C2 c12
a 1+α −2
2+α
n
1 + r− 1+α
r
α
v 1+α
−2
a
≤ c2
n
for
n≥
C1 v
2α
c1 a 1+α
⇐⇒
a≥
C1
c1
1+α 1+α
2α
v 2α
n
(A.21)
1
where C1 and C2 are universal constants defined in Lemmas A.4 and A.5, and c2 = C2 c12
∨
− 2+α
r2
1+r
1+α
1 is a constant greater than or equal to one and depends only on (M, κ, η, α), as r > 1 is fixed. We
plug in this upper bound into (A.19) to obtain
vh
u
p v α −2 i
2α
−2
r
u c a 1+α
1+α
+
4c
t
1
2
t
α
na
v 1+α
2t
−2
Va ≤ c2
a
+
+
.
2
n
na
3na2
57
(A.22)
Choose n as the root of c2
pv
na
α
−2
1+α
= 1, i.e.,
r 1+α
v 2+α
n = c2
.
n
(A.23)
Note that the right hand side of (A.22) is decreasing in a, and a ≥ n by the construction. Hence,
if n satisfies inequality (A.21), i.e.,
n≥
c−α
2
C1
c1
1+ α
2
v,
which can be reduced to an innocuous restriction n ≥ 1 by inflating, if necessary, c1 large enough,
we can substitute n for a to bound the right hand side of (A.22). In particular, by noting
r
α
1
v 1+α
n
1
−2
and
c2
a
≤
=√ ≤√
n
a
kt
k
h α −2 i2
2α
α
−1 2
a 1+α −2 = a2( 1+α −2) a2 ≤ n1+α
2n = c−2
2 v nn ,
the right-hand side of (A.22) can be bounded by
s
−1 2
1
c1 c−2
2
2 v nn + 4
Va ≤ √ +
+
2
2
nk
3nk
k
n
n
s
−1
1
c1 c−2
4
2
2 v
= √ +
+
+
2
k
nkn 3nk2n
k
s
−1
1
c1 c−2
4
2
2 v
≤ √ +
+ +
for n2n ≥ 1.
k
k
3k
k
(A.24)
Note that condition n2n ≥ 1 used to derive the last line is valid for all n, since it is equivalent to
−2(1+α) −(1+α)
v
,
n ≥ c2
which holds for all n ≥ 1 since c2 ≥ 1 and v ≥ 1. By choosing k large enough
so that the right-hand side of (A.24) is less than 12 , we can conclude
1
Pr(Va < ) ≥ 1 − exp(−t).
2
(A.25)
Hence, (A.18) yields
P n d(G∗F B , ĜEW M ) ≥ kt2n ≤ exp (−t)
58
for all t ≥ 1. From this exponential bound, we obtain
EP n d(G∗F B , ĜEW M )
Z ∞
=
P n d(G∗F B , ĜEW M ) > t0 dt0
0
Z
k2n
≤
P
n
d(G∗F B , ĜEW M )
≥t
0
≤
+
Z
∞
dt +
0
k2n
0
k2n
P n d(G∗F B , ĜEW M ) ≥ t0 dt0
k2n e−1
2(1+α)
= (1 + e−1 )kc2 2+α
v 1+α
2+α
n
.
2(1+α)
So, setting c =
M
κ (1
+ e−1 )kc2 2+α leads to the conclusion.
Proof of Theorem 2.4. As in the proof of Theorem 2.2, we work with the normalized outcome
support, Y1i , Y0i ∈ − 12 , 12 . With the normalized outcome, constant η of the margin assumption
satisfies η ≤ 1.
Let α ∈ (0, ∞) and η ∈ (0, 1] be given.
Similarly to the proof of Theorem 2.2, we consider
P∗
⊂ P (1, κ, η, α). Let x1 , . . . , xv ∈ X be v points that are
shattered by G, and let γ be a positive number satisfying γ ≤ min η, 12 , whose proper choice will
constructing a suitable subclass
be given later. We fix the marginal distribution of X at the one supported only on (x1 , . . . , xv ) and
having the probability mass function,
α
γ
1
, for j = 1, . . . , (v − 1), and
PX (Xi = xj ) =
v−1 η
α
γ
1
.
PX (Xi = xv ) = 1 −
v−1 η
Thus-constructed marginal distribution of X is common in P ∗ . As in the proof of Theorem 2.2,
we specify D to be independent of (Y1 , Y0 , X) and follow the Bernoulli distribution with Pr(D =
1) = 1/2. Let b = (b1 , . . . , bv−1 ) ∈ {0, 1}v−1 be a binary vector that uniquely indexes a member
n
o
of P ∗ , and, accordingly, write P ∗ = Pb : b ∈ {0, 1}v−1 . For each j = 1, . . . , (v − 1), we specify
the conditional distribution of Y1 given X = xj to be (A.10) if bj = 1 and (A.11) if bj = 0.
For j = v, the conditional distribution of Y1 given X = xv is degenerate at Y1 = 21 . As for the
conditional distribution of Y0 given X = xj , we consider the degenerate distribution at Y0 = 0 for
j = 1, . . . , (v − 1), and the degenerate distribution at Y0 = − 21 for X = xv . In this specification of
59
P ∗ , it holds



 0α for t ∈ [0, γ),
γ
PX (|τ (x)| ≤ t) =
for t ∈ [γ, η), ,
η



1 for t ∈ [η, 1].
for every Pb ∈ P ∗ . Furthermore, by the construction of the support points, for every Pb ∈ P ∗ , the
first-best decision rule is contained in G. Hence, it holds P ∗ ⊂ PF B (1, κ, η, α).
Let π (b) be a prior distribution for b such that b1 , . . . , bv−1 are iid and b1 ∼ Ber(1/2). By
following the same line of reasonings as used in obtaining (A.12), for arbitrary estimated treatment
choice rule Ĝ, we obtain
h
i
EP n W (G∗ ) − W (Ĝ)
sup
P ∈P(1,κ,η,α)
≥
γ
v−1
α
γ
η
Z
v−1
X
[min {π (bj = 1|Z1 , . . . , Zn ) , 1 − π(bj = 1|Z1 , . . . , Zn )}] dP̃ n .
Z1 ,...,Zn j=1
Furthermore, by repeating the same bounding arguments as in the proof of Theorem 2.2, this Bayes
α
γ
1
risk can be bounded from below by (A.14) with p = v−1
,
η
sup
P ∈P(1,κ,η,α)
EP n
h
s
(
α )
i γ γ α
4γ
n
γ
.
exp −
W (G∗ ) − W (Ĝ) ≥
2 η
1 − 2γ 8(v − 1) η
The slowest convergence rate of this lower bound can be obtained by tuning γ to be converging at
1
1
α
2+α
the rate of n− 2+α . In particular, by choosing γ = η 2+α v−1
assuming γ ≤ 14 , the exponential
n
√ term can be bounded from below by exp −2 2 , so we obtain the following lower bound,
n √ o
1+α
α
α
1 2+α
η
(v − 1) 2+α n− 2+α exp −2 2 .
(A.26)
2
Recall that γ is constrained to γ ≤ min η, 41 . This implies that the obtained bound is valid for
2+α α
n ≥ max η −1 , 4
η (v − 1),
whose stricter but simpler form is given by
n ≥ max η −2 , 42+α (v − 1).
(A.27)
The lower bound presented in this theorem follows by denormalizing the outcomes, i.e., multiply
M to (A.26) and substitute η/M for η appearing in (A.26) and (A.27).
60
A.4
Proof of Theorem 2.6
The next lemma gives a linearized solution of a certain polynomial inequality. We owe this lemma to
Shin Kanaya (2014, personal communication). The technique of applying the mean value expansion
to an implicit function defined as the root of a polynomial equation has been used in the context
of bandwidth choice in Kanaya and Kristensen (2014).
α
Lemma A.8. Let A ≥ 0, B ≥ 0, and X ≥ 0. For any α ≥ 0, X ≤ AX 1+α + B implies
X ≤ A1+α + (1 + α)B.
α
Proof. When A = B = 0, the conclusion trivially holds. When B > 0, X = AX 1+α + B has a
unique root, and we denote it by X ∗ = g(A, B). When A > 0 and B = 0, we mean by g(A, 0) the
α
α
nonzero root of X = AX 1+α . Let f (X, A, B) = X − AX 1+α − B. By the form of the inequality,
the original inequality can be equivalently written as X ≤ X ∗ = g(A, B), so we aim to verify that
X ∗ is bounded from above by A1+α + (1 + α)B. Consider the mean value expansion of g(A, B) in
B at B = 0,
X ∗ = g(A, 0) +
∂g A, B̃ × B
∂B
for some 0 ≤ B̃ ≤ B.
Note g(A, 0) = A1+α . In addition, by the implicit function theorem, we have, with X̃ = g(A, B̃),
∂f
(X̃, A, B̃)
∂g A, B̃
= − ∂B
∂f
∂B
(X̃, A, B̃)
∂X
1
=
=
=
1
1−
− 1+α
α
1+α AX̃
X̃
1+α
X̃
α
α
+ 1+α
X̃ − AX̃ 1+α
X̃
X̃
1+α
+
≤ 1 + α.
α
1+α B̃
Hence, X ∗ ≤ A1+α + (1 + α)B holds.
The next lemma provides an exponential tail probability bound of the supremum of the centered
empirical processes. This lemma follows from Theorem 2.14.9 in van der Vaart and Wellner (1996)
combined with their Theorem 2.6.4.
61
Lemma A.9. Assume G is a VC-class of subsets in X with VC-dimension v < ∞. Let PX,n (·) be
the empirical probability distribution on X constructed upon (X1 , . . . , Xn ) generated iid from PX (·).
Then,
P
n
C4 t 2v v
sup |PX,n (G) − PX (G)| > t ≤ √
n exp −nt2
2v
G∈G
holds for every t > 0, where C4 is a universal constant.
Proof of Theorem 2.6. We first consider the m-hybrid case. Set G̃ = G∗F B in (2.4) and rewrite (2.4)
in terms of the normalized welfare loss for Ĝm−hybrid ,
i
κ h τ ∗
τ
∗
τ
τ
∗
Wn (GF B ) − Ŵn (GF B ) − Wn (Ĝm−hybrid ) + Ŵn Ĝm−hybrid
d(GF B , Ĝm−hybrid ) ≤
M
+d(G∗F B , Ĝm−hybrid ) − dτn G∗F B , Ĝm−hybrid
n
oi
h
n
1X κ
≤
[τ (Xi ) − τ̂ m (Xi )] 1 {Xi ∈ G∗F B } − 1 Xi ∈ Ĝm−hybrid
n
M
i=1
∗
τ
∗
+d(GF B , Ĝm−hybrid ) − dn GF B , Ĝm−hybrid
(A.28)
≤ ρn + d(G∗F B , Ĝm−hybrid ) − dτn G∗F B , Ĝm−hybrid
where d(G∗F B , Ĝm−hybrid ) is as defined in equation (A.16) in Appendix A.3, dτn G∗F B , Ĝm−hybrid =
Wnτ (G∗F B ) − Wnτ (Ĝm−hybrid ),
κ
max |τ̂ m (Xi ) − τ (Xi )| PX,n G∗F B 4Ĝm−hybrid ,
M 1≤i≤n
ρn ≡
and PX,n is the empirical distribution on X constructed upon (X1 , . . . , Xn ). Similarly to the proof
of Theorem 3.2, define a class of functions generated by G ∈ G
Hτ
≡ {h(Zi ; G) : G ∈ G} ,
κ
h(Zi ; G) ≡
τ (Xi ) · [1 {Xi ∈ G} − 1 {Xi ∈ G∗F B }] ,
M
which is a VC-subgraph class with the VC-dimension at most v with envelope H̄ = 1 by Lemma
n
o
√
n (h)−EP (h)
A.1. Let a = ktn be as defined in the proof of Theorem 2.3 and Vaτ ≡ suph∈Hτ E−E
.
2
P (h)+a
By noting
d(G∗F B , Ĝm−hybrid ) − dτn G∗F B , Ĝm−hybrid ≤ Vaτ (d(G∗F B , Ĝm−hybrid ) + a2 ),
62
inequality (A.28) implies
d(G∗F B , Ĝm−hybrid ) ≤ ρn + Vaτ (d(G∗F B , Ĝm−hybrid ) + a2 ).
Denote event
Vaτ <
1
2
by Ωt , which is equivalent to event
(A.29)
n
o
d(G∗F B , Ĝm−hybrid ) ≤ 2ρn + k2n t .
Along the same line of argument that leads to (A.25) in the proof of Theorem 2.3, we obtain, for
t ≥ 1,
P n (Ωt ) = P n d(G∗F B , Ĝm−hybrid ) ≤ 2ρn + k2n t ≥ 1 − exp (−t) ,
(A.30)
where n is given in (A.23). We bound ρn from above by
κ
max |τ̂ m (Xi ) − τ (Xi )| PX G∗F B 4Ĝm−hybrid + V0,n max |τ̂ m (Xi ) − τ (Xi )| ,
ρn ≤
1≤i≤n
M 1≤i≤n
where
V0,n = sup |PX,n (G∗F B 4G) − PX (G∗F B 4G)| ,
G∈G:
Let λ > 0, that will be chosen properly later. Define events
o
n
Λ1 =
V0,n ≤ n−λ ,
o
n
Λ2 =
PX G∗F B 4Ĝm−hybrid ≥ n−λ .
Then, on Λ1 ∩Λ2 , it holds V0,n ≤ PX G∗F B 4Ĝm−hybrid . Therefore, on Λ1 ∩Λ2 ∩Ωt , d(G∗F B , Ĝm−hybrid )
can be bounded by
κ
max |τ̂ m (Xi ) − τ (Xi )| PX G∗F B 4Ĝm−hybrid + k2n t
M 1≤i≤n
α
κ
≤ 4c1
max |τ̂ m (Xi ) − τ (Xi )| d(G∗F B , Ĝm−hybrid ) 1+α + k2n t,
M 1≤i≤n
where the second line follows from Lemma A.7 with the same definition of c1 given there. By
d(G∗F B , Ĝm−hybrid ) ≤ 4
Lemma A.8 and substituting (A.23) to n , we obtain, on event Λ1 ∩ Λ2 ∩ Ωt ,
1+α
v 1+α
2+α
d(G∗F B , Ĝm−hybrid ) ≤ c6 max |τ̂ m (Xi ) − τ (Xi )|
+ c7
t,
1≤i≤n
n
where constants c6 and c7 depend only on (M, κ, η, α).
63
(A.31)
Using the upper bound derived in (A.31), we obtain, for t ≥ 1,
EP n d(G∗F B , Ĝm−hybrid )
= EP n d(G∗F B , Ĝm−hybrid )1 {Λ1 ∩ Λ2 ∩ Ωt } + EP n d(G∗F B , Ĝm−hybrid )1 {Λc1 ∪ Λc2 ∪ Ωct }
1+α !
v 1+α
2+α
≤ c6 EP n
max |τ̂ m (Xi ) − τ (Xi )|
+ c7
t + P n (Λc1 )
1≤i≤n
n
+EP n d(G∗F B , Ĝm−hybrid )1{Λc2 } + P n (Ωct )
1+α !
v 1+α
2+α
m
n
ψ
max
|τ̂
(X
)
−
τ
(X
)|
≤ c6 ψ −(1+α)
E
t
+
c
i
i
7
P
n
n
1≤i≤n
| n{z
}
|
{z
}
A2,n
A1,n
2v
1
1
C4
n−λ + exp(−t)
+
n−2v(λ− 2 ) exp −n−2(λ− 2 ) + |{z}
| {z }
2v
|
{z
} A4,n
A5,n
A3,n
where ψ n is a sequence as specified in Condition 2.7. In these inequalities, the third line uses (A.31)
and d(G∗F B , Ĝm−hybrid ) ≤ 1. In the fourth line, A3,n follows from Lemma A.9, A4,n follows from
d(G∗F B , Ĝm−hybrid ) ≤ PX G∗F B 4Ĝm−hybrid and PX G∗F B 4Ĝm−hybrid < n−λ on Λc2 , and A5,n
follows from (A.30).
We now discuss convergence rates of Aj,n , j = 1, . . . , 5, individually with suitable choices of t
and λ. Condition (2.9) assumed in this theorem implies
1+α !
m
sup EP n
ψ n max |τ̂ (Xi ) − τ (Xi )|
P ∈Pm
≤
1≤i≤n
h
i 1+α 2
2
m
sup EP n (ψ n max1≤i≤n |τ̂ (Xi ) − τ (Xi )|)
P ∈Pm
≤
sup EP n (ψ n max1≤i≤n |τ̂ m (Xi ) − τ (Xi )|)
2
1+α !
2
P ∈Pm
= O(1),
−(1+α)
where the third line follows from Jensen’s inequality. Hence, A1,n satisfies sup A1,n = O ψ n
.
P ∈Pm
By setting t = (1 + α) log ψ n , we can make the convergence rate of A5,n equal to that of A1,n . At
≥ 12 , we can make A3,n and A4,n converge faster than A2,n .
Hence, the uniform convergence rate of EP n d(G∗F B , Ĝm−hybrid ) over P ∈ Pm ∩ PF B (M, κ, η, α)
the same time, by choosing λ >
1+α
2+α
64
is bounded by the convergence rates of the A1,n and A2,n ,
!
1+α
− 2+α
∨
n
sup A1,n ∨
sup
A2,n = ψ −(1+α)
log
ψ
n .
n
P ∈Pm
P ∈PF B (M,κ,η,α)
This completes the proof for the m-hybrid case.
A proof for the e-hybrid case follows almost identically to the proof of the m-hybrid case. The
differences are that ρn in inequality (A.28) is given by
ρn =
κ
max |τ̂ ei − τ i | PX,n G∗F B 4Ĝe−hybrid .
M 1≤i≤n
and that inequality (A.32) is replaced by
d(G∗F B , Ĝe−hybrid ) ≤ ρn + Va (d(G∗F B , Ĝe−hybrid ) + a2 ),
(A.32)
where Va is as defined in the proof of Theorem 2.3. The rest of the proof goes similarly to the proof
of the first claim except that the rate φn given in Condition 2.1 replaces ψ n in the first claim.
A.5
Proofs of Theorems 4.1 and 4.2
Proof of Theorem 4.1. Since WK (G) − WK (G0 ) = VK (G) − VK (G0 ) for all G, G0 ,
sup EP n sup WK (G) − WK (ĜK ) = sup EP n sup VK (G) − VK (ĜK ) ,
P ∈P(M,κ)
G∈G
P ∈P(M,κ)
G∈G
and we focus on bounding the latter expression.
Since ĜK maximizes VK,n (G), VK,n (G̃) ≤ VK,n (ĜK ) for any G̃ ∈ G and
VK (G̃) ≤ VK,n (G̃) + sup |VK,n (G) − VK (G)|
G∈G
≤ VK,n (ĜK ) + sup |VK,n (G) − VK (G)|
G∈G
≤ VK (ĜK ) + 2 sup |VK,n (G) − VK (G)| .
G∈G
Applying the inequality for all G̃ ∈ G, we obtain
sup VK (G) − VK (ĜK ) ≤ 2 sup |VK,n (G) − VK (G)| ,
G∈G
G∈G
which is also true in expectation over P n .
65
(A.33)
The welfare gain estimation error for any treatment rule G could be bounded from above by:
K
K
|VK,n (G) − VK (G)| = · Vn (G) −
· V (G)
max{K, PX,n (G)}
max{K, PX (G)}
K
K
K
≤
· |Vn (G) − V (G)| + V (G) · −
max{K, PX,n (G)}
max{K, PX,n (G)} max{K, PX (G)} M
≤ |Vn (G) − V (G)| +
· |PX,n (G) − PX (G)| ,
K
The second line comes from subtracting and adding
triangle inequality. The third line uses inequalities
K
max{K,PX,n (G)} V (G)
K
max{K,PX,n (G)} ≤ 1
and then applying the
and V (G) ≤ M (from
Assumption 2.1 (BO)), and the observation that for any a, b ∈ R and c > 0,
c(max{c, b} − max{c, a}) |max{c, b} − max{c, a}|
c
|b − a|
c
≤
.
max{c, a} − max{c, b} = max{c, a} · max{c, b} ≤
c
c
Then
sup
EP n
P ∈P(M,κ)
sup VK (G) − VK (ĜK ) ≤ 2
G∈G
EP n sup |VK,n (G) − VK (G)|
sup
P ∈P(M,κ)
≤2
sup
P ∈P(M,κ)
EP n
M
sup |Vn (G) − V (G)| + 2
K
G∈G
sup
P ∈P(M,κ)
G∈G
EP n sup |PX,n (G) − PX (G)|
G∈G
Note that since the class G has VC-dimension v < ∞, the classes of functions
Y (1 − D)
YD
−
· 1{X ∈ G},
fG (Y, D, X) ≡
e(X)
1 − e(X)
hG (Y, D, X) ≡ 1{X ∈ G} − 1/2,
are VC-subgraph classes with VC-dimension no greater than v by Lemma A.1. These classes of
functions are uniformly bounded by M/(2κ) and 1/2. Since Vn (G) = En (fG ), V (G) = E(fG ),
Pn (Xi ∈ G) = En (hG ) + 1/2 and P (X ∈ G) = E(hG ) + 1/2, we could apply Lemma A.4 and obtain
r
r
M v
M v
sup EP n sup VK (G) − VK (ĜK ) ≤ C1
+ C1
.
κ n
K n
G∈G
P ∈P(M,κ)
The theorem’s result follows from (A.33).
Proof of Theorem 4.2. Define the class of functions F = {f (·; G) : G ∈ G} on Z:
Yi Di
Yi (1 − Di )
f (Zi ; G) ≡
−
· ρ(Xi ) · 1 {Xi ∈ G} ,
e(Xi )
1 − e(Xi )
66
Then V T (G) = EP (f (·, G)) by (4.3) and ĜTEW M = arg maxGinG En (f (·, G)).
By Assumptions 2.1 (BO), (SO) and 4.2 (BDR), these functions are uniformly bounded by
F̄ =
M ρ̄
2κ .
By Assumption 2.1 (VC) and Lemma A.1, F is a VC-subgraph class of functions with
VC-dimension at most v. Applying the argument in inequality (2.2) we obtain
sup W T (G) − W T ĜTEW M = sup V T (G) − V T ĜTEW M ≤ 2 sup |En (f ) − EP (f )| .
G∈G
f ∈F
G∈G
Now we take the expectation with respect to the sampling distribution P n and apply Lemma A.4:
"
#
r
M ρ̄ v
T
T
T
EP n sup W (G) − W (ĜEW M ) ≤ 2EP n sup |En (f ) − EP (f )| ≤ C1
.
κ
n
G∈G
f ∈F
B
Validating Condition 2.1 for Local Polynomial Estimators
This Appendix verifies that Condition 2.1 (m) and (e) hold for local polynomial estimators if the
class of data generating processes Pm or Pe is constrained by Assumptions 2.3 and 2.4, respectively.
B.1
Notations and Basic Lemmas
In addition to the notations introduced in Section 2.4 in the main text, the following notations are
used. Let µ : Rdx → R be a generic notation for a regression equation onto a vector of covariates
X ∈ Rdx . In case of m-hybrid EWM, µ (·) corresponds to either of m1 (·) or m0 (·). In case of
e-hybrid EWM, µ (·) corresponds to propensity score e(·). We use n to denote the size of the
entire sample indexed by i = 1, . . . , n, and denote by Ji ⊂ {1, . . . , n} a subsample with which
µ (Xi ) is estimated nonparametrically. Since we consider throughout the leave-one-out regression
fits of µ (Xi ), Ji does not include i-th observation. In case of m-hybrid EWM, Ji is either the
leave-one-out treated sample {j ∈ {1, . . . , n} : Dj = 1, j 6= i} or the leave-one-out control sample
{j ∈ {1, . . . , n} : Dj = 0, j 6= i} depending on µ (·) corresponds to m1 (·) or m0 (·). Note that, in
the m-hybrid case, Ji is random as it depends on a realization of (D1 , . . . , Dn ). When the e-hybrid
EWM is considered, Ji is non-stochastic and it is given by Ji = {1, . . . , n} \ {i}. The size of Ji is
denoted by nJi , which is equal to n1 − 1 or n0 − 1 in the m-hybrid case, and is equal to n − 1 in case
of e-hybrid case. With abuse of notations, we use Yi , i = 1, . . . , n, to denote observations of the
regressors and use ξ i to denote a regression residual, i.e., Yi = µ (Xi ) + ξ i , E (ξ i |Xi ) = 0, holds for
all i = 1, . . . , n. For e-hybrid rule, Yi should be read as the treatment status indicator Di ∈ {1, 0}.
67
We assume that µ (·) belongs to a Hölder class of functions with degree β ≥ 1 and constant
0 < L < ∞. Define the leave-one-out local polynomial regression with degree l = (β − 1) by
µ̂−i (Xi ) = U T (0)θ̂(Xi ) · 1 {λ(Xi ) ≥ tn } ,
X
Xj − Xi
Xj − Xi 2
K
,
θ̂−i (Xi ) = arg min
Yj − θT U
θ
h
h
(B.1)
j∈Ji
Xj −Xi
Xj −Xi s
, λ(Xi ) is a smallest eigenis a regressor vector, U
≡
h
h
|s|≤l P
−1
Xj −Xi
Xi −Xj
Xj −Xi
value of B−i (Xi ) ≡ nhdx
UT
K
, and tn is a sequence of
j∈Ji U
h
h
h
where U
Xj −Xi
h
trimming constant converging to zero, whose choice will be discussed later. The standard least
squares calculus shows

θ̂−i (Xi ) = B−i (Xi )−1 
1
nhdx
X
j∈Ji
U
Xj − Xi
h
K

Xj − Xi 
,
h
so that µ̂ (Xi ) can be written as


X
µ̂−i (Xi ) = 
Yj ω j (Xi ) · 1 {λ(Xi ) ≥ tn } ,
(B.2)
j∈Ji
1
U T (0) [B−i (Xi )]−1 U
nhdx
where ω j (Xi ) =
Xj − Xi
h
K
Xj − Xi
h
.
We first present lemmas that will be used for proving Lemma B.4 below.
Lemma B.1. Suppose Assumptions 2.3 (PX) and (Ker).
(i) Conditional on (X1 , . . . , Xn ) such that λ(Xi ) > 0,
max |ω j (Xi )| ≤ c5
j6=i
X
j∈Ji
|ω j (Xi )| ≤
1
,
nhdx λ(Xi )
X n
c5
1 (Xj
nhdx λ(Xi )
j∈Ji
o
− Xi ) ∈ [−h, h]dx ,
where c5 is a constant that depends only on β, dx and Kmax .
P
Xj −Xi s
(ii) For any multi-index s such that |s| ≤ (β − 1), j∈Ji
ω j (Xi ) = 0.
h
P
−1
X
n
j −x
T Xj −x K Xj −x
(iii) Let λ (x) be a smallest eigenvalue of B(x) ≡ nhdx
U
U
j=1
h
h
h
there exist positive constants c6 and c7 that depend only on c, r0 , pX , and K(·) such that
P n ({λ (x) ≤ c6 }) ≤ 2 [dim U ]2 exp −c7 nhdx
holds for all x, PX -almost surely, at every n ≥ 1.
68
Proof. (i) Since kU (0)k = 1, it holds
Xj − Xi
Xj − Xi 1 −1
|ω j (Xi )| ≤
[B−i (Xi )] U
K
h
h
nhdx n
o
X
−
X
Kmax j
i
d
x
U
≤
1
(X
−
X
)
∈
[−h,
h]
j
i
d
h
nh x λ(Xi )
≤
≡
Kmax dim (U )1/2
nhdx λ(Xi )
c5
,
d
x
nh λ(Xi )
for every 1 ≤ j ≤ n. Similarly,
n
o
X
Xj − Xi Kmax X 1 (Xj − Xi ) ∈ [−h, h]dx
|ω j (Xi )| ≤
U
h
nhdx λ(Xi )
j∈Ji
j∈Ji
o
X n
c5
dx
1
(X
−
X
)
∈
[−h,
h]
.
=
j
i
nhdx λ(Xi )
j∈Ji
(ii) This claim follows from the first order condition for θ in the least square minimization
problem in (B.1).
(iii) This lemma is from Equation (6.3, pp. 626) in the proof of Theorem 3.2 in Audibert
and Tsybakov (2007), where a suitable choice of constant c6 is given in Equation (6.2, pp.625) in
Audibert and Tsybakov (2007).
The next lemma provides an exponential tail bound for the local polynomial estimators. The
first statement is borrowed from Theorem 3.2 in Audibert and Tsybakov (2007), and the second
statement is its immediate extension.
Lemma B.2. (i) Suppose Assumption 2.3 (PX) and (Ker) hold, and µ (·) belongs to a Hölder
class of functions with degree β ≥ 1 and constant 0 < L < ∞. Assume Ji is non-stochastic with
nJi = n − 1 (e-hybrid case). Then, there exist positive constants c8 , c9 , and c10 that depend only
on β, dx , L, c, r0 , pX , and p̄X , such that, for any 0 < h < r0 /c, any c8 hβ < δ, and any n ≥ 2,
P n−1 µ̂−n (x) − µ (x) > δ ≤ c9 exp −c10 nhdx δ 2 ,
holds for almost all x with respect to PX , where P n−1 (·) is the distribution of
n
o
n−1
(Yi , Xi )i=1
.
(ii) Suppose Assumptions 2.1 (SO), 2.3 (PX), and (Ker) hold, and µ (·) belongs to a Hölder
class of functions with degree β ≥ 1 and constant 0 < L < ∞. Assume Ji is stochastic (m-hybrid
case) with Ji = {j 6= i : Dj = d}, d ∈ {1, 0}. There exist positive constants c11 , c12 , and c13 that
69
depend only on κ, β, dx , L, c, r0 , pX , and p̄X , such that for any 0 < h < r0 /c, any c11 hβ < δ, and
any nJn ≥ 1,
P n−1 µ̂−n (x) − µ (x) > δ|nJn ≤ c12 exp −c13 nJn hdx δ 2
holds for almost all x with respect to PX , where P n−1 (·|nJn ) is the conditional distribution of
n
o
P
(Yi , Xi )n−1
given n−1
i=1
j=1 1 {Dj = d}.
Proof. (i) See Theorem 3.2 in Audibert and Tsybakov (2007).
(ii) Under Assumption 2.1 (SO), the conditional distribution of covariates X given D = d,
d ∈ {1, 0}, has the support X same as the unconditional distribution PX , and has bounded density
on X , since
dPX|D=d
1 − κ dPX
κ dPX
<
<
1 − κ dx
dx
κ dx
holds for all x ∈ X . Therefore, when PX satisfies Assumption 2.3 (PX), the conditional distributions
PX|D=d , d ∈ {1, 0} also satisfy the support and density conditions analogous to Assumption 2.3
Pn−1
1 {Dj = d} ≥ 1, the exponential
(PX). This implies that, even when we condition on nJn = j=1
inequality of (i) in the current lemma is applicable with different constant terms.
The next lemma concerns an upper bound of the variance of the supremum of centered empirical
processes indexed by a class of sets.
Lemma B.3. Let B be a countable class of sets in X , and let {PX,n (B) : B ∈ B} be the empirical
distribution based on iid observations, (X1 , . . . , Xn ), Xi ∼ PX .
1
2
.
V ar sup {PX,n (B) − PX (B)} ≤ E sup {PX,n (B) − PX (B)} +
n
4n
B∈B
B∈B
Proof. In Theorem 11.10 of Boucheron et al. (2013), setting Xi,s at the centered indicator function
1 {Xi ∈ B} − PX (B), and dividing the inequality of Theorem 11.10 of Boucheron et al. (2013) by
n2 lead to
V ar sup {PX,n (B) − PX (B)}
≤
B∈B
2
E sup {PX,n (B) − PX (B)}
n
B∈B
1
+ sup {PX (B) [1 − PX (B)]}
n B∈B
2
1
≤
E sup {PX,n (B) − PX (B)} +
.
n
4n
B∈B
70
B.2
Main Lemmas and Proofs of Corollaries 2.1 and 2.2
The next lemma yields Corollaries 2.1 and 2.2.
Lemma B.4. Let Pµ be a class of joint distributions of (Y, X) such that µ (·) belongs to a Hölder
class of functions with degree β ≥ 1 and constant 0 < L < ∞, and Assumption 2.3 (PX) holds. Let
µ̂−i (·) be the leave-one-out local polynomial fit for µ (Xi ) defined in (B.1), whose kernel function
satisfies Assumption 2.3 (Ker).
(i) Then,
"
sup EP n
P ∈Pµ
#
n
1
1 X β
µ̂−i (Xi ) − µ (Xi ) ≤ O(h ) + O √
n
nhdx
(B.3)
i=1
holds. Hence, an optimal choice of bandwidth that leads to the fastest convergence rate of the
−
1
uniform upper bound is h ∝ n 2β+dx and the resulting uniform convergence rate is
#
" n
1
1 X −
sup EP n
µ̂−i (Xi ) − µ (Xi ) ≤ O n 2+dx /β .
n
P ∈Pµ
i=1
(ii) Let tn = (log n)−1 . Then,
"
2β #
2
h
log
n
≤O
sup EP n
+O
max µ̂−i (Xi ) − µ (Xi )
1≤i≤n
t2n
nhdx t2n
P ∈Pµ
(B.4)
holds. Hence, an optimal choice of bandwidth that leads to the fastest convergence rate of the
1
2β+dx
uniform upper bound is h ∝ logn n
and the resulting uniform convergence rate is
"
sup EP n
P ∈Pµ
max µ̂−i (Xi ) − µ (Xi )
2 #
1≤i≤n
≤ O (tn )−2
log n
n
2
2+dx /β
!
.
Proof. (i) First, consider the non-stochastic Ji case with nJi = (n − 1) (e-hybrid case). Since
observations are iid (hence exchangeable) and the probability law of µ̂−i (·) does not depend on Xi ,
it holds
"
EP n
n
1 X µ̂−i (Xi ) − µ (Xi )
n
#
= EP n µ̂−i (Xi ) − µ (Xi )
i=1
= EPX EP n−1 µ̂−n (Xn ) − µ (Xn ) |Xn
Z
=
EP n−1 µ̂−n (x) − µ (x) dPX (x)
ZX Z ∞
n−1 =
P
µ̂−n (x) − µ (x) > δ dδ dPX (x),
X
0
71
(B.5)
where EP n−1 [·] is the expectation with respect to the first (n − 1)-observations of (Yi , Xi ).
By
Lemma B.2 (i), there exist positive constants c8 , c9 , and c10 that depend only on β, dx , L, c, r0 ,
pX , and p̄X such that, for any 0 < h < r0 /c, any c8 hβ < δ, and any n ≥ 2,
P n−1 µ̂−n (x) − µ (x) > δ ≤ c9 exp −c10 nhdx δ 2
holds for almost all
Z Z ∞
P n−1
X
(B.6)
x with respect to PX . Hence,
Z
µ̂−n (x) − µ (x) > δ dδ dPX (x) ≤ c8 hβ + c9
∞
exp −c10 nhdx δ 2 dδ
0
0
c14
= c8 h + √
nhdx
1
β
= O(h ) + O √
nhdx
β
where c14 = c9 (2c10 )−1/2
R∞
0
δ0
−1/2
(B.7)
exp −c10 δ 0 dδ 0 < ∞. Since the upper bound (B.7) does not
depend upon P ∈ Pµ , this upper bound is uniform over P ∈ Pµ , so the conclusion holds.
P
Next, consider the stochastic Ji case with nJi = j6=i 1 {Dj = d}, where d ∈ {1, 0}. we can
interpret nJi as a binomial random variable with parameters (n − 1) and π, where π = P (Di = 1)
when µ (·) corresponds to m1 (·) and π = P (Di = 0) when µ (·) corresponds to m0 (·). In either
n
o
nJn
− π ≤ 12 π =
case, κ < π < 1 − κ by Assumption 2.1 (SO). Let n ≥ 1 + π2 and Ωπ,n ≡ n−1
n
o
3(n−1)π
(n−1)π
≤
n
≤
. Consider
Jn
2
2
X
EP n−1 µ̂−n (x) − µ (x) |nJn P n−1 (nJn )
EP n−1 µ̂−n (x) − µ (x) · 1 {Ωπ,n } =
nJn ∈Ωπ,n
EP n−1 µ̂−n (x) − µ (x) |nJn P n−1 (Ωπ,n )
nJn ∈Ωπ,n
EP n−1 µ̂−n (x) − µ (x) |nJn .
≤
max
≤
max
nJn ∈Ωπ,n
Since nJn ≥
(n−1)π
2
≥ 1 on Ωπ,n , Lemma B.2 (ii) implies
Z Z ∞
n−1
µ̂−n (x) − µ (x) > δ|nJn dδ dPX (x)
EP n−1 µ̂−n (x) − µ (x) |nJn ≤
P
X
0
c15
≤ c11 h + p
,
nJn hdx
β
where c11 and c15 are positive constants that depend only on κ, β, dx , L, c, r0 , pX , and p̄X . Since
n Jn ≥
(n−1)π
2
≥
max
nJn ∈Ωπ,n
nπ
4
on Ωπ,n for n ≥ 2, it holds
2c15
.
EP n−1 µ̂−n (x) − µ (x) |nJn ≤ c11 hβ + √
πnhdx
72
2 Accordingly, combined with the Hoeffding’s inequality P n−1 Ωcπ,n ≤ 2 exp − π4 n , we obtain
EP n−1 µ̂−n (x) − µ (x) ≤ EP n−1 µ̂−n (x) − µ (x) · 1 {Ωπ,n } + M P n−1 Ωcπ,n
2 2c15
π
β
≤ c11 h + √
+ 2M exp − n .
4
πnhdx
The third term in the right hand side converges faster than the second term, so we have shown
" n
#
Z
1 X EP n
EP n−1 µ̂−n (x) − µ (x) dPX (x)
µ̂−i (Xi ) − µ (Xi ) =
n
X
i=1
1
β
≤ O(h ) + O √
nhdx
holds for the stochastic Ji case as well.
(ii) Let Ωλ,n be an event defined by {λ (Xi ) ≥ tn , ∀i = 1, . . . , n}. On Ωλ,n , (B.2) implies
2
X
µ̂−i (Xi ) − µ (Xi )2 ≤ Yj ω j (Xi ) − µ (Xi )
j∈Ji
2
X
X
= (µ (Xj ) − µ (Xi )) ω j (Xi ) +
ξ j ω j (Xi )
j∈Ji
j∈Ji
2
2
X
X
≤ 2
(µ (Xj ) − µ (Xi )) ω j (Xi ) + 2 ξ j ω j (Xi ) ,
(B.8)
j∈Ji
j∈Ji
P
where the second line follows from Yj = µ (Xj ) + ξ j and j6=i ω j (Xi ) = 0 as implied by Lemma
B.1 (ii). Since µ (·) is assumed to belong to the Hölder class, Lemma B.1 (ii) and Assumption 2.3
(Ker) imply
X
2 X
2
(µ (Xj ) − µ (Xi )) ω j (Xi ) = kXj − Xi kβ ω j (Xi )
j∈Ji
j∈Ji
X
n
o2
=
kXj − Xi kβ ω j (Xi ) · 1 (Xj − Xi ) ∈ [−h, h]dx j∈Ji
X
2
2β 2β |ω j (Xi )|
≤ dx h j∈Ji
2 n
o2
1 X
c5
dx
2β 2β
1 (Xj − Xi ) ∈ [−h, h]
≤ dx h
j∈Ji
λ(Xi )
nhdx
n
o2
1 X
h2β
dx
≤ c16 2
1 (Xj − Xi ) ∈ [−h, h]
,
j∈Ji
tn nhdx
73
2
where c16 = d2β
x c5 . Under Assumption 2.3 (PX) and conditional on Ωλ,n ,
"
#2
X
2
1
h2β
max sup PX,n (B)
(µ (Xj ) − µ (Xi )) ω j (Xi ) ≤ c16 2
j∈Ji
1≤i≤n
tn hdx B∈Bh
!#2
"
h2β
1
≤ c16 2
sup (PX,n (B) − PX (B)) + sup PX (B)
tn hdx B∈Bh
B∈Bh
#2
"
h2β
1
≤ c16 2
sup (PX,n (B) − PX (B)) + 2dx · p̄X
tn hdx B∈Bh


#2
"


2β
h
2
2dx +1
2
,
(P
(B)
−
P
(B))
+
2
·
p̄
sup
≤ c16 2
X,n
X
X

tn  h2dx B∈Bh
where Bh is the class of hypercubes in
Rdx ,
Y
dx
Bh ≡
k=1
[xk − h, xk + h] : (x1 , . . . , xdx ) ∈ X .
Accordingly,
E
Pn
X
max 1≤i≤n
j6=i
2
(µ (Xj ) − µ (Xi )) ω j (Xi ) · 1 {Ωλ,n }
n
2 o
h2β
h2β 1
n
+
2c
E
sup
(P
(B)
−
P
(B))
16 2
P
X,n
X
B∈Bh
t2n
tn h2dx
)
(
V ar supB∈Bh (PX,n (B) − PX (B))
h2β 1
h2β
≤ c17 2 + 4c16 2 2dx
2 ,
tn
tn h
+ EP n supB∈B (PX,n (B) − PX (B))
≤ c17
h
where c17 = 22dx +1 c16 p̄2X . In order to bound the variance and the squared mean terms in the
curly brackets, we apply Lemma B.3 and Lemma A.5 with F̄ = 1 and δ = p̄X (2h)dx /2 . For all n
satisfying nhdx ≥
C1 vBh
,
2dx p̄2X
it holds
!
V ar
sup (PX,n (B) − PX (B))
B∈Bh
≤
2
EP n
n
≤ 2
dx
+1
2
!
sup (PX,n (B) − PX (B))
B∈Bh
p
C2 p̄X
!#2
"
EP n
sup (PX,n (B) − PX (B))
B∈Bh
≤ 2dx C22 p̄2X
+
1
4n
vBh hdx
1
+
,
3/2
4n
n
vBh hdx
,
n
where vBh < ∞ is the VC-dimension of Bh that depends only on dx . As a result, there exist positive
constants c18 , and c19 that depend only on β, dx , and p̄X , such that
2


X
h2β
h2β
h2β
(µ (Xj ) − µ (Xi )) ω j (Xi ) · 1 {Ωλ,n } ≤ c17 2 +c18 2
EP n  max +c
19
3/2
1≤i≤n tn
tn (nhdx )
t2n (nhdx )
j6=i
74
holds for all n satisfying nhdx ≥
C1 vBh
.
2dx p̄2X
Since nhdx → ∞ by the assumption, focusing on the leading
terms yields
2

2β X
h


.
lim sup sup EP n 2 max (µ (Xj ) − µ (Xi )) ω j (Xi ) · 1 {Ωλ,n } ≤ O
1≤i≤n t2n
n→∞ P ∈Pµ
j∈Ji

(B.9)
In order to bound the second term in the right hand side of (B.8), note first that
2
2
X
X
Xj − Xi Xj − Xi
1
≤
√ 1
K
ξ
ω
(X
)
ξ
U
j
i
j
j
h
h
nhd λ2 (Xi ) j∈Ji
nhdx j∈Ji
≤
2
Kmax
nhd t2n
max
1≤k≤dim(U )
η 2ik
holds conditional on Ωλ,n , where η ik , 1 ≤ k ≤ dim (U ), is the k-th entry of vector
n
o
X
Xj − Xi
1
√
ξj U
1 (Xj − Xi ) ∈ [−h, h]dx .
h
nhdx j∈Ji
Therefore,
2

2
X
Kmax
2


ξ j ω j (Xi ) · 1 {Ωλ,n } ≤
EP n max EP n
max
η ik .
1≤i≤n nhd t2n
1≤i≤n,1≤k≤dim(U )
j∈Ji

(B.10)
Conditional on (X1 , . . . , Xn ) , η ik has mean zero and every summand in η ik lies in the interval,
h
n
o
n
oi
− √ M dx 1 (Xj − Xi ) ∈ [−h, h]dx , √ M dx 1 (Xj − Xi ) ∈ [−h, h]dx . The Hoeffding’s inequality
nh
nh
then implies that, for every 1 ≤ i ≤ n and 1 ≤ k ≤ dim (U ), it holds
P n (|η ik | ≥ t|X1 , . . . , Xn )

≤ 2 exp −
2M 2
nhdx

≤ 2 exp −
2M 2
nhdx

t2
n
o
dx
j∈Ji 1 (Xj − Xi ) ∈ [−h, h]
P

t2
n
o  , ∀t > 0.
P
max1≤i≤n j∈Ji 1 (Xj − Xi ) ∈ [−h, h]dx
75
Therefore,


EP n exp 
2M 2
nhdx



η 2ik
n
o  |X1 , . . . , Xn 
P
max1≤i≤n j∈Ji 1 (Xj − Xi ) ∈ [−h, h]dx


∞
Z



Pn 
exp 
= 1+
1
∞
Z
= 1+
1


 0
η 2ik
n
o
≥ t0 |X1 , . . . , Xn 
P

 dt
max
1 (Xj − Xi ) ∈ [−h, h]dx
2M 2
nhdx 1≤i≤n
j∈Ji


v
u
n
o
2
X
u 2M
P n |η ik | ≥ t dx max
1 (Xj − Xi ) ∈ [−h, h]dx log t0 |X1 , . . . , Xn  dt0
nh 1≤i≤n
j∈Ji
Z
∞
≤ 1+2
exp −2 log t0 dt0
Z1 ∞
= 1+2
t0
−2
dt0
1
= 3
for all 1 ≤ i ≤ n and 1 ≤ k ≤ dim (U ). We can therefore apply Lemma 1.6 of Tsybakov (2009) to
bound EP n maxi,k η 2ik |X1 , . . . , Xn ,
2
n
EP
max
η ik |X1 , . . . , Xn
1≤i≤n,1≤k≤dim(U )


n
o
X
1
1 (Xj − Xi ) ∈ [−h, h]dx  log (3 dim (U ) n)
≤ 2M 2 max  dx
1≤i≤n nh
j∈Ji
"
#
1
≤ 2M 2 dx sup (PX,n (B) − PX (B)) + 2dx p̄X log (3 dim (U ) n) .
h B∈Bh
By applying Lemma A.5 with F̄ = 1 and δ = p̄X (2h)dx /2 , the unconditional expectation of
maxi,k η 2ik can be bounded as
EP n
max
1≤i≤n,1≤k≤dim(U )
for all n such that nhdx ≥
η 2ik
r
vBh
dx /2
dx
≤ 2M C2 2
p̄X
+ 2 p̄X log (3 dim (U ) n)
nhdx
C1 vBh
.
2dx p̄2X
2
(B.11)
Plugging (B.11) back into (B.10) and focusing on the leading
term give
lim sup sup EP n
n→∞ P ∈Pµ
X
max 0≤i≤n
2
log n
ξ ω j (Xi ) · 1 {Ωλ,n } ≤ O
.
j6=i j
nhdx t2n
76
(B.12)
Combining (B.8), (B.9), and (B.12), we obtain
2
EP n max µ̂−i (Xi ) − µ (Xi )
1≤i≤n
2
≤ EP n max µ̂−i (Xi ) − µ (Xi ) · 1 {Ωλ,n } + M 2 P n Ωcλ,n
1≤i≤n
X
2
(µ (Xj ) − µ (Xi )) ω j (Xi ) · 1 {Ωλ,n }
≤ 2EP n max j6=i
1≤i≤n
X
2
ξ j ω j (Xi ) · 1 {Ωλ,n } + M 2 P n Ωcλ,n ,
+2EP n max j6=i
1≤i≤n
2β h
log n
2 n
c
,
+
M
P
Ω
= O
+O
λ,n
2
2
d
tn
nh x tn
n
so the desired conclusion is proven if P n Ωcλ,n is shown to converge faster than the O nhlog
dx t2
n
term.
To find the convergence rate of P n Ωcλ,n , consider first the case of non-stochastic Ji . By
applying Lemma B.1 (iii) with the sample size set at (n − 1), we have
P n ({λ (Xi ) ≤ c6 , for some 1 ≤ i ≤ n}) = nP n ({λ (Xn ) ≤ c6 })
Z
= n P n (λ(Xn ) ≤ c6 |Xn ) dPX
Z
= n P n−1 (λ(x) ≤ c6 ) dPX (x)
(B.13)
c
7
≤ 2n [dim U ]2 exp − nhdx .
2
For the case of stochastic Ji , by viewing nJi as a binomial random variable with parameters (n − 1)
and π with κ < π < 1 − κ, and recalling that, when PX satisfies Assumption 2.3 (PX), the
conditional distributions PX|D=d , d ∈ {1, 0} also satisfy the support and density conditions stated
in Assumption 2.3 (PX), we can apply the exponential inequality
shown in Lemma B.1 (iii) to
o n
o
n
1
nJn
3(n−1)π
n−1
≤
n
≤
bound P
(λ(x) ≤ c6 |nJn ). Hence, with Ωπ,n ≡ n−1 − π ≤ 2 π = (n−1)π
J
n
2
2
used above, we have
P n−1 (λ(x) ≤ c6 ) ≤ P n−1 ({λ(x) ≤ c6 } ∩ Ωπ,n ) + P n−1 Ωcπ,n
P n−1 (λ(x) ≤ c6 |nJn ) + P n−1 Ωcπ,n .
nJn ∈Ωπ,n
2 c π
π
7
2
dx
≤ 2 [dim U ] exp −
nh
+ 2 exp − n ,
4
4
≤
max
Plugging this upper bound into (B.13) and focusing on the leading term leads to
π
P n ({λ (Xi ) ≤ c6 , for some 1 ≤ i ≤ n}) ≤ O n exp −c7 nhdx .
4
77
Hence, in either of the non-stochastic or the stochastic Ji case, since tn ≤ c6 holds for all large n
and the obtained upper bounds are uniform over P ∈ Pµ , we conclude
2β 2
log n
h
dx
+O
+O
n
exp(−nh
)
.
lim sup sup EP n max µ̂−i (Xi ) − µ (Xi ) ≤ O
1≤i≤n
t2n
nhdx t2n
n→∞ P ∈Pµ
n
Since tn = (log n)−1 by assumption, O(n exp −nhdx ) converges faster than O nhlog
, the leading
dx t2
n
2β n
.
terms are given by the first two terms, O ht2 + O nhlog
dx t2
n
n
Proof of Corollary 2.1. By noting the following inequalities,
X
X
n
n
1
1
m
|τ̂ (Xi ) − τ (Xi )| ≤ EP n
|m̂1 (Xi ) − m1 (Xi )|
EP n
i=1
i=1
n
n
X
n
1
+EP n
|m̂0 (Xi ) − m0 (Xi )|
i=1
n
2
2
m
EP n max (τ̂ (Xi ) − τ (Xi ))
≤ 2EP n max (m̂1 (Xi ) − m1 (Xi ))
1≤i≤n
1≤i≤n
2
+2EP n max (m̂0 (Xi ) − m0 (Xi )) ,
1≤i≤n
we obtain the current corollary by applying Lemma B.4. The resulting uniform convergence rate is
1
given by ψ n = n 2+dx /βm . When the assumption (2.9) in Theorem 2.6 is concerned, the corresponding
−1
1
2
log n 2+dx /β m
rate is given by ψ̃ n =
(log n)
.
n
Proof of Corollary 2.2. (i) Assume that n is large enough so that εn ≤ κ/2 holds. Given ê (Xi ) ∈
[εn , 1 − εn ], τ̂ ei − τ i can be expressed as
Yi Di e(Xi ) − ê(Xi )
Yi (1 − Di ) e(Xi ) − ê(Xi )
τ̂ ei − τ i =
+
,
e(Xi )
ê(Xi )
1 − e(Xi )
1 − ê (Xi )
so
|τ̂ ei − τ i | ≤
M
1
·
· |ê(Xi ) − e(Xi )|
κ ê (Xi ) (1 − ê (Xi ))
holds. On the other hand, when ê (Xi ) ∈
/ [εn , 1 − εn ], τ̂ ei = 0 and |τ i | ≤
Hence, the following bounds are valid,

 M · 2 · |ê(X ) − e(X )| if ê (X ) ∈ κ , 1 − κ ,
i
i
i
κ
2
2
κ(2−κ)
e
|τ̂ i − τ i | ≤
κ
M
1
κ

if ê (Xi ) ∈
/ 2,1 − 2 .
κ · εn (1−εn )
78
M
κ
imply |τ̂ ei − τ i | ≤
M
κ .
(B.14)
Hence,
X
n
1
e
|τ̂ − τ i | = EP n [|τ̂ en − τ n |]
EP n
i=1 i
n
M
2
≤
·
· EP n [|ê(Xn ) − e(Xn )|]
κ κ (2 − κ)
hκ
M
1
κ i
+
·
· P n ê (Xn ) ∈
/
,1 −
.
κ εn (1 − εn )
2
2
−
1
By Lemma B.4 (i), supP ∈Pe EP n [|ê(Xn ) − e(Xn )|] ≤ O(n 2+dx /βe ), so the conclusion follows if
1
−
P n ê (Xn ) ∈
/ κ2 , 1 − κ2 is shown to converge faster than O(n 2+dx /βe ). To see this claim is true,
note that
Z
hκ
hκ
κ i
κ i
n
n−1
P
ê (Xn ) ∈
/
P
ê (x) ∈
/
,1 −
=
,1 −
dPX (x)
2
2
2
2
ZX
κ
dPX (x)
≤
P n−1 |ê (x) − e(x)| ≥
2
X
c10 κ2 dx
≤ c9 exp −
nh
4
holds for all n satisfying c8 hβ < κ/2, where the c8 , c9 , and c10 are the constants defined in Lemma
κ
1
κ
n ê (X ) ∈
B.2 (i). Since εn is assumed to converge at a polynomial rate, εn (1−ε
,
1
−
P
/
n
2
2
)
n
converges faster than O(n
− 2+d1 /β
x
e
).
(ii) By (B.14), we have
2
e
EP n max |τ̂ i − τ i |
≤
2
2M
2
EP n max |ê(Xi ) − e(Xi )|
(B.15)
1≤i≤n
1≤i≤n
κ2 (2 − κ)
2
hκ
κi
M
/
for some 1 ≤ i ≤ n .
P n ê (Xi ) ∈
,1 −
+
κεn (1 − εn )
2
2
2
2
−
+2
By Lemma B.4 (ii), the first term in (B.15) converges at rate O n 2+dx /β (log n) 2+dx /β
. To find
the convergence rate of the second term in (B.15), consider
hκ
κi
P n ê (Xi ) ∈
/
,1 −
for some 1 ≤ i ≤ n
2 i
h2κ
κ
n
≤ nP ê (Xn ) ∈
,1 −
/
2
2
2
c10 κ
≤ c9 n exp −
nhdx ,
4
where the last line follows from Lemma B.2 (i). Since εn converges at polynomial rate, we conclude
the second term in (B.15) converges faster than the first term.
79
References
Abadie, A., J. Angrist, and G. Imbens (2002): “Instrumental Variables Estimates of the Effect
of Subsidized Training on the Quantiles of Trainee Earnings,” Econometrica, 70, 91–117.
Abadie, A., M. M. Chingos, and M. R. West (2014): “Endogenous Stratification in Randomized Experiments,” unpublished manuscript.
Armstrong, T. B. and S. Shen (2014): “Inference on Optimal Treatment Assignments,” unpublished manuscript.
Audibert, J.-Y. and A. B. Tsybakov (2007): “Fast Learning Rates for Plug-in Classifiers,”
The Annals of Statistics, 35, 608–633.
Bhattacharya, D. and P. Dupas (2012): “Inferring Welfare Maximizing Treatment Assignment
under Budget Constraints,” Journal of Econometrics, 167, 168 – 196.
Bloom, H. S., L. L. Orr, S. H. Bell, G. Cave, F. Doolittle, W. Lin, and J. M. Bos
(1997): “The Benefits and Costs of JTPA Title II-A Programs: Key Findings from the National
Job Training Partnership Act Study,” Journal of Human Resources, 32, 549–576.
Boucheron, S., G. Lugosi, and P. Massart (2013): Concentration Inequalities, A Nonasymptotic Theory of Independence, Oxford University Press.
Bousquet, O. (2002): “A Bennet Concentration Inequality and its Application to Suprema of
Empirical Processes,” Comptes Rendus de l’Académie des Sciences - Series I, 334, 495–500.
Chamberlain, G. (2011): “Bayesian Aspects of Treatment Choice,” in The Oxford Handbook of
Bayesian Econometrics, ed. by J. Geweke, G. Koop, and H. van Dijk, Oxford University Press,
11–39.
Dehejia (2005): “Program Evaluation as a Decision Problem,” Journal of Econometrics, 125,
141–173.
Devroye, L., L. Györfi, and G. Lugosi (1996): A Probabilistic Theory of Pattern Recognition,
Springer.
Devroye, L. and G. Lugosi (1995): “Lower Bounds in Pattern Recognition and Learning,”
Pattern Recognition, 28, 1011–1018.
80
Dudley, R. (1999): Uniform Central Limit Theorems, Cambridge University Press.
Elliott, G. and R. P. Lieli (2013): “Predicting Binary Outcomes,” Journal of Econometrics,
174, 15 – 26.
Florios, K. and S. Skouras (2008): “Exact Computation of Max Weighted Score Estimators,”
Journal of Econometrics, 146, 86 – 91.
Heckman, J. J., H. Ichimura, and P. E. Todd (1997): “Matching As An Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme,” The Review of Economic
Studies, 64, 605–654.
Hirano, K. and J. R. Porter (2009): “Asymptotics for Statistical Treatment Rules,” Econometrica, 77, 1683–1701.
Kanaya, S. and D. Kristensen (2014): “Optimal Sampling and Bandwidth Selection for Kernel
Estimators of Diffusion Processes,” unpublished manuscript.
Kasy, M. (2014): “Using Data to Inform Policy,” unpublished manuscript.
Kerkyacharian, G., A. B. Tsybakov, V. Temlyakov, D. Picard, and V. Koltchinskii
(2014): “Optimal Exponential Bounds on the Accuracy of Classification,” Constructive Approximation, 39, 421–444.
Lieli, R. P. and H. White (2010): “The Construction of Empirical Credit Scoring Rules Based
on Maximization Principles,” Journal of Econometrics, 157, 110 – 119.
Lugosi, G. (2002): “Pattern Classification and Learning Theory,” in Principles of Nonparametric
Learning, ed. by L. Györfi, Vienna: Springer, 1–56.
Mammen, E. and A. B. Tsybakov (1999): “Smooth Discrimination Analysis,” Annals of Statistics, 27, 1808–1829.
Manski, C. F. (1975): “Maximum Score Estimation of the Stochastic Utility Model of Choice,”
Journal of Econometrics, 3, 205 – 228.
——— (2004): “Statistical Treatment Rules for Heterogeneous Populations,” Econometrica, 72,
1221–1246.
81
Manski, C. F. and T. Thompson (1989): “Estimation of Best Predictors of Binary Response,”
Journal of Econometrics, 40, 97 – 123.
Massart, P. and E. Nédélec (2006): “Risk Bounds for Statistical Learning,” The Annals of
Statistics, 34, 2326–2366.
Qian, M. and S. A. Murphy (2011): “Performance Guarantees for Individualized Treatment
Rules,” The Annals of Statistics, 39, 1180–1210.
Stoye, J. (2009): “Minimax regret treatment choice with finite samples,” Journal of Econometrics,
151, 70 – 81.
——— (2012): “Minimax regret treatment choice with covariates or with limited validity of experiments,” Journal of Econometrics, 166, 138 – 156.
Tetenov, A. (2012): “Statistical treatment choice based on asymmetric minimax regret criteria,”
Journal of Econometrics, 166, 157–165.
Tsybakov, A. B. (2004): “Optimal Aggregation of Classifiers in Statistical Learning,” The Annals
of Statistics, 32, 135–166.
——— (2009): Introduction to Nonparametric Estimation, Springer.
van der Vaart, A. W. and J. A. Wellner (1996): Weak Convergence and Empirical Processes,
Springer.
Vapnik, V. N. (1998): Statistical Learning Theory, John Wiley & Sons.
Zhao, Y., D. Zeng, A. J. Rush, and M. R. Kosorok (2012): “Estimating Individualized Treatment Rules Using Outcome Weighted Learning,” Journal of the American Statistical Association,
107, 1106–1118.
82
Download