Doubly robust uniform confidence band for the conditional average treatment

advertisement
Doubly robust uniform
confidence band for the
conditional average treatment
effect function
Sokbae Lee
Ryo Okui
Yoon-Jae Wang
The Institute for Fiscal Studies
Department of Economics, UCL
cemmap working paper CWP03/16
DOUBLY ROBUST UNIFORM CONFIDENCE BAND FOR THE
CONDITIONAL AVERAGE TREATMENT EFFECT FUNCTION
SOKBAE LEE1,2 , RYO OKUI3,4 , AND YOON-JAE WHANG1
Abstract. In this paper, we propose a doubly robust method to present the heterogeneity of the average treatment effect with respect to observed covariates of
interest. We consider a situation where a large number of covariates are needed for
identifying the average treatment effect but the covariates of interest for analyzing
heterogeneity are of much lower dimension. Our proposed estimator is doubly robust and avoids the curse of dimensionality. We propose a uniform confidence band
that is easy to compute, and we illustrate its usefulness via Monte Carlo experiments
and an application to the effects of smoking on birth weights.
Keywords: average treatment effect conditional on covariates, uniform confidence
band, double robustness, Gaussian approximation.
JEL classification codes: C14, C21
1
Department of Economics, Seoul National University, 1 Gwanak-ro, Gwanak-gu,
Seoul, 151-742, Republic of Korea.
2
Centre for Microdata Methods and Practice, Institute for Fiscal Studies, 7 Ridgmount Street, London, WC1E 7AE, UK.
3
Institute of Economic Research, Kyoto University, Yoshida-Honmachi Sakyo, Kyoto, 606-8501, Japan.
4
Department of Econometrics & OR, VU University Amsterdam, De Boelelaan 1105,
1081 HV Amsterdam, The Netherlands.
E-mail addresses: sokbae@gmail.com, okui.ryo.3@gmail.com, whang@snu.ac.kr.
Date: This version: January 2016; the first draft: November 2014.
We would like to thank Yanchun Jin for capable research assistance and Takahiro Hoshino, YuChin Hsu, Kengo Kato, Edward Kennedy, Taisuke Otsu, and Dylan Small for helpful comments.
Lee’s work was supported by the European Research Council (ERC-2009-StG-240910-ROMETA).
Okui’s work was supported by the Japan Society of the Promotion of Science (KAKENHI 25285067,
25780151, 15H03329). Whang’s work was supported by the SNU College of Social Science Research
Grant.
1
2
LEE, OKUI, AND WHANG
1. Introduction
In this paper, we propose a doubly robust method to present the heterogeneity
of the average treatment effect with respect to observed covariates of interest. To
describe our methodology, we consider the potential outcome framework. Let Y1
and Y0 be potential individual outcomes in two states, with treatment and without
treatment, respectively. For each individual, the observed outcome Y is Y = DY1 +
(1 − D)Y0 , where D denotes an indicator variable for the treatment, with D = 0 if
an individual is not treated and D = 1 if an individual is treated. We assume that
independent and identically distributed observations {(Yi , Di , Zi ) : i = 1, . . . , n} of
(Y, D, Z) are available, where Z ∈ Rp denotes a p-dimensional vector of covariates.
Suppose that a researcher is interested in evaluating the average treatment effect
conditional on only a subset of covariates X, which is of a substantially lower dimension than Z, where Z ≡ (X> , V> )> ∈ Rd × Rm , p ≡ d + m. That is, we are interested
in a case where d p.
The main object of interest in this paper is the conditional average treatment effect
function (CATEF); namely:
(1.1)
g(x) ≡ E[Y1 − Y0 |X = x].
When d ≥ 3, it is difficult to plot g(x), not to mention low precision due to the curse
of dimensionality. Hence, for practical reasons, we focus on the case that d = 1 or
d = 2, while p is often of a much higher dimension.
To achieve identification of the CATEF, we assume that Y1 and Y0 are independent
of D conditional on Z (known as the unconfoundedness assumption):
(1.2)
(Y1 , Y0 ) ⊥ D|Z,
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
3
where ⊥ denotes the independence. For (1.2) to be plausible in applications, applied
researchers tend to consider a large number of covariates Z. Note that in our setup,
the treatment may be confounded in the sense that the treatment assignment may
not be independent of the potential outcome variables given X only. To satisfy the
unconfoundedness condition, a much larger set of conditioning variables Z needs to
be employed.
Different roles of covariates between X and V are noted in the recent literature.
For example, Ogburn, Rotnitzky, and Robins (2015) consider a similar issue in the
context of the local average treatment effect (LATE) of Imbens and Angrist (1994).
Ogburn, Rotnitzky, and Robins (2015) emphasize that conditioning on a large number of covariates Z may be required to make it plausible that the binary instrument is
valid. In their empirical example, Ogburn, Rotnitzky, and Robins (2015) revisit the
analyses of Poterba, Venti, and Wise (1995) and Abadie (2003) to examine whether
participation in 401(k) pension plans increases household savings. In their example,
the vector of covariates Z for the identifying assumption consists of income, age, marital status, and family size, whereas the variable of interest X is income. Abrevaya,
Hsu, and Lieli (2015) also consider the case of investigating the effect of smoking
during pregnancy on birth weights. They are interested in estimating (1.1) with X
being the age of mother; however, as noted in Abrevaya, Hsu, and Lieli (2015), it is
unlikely that conditioning only on the age of mother would achieve the unconfoundedness assumption with nonexperimental data. As a result, it is necessary to consider
a high-dimensional Z, including the age of the mother.
The fact that a high-dimensional Z needs to be employed for (1.2) to be plausible in
an application makes a fully nonparametric estimation approach impractical because
of the curse of dimensionality. For example, the propensity score is not nonparametrically estimable in moderately sized samples, if the dimension of Z is high. One
4
LEE, OKUI, AND WHANG
obvious alternative is to use a parametric model for the propensity score; however, it
may lead to misleading results if the parametric model is misspecified.
With the aim of providing a practical method and, at the same time, reducing
sensitivity to model misspecification, we propose to use a doubly robust method
based on parametric regression and propensity score models. Our estimator of the
CATEF is doubly robust in the sense that it is consistent when at least one of the
regression model and the propensity score model is correctly specified. Specifically, we
first estimate CATEF(Z) using a doubly robust procedure: we estimate a parametric
regression model of the outcome on Z for each treatment status and a parametric
model for the probability of selecting into the treatment given Z; we then combine
the parametric estimation results in a doubly robust fashion to construct an estimate
of CATEF(Z). We then obtain an estimate of CATEF(X) by adopting the local
linear smoothing of CATEF(Z). As a result, we avoid high-dimensional smoothing
with respect to Z but mitigate the problem of misspecification by both the doubly
robust estimation and low-dimensional smoothing with respect to X.
We emphasize that we are willing to assume parametric specifications for the
propensity score and regression models as functions of Z to avoid the curse of dimensionality, but not for CATEF(X). One may consider parametric estimation of
CATEF(X), as Ogburn, Rotnitzky, and Robins (2015) estimate their LATE parameter using least squares approximations. However, note that even if the parametric
specification of CATEF(Z) is correct, the resulting specification of CATEF(X) may
not be correctly specified since E[Z|X] is possibly highly nonlinear. To avoid this
misspecification, we estimate CATEF(X) nonparametrically.
Because the CATEF is a functional parameter, as a tool of inference, we propose to
use a uniform confidence band for the CATEF. Our construction of the uniform confidence band is based on some analytic approximation of the supremum of a Gaussian
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
5
process using arguments built on Piterbarg (1996), combined with a Gaussian approximation result of Chernozhukov, Chetverikov, and Kato (2014) and an empirical
process result of Ghosal, Sen, and van der Vaart (2000). Our method is simple to
implement and does not rely on resampling techniques.
This paper contributes to the literature on doubly robust estimation by demonstrating that the doubly robust procedures are useful for estimating the CATEF. In
this paper, we focus on the so-called augmented inverse probability weighting estimator that was originally proposed by Robins, Rotnitzky, and Zhao (1994) for the
estimation of the mean (see also Robins and Rotnitzky, 1995; Scharfstein, Rotnitzky,
and Robins, 1999). Their estimator appears to be the first estimator to be recognized
as being doubly robust. Since then, many other alternative doubly robust estimators
have been proposed in the literature. For example, the inverse probability weighting
regression adjustment estimator (Kang and Schafer, 2007; Wooldridge, 2007, 2010)
is widely known and has been implemented in statistical software packages. See the
introduction of Tan (2010) for a comprehensive summary of other doubly robust estimators. Doubly robust estimators have been advocated for use in many different
areas of application: See, for example, Lunceford and Davidian (2004) for medicine,
Glynn and Quinn (2010) for political science, Wooldridge (2010) for economics, and
Schafer and Kang (2008) for psychology. There are also doubly robust estimators
available for different settings including instrumental variables estimation (Tan, 2006;
Okui, Small, Tan, and Robins, 2012) and estimation under multivalued treatments
(Derya Uysal, 2015). It would not be difficult to extend our method to allow these
other doubly robust estimators and to consider different settings. However, to keep
the analysis simple, in this paper, we focus on the augmented inverse probability
weighting estimator of the CATEF.
6
LEE, OKUI, AND WHANG
The CATEF is mathematically equivalent to “V -adjusted variable importance” of
van der Laan (2006), who proposes it as a measure of variable importance in prediction. van der Laan (2006) proposes a doubly robust estimator of V -adjusted variable
importance. Contrary to ours, he considers the projection of the V -adjusted variable
importance on a parametric working model and does not consider a nonparametric
estimation. Moreover, a uniform confidence band is not examined in van der Laan
(2006).
In a recent paper, Abrevaya, Hsu, and Lieli (2015) consider the estimation of the
CATEF1; however, there are two main differences of this paper relative to Abrevaya,
Hsu, and Lieli (2015). First, we propose the doubly robust procedure to estimate the
CATEF. Abrevaya, Hsu, and Lieli (2015) consider the inverse probability weighting
estimator. The inverse probability weighting estimator suffers from model misspecification when the propensity score model is misspecified and from the curse of dimensionality when it is estimated nonparametrically. Second, we present a method to
construct a uniform confidence band, whereas Abrevaya, Hsu, and Lieli (2015) only
provide a pointwise confidence interval.
The remainder of the paper is organized as follows. Section 2 presents the doubly
robust estimation method, Section 3 gives an informal description of how to construct
a two-sided, symmetric uniform confidence band when the dimension of X is one, and
Section 4 deals with a general case and provides formal theoretical results. In Section
5, the results of Monte Carlo simulations demonstrate that in finite samples, our
doubly robust estimator works well, and the proposed confidence band has desirable
coverage properties. Section 6 gives an empirical application, and Section 7 concludes.
The proofs are contained in Appendix A.
1Our
paper is independent of Abrevaya, Hsu, and Lieli (2015) and it is started without knowing
their work.
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
7
2. Doubly Robust Estimation of the Average Treatment Effect
Conditional on Covariates of Interest
In this section, a doubly robust method for estimating the CATEF is proposed. We
first estimate the CATEF for all the covariates using a doubly robust method. We
then obtain the CATEF for the covariates of interest using a nonparametric approach.
Define:
π(z) ≡ E [D|Z = z] ,
µj (z) ≡ E [Y |Z = z, D = j] for j = 0, 1,
where π(z) is the propensity score and µj (z) for j = 0, 1 are called regression functions. Note that µj (z) = E(Yj |Z = z) for j = 0, 1 under unconfoundedness. Let
π(z, β) and µj (z, αj ) for j = 0, 1 denote parametric models of π(z) and µj (z), respectively. (µj (z, αj ) may also be called “marginal structural models” of Robins (2000).)
A doubly robust procedure requires that either π(z) or µj (z) for j = 0, 1 should
be correctly specified, thereby allowing for misspecification in π(z) or in µj (z). Let
>
>
θ0 ≡ (α10
, α00
, β0> )> denote the vector of true or pseudo-true parameter values that
optimize some criterion functions.
We consider the augmented inverse probability weighting approach. Let:
ψ1 (W, α1 , β) ≡
DY
D − π(Z, β)
−
µ1 (Z, α1 ),
π(Z, β)
π(Z, β)
ψ0 (W, α0 , β) ≡
(1 − D) Y
D − π(Z, β)
+
µ0 (Z, α0 ),
1 − π(Z, β)
1 − π(Z, β)
ψ(W, θ) ≡ ψ1 (W, α1 , β) − ψ0 (W, α0 , β),
where W ≡ (Y, Z> )> and θ ≡ (α1> , α0> , β > )> . The first terms in ψ1 (W, α1 , β) and
ψ0 (W, α0 , β) correspond to inverse probability weighting. The second terms are augmented terms that make the procedure doubly robust.
8
LEE, OKUI, AND WHANG
The following lemma gives regularity conditions under which g(x) is identified.
Lemma 1 (Identification of the CATEF). Assume that (1.2) holds and 0 < π(Z, β0 ) <
1 almost surely. Suppose that either there exists β0 such that E [D|Z] = π(Z, β0 ) almost surely or there exist α10 and α00 such that E [Y1 |Z] = µ1 (Z, α10 ) and E [Y0 |Z] =
µ0 (Z, α00 ) almost surely. Then:
g(x) = E [ψ(W, θ0 )|X = x] .
Lemma 1 suggests that one may estimate g(x) by running the nonparametric regression of ψ(W, θ̂) on Xi , where θ̂ is a consistent parametric estimator of θ0 . Moreover,
this lemma implies that the CATEF can be identified through ψ(W, θ0 ) if either the
regression models (µ1 (z, α1 ) and µ0 (z, α0 )) or the propensity score model (π(z, β)) is
correctly specified (or both). That is, even if µ1 (z, α1 ) and µ0 (z, α0 ) do not represent the true conditional expectation functions, provided that π(z, β) is correct, the
CATEF is identified. Similarly, even if π(z, β) is misspecified, provided that µ1 (z, α1 )
and µ0 (z, α0 ) are correct, the CATEF is identified.
2.1. Parametric Estimation of θ. For concreteness, we consider the following estimation procedure for θ0 . However, how θ0 is estimated does not alter our results
provided that the rate of convergence is sufficiently fast so that Assumption 1(7) given
below is satisfied. For each j = 0, 1, we estimate αj by least squares:
(2.1)
α̂j ≡ argmin
α
n
X
Dij (1 − Di )1−j (Yi − µj (Zi , α))2 .
i=1
We estimate β by maximum likelihood (e.g., probit or logit):
(2.2)
β̂ ≡ argmax
β
n
X
i=1
(Di log π(Zi , β) + (1 − Di ) log(1 − π(Zi , β))) .
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
9
Remark 1. When the dimension of Z is not too high, an alternative to parametric
estimation of ψ(W, θ0 ) is to estimate its nonparametric counterpart via local polynomial estimators as in Rothe and Firpo (2014). However, this would not work when the
dimension of Z is sufficiently high (see related remarks in Rothe and Firpo (2014)).
The latter is the case we focus on in the paper.
2.2. Local Linear Estimation of g. We consider a local linear estimator of g(x).2
Assume that g(x) is twice continuously differentiable. For each x = (x1 , . . . , xd ), the
local linear estimator of g(x) can be obtained by minimizing:
Sn (γ) ≡
n h
X
ψ(Wi , θ̂) − γ0 −
γ1>
i=1
i2 X − x i
(Xi − x) K
hn
with respect to γ ≡ (γ0 , γ1> )> ∈ Rd+1 , where K(·) is a kernel function on Rd and
hn is a sequence of bandwidths. More specifically, let ĝ(x) = e>
1 γ̂(x), where γ̂(x) ≡
arg minγ∈Rd+1 Sn (γ) and e1 is a column vector whose first entry is one, and the rest
are zero. Because θ̂ can be estimated at a rate of n−1/2 , there is no estimation effect
from the first stage, implying that we can carry out inference as if θ0 were known.
3. An Informal Description of a Uniform Confidence Band
In this section, we provide an informal description of how to construct a two-sided,
symmetric uniform confidence band. For simplicity, we focus on the leading case
where d = 1. Let I ≡ [a, b] denote an interval of interest for which we build a
uniform confidence band. Assume that I is a subset of the support of X. We use
nonbold x to mean that x is one-dimensional.
Algorithm. Carry out the following steps to construct a (1 − α) uniform confidence
band.
2In
this paper, we consider cases in which x is continuous. When x is discreet, the CATEF can be
estimated by the sample average of ψ(W, θ̂) using the sub-sample for each value of x. Moreover,
constructing a confidence band is standard when x takes a finite number of values.
10
LEE, OKUI, AND WHANG
(1) Obtain ĝ(x) using a local linear estimator with a bandwidth hn such that:
hn = b
h × n1/5 × n−2/7 ,
where b
h is a commonly used optimal bandwidth in the literature (for example,
the plug-in method of Ruppert, Sheather, and Wand (1995)).
(2) Obtain the pointwise standard error ŝ(x)/(nhn )1/2 of ĝ(x) by constructing a
feasible version of the asymptotic standard error formula:
ŝ(x)
≡
(nhn )1/2
−1
[nhn fX (x)]
Z
1/2
K (u)du σ (x)
,
2
2
where fX is the density of X and σ 2 (x) is the conditional variance function.
(3) To compute a critical value c(1 − α), define:
λ≡
Let:
−
R
K(u)K 00 (u)du
R
.
K 2 (u)du
1/2
λ1/2
−1
an ≡ an (I) = 2 log(hn (b − a)) + 2 log
.
2π
Now set the critical value by:
1/2
c(1 − α) ≡ a2n − 2 log{log[(1 − α)−1/2 ]}
.
(4) For each x ∈ I, we set the two-sided symmetric confidence band:
ŝ(x)
ŝ(x)
ĝ(x) − c(1 − α) √
≤ g(x) ≤ ĝ(x) + c(1 − α) √
.
nhn
nhn
We make some remarks on the proposed algorithm. In step (1), the factor n1/5 ×
n−2/7 is multiplied in the definition of hn to ensure that the bias is asymptotically negligible by undersmoothing. In step (2), one can estimate fX and σ 2 (x) ≡
Var [ψ(W, θ0 )|X = x] using the standard kernel density and regression estimators
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
11
with the same kernel function K(·) and the same bandwidth h and also with an estimator of θ0 . In step (3), we may restrict the bandwidth such that hn (b−a) (which
is satisfied asymptotically), thereby imposing the condition that log(h−1
n (b−a)) is positive. The critical value proposed in step (3) is strictly positive if α is not too close
to one or if n is large enough.
Remark 2. We may compare our proposal with the critical value based on the (1 − α)
quantile of the Gumbel distribution, which is given by:
− log{log[(1 − α)−1/2 ]}
c∞ (1 − α) ≡ an +
.
an
Note that:
− log{log[(1 − α)−1/2 ]}
c∞ (1 − α) − c(1 − α) =
an
2
,
which is strictly positive for small α but converges to zero as an diverges. Hence, we
expect that in finite samples, the confidence band based on c∞ (1 − α) is too wide
and has higher coverage probability than the nominal level. It is shown in the next
section that the critical value based on the Gumbel distribution is accurate only up
to the logarithmic rate, where our proposed critical value is precise in a polynomial
rate. This is because our proposal uses a higher-order expansion of Piterbarg (1996),
whose approximation error is of a polynomial rate. See Theorem 2 in Section 4 for
details.
Remark 3. Our construction of critical values is based on a simple analytic method
that is easy to compute. Alternatively, one may rely on bootstrap methods to compute
critical values for the uniform confidence band. For example, see Claeskens and Keilegom (2003) for smoothed bootstrap confidence bands and Chernozhukov, Lee, and
Rosen (2013) for multiplier bootstrap confidence bands. Chernozhukov, Chetverikov,
and Kato (2013) show that in general settings including high dimensional models,
12
LEE, OKUI, AND WHANG
Gaussian multiplier bootstrap methods yield critical values for which the approximation error decreases polynomially in the sample size. Roughly speaking, both
our simple analytic correction and multiplier bootstrap methods yield critical values
that are accurate at polynomial rates. A refined theoretical analysis is necessary to
determine which type of the critical value is better asymptotically.
Remark 4. The proposed confidence band can be used to test whether the CATEF
is constant. Suppose that our null hypothesis is that g(x) is constant in I. This null
hypothesis can be written as g(x) = gI , where gI = E[g(x)|x ∈ I]. Since gI can be
√
estimated at the parametric ( n ) rate and the estimator thus converges faster than
ĝ(x), we can ignore the estimation error in ĝ(x). We reject the constancy of g(x), if
the confidence band does not include the estimate of gI for some x ∈ I.
4. Asymptotic Theory
In this section, we establish asymptotic theory. Let U ≡ ψ(W, θ0 ) − g(X) and let
Ui ≡ ψ(Wi , θ0 ) − g(Xi ) for i = 1, . . . , n. Let ŝ2 (x) be the estimator of the asymptotic
variance of ĝ(x). Let s2n (x) denote the population version of the asymptotic variance
of the estimator:
s2n (x)
1
U2
X−x
2
K
≡ dE 2
.
hn
fX (x)
hn
Assume that the d-dimensional kernel function is the product of d univariate kernel
Q
functions. That is, K(s) = dj=1 K(sj ), where s ≡ (s1 , . . . , sd ) is a d-dimensional
Q
vector and K(·) is a kernel function on R. Let ρd (s) = dj=1 ρ(sj ), where:
R
(4.1)
ρ(sj ) ≡
K(u)K(u − sj )du
R
,
K 2 (u)du
for each j. We make the following assumptions.
Assumption 1. Let d < 4.
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
(1) I ≡
Qd
j=1 [aj , bj ],
13
where aj < bj , j = 1, . . . , d, and I is a strict subset of the
support of X.
(2) The distribution of X has a bounded Lebesgue density fX (·) on Rd . Furthermore, fX (·) is bounded below from zero with continuous derivatives on I.
(3) The density of U is bounded, E[U 2 |X = x] is continuous on I, and supx∈Rd E[U 4 |X =
x] < ∞.
(4) g(·) is twice continuously differentiable on I.
Q
(5) K(s) = dj=1 K(sj ), where K(·) is a kernel function on R that has finite
R1
R1
support on [−1, 1], −1 uK(u)du = 0, −1 K(u)du = 1, symmetric around
zero, and six times differentiable.
(6) hn = Cn−η , where C and η are positive constants such that η < 1/(2d) and
η > 1/(d + 4).
(7) inf n≥1 inf x∈I sn (x) > 0 and sn (x) is continuous for each n ≥ 1. Furthermore,
x 7→ E [U 2 |X = x] fX (x) is Lipschitz continuous.
(8) There exists an estimator ŝ2 (x) such that
sup ŝ2 (x) − s2n (x) = Op (n−c )
x∈I
for some constant c > 0.
n
o
(9) max (nhdn )1/2 |ψ(Wi , θ̂) − ψ(Wi , θ0 )| : i = 1, . . . , n = Op (n−c ) for some constant c > 0.
Most of the assumptions are standard. One of the bandwidth conditions in hn
(that is, η > 1/(d + 4)) imposes undersmoothing, so that we can ignore the bias
asymptotically. The rule-of-thumb bandwidth proposed in Section 3 satisfies the
required rate conditions.
Remark 5. Note that d < 4 is necessary to ensure that η < 1/(2d) and η > 1/(d + 4)
can hold jointly. It is possible to extend our asymptotic theory to the case that
14
LEE, OKUI, AND WHANG
d ≥ 4 using a higher-order local polynomial estimator under stronger smoothness
conditions. In this paper, we limit our attention to the local linear estimator since
we are mainly interested in low dimensional x.
Remark 6. An estimator of ŝ2 (x) is readily available. For example, we may set
n
X−x
1 X Ûi2
2
K
ŝ (x) = d
,
2
hn i=1 fˆX
hn
(x)
2
where Ûi ≡ ψ(Wi , θ̂) − ĝ(Xi ) and fˆX (·) is the kernel density estimator. Alternatively,
since
s2n (x)
σ 2 (x)
→
fX (x)
Z
K2 (u) du, as n → ∞,
where σ 2 (x) ≡ E [U 2 |X = x], we may use
σ̂ 2 (x)
ŝ (x) =
fˆX (x)
2
Z
K2 (u) du,
with a suitable nonparametric estimator σ̂ 2 (x) of σ 2 (x) such as the kernel regression
estimator using {(Ûi2 , Xi ) : i = 1, . . . , n}. For either estimator, it is straightforward to
verify condition (8) of Assumption 1 using the standard results in kernel estimation.
Remark 7. Condition (9) of Assumption 1 is satisfied, for example, if kθ̂ − θ0 k =
Op (n−1/2 ), functions β 7→ π(Z, β) and αj 7→ µj (Z, αj ), j = 0, 1, are Lipschitz continuous, π(Z, β0 ) is bounded between and 1 − for some constant > 0, provided that
some weak moment conditions on (Y, Z) hold.
Let an ≡ an (I) be the largest solution to the following equation:
(4.2)
mes(I)hn −d λd/2 (2π)−(d+1)/2 and−1 exp(−a2n /2) = 1,
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
where mes(I) is the Lebesgue measure of I; that is, mes(I) =
(4.3)
λ=
−
R
15
Qd
j=1 (bj
− aj ) and:
K(u)K 00 (u)du
R
.
K 2 (u)du
The following is the main theoretical result of our paper.
Theorem 2. Let Assumption 1 hold. Then there exists κ > 0 such that, uniformly
in t, on any finite interval:
ĝ(x) − g(x) P an max − an < t =
x∈I
ŝ(x)
(4.4)
d−2m−1
[(d−1)/2]
X
t
−t−t2 /2a2n
−2m
exp −2e
+ O(n−κ ),
hm,d−1 an
1+ 2
an
m=0
as n → ∞, where hm,d−1 ≡
(−1)m (d−1)!
m!2m (d−2m−1)!
and and [·] is the integer part of a number.
Notice that the approximation error is of a polynomial rate. As a result, a critical
value based on the leading term of the right-hand side of (4.4) provides a better
approximation than one based on the Gumbel approximation. The result in Theorem
2 may be of independent interest for constructing the uniform confidence band in
nonparametric regression beyond the scope of estimating the CATEF in our context.
Remark 8. In a setting different from here, Lee, Linton, and Whang (2009) propose
analytic critical values based on Piterbarg (1996) in order to test for stochastic monotonicity, compare its performance with the bootstrap critical values in their Monte
Carlo experiments, and find that both perform well in finite samples. However, the
discussions in Lee, Linton, and Whang (2009) are informal and rely on the results
of Monte Carlo experiments without the formal proof of establishing the polynomial
approximation error.
16
LEE, OKUI, AND WHANG
The conclusion of the theorem can be simplified for special cases. In particular, if
d = 1, then:
ĝ(x) − g(x) −t−t2 /2a2n
P an max + O(n−κ ),
−
a
<
t
=
exp
−2e
n
x∈I
ŝ(x)
where an is the largest solution to mes(I)hn −1 λ1/2 (2π)−1 exp(−a2n /2) = 1. Also, if
d = 2, then:
P an
ĝ(x) − g(x) t
−t−t2 /2a2n
max 1 + 2 + O(n−κ ),
− an < t = exp −2e
x∈I
ŝ(x)
an
where an is the largest solution to mes(I)hn −2 λ2 (2π)−3/2 an exp(−a2n /2) = 1.
4.1. Construction of critical values. We use the leading term on the right-hand
side of (4.4) as a distribution-like function to construct a uniform confidence band.
For example, if d = 1, we may construct a critical value c(1 − α) that satisfies:
Fn,1 (c) ≥ 1 − α,
where Fn,1 (t) ≡ exp −2e
−t−t2 /2a2n
. This yields the critical value presented in the
Algorithm of Section 3. Similarly, if d = 2, we can use:
Fn,2 (c) ≥ 1 − α,
−t−t2 /2a2n
where Fn,2 (t) ≡ exp −2e
1+
t
a2n
. In finite samples, it might be useful to
impose monotonicity of Fn,j (·) by rearrangement (see, e.g., Chernozhukov, FerndádezVal, and Galichon (2009)).
Remark 9. Theorem 2 implies that:
ĝ(x) − g(x) −t
lim P an max −
a
<
t
=
exp
−2e
.
n
n→∞
x∈I
ŝ(x)
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
17
Thus, one may construct analytical critical values based on the Gumbel distribution.
However, this approximation is accurate only up to the logarithmic rate in view of
Theorem 2.
5. Monte Carlo Experiments
In this section, we present the results of Monte Carlo experiments. These experiments are conducted to see the finite sample performances of the proposed doubly
robust estimator and the proposed uniform confidence band. The simulations are
conducted by R 3.1.2 with Windows 7. The number of replications is 5000.
5.1. Data generating process. The data generating process follows the potential
outcome framework, and we borrow the design used by Abrevaya, Hsu, and Lieli
(2015). The notations for the variables are the same as those used in the theoretical
part of the paper. Recall that p is the number of covariates (i.e., the dimension of
Z). We consider two cases: p = 2 and p = 4. We explain the data generating process
for each value of p.
The data generating process for p = 2 is the following. The vector of covariates
Z = (X1 , X2 )> is generated by:
X1 = 1 ,
X2 = (1 + 2X1 )2 (−1 + X1 )2 + 2 ,
where i ∼ U [−0.5, 0.5] for i = 1, 2 and 1 and 2 are independent. The potential
outcomes are generated by:
Y1 = X1 X2 + v,
Y0 = 0,
where v ∼ N (0, 0.252 ) and v is independent of (1 , 2 ). The treatment status D is
generated by:
D = 1{Λ(X1 + X2 ) > U },
18
LEE, OKUI, AND WHANG
where U ∼ U [0, 1], U is independent of (1 , 2 , v) and Λ is the logistic function. Thus,
the propensity score is π(Z) = Λ(X1 + X2 ). The observed outcome is Y = DY1 .
Next, we present the data generating process for p = 4. The covariates Z =
(X1 , X2 , X3 , X4 )> are generated by:
X1 = 1 ,
X2 = 1 + 2X1 + 2 ,
X3 = 1 + 2X1 + 3 ,
X4 = (−1 + X1 )2 + 4 ,
where i ∼ U [−0.5, 0.5] for i = 1, 2, 3, 4 and 1 , 2 , 3 and 4 are independent. The
potential outcomes are generated by:
Y1 = X1 X2 X3 X4 + v,
Y0 = 0,
where v ∼ N (0, 0.252 ) and v is independent of (1 , 2 , 3 , 4 ). The treatment status D
is generated by:
D = 1{Λ(0.5(X1 + X2 + X3 + X4 )) > U },
where U ∼ U [0, 1] and U is independent of (1 , 2 , 3 , 4 v). The propensity score is
π(Z) = Λ(0.5(X1 + X2 + X3 + X4 )). The observed outcome is Y = DY1 .
The parameter of interest is the CATEF for X = X1 . In our specification, the
CATEF is the same in both p = 2 and p = 4 cases. The CATEF can be written as:
CAT EF (x1 ) = x1 (1 + 2x1 )2 (−1 + x1 )2 .
We examine the performance of various statistical procedures regarding this CATEF.
5.2. Model specification. To estimate and conduct statistical inferences on CAT EF (x1 )
using our doubly robust procedure, we need to specify a model for the regression µj (z)
for j = 0, 1 and a model for the propensity score π(z). We consider two regression
models and two propensity score models. One of two models is correctly specified,
but the other model is misspecified. We note that our doubly robust procedure is
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
19
predicted to work well provided that at least one of the regression models and the
propensity score models is correctly specified.
We first discuss the model specifications for p = 2. The first regression model is:
µ1 (z, α1 ) = α10 + α11 X1 X2 ,
µ0 (z, α0 ) = α00 + α01 X1 X2 .
This model is correctly specified. The coefficients are estimated by OLS using X1 X2
as the explanatory variable. The second regression model, which is misspecified, is:
µ1 (z, α1 ) = α10 + α11 X1 + α12 X2 ,
µ0 (z, α0 ) = α00 + α01 X1 + α02 X2 .
The model is estimated by OLS using X1 and X2 as explanatory variables.
We also consider two models for propensity score. When p = 2, the first model is
correctly specified and it is:
π(z, β) = Λ(β0 + β1 X1 + β2 X2 ).
The model is estimated by maximum likelihood. The misspecified model is:
π(z, β) = (Λ(β0 + β1 X1 + β2 X2 ))2 .
This misspecified propensity score is computed by taking the square of the estimated
correctly specified propensity score.
The model specifications for p = 4 are similar to those for p = 2 but include more
covariates. The first regression model (correctly specified) is:
µ1 (z, α1 ) = α10 + α11 X1 X2 X3 X4 ,
µ0 (z, α0 ) = α00 + α01 X1 X2 X3 X4 .
20
LEE, OKUI, AND WHANG
The model is estimated by OLS using X1 X2 X3 X4 as the explanatory variable. The
misspecified version is:
µ1 (z, α1 ) = α10 + α11 X1 + α12 X2 + α13 X3 + α14 X4 ,
µ0 (z, α0 ) = α00 + α01 X1 + α02 X2 + α03 X3 + α04 X4 .
The model is estimated by OLS using X1 , X2 , X3 and X4 as explanatory variables.
For the propensity score, we first estimate the correctly specified logit model:
π(z, β) = Λ(β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 ).
We then compute a misspecified propensity score by taking the square of the estimated
propensity score based on the correctly specified model.
We estimate CAT EF (x1 ) for x1 ∈ {−0.4, −0.2, 0, 0.2, 0.4} and compute the mean
\
bias (“MEAN”), standard deviation (“SD”), the average of standard error for CAT
EF (x1 )
(“ASE”), and the mean squared error times 100 (“MSE×100”). The local linear regression is conducted with the Gaussian kernel, and the preliminary bandwidth (ĥ in
Algorithm (1)) is chosen by the method of Ruppert, Sheather, and Wand (1995). We
also compute the “BIAS”, “SE” and “MSE×100” of the corresponding inverse probability weighting estimators and the regression adjustment estimators. Note that the
difference between the proposed method and those alternative methods arises only in
the estimation of ψ(W, θ0 ) and the other steps are the same.
We examine the coverage probability of the uniform confidence band for CAT EF (x1 )
for the range −0.4 ≤ x1 ≤ 0.4. The nominal coverage probabilities that we consider
are 99%, 95% and 90%. We compute the empirical coverage (“CP”), the mean critical value (“Mcri”), and the standard deviation of critical value (“Sdcri”). We also
compute the coverage probabilities of the confidence band when the critical values
are computed by the Gumbel approximation (“GCP”).
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
21
5.3. Results. Tables 1 and 2 summarize the results on the properties of the estimators. The proposed doubly robust estimator of the CATEF works well in finite
samples. As the theory indicates, the proposed estimator exhibits small bias provided
that at least one of the regression models and the propensity score models is correctly
specified. Its standard deviation is typically smaller than that of the inverse probability weighting estimator, and when it is not, the difference is small. We find that
the regression adjustment estimator is very precise when the regression model is correctly specified. However, it is markedly vulnerable against model misspecification.
The inverse probability weighting estimator also suffers from model misspecification,
although it is arguably more robust than the regression adjustment estimator. When
both models are misspecified, the doubly robust estimator is biased; however, its
mean squared error is typically smaller than those of the alternatives, although we
notice that at some points (x = 0 for the inverse probability weighting estimator and
x = 0.2 for the regression adjustment estimator), the alternative estimators happen to
have small biases. All the estimators have large mean squared errors when x = −0.4.
This result is a consequence of the well-known result in nonparametric statistics that
it is difficult to estimate the value of a function near the end of the support. The
standard error for the proposed doubly robust estimator is, on average, close to the
standard deviation.
Tables 3 and 4 summarize the finite sample properties of the proposed uniform
confidence band. The results show that our uniform confidence band has a good
coverage property provided that one of the models is correctly specified. When both
models are misspecified, p = 4 and N = 1000, the size distortion is heavy. The average
values of the 95% critical values are around 3. Because the pointwise critical value is
1.96 and is much smaller than the uniformly valid critical value, it demonstrates the
importance of the uniform property of confidence band. The standard deviations of
22
LEE, OKUI, AND WHANG
the critical values are small because they change only if the bandwidth changes. The
confidence band based on the Gumbel approximation is very conservative.
The results of the Monte Carlo simulation confirm that the proposed doubly robust
estimator indeed works well in finite samples provided that one of the regression and
propensity score models is correctly specified. The proposed uniform confidence band
also has good coverage properties.
6. An Empirical Application
We apply our uniformly valid confidence band for the CATEF for the effect of
maternal smoking on birth weight where the argument of the CATEF is the mother’s
age. Our aim here is to illustrate our confidence band in comparison with alternative
confidence bands. We first discuss the background of this application and the dataset
used. We then compute various confidence bands for the CATEF and discuss the
results.
While the purpose of this application is to illustrate our uniformly valid confidence
band and not to present new insights on the effect of smoking, it is still informative
to discuss the background of this application. Many studies document that low birth
weight is associated with prolonged negative effects on health and educational or
labor market outcomes throughout life, although there has been a debate over its
magnitude. See, e.g., Almond and Currie (2011) for a review. Maternal smoking is
considered to be the most important preventable negative cause of low birth weight
(Kramer, 1987). There are many studies that evaluate the effect of maternal smoking
on low birth weight (Almond and Currie, 2011). The program evaluation approach
is employed by, for example, Almond, Chay, and Lee (2005), da Veiga and Wilder
(2008) and Walker, Tekin, and Wallace (2009), and panel data analysis is carried out
by Abrevaya (2006) and Abrevaya and Dahl (2008). Here, we are interested in how
the effect of smoking changes across different age groups of mothers. Walker, Tekin,
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
23
and Wallace (2009) examine whether the effect of smoking is larger for teen mothers
than for adult mothers and find mixed evidence. Abrevaya, Hsu, and Lieli (2015) also
consider this problem in their application.
Our dataset consists of observations from white mothers in Pennsylvania in the
USA. The dataset is an excerpt from Cattaneo (2010) and is obtained from the STATA
website (“http://www.stata-press.com/data/r13/cattaneo2.dta”). Note that
the dataset was originally used in Almond, Chay, and Lee (2005). We restrict our
sample to white and non-Hispanic mothers, and the sample size is 3754. The outcome
of interest (Y ) is infant birth weight measured in grams. The treatment variable (D)
is a binary variable that is equal to 1 if the mother smokes and 0 otherwise. The set of
covariates Z includes the mother’s age, an indicator variable for alcohol consumption
during pregnancy, an indicator for the first baby, the mother’s educational attainment,
an indicator for the first prenatal visit in the first trimester, the number of prenatal
care visits, and an indicator for whether there was a previous birth where the newborn
died. We are interested in how the effect of smoking varies across different values of
the mother’s age. Therefore, X is mother’s age in this application.
To estimate the CATEF, we use linear regression models for the regression part and
a logit model for propensity score. The explanatory variables used in the regression
models and the logit models consist of all the elements of Z, the square of the mother’s
age, and the interaction terms between the mother’s age and all other elements of Z.
We estimate the CATEF in the interval between ages 15 and 35.
We compute the following three 95% confidence bands for the CATEF. “Our CB”
is the confidence band proposed in this paper. Because X is univariate in this application, we follow the algorithm in Section 3. We use the Gaussian kernel. The
preliminary bandwidth (ĥ) is chosen by the method of Ruppert, Sheather, and Wand
(1995). “Gumbel CB” is the confidence band in which c(1 − α) in the algorithm is
replaced by that based on the Gumbel approximation (see Remark 1). “PW CB”
24
LEE, OKUI, AND WHANG
is a pointwise valid confidence band where we replace c(1 − α) in the algorithm by
the corresponding value from the standard normal distribution (i.e., 1.96). This provides a valid confidence interval for each point of the CATEF. However, its uniform
coverage rate would be smaller than 95%.
Figure 1 plots the estimated CATEF and the three 95% confidence bands for the
range between 15 and 35 years of age. The figure also contains the average treatment
effect estimate (AIPW estimate) for a reference.
The widths of the three confidence bands are substantially different. The confidence
band based on the Gumbel approximation provides the widest band and may not be
very informative. The confidence band that is valid only in a pointwise sense gives
the narrowest band. This band is not uniformly valid and so may provide misleading
information about the CATEF. On the other hand, this provides valuable information
if we are interested at a particular point of the CATEF. The confidence band we
propose lay between “Gumbel CB” and “PW CB”. While this band is wider than
“PW CB”, it is much narrower than “Gumbel CB” and is uniformly asymptotically
valid. We see from this figure that our confidence band is informative while being
uniformly valid.
The estimated CATEF is decreasing from 15 to around 25 years of age. It is
rather stable for the range above 25 years of age. All confidence bands indicate that
the CATEF is estimated imprecisely near the ends of the range. Nonetheless, the
estimated CATEF indicates that smoking may not have a strong impact when the
mother is young. The CATEF is estimated relatively precisely in the middle of the
range. For the range between 20 and 30 years of age, even the band based on the
Gumbel approximation, which is the widest, does not contain 0. This result provides
robust evidence that smoking has a negative impact on birth weight at least for
mothers who are 20 to 30 years old. In this particular dataset, the statistical evidence
against a constant smoking effect is somewhat weak. The confidence band that is valid
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
25
only in a pointwise sense may provide an impression that the smoking effect depends
on the mother’s age. However, the uniformly valid confidence band that we propose
marginally contains the straight line that is equal to the ATE estimate. This result
illustrates that there is a caveat when we use pointwise confidence intervals, as well
as the importance of using uniformly valid confidence bands.
7. Conclusion
In this paper, we propose a doubly robust method for estimating the CATEF. We
consider the situation where a high-dimensional vector of covariates is needed for
identifying the average treatment effect but the covariates of interest are of much
lower dimension. Our proposed estimator is doubly robust and does not suffer from
the curse of dimensionality. We propose a uniform confidence band that is easy
to compute, and we illustrate its usefulness via Monte Carlo experiments and an
application to the effects of smoking on birth weights.
There are a few topics to be explored in the future. First, it would be useful to
consider the issue of asymptotic biases of the proposed estimator without relying on
undersmoothing. For example, it might be possible to extend the approach of Hall
and Horowitz (2013) that avoids undersmoothing for our purposes. Second, it would
be an interesting exercise to develop a method for estimating the quantile treatment
effects conditional on covariates. Third, it is possible to extend our approach to the
local average treatment effect. As mentioned in the Introduction, Ogburn, Rotnitzky,
and Robins (2015) consider conditioning on Z to achieve identification, but they estimate the local average treatment effect, say LATE(X), as a function of X. However,
their specification of LATE(X) is parametric. Our approach can be adapted to specify LATE(X) nonparametrically and to develop a corresponding uniform confidence
band. Fourth, this paper does not cover marginal treatment effects that can be identified using the method of local instrumental variables developed by Heckman and
26
LEE, OKUI, AND WHANG
Vytlacil (1999, 2005). It would be interesting to develop a uniform confidence band
for the marginal treatment effects.
Appendix A. Proofs
Proof of Lemma 1. Because DY = DY1 and Y1 and D are independent of each other
conditional on Z, write:
(A.1)
E [D|Z] E [Y1 |Z] E [D|Z] − π(Z, β0 )
−
µ1 (Z, α10 )X = x .
E [ψ1 (W, α10 , β0 )|X = x] = E
π(Z, β0 )
π(Z, β0 )
Suppose that there exists β0 such that E [D|Z] = π(Z, β0 ) almost surely. Then the
right-hand side of (A.1) reduces to:
E [E [Y1 |Z] |X = x] = E [Y1 |X = x] .
Suppose now that there exists α10 such that E [Y1 |Z] = µ1 (Z, α10 ) almost surely. Then
the right-hand side of (A.1) again reduces to:
E [µ1 (Z, α10 )|X = x] = E [Y1 |X = x] .
Analogously, we have E [ψ0 (W, α00 , β0 )|X = x] = E [Y0 |X = x].
The remainder of the appendix gives the proof of Theorem 2. We first establish
the linear expansion of the local linear estimator.
Lemma 3.
sup
x∈I
p
n
ĝ(x) − g(x)
X
1
Ui
Xi − x nhdn − d
K
= Op n−c
ŝ(x)
nhn sn (x) i=1 fX (x)
hn
for some positive constant c > 0.
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
27
Proof of Lemma 3. Let Ψ̂ and Ψ0 denote the n-dimensional vectors such that Ψ̂ =
{ψ(Wi , θ̂)}ni=1 and Ψ0 = {ψ(Wi , θ0 )}ni=1 , respectively. Let Γ(x) be the n × (d + 1)
matrix whose i-th row is [1, (Xi − x)> ], Ω(x) the n-dimensional diagonal matrix
n
whose i-th element is h−1
n K [(Xi − x)/hn ], G := [g(Xi )]i=1 the n-dimensional vector
of regression functions evaluated at data points, and U := (Ui )ni=1 the n-dimensional
vector of regression errors. Let e1 denote the (d + 1) × 1 vector whose first element
is one and all others are zeros. Write
ĝ(x) − g(x) = Tn1 (x) + Tn2 (x) + Rn1 (x),
where
−1
>
Tn1 (x) = e>
Γ(x)> Ω(x)U ,
1 Γ(x) Ω(x)Γ(x)
−1
>
Tn2 (x) = e>
Γ(x)> Ω(x)G,
1 Γ(x) Ω(x)Γ(x)
−1
>
>
>
Rn (x) = e1 Γ(x) Ω(x)Γ(x)
Γ(x) Ω(x) Ψ̂ − Ψ0 .
We first consider the leading stochastic term Tn1 (x). As in equation (2.10) of
Ruppert and Wand (1994), we have that
(A.2)
n−1 Γ(x)> Ω(x)Γ(x)

Pn
Xi −x
1
K
i=1
h
nhdn

n
=
P
n
Xi −x
1
(Xi − x)
i=1 K
hn
nhd
n
1
nhdn
1
nhdn
Pn
Pn
i=1
i=1
K
K
Xi −x
hn
Xi −x
hn
>
(Xi − x)
(Xi − x)(Xi − x)>


.
By the standard empirical process results (see e.g., Pollard (1984) or van der Vaart and
Wellner (1996)) combined with the usual change of variables, we have that uniformly
28
LEE, OKUI, AND WHANG
in x ∈ I,
"
1/2 #
n
1 X
log n
Xi − x
= fX (x) + o(hn ) + Op
,
K
nhdn i=1
hn
nhdn
" 1/2 #
n
Xi − x
∂f
(x)
log
n
1 X
X
K
(Xi − x) = h2n
µ2 (K) + o(h2n ) + Op hn
,
nhdn i=1
hn
∂x
nhdn
" 1/2 #
n
Xi − x
log
n
1 X
K
(Xi − x)(Xi − x)> = h2n fX (x)µ2 (K) + o(h2n ) + Op h2n
,
nhdn i=1
hn
nhdn
where µ2 (K) :=
R
u2 K(u)du.
Throughout the remainder of the proof, we let rn (x) = Op (n−c ) , uniformly in x,
be a sequence that can be different in different places for some constant c > 0. Then
as in (2.11) of Ruppert and Wand (1994), we have that
(A.3)
−1
−1
n Γ(x)> Ω(x)Γ(x)


fX (x)−1 [1 + rn (x)]
−[∂fX (x)/∂x]> fX (x)−2 [1 + rn (x)]
,
=
−1
−2
2
−[∂fX (x)/∂x]fX (x) [1 + rn (x)]
[µ2 (K)fX (x)hn Id ] [1 + rn (x)]
where Id is the d-dimensional identity matrix. The little op (·) terms in equation (2.11)
of Ruppert and Wand (1994) are pointwise; however, (A.3) holds uniformly in x ∈ I
with polynomially decaying terms rn (x) under our assumptions.
Let Γi (x) := [1, (Xi − x)> ]> . Since
n
1 X
Xi − x
n Γ(x) Ω(x)U =
Ui K
Γi (x),
nhdn i=1
hn
−1
>
we have by (A.3) that Tn1 (x) = Tn11 (x) + Tn12 (x), where
n
X
1
Tn11 (x) =
Ui K
nhdn fX (x) i=1
Xi − x
hn
[1 + rn (x)],
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
29
>
n
X
1
∂fX (x)
Xi − x
Tn12 (x) = − d
Ui K
(Xi − x)[1 + rn (x)].
nhn [fX (x)]2 i=1
hn
∂x
Again using the standard empirical process result and the method of change of variables,
"
Tn11 (x) = Op
log n
nhdn
1/2 #
"
and Tn12 (x) = Op hn
log n
nhdn
1/2 #
uniformly in x ∈ I. Therefore, we have shown that
n
(A.4)
X
1
Ui K
Tn1 (x) =
nhdn fX (x) i=1
Xi − x
hn
[1 + rn (x)].
We now move on the other remainder terms. The proof of Theorem 2.1 (in particular, equation (2.3)) of Ruppert and Wand (1994) implies that Tn2 (x) = O(h2n )
→ 0 at a polynomial rate in n
uniformly in x ∈ I. The condition that nhd+4
n
corresponds to the undersmoothing condition. It is straightforward to show that
(nhdn )1/2 Rn (x) = O(n−c ) uniformly in x for some constant c > 0 due to Assumption
1(9) that
n
o
max (nhdn )1/2 |ψ(Wi , θ̂) − ψ(Wi , θ0 )|i = 1, . . . , n = Op (n−c )
for some constant c > 0
Note that by conditions (7) and (8) of Assumption 1, we have that inf n≥1 inf x∈I sn (x) >
0 and supx∈I |ŝ2 (x) − s2n (x)| = Op (n−c ). Hence, the lemma follows from (A.4) immediately.
Define:
−1/2
n
1 X
Xi − x
1
X−x
2 2
Tn (x) ≡
Ui K
and cn (x) ≡
E U K
.
nhdn i=1
hn
hdn
hn
30
LEE, OKUI, AND WHANG
Note that cn (x) = [fX (x)sn (x)]−1 . By Lemma 3:
p
ĝ(x) − g(x)
d
max nhn − cn (x)Tn (x) = Op n−c .
x∈I
ŝ(x)
We now use the result of Chernozhukov, Chetverikov, and Kato (2014) to obtain
Gaussian approximations. Define:
(A.5)
Wn ≡ sup cn (x)
x∈I
p
nhdn [Tn (x) − ETn (x)] .
Chernozhukov, Chetverikov, and Kato (2014) established an approximation of Wn by
a sequence of suprema of Gaussian processes. For each n ≥ 1, let B̃n,1 be a centered
Gaussian process indexed by I with covariance function:
(A.6)
E[B̃n,1 (x)B̃n,1 (x0 )] =
X−x
X − x0
−d
0
2
hn cn (x)cn (x )Cov U K
K
.
hn
hn
Proposition 3.2 of Chernozhukov, Chetverikov, and Kato (2014) establishes the following approximation result.
Lemma 4. Let Assumption 1 hold. Then for every n ≥ 1, there is a tight Gaussian
random variable B̃n,1 in `∞ (I) with mean zero and covariance function (A.6), and
there is a sequence W̃n,1 of random variables such that W̃n,1 =d supx∈I B̃n,1 (x) and as
n → ∞:
n
o
|Wn − W̃n,1 | = OP (nhdn )−1/6 log n + (nhdn )−1/4 log5/4 n + (n1/2 hdn )−1/2 log3/2 n .
Proof of Lemma 4. To apply Proposition 3.2 of Chernozhukov, Chetverikov, and Kato
(2014), we first note that Assumption 1 implies that all the regularity conditions for
Proposition 3.2 of Chernozhukov, Chetverikov, and Kato (2014) are satisfied. They
are:
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
31
(1) supx∈Rd E[U 4 |X = x] < ∞.
(2) K(·) is a bounded and continuous kernel function on Rd , and such that the
class of functions K ≡ {t 7→ K(ht + x) : h > 0, x ∈ Rd } is a VC type with
envelope kKk∞ .
(3) The distribution of X has a bounded Lebesgue density p(·) on Rd .
(4) hn → 0 and log(1/hn ) = O(log n) as n → ∞.
(5) CI ≡ supn≥1 supx∈I |cn (x)| < ∞. Moreover, for every fixed n ≥ 1 and for
every xm ∈ I → x ∈ I pointwise, cn (xm ) → cn (x).
Then the desired result is an immediate consequence of Proposition 3.2 of Chernozhukov, Chetverikov, and Kato (2014) with a singleton set G = {U } and with q = 4
(using their notation) in verifying condition (B1)’ of Chernozhukov, Chetverikov, and
Kato (2014).
We now show that the Gaussian field obtained in Lemma 4 can be further approximated by a stationary Gaussian field.
Lemma 5. Let Assumption 1 hold. Then for every n ≥ 1, there is a tight Gaussian
random variable B̃n,2 in `∞ (In ) with mean zero and covariance function:
E[B̃n,2 (s)B̃n,2 (s0 )] = ρd (s − s0 )
for s, s0 ∈ In ≡ h−1
n I, and there is a sequence of random variables such that W̃n,2 =d
supx∈I B̃n,2 (h−1
n x) and as n → ∞:
p
|W̃n,1 − W̃n,2 | = OP hn log h−d
.
n
Proof of Lemma 5. This lemma can be proved as in the proof of Lemma 3.4 of Ghosal,
Sen, and van der Vaart (2000). Let:
−1/2
X−x
Xi − x
2 2
φn,x (Ui , Xi ) := E U K
Ui K
,
hn
hn
32
LEE, OKUI, AND WHANG
ϕn,x (Ui , Xi ) :=
hdn E
2
Z
U |Xi fX (Xi )
2
−1/2
K (u)du
Ui K
Xi − x
hn
.
As in Remark 8.3 of Ghosal, Sen, and van der Vaart (2000) and in the proof of Lemma
3.4 of Ghosal, Sen, and van der Vaart (2000), we can regard Gaussian processes B̃n,1
and B̃n,2 as Brownian bridges Bn (φn,x ) and Bn (ϕn,x ), respectively, in the sense that
EBn (g) = 0 and E[Bn (g)Bn (g 0 )] = cov(g, g 0 ) for g = φn,x , g 0 = φn,x0 or g = ϕn,x , g =
ϕn,x0 .
Define δn (x) := Bn (φn,x ) − Bn (ϕn,x ). Note that δn (x) is also a mean zero Gaussian
process with:
E [δn (x)δn (x0 )] = E [{φn,x (U, X) − ϕn,x (U, X)}{φn,x0 (U, X) − ϕn,x0 (U, X)}] .
Note that:
E {δn (x)}
2
Z Z
=
−1/2
E U 2 |X = x + hu K2 (u)fX (x + hu)du
2
− E U |X = x + ht fX (x + ht)
Z
2
−1/2 2
K (u) du
× E U 2 |X = x + ht K2 (t) fX (x + hu)dt
= O(h2n ),
because x 7→ E [U 2 |X = x] fX (x) is Lipschitz continuous. Thus, the L2 -diameter of
δn (·) is O(hn ). In addition, we can show that there exists a constant C > 0 such that:
2
E {δn (x) − δn (x0 )}2 ≤ Ch−2 kx − x0 k .
Then arguments similar to those used in the proof of Lemma 3.4 of Ghosal, Sen, and
van der Vaart (2000) yield the desired result.
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
33
√
Proof of Theorem 2. First note that an = O( log n ) because hn = Cn−η . Lemmas 4
and 5 together imply that:
ĝ(x) − g(x)
−1
max − B̃n,2 (hn x) = op (an ) .
x∈I
ŝ(x)
Note that B̃n,2 , defined in Theorem 5, is a homogeneous Gaussian field with zero
mean and the covariance function ρd (s). Because of the assumption on K(·), the
covariance function ρd (s) has finite support and is six times differentiable. The latter
property implies that the Gaussian process B̃n,2 is three times differentiable in the
mean square sense (see, e.g., Chapter 4 of Rasmussen and Williams (2006)). Then by
Theorem 14.3 of Piterbarg (1996) and also by Theorem 3.2 of Konakov and Piterbarg
(1984), there exists κ > 0 such that uniformly in t, on any finite interval:
−1
P an max B̃n,2 (hn x) − an < t =
x∈I
−t−t2 /2a2n
exp −2e
[(d−1)/2]
X
m=0
hm,d−1 a−2m
n
t
1+ 2
an
d−2m−1
+ O(n−κ )
as n → ∞, where an is obtained as the largest solution to the equation:
√
mes(I)hn −d detΛ2 d−1 −a2n /2
an e
= 1,
(2π)(d+1)/2
Λ2 is the covariance matrix of the vector of the first derivative of the Gaussian field
B̃n,2 :
2
∂ r(0)
, i, j = 1, . . . , d ,
Λ2 ≡ cov grad B̃n,2 (t) = −
∂ti ∂tj
and [·] is the integer part of a number. Simple calculation yields
λ defined in (4.3).
√
detΛ2 = λd/2 with
34
LEE, OKUI, AND WHANG
References
Abadie, A. (2003): “Semiparametric instrumental variable estimation of treatment
response models,” Journal of Econometrics, 113(2), 231–263.
Abrevaya, J. (2006): “Estimating the Effect of Smoking on Birth Outcomes Using
a Matched Panel Data Approach,” Journal of Applied Econometrics, 21, 489–519.
Abrevaya, J., and C. M. Dahl (2008): “The effects of birth inputs on birthweight: evidence from quantile estimation on panel data,” Journal of Business and
Economic Statistics, 26, 379–397.
Abrevaya, J., Y.-C. Hsu, and R. P. Lieli (2015): “Estimating Conditional
Average Treatment Effects,” Journal of Business and Economic Statistics, 33(4),
485–505.
Almond, D., K. Y. Chay, and D. S. Lee (2005): “The Costs of Low Birth
Weight,” Quarterly Journal of Economics, 120, 1031–1083.
Almond, D., and J. Currie (2011): “Human Capital Development before Age
Five,” in Handbook of Labor Economics, ed. by D. Card, and O. Ashenfelter, vol. 4b,
chap. 15, pp. 1315–1486. Elsevier.
Cattaneo, M. D. (2010): “Efficient semiparametric estimation of multi-valued
treatment effects under ignorability,” Journal of Econometrics, 155, 138–154.
Chernozhukov, V., D. Chetverikov, and K. Kato (2013): “Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random
vectors,” Annals of Statistics, 41(6), 2786–2819.
(2014): “Gaussian approximation of suprema of empirical processes,” Annals
of Statistics, 42(4), 1564–1597.
Chernozhukov, V., I. Ferndádez-Val, and A. Galichon (2009): “Improving point and interval estimators of monotone functions by rearrangement,”
Biometrika, 96(3), 559–575.
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
35
Chernozhukov, V., S. Lee, and A. Rosen (2013): “Intersection Bounds: Estimation and Inference,” Econometrica, 81(2), 667–737.
Claeskens, G., and I. v. Keilegom (2003): “Bootstrap confidence bands for
regression curves and their derivatives,” Annals of Statistics, 31(6), 1852–1884.
da Veiga, P. V., and R. P. Wilder (2008): “Maternal Smoking During Pregnancy and Birthweight: A Propensity Score Matching Approach,” Maternal and
Child Health Journal, 12(2), 194–203.
Derya Uysal, S. (2015): “Doubly Robust Estimation of Causal Effects with Multivalued Treatments: An Application to the Returns to Schooling,” Journal of
Applied Econometrics, 30(5), 763–786.
Ghosal, S., A. Sen, and A. W. van der Vaart (2000): “Testing Monotonicity
of Regression,” Annals of Statistics, 28, 1054–1082.
Glynn, A. N., and K. M. Quinn (2010): “An Introduction to the Augmented
Inverse Propensity Weighted Estimator,” Political Analysis, 18, 36–56.
Hall, P., and J. Horowitz (2013): “A simple bootstrap method for constructing
nonparametric confidence bands for functions,” Annals of Statistics, 41(4), 1892–
1921.
Heckman, J. J., and E. J. Vytlacil (1999): “Local instrumental variable and
latent variable models for identifying and bounding treatment effects,” Proceedings
of the National Academy of Sciences, 96, 4730–4734.
(2005): “Structural equations, treatment, effects and econometric policy
evaluation,” Econometrica, 73(3), 669–738.
Imbens, G. W., and J. D. Angrist (1994): “Identification and estimation of local
average treatment effects,” Econometrica, 62(2), 467–475.
Kang, J. D. Y., and J. L. Schafer (2007): “Demystifying Double Robustness:
A comparison of Alternative Strategies for Estimating a Population Mean from
Incomplete Data,” Statistical Science, 22(4), 523–539.
36
LEE, OKUI, AND WHANG
Konakov, V., and V. Piterbarg (1984): “On the convergence rate of maximal deviation distribution for kernel regression estimates,” Journal of Multivariate
Analysis, 15(3), 279–294.
Kramer, M. S. (1987): “Intrauterine Growth and Gestational Duration Determinants,” Pediatrics, 80, 502–511.
Lee, S., O. Linton, and Y.-J. Whang (2009): “Testing for Stochastic Monotonicity,” Econometrica, 77(2), 585–602.
Lunceford, J. K., and M. Davidian (2004): “Stratification and weighting via the
propensity score in estimation of causal treatment effects: a comparative study,”
Statistics in Medicine, 23, 2937–2960.
Ogburn, E. L., A. Rotnitzky, and J. M. Robins (2015): “Doubly robust estimation of the local average treatment effect curve,” Journal of the Royal Statistical
Society, Series B, 77(Part 2), 373–396.
Okui, R., D. S. Small, Z. Tan, and J. M. Robins (2012): “Doubly Robust
Instrumental Variable Regression,” Statistica Sinica, 22, 173–205.
Piterbarg, V. I. (1996): Asymptotic Methods in the Theory of Gaussian Processes
and Fields. American Mathematical Society, Providence, RI.
Pollard, D. (1984): Convergence of Stochastic Processes. Springer, New York, NY.
Poterba, J. M., S. F. Venti, and D. A. Wise (1995): “Do 401(k) contributions
crowd out other personal saving?,” Journal of Public Economics, 58(1), 1–32.
Rasmussen, C. E., and C. K. I. Williams (2006): Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA.
Robins, J. M. (2000): “Marginal structural models versus structural nested models
as tools for causal inference,” in Statistical Models in Epidemiology, the Environment, and Clinical Trials, ed. by M. E. Halloran, and D. Berry, vol. 116 of The IMA
Volumes in Mathematics and its Applications, pp. 95–133. Springer, New York.
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
37
Robins, J. M., and A. Rotnitzky (1995): “Semiparametric Efficiency in Multivariate Regression Models with Missing Data,” Journal of the American Statistical
Association, 90(429), 122–129.
Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994): “Estimation of Regression Coefficients when Some Regressors are not always oberved,” Journal of the
American Statistical Association, 89, 846–866.
Rothe, C., and S. Firpo (2014): “Semiparametric Two-Step Estimation Using
Doubly Robust Moment Conditions,” working paper.
Ruppert, D., S. J. Sheather, and M. P. Wand (1995): “An effective bandwidth
selector for local least squares regression,” Journal of the American Statistical Association, 90(432), 1257–1270.
Ruppert, D., and M. P. Wand (1994): “Multivariate Locally Weighted Least
Squares Regression,” Annals of Statistics, 22(3), 1346–1370.
Schafer, J. L., and J. Kang (2008): “Average causal effects from nonrandomized
studies: A practical guide and simulated example,” Psychological Methods, 13(4),
279–313.
Scharfstein, D. O., A. Rotnitzky, and J. M. Robins (1999): “Adjusting for
Nonignorable Drop-Out Using Semiparametric Nonresponse Models,” Journal of
the Americal Statistical Association, 94(448), 1096–1120.
Tan, Z. (2006): “Regression and Weighting Methods for Causal Inference Using
Instrumental Variables,” Journal of the American Statistical Association, 101(476),
1607–1618.
(2010): “Bounded, efficient and doubly robust estimation with inverse
weighting,” Biometrika, 97(3), 661–682.
van der Laan, M. J. (2006): “Statistical Inference for Variable Importance,” The
International Journal of Biostatistics, 2(1), Article 2.
38
LEE, OKUI, AND WHANG
van der Vaart, A. W., and J. A. Wellner (1996): Weak Convergence and
Empirical Processes. Springer-Verlag, New York, NY.
Walker, M., E. Tekin, and S. Wallace (2009): “Teen Smoking and Birth
Outcomes,” Southern Economic journal, 75, 892–907.
Wooldridge, J. M. (2007): “Inverse Probability Weighted Estimation for General
Missing Data Problems,” Journal of Econometrics, 141, 1281–1301.
(2010): Econometric analysis of cross section and panel data. The MIT
press, second edn.
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
39
Table 1. Monte Carlo results: CATEF estimates, p = 2
At x =
BIAS
SD
-0.4
-0.2
0
0.2
0.4
-0.001
0.005
0.004
0
-0.004
0.06
0.042
0.037
0.035
0.038
-0.4
-0.2
0
0.2
0.4
0.002
0.005
0.005
0
-0.005
0.134
0.069
0.048
0.041
0.042
-0.4
-0.2
0
0.2
0.4
0
0.004
0.004
0
-0.004
0.052
0.042
0.036
0.035
0.038
-0.4
-0.2
0
0.2
0.4
0.282
-0.023
-0.034
-0.005
-0.002
0.178
0.075
0.052
0.044
0.045
-0.4
-0.2
0
0.2
0.4
0
0.004
0.002
-0.001
-0.003
0.045
0.032
0.027
0.026
0.028
-0.4
-0.2
0
0.2
0.4
0.002
0.005
0.003
-0.001
-0.004
0.094
0.051
0.035
0.03
0.03
-0.4
-0.2
0
0.2
0.4
0
0.004
0.002
-0.001
-0.003
0.039
0.031
0.027
0.026
0.028
-0.4
-0.2
0
0.2
0.4
0.276
-0.026
-0.037
-0.006
0
0.127
0.057
0.039
0.033
0.033
DR
ASE
IPW
BIAS
SD
MSE×100 BIAS
N = 500
True propensity score model, False regression model
0.057
0.36
-0.001 0.055
0.3
-0.158
0.04
0.182
0.004 0.043
0.183
0.058
0.036
0.138
0.004 0.036
0.129
0.097
0.034
0.124
0
0.037
0.14
-0.002
0.036
0.149
-0.004 0.044
0.192
-0.053
False propensity score model, True regression model
0.117
1.784
0.006 0.147
2.171
-0.001
0.07
0.484
-0.06 0.077
0.955
0.002
0.048
0.23
0.004
0.05
0.254
0.001
0.041
0.169
0.056 0.047
0.543
-0.001
0.042
0.176
0.092 0.056
1.155
-0.002
True propensity score model, True regression model
0.048
0.268
-0.001 0.055
0.3
-0.001
0.039
0.175
0.004 0.043
0.183
0.002
0.035
0.132
0.004 0.036
0.129
0.001
0.034
0.119
0
0.037
0.14
-0.001
0.035
0.145
-0.004 0.044
0.192
-0.002
False propensity score model, False regression model
0.166
11.125
0.006 0.147
2.171
-0.158
0.076
0.615
-0.06 0.077
0.955
0.058
0.052
0.387
0.004
0.05
0.254
0.097
0.045
0.195
0.056 0.047
0.543
-0.002
0.046
0.203
0.092 0.056
1.155
-0.053
N = 1000
True propensity score model, False regression model
0.043
0.199
0
0.04
0.163
-0.156
0.03
0.103
0.004 0.032
0.103
0.06
0.027
0.076
0.002 0.026
0.071
0.098
0.026
0.068
-0.001 0.027
0.075
-0.002
0.027
0.082
-0.003 0.032
0.101
-0.053
False propensity score model, True regression model
0.087
0.881
0.005 0.105
1.106
0
0.052
0.265
-0.06 0.058
0.689
0.001
0.036
0.124
0.002 0.037
0.138
0.001
0.031
0.092
0.055 0.034
0.425
0
0.031
0.093
0.091
0.04
0.994
-0.001
True propensity score model, True regression model
0.036
0.148
0
0.04
0.163
0
0.03
0.1
0.004 0.032
0.103
0.001
0.026
0.072
0.002 0.026
0.071
0.001
0.026
0.067
-0.001 0.027
0.075
0
0.027
0.077
-0.003 0.032
0.101
-0.001
False propensity score model, False regression model
0.123
9.214
0.005 0.105
1.106
-0.156
0.057
0.396
-0.06 0.058
0.689
0.06
0.039
0.288
0.002 0.037
0.138
0.098
0.034
0.114
0.055 0.034
0.425
-0.002
0.035
0.112
0.091
0.04
0.994
-0.053
MSE×100
RA
SD
MSE×100
0.036
0.023
0.017
0.018
0.026
2.626
0.394
0.978
0.034
0.343
0.024
0.022
0.016
0.017
0.029
0.059
0.047
0.026
0.03
0.085
0.024
0.022
0.016
0.017
0.029
0.059
0.047
0.026
0.03
0.085
0.036
0.023
0.017
0.018
0.026
2.626
0.394
0.978
0.034
0.343
0.025
0.016
0.012
0.013
0.018
2.51
0.382
0.974
0.016
0.313
0.018
0.016
0.012
0.012
0.02
0.034
0.025
0.014
0.015
0.041
0.018
0.016
0.012
0.012
0.02
0.034
0.025
0.014
0.015
0.041
0.025
0.016
0.012
0.013
0.018
2.51
0.382
0.974
0.016
0.313
40
LEE, OKUI, AND WHANG
Table 2. Monte Carlo results: CATEF estimates, p = 4
At x =
BIAS
SD
-0.4
-0.2
0
0.2
0.4
-0.001
0.005
0.004
-0.002
-0.002
0.042
0.038
0.036
0.038
0.063
-0.4
-0.2
0
0.2
0.4
-0.001
0.005
0.003
-0.002
-0.002
0.052
0.045
0.041
0.041
0.063
-0.4
-0.2
0
0.2
0.4
-0.001
0.005
0.003
-0.002
-0.002
0.039
0.037
0.035
0.038
0.062
-0.4
-0.2
0
0.2
0.4
0.051
0
-0.009
-0.001
-0.003
0.057
0.049
0.043
0.042
0.066
-0.4
-0.2
0
0.2
0.4
-0.002
0.003
0.003
-0.001
-0.003
0.03
0.029
0.026
0.028
0.048
-0.4
-0.2
0
0.2
0.4
-0.002
0.003
0.003
-0.001
-0.003
0.038
0.034
0.03
0.03
0.047
-0.4
-0.2
0
0.2
0.4
-0.002
0.003
0.003
-0.001
-0.003
0.029
0.027
0.026
0.028
0.047
-0.4
-0.2
0
0.2
0.4
0.049
-0.003
-0.01
0
-0.004
0.041
0.036
0.032
0.031
0.05
DR
ASE
IPW
BIAS
SD
MSE×100 BIAS
N = 500
True propensity score model, False regression model
0.04
0.175
-0.001 0.039
0.153
-0.098
0.037
0.147
0.005 0.037
0.141
0.062
0.034
0.128
0.003 0.035
0.121
0.079
0.037
0.144
-0.002 0.04
0.158
-0.017
0.059
0.399
-0.001 0.065
0.429
-0.044
False propensity score model, True regression model
0.05
0.271
-0.009 0.055
0.314
-0.001
0.045
0.207
-0.02 0.049
0.28
0.003
0.04
0.17
0.003 0.043
0.185
0.003
0.041
0.171
0.033 0.046
0.324
-0.001
0.059
0.392
0.045 0.073
0.732
-0.001
True propensity score model, True regression model
0.037
0.153
-0.001 0.039
0.153
-0.001
0.035
0.138
0.005 0.037
0.141
0.003
0.033
0.122
0.003 0.035
0.121
0.003
0.037
0.144
-0.002 0.04
0.158
-0.001
0.058
0.387
-0.001 0.065
0.429
-0.001
False propensity score model, False regression model
0.057
0.582
-0.009 0.055
0.314
-0.098
0.048
0.24
-0.02 0.049
0.28
0.062
0.042
0.194
0.003 0.043
0.185
0.079
0.042
0.178
0.033 0.046
0.324
-0.017
0.064
0.432
0.045 0.073
0.732
-0.044
N = 500
True propensity score model, False regression model
0.03
0.092
-0.002 0.029
0.083
-0.098
0.028
0.083
0.003 0.028
0.079
0.062
0.026
0.069
0.003 0.025
0.066
0.079
0.028
0.078
-0.001 0.029
0.086
-0.017
0.044
0.228
-0.002 0.05
0.248
-0.045
False propensity score model, True regression model
0.037
0.143
-0.01
0.04
0.171
-0.001
0.033
0.115
-0.023 0.037
0.186
0.002
0.03
0.093
0.003 0.031
0.1
0.002
0.031
0.091
0.036 0.034
0.242
-0.001
0.044
0.222
0.043 0.055
0.493
-0.002
True propensity score model, True regression model
0.028
0.082
-0.002 0.029
0.083
-0.001
0.026
0.077
0.003 0.028
0.079
0.002
0.025
0.066
0.003 0.025
0.066
0.002
0.028
0.077
-0.001 0.029
0.086
-0.001
0.043
0.222
-0.002 0.05
0.248
-0.002
False propensity score model, False regression model
0.043
0.412
-0.01
0.04
0.171
-0.098
0.036
0.134
-0.023 0.037
0.186
0.062
0.031
0.11
0.003 0.031
0.1
0.079
0.032
0.095
0.036 0.034
0.242
-0.017
0.048
0.248
0.043 0.055
0.493
-0.045
MSE×100
RA
SD
MSE×100
0.03
0.022
0.02
0.025
0.04
1.055
0.43
0.671
0.089
0.351
0.019
0.019
0.014
0.024
0.06
0.037
0.037
0.02
0.056
0.365
0.019
0.019
0.014
0.024
0.06
0.037
0.037
0.02
0.056
0.365
0.03
0.022
0.02
0.025
0.04
1.055
0.43
0.671
0.089
0.351
0.021
0.016
0.015
0.018
0.029
1.015
0.407
0.646
0.062
0.286
0.014
0.014
0.01
0.017
0.046
0.019
0.02
0.01
0.03
0.21
0.014
0.014
0.01
0.017
0.046
0.019
0.02
0.01
0.03
0.21
0.021
0.016
0.015
0.018
0.029
1.015
0.407
0.646
0.062
0.286
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
Table 3. Monte Carlo results, CATEF confidence band, p = 2
Confidence level
CP
Mcri Sdcri
GCP
N = 500
True propensity score model, False regression model
99%
0.978 3.485 0.078
0.999
95%
0.917 2.98 0.091
0.994
90%
0.855 2.728 0.099
0.977
False propensity score model, True regression model
99%
0.99 3.468 0.106
1
95%
0.958 2.96 0.123
0.998
90%
0.912 2.705 0.134
0.99
True propensity score model, True regression model
99%
0.981 3.483 0.072
1
95%
0.929 2.978 0.084
0.995
0.864 2.726 0.092
0.98
90%
False propensity score model, False regression model
99%
0.983 3.505 0.098
1
95%
0.916 3.003 0.113
0.997
90%
0.823 2.753 0.123
0.982
N = 1000
True propensity score model, False regression model
99%
0.982 3.525 0.067
0.999
0.93 3.027 0.078
0.994
95%
90%
0.875 2.779 0.085
0.979
False propensity score model, True regression model
99%
0.993 3.5 0.093
1
95%
0.96 2.998 0.108
0.997
90%
0.919 2.747 0.117
0.994
True propensity score model, True regression model
99%
0.986 3.524 0.061
1
95%
0.936 3.026 0.071
0.995
90%
0.881 2.778 0.077
0.983
False propensity score model, False regression model
99%
0.948 3.545 0.088
1
95%
0.774 3.05 0.102
0.991
90%
0.614 2.804 0.111
0.947
41
42
LEE, OKUI, AND WHANG
Table 4. Monte Carlo results, CATEF confidence band, p = 4
Confidence level
CP
Mcri Sdcri
GCP
N = 500
True propensity score model, False regression model
99%
0.982 3.488 0.081
1
95%
0.926 2.984 0.095
0.994
90%
0.863 2.732 0.103
0.981
False propensity score model, True regression model
99%
0.987 3.478 0.082
1
95%
0.941 2.972 0.095
0.996
90%
0.886 2.718 0.104
0.985
True propensity score model, True regression model
99%
0.981 3.489 0.082
1
95%
0.924 2.985 0.095
0.996
0.861 2.733 0.103
0.981
90%
False propensity score model, False regression model
99%
0.986 3.486 0.079
1
95%
0.938 2.982 0.092
0.996
90%
0.879 2.73
0.1
0.983
N = 1000
True propensity score model, False regression model
99%
0.985 3.524 0.07
1
0.931 3.026 0.081
0.996
95%
90%
0.877 2.778 0.088
0.982
False propensity score model, True regression model
99%
0.988 3.513 0.07
1
95%
0.945 3.013 0.081
0.997
90%
0.893 2.764 0.088
0.986
True propensity score model, True regression model
99%
0.984 3.526 0.069
1
95%
0.933 3.029 0.08
0.995
90%
0.874 2.781 0.087
0.982
False propensity score model, False regression model
99%
0.986 3.523 0.066
1
95%
0.923 3.025 0.077
0.996
90%
0.855 2.777 0.084
0.98
DOUBLY ROBUST UNIFORM CONFIDENCE BAND
43
Figure 1. CATEF for the effect of smoking on birth weights, 95%
confidence bands
−600
−400
−200
0
200
400
CATEF
our CB
PW CB
Gumbel CB
ATE
15
20
25
30
35
mother’s age
Note: “CATEF” = the estimated CATEF; “our CB” = the uniformly valid confidence
band proposed in this paper; “PW CB” = the confidence band that is valid only in
a pointwise sense; “Gumbel CB” = the uniformly valid confidence band based on
the Gumbel approximation; “ATE” = the estimated value of the average treatment
effect.
Download