Doubly robust uniform confidence band for the conditional average treatment effect function Sokbae Lee Ryo Okui Yoon-Jae Wang The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper CWP03/16 DOUBLY ROBUST UNIFORM CONFIDENCE BAND FOR THE CONDITIONAL AVERAGE TREATMENT EFFECT FUNCTION SOKBAE LEE1,2 , RYO OKUI3,4 , AND YOON-JAE WHANG1 Abstract. In this paper, we propose a doubly robust method to present the heterogeneity of the average treatment effect with respect to observed covariates of interest. We consider a situation where a large number of covariates are needed for identifying the average treatment effect but the covariates of interest for analyzing heterogeneity are of much lower dimension. Our proposed estimator is doubly robust and avoids the curse of dimensionality. We propose a uniform confidence band that is easy to compute, and we illustrate its usefulness via Monte Carlo experiments and an application to the effects of smoking on birth weights. Keywords: average treatment effect conditional on covariates, uniform confidence band, double robustness, Gaussian approximation. JEL classification codes: C14, C21 1 Department of Economics, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 151-742, Republic of Korea. 2 Centre for Microdata Methods and Practice, Institute for Fiscal Studies, 7 Ridgmount Street, London, WC1E 7AE, UK. 3 Institute of Economic Research, Kyoto University, Yoshida-Honmachi Sakyo, Kyoto, 606-8501, Japan. 4 Department of Econometrics & OR, VU University Amsterdam, De Boelelaan 1105, 1081 HV Amsterdam, The Netherlands. E-mail addresses: sokbae@gmail.com, okui.ryo.3@gmail.com, whang@snu.ac.kr. Date: This version: January 2016; the first draft: November 2014. We would like to thank Yanchun Jin for capable research assistance and Takahiro Hoshino, YuChin Hsu, Kengo Kato, Edward Kennedy, Taisuke Otsu, and Dylan Small for helpful comments. Lee’s work was supported by the European Research Council (ERC-2009-StG-240910-ROMETA). Okui’s work was supported by the Japan Society of the Promotion of Science (KAKENHI 25285067, 25780151, 15H03329). Whang’s work was supported by the SNU College of Social Science Research Grant. 1 2 LEE, OKUI, AND WHANG 1. Introduction In this paper, we propose a doubly robust method to present the heterogeneity of the average treatment effect with respect to observed covariates of interest. To describe our methodology, we consider the potential outcome framework. Let Y1 and Y0 be potential individual outcomes in two states, with treatment and without treatment, respectively. For each individual, the observed outcome Y is Y = DY1 + (1 − D)Y0 , where D denotes an indicator variable for the treatment, with D = 0 if an individual is not treated and D = 1 if an individual is treated. We assume that independent and identically distributed observations {(Yi , Di , Zi ) : i = 1, . . . , n} of (Y, D, Z) are available, where Z ∈ Rp denotes a p-dimensional vector of covariates. Suppose that a researcher is interested in evaluating the average treatment effect conditional on only a subset of covariates X, which is of a substantially lower dimension than Z, where Z ≡ (X> , V> )> ∈ Rd × Rm , p ≡ d + m. That is, we are interested in a case where d p. The main object of interest in this paper is the conditional average treatment effect function (CATEF); namely: (1.1) g(x) ≡ E[Y1 − Y0 |X = x]. When d ≥ 3, it is difficult to plot g(x), not to mention low precision due to the curse of dimensionality. Hence, for practical reasons, we focus on the case that d = 1 or d = 2, while p is often of a much higher dimension. To achieve identification of the CATEF, we assume that Y1 and Y0 are independent of D conditional on Z (known as the unconfoundedness assumption): (1.2) (Y1 , Y0 ) ⊥ D|Z, DOUBLY ROBUST UNIFORM CONFIDENCE BAND 3 where ⊥ denotes the independence. For (1.2) to be plausible in applications, applied researchers tend to consider a large number of covariates Z. Note that in our setup, the treatment may be confounded in the sense that the treatment assignment may not be independent of the potential outcome variables given X only. To satisfy the unconfoundedness condition, a much larger set of conditioning variables Z needs to be employed. Different roles of covariates between X and V are noted in the recent literature. For example, Ogburn, Rotnitzky, and Robins (2015) consider a similar issue in the context of the local average treatment effect (LATE) of Imbens and Angrist (1994). Ogburn, Rotnitzky, and Robins (2015) emphasize that conditioning on a large number of covariates Z may be required to make it plausible that the binary instrument is valid. In their empirical example, Ogburn, Rotnitzky, and Robins (2015) revisit the analyses of Poterba, Venti, and Wise (1995) and Abadie (2003) to examine whether participation in 401(k) pension plans increases household savings. In their example, the vector of covariates Z for the identifying assumption consists of income, age, marital status, and family size, whereas the variable of interest X is income. Abrevaya, Hsu, and Lieli (2015) also consider the case of investigating the effect of smoking during pregnancy on birth weights. They are interested in estimating (1.1) with X being the age of mother; however, as noted in Abrevaya, Hsu, and Lieli (2015), it is unlikely that conditioning only on the age of mother would achieve the unconfoundedness assumption with nonexperimental data. As a result, it is necessary to consider a high-dimensional Z, including the age of the mother. The fact that a high-dimensional Z needs to be employed for (1.2) to be plausible in an application makes a fully nonparametric estimation approach impractical because of the curse of dimensionality. For example, the propensity score is not nonparametrically estimable in moderately sized samples, if the dimension of Z is high. One 4 LEE, OKUI, AND WHANG obvious alternative is to use a parametric model for the propensity score; however, it may lead to misleading results if the parametric model is misspecified. With the aim of providing a practical method and, at the same time, reducing sensitivity to model misspecification, we propose to use a doubly robust method based on parametric regression and propensity score models. Our estimator of the CATEF is doubly robust in the sense that it is consistent when at least one of the regression model and the propensity score model is correctly specified. Specifically, we first estimate CATEF(Z) using a doubly robust procedure: we estimate a parametric regression model of the outcome on Z for each treatment status and a parametric model for the probability of selecting into the treatment given Z; we then combine the parametric estimation results in a doubly robust fashion to construct an estimate of CATEF(Z). We then obtain an estimate of CATEF(X) by adopting the local linear smoothing of CATEF(Z). As a result, we avoid high-dimensional smoothing with respect to Z but mitigate the problem of misspecification by both the doubly robust estimation and low-dimensional smoothing with respect to X. We emphasize that we are willing to assume parametric specifications for the propensity score and regression models as functions of Z to avoid the curse of dimensionality, but not for CATEF(X). One may consider parametric estimation of CATEF(X), as Ogburn, Rotnitzky, and Robins (2015) estimate their LATE parameter using least squares approximations. However, note that even if the parametric specification of CATEF(Z) is correct, the resulting specification of CATEF(X) may not be correctly specified since E[Z|X] is possibly highly nonlinear. To avoid this misspecification, we estimate CATEF(X) nonparametrically. Because the CATEF is a functional parameter, as a tool of inference, we propose to use a uniform confidence band for the CATEF. Our construction of the uniform confidence band is based on some analytic approximation of the supremum of a Gaussian DOUBLY ROBUST UNIFORM CONFIDENCE BAND 5 process using arguments built on Piterbarg (1996), combined with a Gaussian approximation result of Chernozhukov, Chetverikov, and Kato (2014) and an empirical process result of Ghosal, Sen, and van der Vaart (2000). Our method is simple to implement and does not rely on resampling techniques. This paper contributes to the literature on doubly robust estimation by demonstrating that the doubly robust procedures are useful for estimating the CATEF. In this paper, we focus on the so-called augmented inverse probability weighting estimator that was originally proposed by Robins, Rotnitzky, and Zhao (1994) for the estimation of the mean (see also Robins and Rotnitzky, 1995; Scharfstein, Rotnitzky, and Robins, 1999). Their estimator appears to be the first estimator to be recognized as being doubly robust. Since then, many other alternative doubly robust estimators have been proposed in the literature. For example, the inverse probability weighting regression adjustment estimator (Kang and Schafer, 2007; Wooldridge, 2007, 2010) is widely known and has been implemented in statistical software packages. See the introduction of Tan (2010) for a comprehensive summary of other doubly robust estimators. Doubly robust estimators have been advocated for use in many different areas of application: See, for example, Lunceford and Davidian (2004) for medicine, Glynn and Quinn (2010) for political science, Wooldridge (2010) for economics, and Schafer and Kang (2008) for psychology. There are also doubly robust estimators available for different settings including instrumental variables estimation (Tan, 2006; Okui, Small, Tan, and Robins, 2012) and estimation under multivalued treatments (Derya Uysal, 2015). It would not be difficult to extend our method to allow these other doubly robust estimators and to consider different settings. However, to keep the analysis simple, in this paper, we focus on the augmented inverse probability weighting estimator of the CATEF. 6 LEE, OKUI, AND WHANG The CATEF is mathematically equivalent to “V -adjusted variable importance” of van der Laan (2006), who proposes it as a measure of variable importance in prediction. van der Laan (2006) proposes a doubly robust estimator of V -adjusted variable importance. Contrary to ours, he considers the projection of the V -adjusted variable importance on a parametric working model and does not consider a nonparametric estimation. Moreover, a uniform confidence band is not examined in van der Laan (2006). In a recent paper, Abrevaya, Hsu, and Lieli (2015) consider the estimation of the CATEF1; however, there are two main differences of this paper relative to Abrevaya, Hsu, and Lieli (2015). First, we propose the doubly robust procedure to estimate the CATEF. Abrevaya, Hsu, and Lieli (2015) consider the inverse probability weighting estimator. The inverse probability weighting estimator suffers from model misspecification when the propensity score model is misspecified and from the curse of dimensionality when it is estimated nonparametrically. Second, we present a method to construct a uniform confidence band, whereas Abrevaya, Hsu, and Lieli (2015) only provide a pointwise confidence interval. The remainder of the paper is organized as follows. Section 2 presents the doubly robust estimation method, Section 3 gives an informal description of how to construct a two-sided, symmetric uniform confidence band when the dimension of X is one, and Section 4 deals with a general case and provides formal theoretical results. In Section 5, the results of Monte Carlo simulations demonstrate that in finite samples, our doubly robust estimator works well, and the proposed confidence band has desirable coverage properties. Section 6 gives an empirical application, and Section 7 concludes. The proofs are contained in Appendix A. 1Our paper is independent of Abrevaya, Hsu, and Lieli (2015) and it is started without knowing their work. DOUBLY ROBUST UNIFORM CONFIDENCE BAND 7 2. Doubly Robust Estimation of the Average Treatment Effect Conditional on Covariates of Interest In this section, a doubly robust method for estimating the CATEF is proposed. We first estimate the CATEF for all the covariates using a doubly robust method. We then obtain the CATEF for the covariates of interest using a nonparametric approach. Define: π(z) ≡ E [D|Z = z] , µj (z) ≡ E [Y |Z = z, D = j] for j = 0, 1, where π(z) is the propensity score and µj (z) for j = 0, 1 are called regression functions. Note that µj (z) = E(Yj |Z = z) for j = 0, 1 under unconfoundedness. Let π(z, β) and µj (z, αj ) for j = 0, 1 denote parametric models of π(z) and µj (z), respectively. (µj (z, αj ) may also be called “marginal structural models” of Robins (2000).) A doubly robust procedure requires that either π(z) or µj (z) for j = 0, 1 should be correctly specified, thereby allowing for misspecification in π(z) or in µj (z). Let > > θ0 ≡ (α10 , α00 , β0> )> denote the vector of true or pseudo-true parameter values that optimize some criterion functions. We consider the augmented inverse probability weighting approach. Let: ψ1 (W, α1 , β) ≡ DY D − π(Z, β) − µ1 (Z, α1 ), π(Z, β) π(Z, β) ψ0 (W, α0 , β) ≡ (1 − D) Y D − π(Z, β) + µ0 (Z, α0 ), 1 − π(Z, β) 1 − π(Z, β) ψ(W, θ) ≡ ψ1 (W, α1 , β) − ψ0 (W, α0 , β), where W ≡ (Y, Z> )> and θ ≡ (α1> , α0> , β > )> . The first terms in ψ1 (W, α1 , β) and ψ0 (W, α0 , β) correspond to inverse probability weighting. The second terms are augmented terms that make the procedure doubly robust. 8 LEE, OKUI, AND WHANG The following lemma gives regularity conditions under which g(x) is identified. Lemma 1 (Identification of the CATEF). Assume that (1.2) holds and 0 < π(Z, β0 ) < 1 almost surely. Suppose that either there exists β0 such that E [D|Z] = π(Z, β0 ) almost surely or there exist α10 and α00 such that E [Y1 |Z] = µ1 (Z, α10 ) and E [Y0 |Z] = µ0 (Z, α00 ) almost surely. Then: g(x) = E [ψ(W, θ0 )|X = x] . Lemma 1 suggests that one may estimate g(x) by running the nonparametric regression of ψ(W, θ̂) on Xi , where θ̂ is a consistent parametric estimator of θ0 . Moreover, this lemma implies that the CATEF can be identified through ψ(W, θ0 ) if either the regression models (µ1 (z, α1 ) and µ0 (z, α0 )) or the propensity score model (π(z, β)) is correctly specified (or both). That is, even if µ1 (z, α1 ) and µ0 (z, α0 ) do not represent the true conditional expectation functions, provided that π(z, β) is correct, the CATEF is identified. Similarly, even if π(z, β) is misspecified, provided that µ1 (z, α1 ) and µ0 (z, α0 ) are correct, the CATEF is identified. 2.1. Parametric Estimation of θ. For concreteness, we consider the following estimation procedure for θ0 . However, how θ0 is estimated does not alter our results provided that the rate of convergence is sufficiently fast so that Assumption 1(7) given below is satisfied. For each j = 0, 1, we estimate αj by least squares: (2.1) α̂j ≡ argmin α n X Dij (1 − Di )1−j (Yi − µj (Zi , α))2 . i=1 We estimate β by maximum likelihood (e.g., probit or logit): (2.2) β̂ ≡ argmax β n X i=1 (Di log π(Zi , β) + (1 − Di ) log(1 − π(Zi , β))) . DOUBLY ROBUST UNIFORM CONFIDENCE BAND 9 Remark 1. When the dimension of Z is not too high, an alternative to parametric estimation of ψ(W, θ0 ) is to estimate its nonparametric counterpart via local polynomial estimators as in Rothe and Firpo (2014). However, this would not work when the dimension of Z is sufficiently high (see related remarks in Rothe and Firpo (2014)). The latter is the case we focus on in the paper. 2.2. Local Linear Estimation of g. We consider a local linear estimator of g(x).2 Assume that g(x) is twice continuously differentiable. For each x = (x1 , . . . , xd ), the local linear estimator of g(x) can be obtained by minimizing: Sn (γ) ≡ n h X ψ(Wi , θ̂) − γ0 − γ1> i=1 i2 X − x i (Xi − x) K hn with respect to γ ≡ (γ0 , γ1> )> ∈ Rd+1 , where K(·) is a kernel function on Rd and hn is a sequence of bandwidths. More specifically, let ĝ(x) = e> 1 γ̂(x), where γ̂(x) ≡ arg minγ∈Rd+1 Sn (γ) and e1 is a column vector whose first entry is one, and the rest are zero. Because θ̂ can be estimated at a rate of n−1/2 , there is no estimation effect from the first stage, implying that we can carry out inference as if θ0 were known. 3. An Informal Description of a Uniform Confidence Band In this section, we provide an informal description of how to construct a two-sided, symmetric uniform confidence band. For simplicity, we focus on the leading case where d = 1. Let I ≡ [a, b] denote an interval of interest for which we build a uniform confidence band. Assume that I is a subset of the support of X. We use nonbold x to mean that x is one-dimensional. Algorithm. Carry out the following steps to construct a (1 − α) uniform confidence band. 2In this paper, we consider cases in which x is continuous. When x is discreet, the CATEF can be estimated by the sample average of ψ(W, θ̂) using the sub-sample for each value of x. Moreover, constructing a confidence band is standard when x takes a finite number of values. 10 LEE, OKUI, AND WHANG (1) Obtain ĝ(x) using a local linear estimator with a bandwidth hn such that: hn = b h × n1/5 × n−2/7 , where b h is a commonly used optimal bandwidth in the literature (for example, the plug-in method of Ruppert, Sheather, and Wand (1995)). (2) Obtain the pointwise standard error ŝ(x)/(nhn )1/2 of ĝ(x) by constructing a feasible version of the asymptotic standard error formula: ŝ(x) ≡ (nhn )1/2 −1 [nhn fX (x)] Z 1/2 K (u)du σ (x) , 2 2 where fX is the density of X and σ 2 (x) is the conditional variance function. (3) To compute a critical value c(1 − α), define: λ≡ Let: − R K(u)K 00 (u)du R . K 2 (u)du 1/2 λ1/2 −1 an ≡ an (I) = 2 log(hn (b − a)) + 2 log . 2π Now set the critical value by: 1/2 c(1 − α) ≡ a2n − 2 log{log[(1 − α)−1/2 ]} . (4) For each x ∈ I, we set the two-sided symmetric confidence band: ŝ(x) ŝ(x) ĝ(x) − c(1 − α) √ ≤ g(x) ≤ ĝ(x) + c(1 − α) √ . nhn nhn We make some remarks on the proposed algorithm. In step (1), the factor n1/5 × n−2/7 is multiplied in the definition of hn to ensure that the bias is asymptotically negligible by undersmoothing. In step (2), one can estimate fX and σ 2 (x) ≡ Var [ψ(W, θ0 )|X = x] using the standard kernel density and regression estimators DOUBLY ROBUST UNIFORM CONFIDENCE BAND 11 with the same kernel function K(·) and the same bandwidth h and also with an estimator of θ0 . In step (3), we may restrict the bandwidth such that hn (b−a) (which is satisfied asymptotically), thereby imposing the condition that log(h−1 n (b−a)) is positive. The critical value proposed in step (3) is strictly positive if α is not too close to one or if n is large enough. Remark 2. We may compare our proposal with the critical value based on the (1 − α) quantile of the Gumbel distribution, which is given by: − log{log[(1 − α)−1/2 ]} c∞ (1 − α) ≡ an + . an Note that: − log{log[(1 − α)−1/2 ]} c∞ (1 − α) − c(1 − α) = an 2 , which is strictly positive for small α but converges to zero as an diverges. Hence, we expect that in finite samples, the confidence band based on c∞ (1 − α) is too wide and has higher coverage probability than the nominal level. It is shown in the next section that the critical value based on the Gumbel distribution is accurate only up to the logarithmic rate, where our proposed critical value is precise in a polynomial rate. This is because our proposal uses a higher-order expansion of Piterbarg (1996), whose approximation error is of a polynomial rate. See Theorem 2 in Section 4 for details. Remark 3. Our construction of critical values is based on a simple analytic method that is easy to compute. Alternatively, one may rely on bootstrap methods to compute critical values for the uniform confidence band. For example, see Claeskens and Keilegom (2003) for smoothed bootstrap confidence bands and Chernozhukov, Lee, and Rosen (2013) for multiplier bootstrap confidence bands. Chernozhukov, Chetverikov, and Kato (2013) show that in general settings including high dimensional models, 12 LEE, OKUI, AND WHANG Gaussian multiplier bootstrap methods yield critical values for which the approximation error decreases polynomially in the sample size. Roughly speaking, both our simple analytic correction and multiplier bootstrap methods yield critical values that are accurate at polynomial rates. A refined theoretical analysis is necessary to determine which type of the critical value is better asymptotically. Remark 4. The proposed confidence band can be used to test whether the CATEF is constant. Suppose that our null hypothesis is that g(x) is constant in I. This null hypothesis can be written as g(x) = gI , where gI = E[g(x)|x ∈ I]. Since gI can be √ estimated at the parametric ( n ) rate and the estimator thus converges faster than ĝ(x), we can ignore the estimation error in ĝ(x). We reject the constancy of g(x), if the confidence band does not include the estimate of gI for some x ∈ I. 4. Asymptotic Theory In this section, we establish asymptotic theory. Let U ≡ ψ(W, θ0 ) − g(X) and let Ui ≡ ψ(Wi , θ0 ) − g(Xi ) for i = 1, . . . , n. Let ŝ2 (x) be the estimator of the asymptotic variance of ĝ(x). Let s2n (x) denote the population version of the asymptotic variance of the estimator: s2n (x) 1 U2 X−x 2 K ≡ dE 2 . hn fX (x) hn Assume that the d-dimensional kernel function is the product of d univariate kernel Q functions. That is, K(s) = dj=1 K(sj ), where s ≡ (s1 , . . . , sd ) is a d-dimensional Q vector and K(·) is a kernel function on R. Let ρd (s) = dj=1 ρ(sj ), where: R (4.1) ρ(sj ) ≡ K(u)K(u − sj )du R , K 2 (u)du for each j. We make the following assumptions. Assumption 1. Let d < 4. DOUBLY ROBUST UNIFORM CONFIDENCE BAND (1) I ≡ Qd j=1 [aj , bj ], 13 where aj < bj , j = 1, . . . , d, and I is a strict subset of the support of X. (2) The distribution of X has a bounded Lebesgue density fX (·) on Rd . Furthermore, fX (·) is bounded below from zero with continuous derivatives on I. (3) The density of U is bounded, E[U 2 |X = x] is continuous on I, and supx∈Rd E[U 4 |X = x] < ∞. (4) g(·) is twice continuously differentiable on I. Q (5) K(s) = dj=1 K(sj ), where K(·) is a kernel function on R that has finite R1 R1 support on [−1, 1], −1 uK(u)du = 0, −1 K(u)du = 1, symmetric around zero, and six times differentiable. (6) hn = Cn−η , where C and η are positive constants such that η < 1/(2d) and η > 1/(d + 4). (7) inf n≥1 inf x∈I sn (x) > 0 and sn (x) is continuous for each n ≥ 1. Furthermore, x 7→ E [U 2 |X = x] fX (x) is Lipschitz continuous. (8) There exists an estimator ŝ2 (x) such that sup ŝ2 (x) − s2n (x) = Op (n−c ) x∈I for some constant c > 0. n o (9) max (nhdn )1/2 |ψ(Wi , θ̂) − ψ(Wi , θ0 )| : i = 1, . . . , n = Op (n−c ) for some constant c > 0. Most of the assumptions are standard. One of the bandwidth conditions in hn (that is, η > 1/(d + 4)) imposes undersmoothing, so that we can ignore the bias asymptotically. The rule-of-thumb bandwidth proposed in Section 3 satisfies the required rate conditions. Remark 5. Note that d < 4 is necessary to ensure that η < 1/(2d) and η > 1/(d + 4) can hold jointly. It is possible to extend our asymptotic theory to the case that 14 LEE, OKUI, AND WHANG d ≥ 4 using a higher-order local polynomial estimator under stronger smoothness conditions. In this paper, we limit our attention to the local linear estimator since we are mainly interested in low dimensional x. Remark 6. An estimator of ŝ2 (x) is readily available. For example, we may set n X−x 1 X Ûi2 2 K ŝ (x) = d , 2 hn i=1 fˆX hn (x) 2 where Ûi ≡ ψ(Wi , θ̂) − ĝ(Xi ) and fˆX (·) is the kernel density estimator. Alternatively, since s2n (x) σ 2 (x) → fX (x) Z K2 (u) du, as n → ∞, where σ 2 (x) ≡ E [U 2 |X = x], we may use σ̂ 2 (x) ŝ (x) = fˆX (x) 2 Z K2 (u) du, with a suitable nonparametric estimator σ̂ 2 (x) of σ 2 (x) such as the kernel regression estimator using {(Ûi2 , Xi ) : i = 1, . . . , n}. For either estimator, it is straightforward to verify condition (8) of Assumption 1 using the standard results in kernel estimation. Remark 7. Condition (9) of Assumption 1 is satisfied, for example, if kθ̂ − θ0 k = Op (n−1/2 ), functions β 7→ π(Z, β) and αj 7→ µj (Z, αj ), j = 0, 1, are Lipschitz continuous, π(Z, β0 ) is bounded between and 1 − for some constant > 0, provided that some weak moment conditions on (Y, Z) hold. Let an ≡ an (I) be the largest solution to the following equation: (4.2) mes(I)hn −d λd/2 (2π)−(d+1)/2 and−1 exp(−a2n /2) = 1, DOUBLY ROBUST UNIFORM CONFIDENCE BAND where mes(I) is the Lebesgue measure of I; that is, mes(I) = (4.3) λ= − R 15 Qd j=1 (bj − aj ) and: K(u)K 00 (u)du R . K 2 (u)du The following is the main theoretical result of our paper. Theorem 2. Let Assumption 1 hold. Then there exists κ > 0 such that, uniformly in t, on any finite interval: ĝ(x) − g(x) P an max − an < t = x∈I ŝ(x) (4.4) d−2m−1 [(d−1)/2] X t −t−t2 /2a2n −2m exp −2e + O(n−κ ), hm,d−1 an 1+ 2 an m=0 as n → ∞, where hm,d−1 ≡ (−1)m (d−1)! m!2m (d−2m−1)! and and [·] is the integer part of a number. Notice that the approximation error is of a polynomial rate. As a result, a critical value based on the leading term of the right-hand side of (4.4) provides a better approximation than one based on the Gumbel approximation. The result in Theorem 2 may be of independent interest for constructing the uniform confidence band in nonparametric regression beyond the scope of estimating the CATEF in our context. Remark 8. In a setting different from here, Lee, Linton, and Whang (2009) propose analytic critical values based on Piterbarg (1996) in order to test for stochastic monotonicity, compare its performance with the bootstrap critical values in their Monte Carlo experiments, and find that both perform well in finite samples. However, the discussions in Lee, Linton, and Whang (2009) are informal and rely on the results of Monte Carlo experiments without the formal proof of establishing the polynomial approximation error. 16 LEE, OKUI, AND WHANG The conclusion of the theorem can be simplified for special cases. In particular, if d = 1, then: ĝ(x) − g(x) −t−t2 /2a2n P an max + O(n−κ ), − a < t = exp −2e n x∈I ŝ(x) where an is the largest solution to mes(I)hn −1 λ1/2 (2π)−1 exp(−a2n /2) = 1. Also, if d = 2, then: P an ĝ(x) − g(x) t −t−t2 /2a2n max 1 + 2 + O(n−κ ), − an < t = exp −2e x∈I ŝ(x) an where an is the largest solution to mes(I)hn −2 λ2 (2π)−3/2 an exp(−a2n /2) = 1. 4.1. Construction of critical values. We use the leading term on the right-hand side of (4.4) as a distribution-like function to construct a uniform confidence band. For example, if d = 1, we may construct a critical value c(1 − α) that satisfies: Fn,1 (c) ≥ 1 − α, where Fn,1 (t) ≡ exp −2e −t−t2 /2a2n . This yields the critical value presented in the Algorithm of Section 3. Similarly, if d = 2, we can use: Fn,2 (c) ≥ 1 − α, −t−t2 /2a2n where Fn,2 (t) ≡ exp −2e 1+ t a2n . In finite samples, it might be useful to impose monotonicity of Fn,j (·) by rearrangement (see, e.g., Chernozhukov, FerndádezVal, and Galichon (2009)). Remark 9. Theorem 2 implies that: ĝ(x) − g(x) −t lim P an max − a < t = exp −2e . n n→∞ x∈I ŝ(x) DOUBLY ROBUST UNIFORM CONFIDENCE BAND 17 Thus, one may construct analytical critical values based on the Gumbel distribution. However, this approximation is accurate only up to the logarithmic rate in view of Theorem 2. 5. Monte Carlo Experiments In this section, we present the results of Monte Carlo experiments. These experiments are conducted to see the finite sample performances of the proposed doubly robust estimator and the proposed uniform confidence band. The simulations are conducted by R 3.1.2 with Windows 7. The number of replications is 5000. 5.1. Data generating process. The data generating process follows the potential outcome framework, and we borrow the design used by Abrevaya, Hsu, and Lieli (2015). The notations for the variables are the same as those used in the theoretical part of the paper. Recall that p is the number of covariates (i.e., the dimension of Z). We consider two cases: p = 2 and p = 4. We explain the data generating process for each value of p. The data generating process for p = 2 is the following. The vector of covariates Z = (X1 , X2 )> is generated by: X1 = 1 , X2 = (1 + 2X1 )2 (−1 + X1 )2 + 2 , where i ∼ U [−0.5, 0.5] for i = 1, 2 and 1 and 2 are independent. The potential outcomes are generated by: Y1 = X1 X2 + v, Y0 = 0, where v ∼ N (0, 0.252 ) and v is independent of (1 , 2 ). The treatment status D is generated by: D = 1{Λ(X1 + X2 ) > U }, 18 LEE, OKUI, AND WHANG where U ∼ U [0, 1], U is independent of (1 , 2 , v) and Λ is the logistic function. Thus, the propensity score is π(Z) = Λ(X1 + X2 ). The observed outcome is Y = DY1 . Next, we present the data generating process for p = 4. The covariates Z = (X1 , X2 , X3 , X4 )> are generated by: X1 = 1 , X2 = 1 + 2X1 + 2 , X3 = 1 + 2X1 + 3 , X4 = (−1 + X1 )2 + 4 , where i ∼ U [−0.5, 0.5] for i = 1, 2, 3, 4 and 1 , 2 , 3 and 4 are independent. The potential outcomes are generated by: Y1 = X1 X2 X3 X4 + v, Y0 = 0, where v ∼ N (0, 0.252 ) and v is independent of (1 , 2 , 3 , 4 ). The treatment status D is generated by: D = 1{Λ(0.5(X1 + X2 + X3 + X4 )) > U }, where U ∼ U [0, 1] and U is independent of (1 , 2 , 3 , 4 v). The propensity score is π(Z) = Λ(0.5(X1 + X2 + X3 + X4 )). The observed outcome is Y = DY1 . The parameter of interest is the CATEF for X = X1 . In our specification, the CATEF is the same in both p = 2 and p = 4 cases. The CATEF can be written as: CAT EF (x1 ) = x1 (1 + 2x1 )2 (−1 + x1 )2 . We examine the performance of various statistical procedures regarding this CATEF. 5.2. Model specification. To estimate and conduct statistical inferences on CAT EF (x1 ) using our doubly robust procedure, we need to specify a model for the regression µj (z) for j = 0, 1 and a model for the propensity score π(z). We consider two regression models and two propensity score models. One of two models is correctly specified, but the other model is misspecified. We note that our doubly robust procedure is DOUBLY ROBUST UNIFORM CONFIDENCE BAND 19 predicted to work well provided that at least one of the regression models and the propensity score models is correctly specified. We first discuss the model specifications for p = 2. The first regression model is: µ1 (z, α1 ) = α10 + α11 X1 X2 , µ0 (z, α0 ) = α00 + α01 X1 X2 . This model is correctly specified. The coefficients are estimated by OLS using X1 X2 as the explanatory variable. The second regression model, which is misspecified, is: µ1 (z, α1 ) = α10 + α11 X1 + α12 X2 , µ0 (z, α0 ) = α00 + α01 X1 + α02 X2 . The model is estimated by OLS using X1 and X2 as explanatory variables. We also consider two models for propensity score. When p = 2, the first model is correctly specified and it is: π(z, β) = Λ(β0 + β1 X1 + β2 X2 ). The model is estimated by maximum likelihood. The misspecified model is: π(z, β) = (Λ(β0 + β1 X1 + β2 X2 ))2 . This misspecified propensity score is computed by taking the square of the estimated correctly specified propensity score. The model specifications for p = 4 are similar to those for p = 2 but include more covariates. The first regression model (correctly specified) is: µ1 (z, α1 ) = α10 + α11 X1 X2 X3 X4 , µ0 (z, α0 ) = α00 + α01 X1 X2 X3 X4 . 20 LEE, OKUI, AND WHANG The model is estimated by OLS using X1 X2 X3 X4 as the explanatory variable. The misspecified version is: µ1 (z, α1 ) = α10 + α11 X1 + α12 X2 + α13 X3 + α14 X4 , µ0 (z, α0 ) = α00 + α01 X1 + α02 X2 + α03 X3 + α04 X4 . The model is estimated by OLS using X1 , X2 , X3 and X4 as explanatory variables. For the propensity score, we first estimate the correctly specified logit model: π(z, β) = Λ(β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 ). We then compute a misspecified propensity score by taking the square of the estimated propensity score based on the correctly specified model. We estimate CAT EF (x1 ) for x1 ∈ {−0.4, −0.2, 0, 0.2, 0.4} and compute the mean \ bias (“MEAN”), standard deviation (“SD”), the average of standard error for CAT EF (x1 ) (“ASE”), and the mean squared error times 100 (“MSE×100”). The local linear regression is conducted with the Gaussian kernel, and the preliminary bandwidth (ĥ in Algorithm (1)) is chosen by the method of Ruppert, Sheather, and Wand (1995). We also compute the “BIAS”, “SE” and “MSE×100” of the corresponding inverse probability weighting estimators and the regression adjustment estimators. Note that the difference between the proposed method and those alternative methods arises only in the estimation of ψ(W, θ0 ) and the other steps are the same. We examine the coverage probability of the uniform confidence band for CAT EF (x1 ) for the range −0.4 ≤ x1 ≤ 0.4. The nominal coverage probabilities that we consider are 99%, 95% and 90%. We compute the empirical coverage (“CP”), the mean critical value (“Mcri”), and the standard deviation of critical value (“Sdcri”). We also compute the coverage probabilities of the confidence band when the critical values are computed by the Gumbel approximation (“GCP”). DOUBLY ROBUST UNIFORM CONFIDENCE BAND 21 5.3. Results. Tables 1 and 2 summarize the results on the properties of the estimators. The proposed doubly robust estimator of the CATEF works well in finite samples. As the theory indicates, the proposed estimator exhibits small bias provided that at least one of the regression models and the propensity score models is correctly specified. Its standard deviation is typically smaller than that of the inverse probability weighting estimator, and when it is not, the difference is small. We find that the regression adjustment estimator is very precise when the regression model is correctly specified. However, it is markedly vulnerable against model misspecification. The inverse probability weighting estimator also suffers from model misspecification, although it is arguably more robust than the regression adjustment estimator. When both models are misspecified, the doubly robust estimator is biased; however, its mean squared error is typically smaller than those of the alternatives, although we notice that at some points (x = 0 for the inverse probability weighting estimator and x = 0.2 for the regression adjustment estimator), the alternative estimators happen to have small biases. All the estimators have large mean squared errors when x = −0.4. This result is a consequence of the well-known result in nonparametric statistics that it is difficult to estimate the value of a function near the end of the support. The standard error for the proposed doubly robust estimator is, on average, close to the standard deviation. Tables 3 and 4 summarize the finite sample properties of the proposed uniform confidence band. The results show that our uniform confidence band has a good coverage property provided that one of the models is correctly specified. When both models are misspecified, p = 4 and N = 1000, the size distortion is heavy. The average values of the 95% critical values are around 3. Because the pointwise critical value is 1.96 and is much smaller than the uniformly valid critical value, it demonstrates the importance of the uniform property of confidence band. The standard deviations of 22 LEE, OKUI, AND WHANG the critical values are small because they change only if the bandwidth changes. The confidence band based on the Gumbel approximation is very conservative. The results of the Monte Carlo simulation confirm that the proposed doubly robust estimator indeed works well in finite samples provided that one of the regression and propensity score models is correctly specified. The proposed uniform confidence band also has good coverage properties. 6. An Empirical Application We apply our uniformly valid confidence band for the CATEF for the effect of maternal smoking on birth weight where the argument of the CATEF is the mother’s age. Our aim here is to illustrate our confidence band in comparison with alternative confidence bands. We first discuss the background of this application and the dataset used. We then compute various confidence bands for the CATEF and discuss the results. While the purpose of this application is to illustrate our uniformly valid confidence band and not to present new insights on the effect of smoking, it is still informative to discuss the background of this application. Many studies document that low birth weight is associated with prolonged negative effects on health and educational or labor market outcomes throughout life, although there has been a debate over its magnitude. See, e.g., Almond and Currie (2011) for a review. Maternal smoking is considered to be the most important preventable negative cause of low birth weight (Kramer, 1987). There are many studies that evaluate the effect of maternal smoking on low birth weight (Almond and Currie, 2011). The program evaluation approach is employed by, for example, Almond, Chay, and Lee (2005), da Veiga and Wilder (2008) and Walker, Tekin, and Wallace (2009), and panel data analysis is carried out by Abrevaya (2006) and Abrevaya and Dahl (2008). Here, we are interested in how the effect of smoking changes across different age groups of mothers. Walker, Tekin, DOUBLY ROBUST UNIFORM CONFIDENCE BAND 23 and Wallace (2009) examine whether the effect of smoking is larger for teen mothers than for adult mothers and find mixed evidence. Abrevaya, Hsu, and Lieli (2015) also consider this problem in their application. Our dataset consists of observations from white mothers in Pennsylvania in the USA. The dataset is an excerpt from Cattaneo (2010) and is obtained from the STATA website (“http://www.stata-press.com/data/r13/cattaneo2.dta”). Note that the dataset was originally used in Almond, Chay, and Lee (2005). We restrict our sample to white and non-Hispanic mothers, and the sample size is 3754. The outcome of interest (Y ) is infant birth weight measured in grams. The treatment variable (D) is a binary variable that is equal to 1 if the mother smokes and 0 otherwise. The set of covariates Z includes the mother’s age, an indicator variable for alcohol consumption during pregnancy, an indicator for the first baby, the mother’s educational attainment, an indicator for the first prenatal visit in the first trimester, the number of prenatal care visits, and an indicator for whether there was a previous birth where the newborn died. We are interested in how the effect of smoking varies across different values of the mother’s age. Therefore, X is mother’s age in this application. To estimate the CATEF, we use linear regression models for the regression part and a logit model for propensity score. The explanatory variables used in the regression models and the logit models consist of all the elements of Z, the square of the mother’s age, and the interaction terms between the mother’s age and all other elements of Z. We estimate the CATEF in the interval between ages 15 and 35. We compute the following three 95% confidence bands for the CATEF. “Our CB” is the confidence band proposed in this paper. Because X is univariate in this application, we follow the algorithm in Section 3. We use the Gaussian kernel. The preliminary bandwidth (ĥ) is chosen by the method of Ruppert, Sheather, and Wand (1995). “Gumbel CB” is the confidence band in which c(1 − α) in the algorithm is replaced by that based on the Gumbel approximation (see Remark 1). “PW CB” 24 LEE, OKUI, AND WHANG is a pointwise valid confidence band where we replace c(1 − α) in the algorithm by the corresponding value from the standard normal distribution (i.e., 1.96). This provides a valid confidence interval for each point of the CATEF. However, its uniform coverage rate would be smaller than 95%. Figure 1 plots the estimated CATEF and the three 95% confidence bands for the range between 15 and 35 years of age. The figure also contains the average treatment effect estimate (AIPW estimate) for a reference. The widths of the three confidence bands are substantially different. The confidence band based on the Gumbel approximation provides the widest band and may not be very informative. The confidence band that is valid only in a pointwise sense gives the narrowest band. This band is not uniformly valid and so may provide misleading information about the CATEF. On the other hand, this provides valuable information if we are interested at a particular point of the CATEF. The confidence band we propose lay between “Gumbel CB” and “PW CB”. While this band is wider than “PW CB”, it is much narrower than “Gumbel CB” and is uniformly asymptotically valid. We see from this figure that our confidence band is informative while being uniformly valid. The estimated CATEF is decreasing from 15 to around 25 years of age. It is rather stable for the range above 25 years of age. All confidence bands indicate that the CATEF is estimated imprecisely near the ends of the range. Nonetheless, the estimated CATEF indicates that smoking may not have a strong impact when the mother is young. The CATEF is estimated relatively precisely in the middle of the range. For the range between 20 and 30 years of age, even the band based on the Gumbel approximation, which is the widest, does not contain 0. This result provides robust evidence that smoking has a negative impact on birth weight at least for mothers who are 20 to 30 years old. In this particular dataset, the statistical evidence against a constant smoking effect is somewhat weak. The confidence band that is valid DOUBLY ROBUST UNIFORM CONFIDENCE BAND 25 only in a pointwise sense may provide an impression that the smoking effect depends on the mother’s age. However, the uniformly valid confidence band that we propose marginally contains the straight line that is equal to the ATE estimate. This result illustrates that there is a caveat when we use pointwise confidence intervals, as well as the importance of using uniformly valid confidence bands. 7. Conclusion In this paper, we propose a doubly robust method for estimating the CATEF. We consider the situation where a high-dimensional vector of covariates is needed for identifying the average treatment effect but the covariates of interest are of much lower dimension. Our proposed estimator is doubly robust and does not suffer from the curse of dimensionality. We propose a uniform confidence band that is easy to compute, and we illustrate its usefulness via Monte Carlo experiments and an application to the effects of smoking on birth weights. There are a few topics to be explored in the future. First, it would be useful to consider the issue of asymptotic biases of the proposed estimator without relying on undersmoothing. For example, it might be possible to extend the approach of Hall and Horowitz (2013) that avoids undersmoothing for our purposes. Second, it would be an interesting exercise to develop a method for estimating the quantile treatment effects conditional on covariates. Third, it is possible to extend our approach to the local average treatment effect. As mentioned in the Introduction, Ogburn, Rotnitzky, and Robins (2015) consider conditioning on Z to achieve identification, but they estimate the local average treatment effect, say LATE(X), as a function of X. However, their specification of LATE(X) is parametric. Our approach can be adapted to specify LATE(X) nonparametrically and to develop a corresponding uniform confidence band. Fourth, this paper does not cover marginal treatment effects that can be identified using the method of local instrumental variables developed by Heckman and 26 LEE, OKUI, AND WHANG Vytlacil (1999, 2005). It would be interesting to develop a uniform confidence band for the marginal treatment effects. Appendix A. Proofs Proof of Lemma 1. Because DY = DY1 and Y1 and D are independent of each other conditional on Z, write: (A.1) E [D|Z] E [Y1 |Z] E [D|Z] − π(Z, β0 ) − µ1 (Z, α10 )X = x . E [ψ1 (W, α10 , β0 )|X = x] = E π(Z, β0 ) π(Z, β0 ) Suppose that there exists β0 such that E [D|Z] = π(Z, β0 ) almost surely. Then the right-hand side of (A.1) reduces to: E [E [Y1 |Z] |X = x] = E [Y1 |X = x] . Suppose now that there exists α10 such that E [Y1 |Z] = µ1 (Z, α10 ) almost surely. Then the right-hand side of (A.1) again reduces to: E [µ1 (Z, α10 )|X = x] = E [Y1 |X = x] . Analogously, we have E [ψ0 (W, α00 , β0 )|X = x] = E [Y0 |X = x]. The remainder of the appendix gives the proof of Theorem 2. We first establish the linear expansion of the local linear estimator. Lemma 3. sup x∈I p n ĝ(x) − g(x) X 1 Ui Xi − x nhdn − d K = Op n−c ŝ(x) nhn sn (x) i=1 fX (x) hn for some positive constant c > 0. DOUBLY ROBUST UNIFORM CONFIDENCE BAND 27 Proof of Lemma 3. Let Ψ̂ and Ψ0 denote the n-dimensional vectors such that Ψ̂ = {ψ(Wi , θ̂)}ni=1 and Ψ0 = {ψ(Wi , θ0 )}ni=1 , respectively. Let Γ(x) be the n × (d + 1) matrix whose i-th row is [1, (Xi − x)> ], Ω(x) the n-dimensional diagonal matrix n whose i-th element is h−1 n K [(Xi − x)/hn ], G := [g(Xi )]i=1 the n-dimensional vector of regression functions evaluated at data points, and U := (Ui )ni=1 the n-dimensional vector of regression errors. Let e1 denote the (d + 1) × 1 vector whose first element is one and all others are zeros. Write ĝ(x) − g(x) = Tn1 (x) + Tn2 (x) + Rn1 (x), where −1 > Tn1 (x) = e> Γ(x)> Ω(x)U , 1 Γ(x) Ω(x)Γ(x) −1 > Tn2 (x) = e> Γ(x)> Ω(x)G, 1 Γ(x) Ω(x)Γ(x) −1 > > > Rn (x) = e1 Γ(x) Ω(x)Γ(x) Γ(x) Ω(x) Ψ̂ − Ψ0 . We first consider the leading stochastic term Tn1 (x). As in equation (2.10) of Ruppert and Wand (1994), we have that (A.2) n−1 Γ(x)> Ω(x)Γ(x) Pn Xi −x 1 K i=1 h nhdn n = P n Xi −x 1 (Xi − x) i=1 K hn nhd n 1 nhdn 1 nhdn Pn Pn i=1 i=1 K K Xi −x hn Xi −x hn > (Xi − x) (Xi − x)(Xi − x)> . By the standard empirical process results (see e.g., Pollard (1984) or van der Vaart and Wellner (1996)) combined with the usual change of variables, we have that uniformly 28 LEE, OKUI, AND WHANG in x ∈ I, " 1/2 # n 1 X log n Xi − x = fX (x) + o(hn ) + Op , K nhdn i=1 hn nhdn " 1/2 # n Xi − x ∂f (x) log n 1 X X K (Xi − x) = h2n µ2 (K) + o(h2n ) + Op hn , nhdn i=1 hn ∂x nhdn " 1/2 # n Xi − x log n 1 X K (Xi − x)(Xi − x)> = h2n fX (x)µ2 (K) + o(h2n ) + Op h2n , nhdn i=1 hn nhdn where µ2 (K) := R u2 K(u)du. Throughout the remainder of the proof, we let rn (x) = Op (n−c ) , uniformly in x, be a sequence that can be different in different places for some constant c > 0. Then as in (2.11) of Ruppert and Wand (1994), we have that (A.3) −1 −1 n Γ(x)> Ω(x)Γ(x) fX (x)−1 [1 + rn (x)] −[∂fX (x)/∂x]> fX (x)−2 [1 + rn (x)] , = −1 −2 2 −[∂fX (x)/∂x]fX (x) [1 + rn (x)] [µ2 (K)fX (x)hn Id ] [1 + rn (x)] where Id is the d-dimensional identity matrix. The little op (·) terms in equation (2.11) of Ruppert and Wand (1994) are pointwise; however, (A.3) holds uniformly in x ∈ I with polynomially decaying terms rn (x) under our assumptions. Let Γi (x) := [1, (Xi − x)> ]> . Since n 1 X Xi − x n Γ(x) Ω(x)U = Ui K Γi (x), nhdn i=1 hn −1 > we have by (A.3) that Tn1 (x) = Tn11 (x) + Tn12 (x), where n X 1 Tn11 (x) = Ui K nhdn fX (x) i=1 Xi − x hn [1 + rn (x)], DOUBLY ROBUST UNIFORM CONFIDENCE BAND 29 > n X 1 ∂fX (x) Xi − x Tn12 (x) = − d Ui K (Xi − x)[1 + rn (x)]. nhn [fX (x)]2 i=1 hn ∂x Again using the standard empirical process result and the method of change of variables, " Tn11 (x) = Op log n nhdn 1/2 # " and Tn12 (x) = Op hn log n nhdn 1/2 # uniformly in x ∈ I. Therefore, we have shown that n (A.4) X 1 Ui K Tn1 (x) = nhdn fX (x) i=1 Xi − x hn [1 + rn (x)]. We now move on the other remainder terms. The proof of Theorem 2.1 (in particular, equation (2.3)) of Ruppert and Wand (1994) implies that Tn2 (x) = O(h2n ) → 0 at a polynomial rate in n uniformly in x ∈ I. The condition that nhd+4 n corresponds to the undersmoothing condition. It is straightforward to show that (nhdn )1/2 Rn (x) = O(n−c ) uniformly in x for some constant c > 0 due to Assumption 1(9) that n o max (nhdn )1/2 |ψ(Wi , θ̂) − ψ(Wi , θ0 )|i = 1, . . . , n = Op (n−c ) for some constant c > 0 Note that by conditions (7) and (8) of Assumption 1, we have that inf n≥1 inf x∈I sn (x) > 0 and supx∈I |ŝ2 (x) − s2n (x)| = Op (n−c ). Hence, the lemma follows from (A.4) immediately. Define: −1/2 n 1 X Xi − x 1 X−x 2 2 Tn (x) ≡ Ui K and cn (x) ≡ E U K . nhdn i=1 hn hdn hn 30 LEE, OKUI, AND WHANG Note that cn (x) = [fX (x)sn (x)]−1 . By Lemma 3: p ĝ(x) − g(x) d max nhn − cn (x)Tn (x) = Op n−c . x∈I ŝ(x) We now use the result of Chernozhukov, Chetverikov, and Kato (2014) to obtain Gaussian approximations. Define: (A.5) Wn ≡ sup cn (x) x∈I p nhdn [Tn (x) − ETn (x)] . Chernozhukov, Chetverikov, and Kato (2014) established an approximation of Wn by a sequence of suprema of Gaussian processes. For each n ≥ 1, let B̃n,1 be a centered Gaussian process indexed by I with covariance function: (A.6) E[B̃n,1 (x)B̃n,1 (x0 )] = X−x X − x0 −d 0 2 hn cn (x)cn (x )Cov U K K . hn hn Proposition 3.2 of Chernozhukov, Chetverikov, and Kato (2014) establishes the following approximation result. Lemma 4. Let Assumption 1 hold. Then for every n ≥ 1, there is a tight Gaussian random variable B̃n,1 in `∞ (I) with mean zero and covariance function (A.6), and there is a sequence W̃n,1 of random variables such that W̃n,1 =d supx∈I B̃n,1 (x) and as n → ∞: n o |Wn − W̃n,1 | = OP (nhdn )−1/6 log n + (nhdn )−1/4 log5/4 n + (n1/2 hdn )−1/2 log3/2 n . Proof of Lemma 4. To apply Proposition 3.2 of Chernozhukov, Chetverikov, and Kato (2014), we first note that Assumption 1 implies that all the regularity conditions for Proposition 3.2 of Chernozhukov, Chetverikov, and Kato (2014) are satisfied. They are: DOUBLY ROBUST UNIFORM CONFIDENCE BAND 31 (1) supx∈Rd E[U 4 |X = x] < ∞. (2) K(·) is a bounded and continuous kernel function on Rd , and such that the class of functions K ≡ {t 7→ K(ht + x) : h > 0, x ∈ Rd } is a VC type with envelope kKk∞ . (3) The distribution of X has a bounded Lebesgue density p(·) on Rd . (4) hn → 0 and log(1/hn ) = O(log n) as n → ∞. (5) CI ≡ supn≥1 supx∈I |cn (x)| < ∞. Moreover, for every fixed n ≥ 1 and for every xm ∈ I → x ∈ I pointwise, cn (xm ) → cn (x). Then the desired result is an immediate consequence of Proposition 3.2 of Chernozhukov, Chetverikov, and Kato (2014) with a singleton set G = {U } and with q = 4 (using their notation) in verifying condition (B1)’ of Chernozhukov, Chetverikov, and Kato (2014). We now show that the Gaussian field obtained in Lemma 4 can be further approximated by a stationary Gaussian field. Lemma 5. Let Assumption 1 hold. Then for every n ≥ 1, there is a tight Gaussian random variable B̃n,2 in `∞ (In ) with mean zero and covariance function: E[B̃n,2 (s)B̃n,2 (s0 )] = ρd (s − s0 ) for s, s0 ∈ In ≡ h−1 n I, and there is a sequence of random variables such that W̃n,2 =d supx∈I B̃n,2 (h−1 n x) and as n → ∞: p |W̃n,1 − W̃n,2 | = OP hn log h−d . n Proof of Lemma 5. This lemma can be proved as in the proof of Lemma 3.4 of Ghosal, Sen, and van der Vaart (2000). Let: −1/2 X−x Xi − x 2 2 φn,x (Ui , Xi ) := E U K Ui K , hn hn 32 LEE, OKUI, AND WHANG ϕn,x (Ui , Xi ) := hdn E 2 Z U |Xi fX (Xi ) 2 −1/2 K (u)du Ui K Xi − x hn . As in Remark 8.3 of Ghosal, Sen, and van der Vaart (2000) and in the proof of Lemma 3.4 of Ghosal, Sen, and van der Vaart (2000), we can regard Gaussian processes B̃n,1 and B̃n,2 as Brownian bridges Bn (φn,x ) and Bn (ϕn,x ), respectively, in the sense that EBn (g) = 0 and E[Bn (g)Bn (g 0 )] = cov(g, g 0 ) for g = φn,x , g 0 = φn,x0 or g = ϕn,x , g = ϕn,x0 . Define δn (x) := Bn (φn,x ) − Bn (ϕn,x ). Note that δn (x) is also a mean zero Gaussian process with: E [δn (x)δn (x0 )] = E [{φn,x (U, X) − ϕn,x (U, X)}{φn,x0 (U, X) − ϕn,x0 (U, X)}] . Note that: E {δn (x)} 2 Z Z = −1/2 E U 2 |X = x + hu K2 (u)fX (x + hu)du 2 − E U |X = x + ht fX (x + ht) Z 2 −1/2 2 K (u) du × E U 2 |X = x + ht K2 (t) fX (x + hu)dt = O(h2n ), because x 7→ E [U 2 |X = x] fX (x) is Lipschitz continuous. Thus, the L2 -diameter of δn (·) is O(hn ). In addition, we can show that there exists a constant C > 0 such that: 2 E {δn (x) − δn (x0 )}2 ≤ Ch−2 kx − x0 k . Then arguments similar to those used in the proof of Lemma 3.4 of Ghosal, Sen, and van der Vaart (2000) yield the desired result. DOUBLY ROBUST UNIFORM CONFIDENCE BAND 33 √ Proof of Theorem 2. First note that an = O( log n ) because hn = Cn−η . Lemmas 4 and 5 together imply that: ĝ(x) − g(x) −1 max − B̃n,2 (hn x) = op (an ) . x∈I ŝ(x) Note that B̃n,2 , defined in Theorem 5, is a homogeneous Gaussian field with zero mean and the covariance function ρd (s). Because of the assumption on K(·), the covariance function ρd (s) has finite support and is six times differentiable. The latter property implies that the Gaussian process B̃n,2 is three times differentiable in the mean square sense (see, e.g., Chapter 4 of Rasmussen and Williams (2006)). Then by Theorem 14.3 of Piterbarg (1996) and also by Theorem 3.2 of Konakov and Piterbarg (1984), there exists κ > 0 such that uniformly in t, on any finite interval: −1 P an max B̃n,2 (hn x) − an < t = x∈I −t−t2 /2a2n exp −2e [(d−1)/2] X m=0 hm,d−1 a−2m n t 1+ 2 an d−2m−1 + O(n−κ ) as n → ∞, where an is obtained as the largest solution to the equation: √ mes(I)hn −d detΛ2 d−1 −a2n /2 an e = 1, (2π)(d+1)/2 Λ2 is the covariance matrix of the vector of the first derivative of the Gaussian field B̃n,2 : 2 ∂ r(0) , i, j = 1, . . . , d , Λ2 ≡ cov grad B̃n,2 (t) = − ∂ti ∂tj and [·] is the integer part of a number. Simple calculation yields λ defined in (4.3). √ detΛ2 = λd/2 with 34 LEE, OKUI, AND WHANG References Abadie, A. (2003): “Semiparametric instrumental variable estimation of treatment response models,” Journal of Econometrics, 113(2), 231–263. Abrevaya, J. (2006): “Estimating the Effect of Smoking on Birth Outcomes Using a Matched Panel Data Approach,” Journal of Applied Econometrics, 21, 489–519. Abrevaya, J., and C. M. Dahl (2008): “The effects of birth inputs on birthweight: evidence from quantile estimation on panel data,” Journal of Business and Economic Statistics, 26, 379–397. Abrevaya, J., Y.-C. Hsu, and R. P. Lieli (2015): “Estimating Conditional Average Treatment Effects,” Journal of Business and Economic Statistics, 33(4), 485–505. Almond, D., K. Y. Chay, and D. S. Lee (2005): “The Costs of Low Birth Weight,” Quarterly Journal of Economics, 120, 1031–1083. Almond, D., and J. Currie (2011): “Human Capital Development before Age Five,” in Handbook of Labor Economics, ed. by D. Card, and O. Ashenfelter, vol. 4b, chap. 15, pp. 1315–1486. Elsevier. Cattaneo, M. D. (2010): “Efficient semiparametric estimation of multi-valued treatment effects under ignorability,” Journal of Econometrics, 155, 138–154. Chernozhukov, V., D. Chetverikov, and K. Kato (2013): “Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors,” Annals of Statistics, 41(6), 2786–2819. (2014): “Gaussian approximation of suprema of empirical processes,” Annals of Statistics, 42(4), 1564–1597. Chernozhukov, V., I. Ferndádez-Val, and A. Galichon (2009): “Improving point and interval estimators of monotone functions by rearrangement,” Biometrika, 96(3), 559–575. DOUBLY ROBUST UNIFORM CONFIDENCE BAND 35 Chernozhukov, V., S. Lee, and A. Rosen (2013): “Intersection Bounds: Estimation and Inference,” Econometrica, 81(2), 667–737. Claeskens, G., and I. v. Keilegom (2003): “Bootstrap confidence bands for regression curves and their derivatives,” Annals of Statistics, 31(6), 1852–1884. da Veiga, P. V., and R. P. Wilder (2008): “Maternal Smoking During Pregnancy and Birthweight: A Propensity Score Matching Approach,” Maternal and Child Health Journal, 12(2), 194–203. Derya Uysal, S. (2015): “Doubly Robust Estimation of Causal Effects with Multivalued Treatments: An Application to the Returns to Schooling,” Journal of Applied Econometrics, 30(5), 763–786. Ghosal, S., A. Sen, and A. W. van der Vaart (2000): “Testing Monotonicity of Regression,” Annals of Statistics, 28, 1054–1082. Glynn, A. N., and K. M. Quinn (2010): “An Introduction to the Augmented Inverse Propensity Weighted Estimator,” Political Analysis, 18, 36–56. Hall, P., and J. Horowitz (2013): “A simple bootstrap method for constructing nonparametric confidence bands for functions,” Annals of Statistics, 41(4), 1892– 1921. Heckman, J. J., and E. J. Vytlacil (1999): “Local instrumental variable and latent variable models for identifying and bounding treatment effects,” Proceedings of the National Academy of Sciences, 96, 4730–4734. (2005): “Structural equations, treatment, effects and econometric policy evaluation,” Econometrica, 73(3), 669–738. Imbens, G. W., and J. D. Angrist (1994): “Identification and estimation of local average treatment effects,” Econometrica, 62(2), 467–475. Kang, J. D. Y., and J. L. Schafer (2007): “Demystifying Double Robustness: A comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data,” Statistical Science, 22(4), 523–539. 36 LEE, OKUI, AND WHANG Konakov, V., and V. Piterbarg (1984): “On the convergence rate of maximal deviation distribution for kernel regression estimates,” Journal of Multivariate Analysis, 15(3), 279–294. Kramer, M. S. (1987): “Intrauterine Growth and Gestational Duration Determinants,” Pediatrics, 80, 502–511. Lee, S., O. Linton, and Y.-J. Whang (2009): “Testing for Stochastic Monotonicity,” Econometrica, 77(2), 585–602. Lunceford, J. K., and M. Davidian (2004): “Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study,” Statistics in Medicine, 23, 2937–2960. Ogburn, E. L., A. Rotnitzky, and J. M. Robins (2015): “Doubly robust estimation of the local average treatment effect curve,” Journal of the Royal Statistical Society, Series B, 77(Part 2), 373–396. Okui, R., D. S. Small, Z. Tan, and J. M. Robins (2012): “Doubly Robust Instrumental Variable Regression,” Statistica Sinica, 22, 173–205. Piterbarg, V. I. (1996): Asymptotic Methods in the Theory of Gaussian Processes and Fields. American Mathematical Society, Providence, RI. Pollard, D. (1984): Convergence of Stochastic Processes. Springer, New York, NY. Poterba, J. M., S. F. Venti, and D. A. Wise (1995): “Do 401(k) contributions crowd out other personal saving?,” Journal of Public Economics, 58(1), 1–32. Rasmussen, C. E., and C. K. I. Williams (2006): Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA. Robins, J. M. (2000): “Marginal structural models versus structural nested models as tools for causal inference,” in Statistical Models in Epidemiology, the Environment, and Clinical Trials, ed. by M. E. Halloran, and D. Berry, vol. 116 of The IMA Volumes in Mathematics and its Applications, pp. 95–133. Springer, New York. DOUBLY ROBUST UNIFORM CONFIDENCE BAND 37 Robins, J. M., and A. Rotnitzky (1995): “Semiparametric Efficiency in Multivariate Regression Models with Missing Data,” Journal of the American Statistical Association, 90(429), 122–129. Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994): “Estimation of Regression Coefficients when Some Regressors are not always oberved,” Journal of the American Statistical Association, 89, 846–866. Rothe, C., and S. Firpo (2014): “Semiparametric Two-Step Estimation Using Doubly Robust Moment Conditions,” working paper. Ruppert, D., S. J. Sheather, and M. P. Wand (1995): “An effective bandwidth selector for local least squares regression,” Journal of the American Statistical Association, 90(432), 1257–1270. Ruppert, D., and M. P. Wand (1994): “Multivariate Locally Weighted Least Squares Regression,” Annals of Statistics, 22(3), 1346–1370. Schafer, J. L., and J. Kang (2008): “Average causal effects from nonrandomized studies: A practical guide and simulated example,” Psychological Methods, 13(4), 279–313. Scharfstein, D. O., A. Rotnitzky, and J. M. Robins (1999): “Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models,” Journal of the Americal Statistical Association, 94(448), 1096–1120. Tan, Z. (2006): “Regression and Weighting Methods for Causal Inference Using Instrumental Variables,” Journal of the American Statistical Association, 101(476), 1607–1618. (2010): “Bounded, efficient and doubly robust estimation with inverse weighting,” Biometrika, 97(3), 661–682. van der Laan, M. J. (2006): “Statistical Inference for Variable Importance,” The International Journal of Biostatistics, 2(1), Article 2. 38 LEE, OKUI, AND WHANG van der Vaart, A. W., and J. A. Wellner (1996): Weak Convergence and Empirical Processes. Springer-Verlag, New York, NY. Walker, M., E. Tekin, and S. Wallace (2009): “Teen Smoking and Birth Outcomes,” Southern Economic journal, 75, 892–907. Wooldridge, J. M. (2007): “Inverse Probability Weighted Estimation for General Missing Data Problems,” Journal of Econometrics, 141, 1281–1301. (2010): Econometric analysis of cross section and panel data. The MIT press, second edn. DOUBLY ROBUST UNIFORM CONFIDENCE BAND 39 Table 1. Monte Carlo results: CATEF estimates, p = 2 At x = BIAS SD -0.4 -0.2 0 0.2 0.4 -0.001 0.005 0.004 0 -0.004 0.06 0.042 0.037 0.035 0.038 -0.4 -0.2 0 0.2 0.4 0.002 0.005 0.005 0 -0.005 0.134 0.069 0.048 0.041 0.042 -0.4 -0.2 0 0.2 0.4 0 0.004 0.004 0 -0.004 0.052 0.042 0.036 0.035 0.038 -0.4 -0.2 0 0.2 0.4 0.282 -0.023 -0.034 -0.005 -0.002 0.178 0.075 0.052 0.044 0.045 -0.4 -0.2 0 0.2 0.4 0 0.004 0.002 -0.001 -0.003 0.045 0.032 0.027 0.026 0.028 -0.4 -0.2 0 0.2 0.4 0.002 0.005 0.003 -0.001 -0.004 0.094 0.051 0.035 0.03 0.03 -0.4 -0.2 0 0.2 0.4 0 0.004 0.002 -0.001 -0.003 0.039 0.031 0.027 0.026 0.028 -0.4 -0.2 0 0.2 0.4 0.276 -0.026 -0.037 -0.006 0 0.127 0.057 0.039 0.033 0.033 DR ASE IPW BIAS SD MSE×100 BIAS N = 500 True propensity score model, False regression model 0.057 0.36 -0.001 0.055 0.3 -0.158 0.04 0.182 0.004 0.043 0.183 0.058 0.036 0.138 0.004 0.036 0.129 0.097 0.034 0.124 0 0.037 0.14 -0.002 0.036 0.149 -0.004 0.044 0.192 -0.053 False propensity score model, True regression model 0.117 1.784 0.006 0.147 2.171 -0.001 0.07 0.484 -0.06 0.077 0.955 0.002 0.048 0.23 0.004 0.05 0.254 0.001 0.041 0.169 0.056 0.047 0.543 -0.001 0.042 0.176 0.092 0.056 1.155 -0.002 True propensity score model, True regression model 0.048 0.268 -0.001 0.055 0.3 -0.001 0.039 0.175 0.004 0.043 0.183 0.002 0.035 0.132 0.004 0.036 0.129 0.001 0.034 0.119 0 0.037 0.14 -0.001 0.035 0.145 -0.004 0.044 0.192 -0.002 False propensity score model, False regression model 0.166 11.125 0.006 0.147 2.171 -0.158 0.076 0.615 -0.06 0.077 0.955 0.058 0.052 0.387 0.004 0.05 0.254 0.097 0.045 0.195 0.056 0.047 0.543 -0.002 0.046 0.203 0.092 0.056 1.155 -0.053 N = 1000 True propensity score model, False regression model 0.043 0.199 0 0.04 0.163 -0.156 0.03 0.103 0.004 0.032 0.103 0.06 0.027 0.076 0.002 0.026 0.071 0.098 0.026 0.068 -0.001 0.027 0.075 -0.002 0.027 0.082 -0.003 0.032 0.101 -0.053 False propensity score model, True regression model 0.087 0.881 0.005 0.105 1.106 0 0.052 0.265 -0.06 0.058 0.689 0.001 0.036 0.124 0.002 0.037 0.138 0.001 0.031 0.092 0.055 0.034 0.425 0 0.031 0.093 0.091 0.04 0.994 -0.001 True propensity score model, True regression model 0.036 0.148 0 0.04 0.163 0 0.03 0.1 0.004 0.032 0.103 0.001 0.026 0.072 0.002 0.026 0.071 0.001 0.026 0.067 -0.001 0.027 0.075 0 0.027 0.077 -0.003 0.032 0.101 -0.001 False propensity score model, False regression model 0.123 9.214 0.005 0.105 1.106 -0.156 0.057 0.396 -0.06 0.058 0.689 0.06 0.039 0.288 0.002 0.037 0.138 0.098 0.034 0.114 0.055 0.034 0.425 -0.002 0.035 0.112 0.091 0.04 0.994 -0.053 MSE×100 RA SD MSE×100 0.036 0.023 0.017 0.018 0.026 2.626 0.394 0.978 0.034 0.343 0.024 0.022 0.016 0.017 0.029 0.059 0.047 0.026 0.03 0.085 0.024 0.022 0.016 0.017 0.029 0.059 0.047 0.026 0.03 0.085 0.036 0.023 0.017 0.018 0.026 2.626 0.394 0.978 0.034 0.343 0.025 0.016 0.012 0.013 0.018 2.51 0.382 0.974 0.016 0.313 0.018 0.016 0.012 0.012 0.02 0.034 0.025 0.014 0.015 0.041 0.018 0.016 0.012 0.012 0.02 0.034 0.025 0.014 0.015 0.041 0.025 0.016 0.012 0.013 0.018 2.51 0.382 0.974 0.016 0.313 40 LEE, OKUI, AND WHANG Table 2. Monte Carlo results: CATEF estimates, p = 4 At x = BIAS SD -0.4 -0.2 0 0.2 0.4 -0.001 0.005 0.004 -0.002 -0.002 0.042 0.038 0.036 0.038 0.063 -0.4 -0.2 0 0.2 0.4 -0.001 0.005 0.003 -0.002 -0.002 0.052 0.045 0.041 0.041 0.063 -0.4 -0.2 0 0.2 0.4 -0.001 0.005 0.003 -0.002 -0.002 0.039 0.037 0.035 0.038 0.062 -0.4 -0.2 0 0.2 0.4 0.051 0 -0.009 -0.001 -0.003 0.057 0.049 0.043 0.042 0.066 -0.4 -0.2 0 0.2 0.4 -0.002 0.003 0.003 -0.001 -0.003 0.03 0.029 0.026 0.028 0.048 -0.4 -0.2 0 0.2 0.4 -0.002 0.003 0.003 -0.001 -0.003 0.038 0.034 0.03 0.03 0.047 -0.4 -0.2 0 0.2 0.4 -0.002 0.003 0.003 -0.001 -0.003 0.029 0.027 0.026 0.028 0.047 -0.4 -0.2 0 0.2 0.4 0.049 -0.003 -0.01 0 -0.004 0.041 0.036 0.032 0.031 0.05 DR ASE IPW BIAS SD MSE×100 BIAS N = 500 True propensity score model, False regression model 0.04 0.175 -0.001 0.039 0.153 -0.098 0.037 0.147 0.005 0.037 0.141 0.062 0.034 0.128 0.003 0.035 0.121 0.079 0.037 0.144 -0.002 0.04 0.158 -0.017 0.059 0.399 -0.001 0.065 0.429 -0.044 False propensity score model, True regression model 0.05 0.271 -0.009 0.055 0.314 -0.001 0.045 0.207 -0.02 0.049 0.28 0.003 0.04 0.17 0.003 0.043 0.185 0.003 0.041 0.171 0.033 0.046 0.324 -0.001 0.059 0.392 0.045 0.073 0.732 -0.001 True propensity score model, True regression model 0.037 0.153 -0.001 0.039 0.153 -0.001 0.035 0.138 0.005 0.037 0.141 0.003 0.033 0.122 0.003 0.035 0.121 0.003 0.037 0.144 -0.002 0.04 0.158 -0.001 0.058 0.387 -0.001 0.065 0.429 -0.001 False propensity score model, False regression model 0.057 0.582 -0.009 0.055 0.314 -0.098 0.048 0.24 -0.02 0.049 0.28 0.062 0.042 0.194 0.003 0.043 0.185 0.079 0.042 0.178 0.033 0.046 0.324 -0.017 0.064 0.432 0.045 0.073 0.732 -0.044 N = 500 True propensity score model, False regression model 0.03 0.092 -0.002 0.029 0.083 -0.098 0.028 0.083 0.003 0.028 0.079 0.062 0.026 0.069 0.003 0.025 0.066 0.079 0.028 0.078 -0.001 0.029 0.086 -0.017 0.044 0.228 -0.002 0.05 0.248 -0.045 False propensity score model, True regression model 0.037 0.143 -0.01 0.04 0.171 -0.001 0.033 0.115 -0.023 0.037 0.186 0.002 0.03 0.093 0.003 0.031 0.1 0.002 0.031 0.091 0.036 0.034 0.242 -0.001 0.044 0.222 0.043 0.055 0.493 -0.002 True propensity score model, True regression model 0.028 0.082 -0.002 0.029 0.083 -0.001 0.026 0.077 0.003 0.028 0.079 0.002 0.025 0.066 0.003 0.025 0.066 0.002 0.028 0.077 -0.001 0.029 0.086 -0.001 0.043 0.222 -0.002 0.05 0.248 -0.002 False propensity score model, False regression model 0.043 0.412 -0.01 0.04 0.171 -0.098 0.036 0.134 -0.023 0.037 0.186 0.062 0.031 0.11 0.003 0.031 0.1 0.079 0.032 0.095 0.036 0.034 0.242 -0.017 0.048 0.248 0.043 0.055 0.493 -0.045 MSE×100 RA SD MSE×100 0.03 0.022 0.02 0.025 0.04 1.055 0.43 0.671 0.089 0.351 0.019 0.019 0.014 0.024 0.06 0.037 0.037 0.02 0.056 0.365 0.019 0.019 0.014 0.024 0.06 0.037 0.037 0.02 0.056 0.365 0.03 0.022 0.02 0.025 0.04 1.055 0.43 0.671 0.089 0.351 0.021 0.016 0.015 0.018 0.029 1.015 0.407 0.646 0.062 0.286 0.014 0.014 0.01 0.017 0.046 0.019 0.02 0.01 0.03 0.21 0.014 0.014 0.01 0.017 0.046 0.019 0.02 0.01 0.03 0.21 0.021 0.016 0.015 0.018 0.029 1.015 0.407 0.646 0.062 0.286 DOUBLY ROBUST UNIFORM CONFIDENCE BAND Table 3. Monte Carlo results, CATEF confidence band, p = 2 Confidence level CP Mcri Sdcri GCP N = 500 True propensity score model, False regression model 99% 0.978 3.485 0.078 0.999 95% 0.917 2.98 0.091 0.994 90% 0.855 2.728 0.099 0.977 False propensity score model, True regression model 99% 0.99 3.468 0.106 1 95% 0.958 2.96 0.123 0.998 90% 0.912 2.705 0.134 0.99 True propensity score model, True regression model 99% 0.981 3.483 0.072 1 95% 0.929 2.978 0.084 0.995 0.864 2.726 0.092 0.98 90% False propensity score model, False regression model 99% 0.983 3.505 0.098 1 95% 0.916 3.003 0.113 0.997 90% 0.823 2.753 0.123 0.982 N = 1000 True propensity score model, False regression model 99% 0.982 3.525 0.067 0.999 0.93 3.027 0.078 0.994 95% 90% 0.875 2.779 0.085 0.979 False propensity score model, True regression model 99% 0.993 3.5 0.093 1 95% 0.96 2.998 0.108 0.997 90% 0.919 2.747 0.117 0.994 True propensity score model, True regression model 99% 0.986 3.524 0.061 1 95% 0.936 3.026 0.071 0.995 90% 0.881 2.778 0.077 0.983 False propensity score model, False regression model 99% 0.948 3.545 0.088 1 95% 0.774 3.05 0.102 0.991 90% 0.614 2.804 0.111 0.947 41 42 LEE, OKUI, AND WHANG Table 4. Monte Carlo results, CATEF confidence band, p = 4 Confidence level CP Mcri Sdcri GCP N = 500 True propensity score model, False regression model 99% 0.982 3.488 0.081 1 95% 0.926 2.984 0.095 0.994 90% 0.863 2.732 0.103 0.981 False propensity score model, True regression model 99% 0.987 3.478 0.082 1 95% 0.941 2.972 0.095 0.996 90% 0.886 2.718 0.104 0.985 True propensity score model, True regression model 99% 0.981 3.489 0.082 1 95% 0.924 2.985 0.095 0.996 0.861 2.733 0.103 0.981 90% False propensity score model, False regression model 99% 0.986 3.486 0.079 1 95% 0.938 2.982 0.092 0.996 90% 0.879 2.73 0.1 0.983 N = 1000 True propensity score model, False regression model 99% 0.985 3.524 0.07 1 0.931 3.026 0.081 0.996 95% 90% 0.877 2.778 0.088 0.982 False propensity score model, True regression model 99% 0.988 3.513 0.07 1 95% 0.945 3.013 0.081 0.997 90% 0.893 2.764 0.088 0.986 True propensity score model, True regression model 99% 0.984 3.526 0.069 1 95% 0.933 3.029 0.08 0.995 90% 0.874 2.781 0.087 0.982 False propensity score model, False regression model 99% 0.986 3.523 0.066 1 95% 0.923 3.025 0.077 0.996 90% 0.855 2.777 0.084 0.98 DOUBLY ROBUST UNIFORM CONFIDENCE BAND 43 Figure 1. CATEF for the effect of smoking on birth weights, 95% confidence bands −600 −400 −200 0 200 400 CATEF our CB PW CB Gumbel CB ATE 15 20 25 30 35 mother’s age Note: “CATEF” = the estimated CATEF; “our CB” = the uniformly valid confidence band proposed in this paper; “PW CB” = the confidence band that is valid only in a pointwise sense; “Gumbel CB” = the uniformly valid confidence band based on the Gumbel approximation; “ATE” = the estimated value of the average treatment effect.