Copula Regression Spline Models for Binary Outcomes With Application in Health Care Utilization∗ Rosalba Radice† Department of Economics, Mathematics and Statistics Birkbeck, London, U.K. Malet Street, London WC1E 7HX, U.K. Giampiero Marra and Małgorzata Wojtyś Department of Statistical Science University College London Gower Street, London WC1E 6BT, U.K. Abstract We introduce a framework for estimating the effect that a binary treatment has on a binary outcome in the presence of unobserved and residual confounding, and of non-linear dependence between treatment and outcome. The method is applied to a case study which uses data from the 2008 Medical Expenditure Panel Survey and whose aim is to estimate the effect of private health insurance on health care utilization. Unobserved confounding arises when variables which are associated with both treatment and outcome (hence confounders) are not available. Residual confounding occurs when observed confounders are insufficiently accounted for in the analysis. Treatment and outcome may also exhibit a dependence that cannot be modeled using a linear measure of association. The problem of unobserved confounding is addressed using a two-equation structural latent variable framework, where one equation describes a binary outcome as a function of a binary treatment whereas the other equation determines whether the treatment is received. The residual confounding issue is tackled by modeling flexibly covariate-response relationships using a spline approach. Non-linear dependence between treatment and outcome is dealt with by using copula functions. Related model fitting and inferential procedures are developed and asymptotic arguments presented. The findings of our empirical analysis suggest that the three issues tackled in this paper are present when estimating the impact of private health insurance on health care utilization, and that useful insights can be gained by using the proposed framework. Key Words: Binary outcomes; Copula; Penalized regression spline; Residual confounding; Simultaneous equation estimation; Unobserved confounding. ∗ † Research Report number 321, Department of Statistical Science, University College London. Date: August 2013 r.radice.bbk@ac.uk 1 1 Introduction Quantifying the effect of a treatment on an outcome is a challenging task in observational studies. This is because there can be confounders, that is variables associated with both treatment and outcome, which are not observable as they are either unknown or not readily quantifiable. This problem, known in econometrics as endogeneity of the treatment, constitutes a serious limitation to the use of classic methods since they may yield inconsistent estimates of the effect of interest. The matter is further complicated by the problem of residual confounding which arises when observed confounders are insufficiently accounted for in the analysis; unmodeled confounder-outcome effects enter the unstructured (error) term of the model causing the treatment to be associated with it and hence inducing bias in the treatment effect. An additional issue which warrants consideration is that treatment and outcome might be associated in a non-linear fashion; for instance, there might exist an asymmetric tail dependence that a linear measure of association would not be able to fully capture. In this article, we consider the case in which the researcher is interested in estimating the effect of a binary treatment on a binary outcome in the presence of unobserved and residual confounding, and of non-linear dependence between treatment and outcome. To clarify ideas, let us consider a case study which uses data from the 2008 Medical Expenditure Panel Survey (MEPS) and whose goal is to estimate the effect of having private health insurance on the probability of using health care services. Because this study is based on observational data obtained through a populationbased survey, insurance coverage is not randomly assigned as in a controlled trial but rather is the result of demand and supply variables, including individual preferences and health status. Even if these data are from a detailed survey, it is not possible to fully control for such variables. As a consequence, the effect of insurance coverage is likely to be biased because of the presence of unobserved confounders that are associated with insurance coverage and health care utilization. The problem of residual confounding arises because the effects of observed confounders such as age and education may be complex since they embody productivity and life-cycle effects that are likely to affect private health insurance and health care utilization non-linearly. If these relationships are mis-modeled then the unstructured term of the model and private health insurance will be associated hence causing the effect of insurance on the probability of using health care services to be biased. Finally, insurance and health care utilization might be non-linearly associated. In other words, there might be dependencies that appear in the tails of the distribution of both insurance and health care utilization which need to be modeled. The effect of unobserved confounding can be accounted for by using the recursive bivariate probit (e.g., Greene, 2012; Maddala, 1983). This model controls for unobserved confounding by using a two-equation structural latent variable framework, where one equation describes a binary outcome (health care utilization) as a function of a binary treatment (insurance coverage) whereas the other equation determines whether the treatment is received. The model is completed by assuming that the latent errors of the two equations follow a standard bivariate normal distribution with correlation θ; θ 6= 0 suggests that unobserved confounding is present, hence joint estimation of the two equations is required. Applications in economics and biostatistics include those by 2 Goldman et al. (2001), Jones et al. (2006), Gitto et al. (2006), Latif (2009), Kawatkar & Nichol (2009), Schokkaert et al. (2010) and Li & Jensen (2011). The limitations of this model are, however, the inability to deal with residual confounding and to model non-linear dependencies between the two equations. To deal with the problem of residual confounding, Marra & Radice (2011a) introduced a penalized likelihood estimation approach to estimate simultaneously the parameters of a system of two binary equations where covariate-response relationships are flexibly modeled using a spline approach. Alternative approaches are available in the literature (e.g., Chib & Greenberg, 2007), however the lack of user-friendly software makes them unfeasible for practitioners. To deal with the problem of non-linear dependence between treatment and outcome, Winkelmann (2012) discussed a modification of the recursive bivariate probit that maintains the normal assumption for the marginal distributions of the two equations while introducing nonnormal dependence between them using the Frank and Clayton copulas. The contribution of this article is twofold, one theoretical and the other practical. First, we extend the procedures discussed in Marra & Radice (2011a) and Winkelmann (2012) to deal simultaneously with unobserved and residual confounding, and non-linear dependencies between the model equations. In particular, we generalize the penalized likelihood estimation approach based on the assumption of bivariate normality presented in Marra & Radice (2011a) by allowing for non-normal dependencies; this is achieved by employing the Clayton, Frank, Gumbel, Joe and Student-t copulas and the rotated versions of Clayton, Gumbel and Joe. This generalization is of paramount importance as in many applied studies the kind of dependence is not known a priori and using copulas that can capture only positive or symmetric dependencies, as in Winkelmann (2012), might mean failing to uncover possibly interesting structures in the data. We also provide some theoretical argumentation related to the asymptotic behavior of the proposed estimator and the ensuing formula to calculate the treatment effect. Second, we implement the methods discussed in this article in the R package SemiParBIVProbit (Marra & Radice, 2013). This can be particularly attractive to practitioners who wish to fit such models. No other computational alternatives which consider copula regression spline models for binary outcomes are available in the literature. The rest of the paper is organized as follows. Section 2 discusses the model structure, parameter estimation procedure, copula model selection, confidence intervals and variable selection, and presents some asymptotic results for the introduced penalized maximum likelihood estimator and the ensuing formula used to calculate the impact of the treatment on the outcome. Section 3 applies the method to the MEPS data mentioned above, whereas Section 4 concludes and discusses some possible extensions of the work. 2 Methods 2.1 Model definition The focus is on a pair of random variables (y1i , y2i ), for i = 1, . . . , n, where yvi ∈ {0, 1}, for v = 1, 2, and n represents the sample size. Variable y1i refers to the treatment whereas y2i to 3 ∗ the outcome. The observed yvi can be viewed as determined by a latent continuous variable yvi , ∗ ∗ such that yvi = 1(yvi > 0), for which we assume that yvi ∼ N (ηvi , 1) where ηvi ∈ R is a linear predictor defined in the next section for v = 1, 2. Recall that 1 is the indicator function. The joint cumulative distribution function (cdf) of event (y1i = 1, y2i = 1) can be defined by using the copula representation (Sklar, 1959, 1973), i.e. P(y1i = 1, y2i = 1) = C(P(y1i = 1), P(y2i = 1); θ), where P(yvi = 1) = Φ(ηvi ), Φ(·) is the cdf of the standard univariate normal distribution, C is a two-place copula function and θ is an association parameter measuring the dependence between the two marginals P(y1i = 1) and P(y2i = 1). In other words, the joint distribution is expressed in terms of marginal distributions and a function C that binds them together. A substantial advantage of the copula approach is that the marginal distributions may come from different families. This construction allows researchers to consider marginal distributions and the dependence between them as two separate but related issues. The families considered are Clayton, Frank, Gaussian, Gumbel, Joe and Student-t, as well as rotated versions of the Clayton, Gumbel and Joe copulas. Rotation by 180 degrees leads to a survival copula (C180 ), while rotation by 90 (C90 ) and 270 degrees (C270 ) allows for the modeling of negative dependence which is not possible with the non-rotated and 180 degrees-rotated versions. The families considered here are defined in Table 1 and displayed in Figure 1. The rotated versions can be obtained using (e.g., Brechmann & Schepsmeier, 2013) C90 (ui , vi ) = vi − C(1 − ui , vi ), C180 (ui , vi ) = ui + vi − 1 − C(1 − ui , 1 − vi ), C270 (ui , vi ) = ui − C(ui , 1 − vi ), where ui = P(y1i = 1) and vi = P(y2i = 1); Figure 2 shows the Clayton copula rotated by 0, 90, 180 and 270 degrees. Note that the parameter ranges of the θ for the copulas rotated by 90 and 270 degrees are on a negative scale; e.g., for the Gumbel copula rotated by 90 and 270 degrees θ has to be smaller than −1. As can be seen from Table 1, θ is not especially meaningful, except for the Gaussian and Student-t copulas. However, as shown in Brechmann & Schepsmeier (2013), for each copula there exists a relation between θ and the Kendall’s τ coefficient (which a measure of association that lies in the customary range −1 to 1). For full details on copulas and their properties see Nelsen (2006). The log-likelihood function for the recursive bivariate probit model can be expressed as (Greene, 2012; Maddala, 1983) `= n X i=1 {y1i y2i log P(y1i = 1, y2i = 1) + y1i (1 − y2i ) log P(y1i = 1, y2i = 0) , +(1 − y1i )y2i log P(y1i = 0, y2i = 1) + (1 − y1i )(1 − y2i ) log P(y1i = 0, y2i = 0)} where P(y1i = 1, y2i = 0) = P(y1i = 1)−P(y1i = 1, y2i = 1), P(y1i = 0, y2i = 1) = P(y2i = 1)− 4 C(u, v; θ) −1/θ −θ u + v − 1 −θ−1 log 1 + (e−θu − 1)(e−θv − 1)/(e−θ − 1) −1 −1 n Φ2 (Φ (u), Φ (v); θ) o 1/θ exp − (− log u)θ + (− log v)θ 1/θ 1 − (1 − u)θ + (1 − v)θ − (1 − u)θ (1 − v)θ −1 t2,ν (t−1 ν (u), tν (v); θ) Copula Clayton Frank Gaussian −θ Gumbel Joe Student-t Parameter range θ ∈ (0, ∞) θ ∈ R\ {0} θ ∈ [−1, 1] θ ∈ [1, ∞) θ ∈ (1, ∞) θ ∈ [−1, 1], ν ∈ (2, ∞) Table 1: Definition of some classic copula functions. Φ2 (·, ·; θ) denotes the cumulative distribution function of a standard bivariate normal distribution with correlation coefficient θ and t2,ν (·, ·; θ) that of a standard bivariate Student-t distribution with correlation coefficient θ and ν degrees of freedom. t−1 ν denotes the inverse univariate Student-t distribution function with ν degrees of freedom. Recall that ν controls the heaviness of the tails. P(y1i = 1, y2i = 1) and P(y1i = 0, y2i = 0) = 1−[P(y1i = 1) + P(y2i = 1) − P(y1i = 1, y2i = 1)]. 3 Gaussian 3 Frank 3 Clayton 2 2 2 0.01 0.05 0.05 0.1 0.15 0.15 0.1 1 1 1 0.1 0.15 0.2 0 0 0 0.2 −1 −1 −2 0.05 −2 −1 −2 0.2 −3 −2 −1 0 1 2 3 −3 0.01 −3 −3 0.01 −3 −2 −1 0 2 3 −3 −2 −1 0 1 2 3 3 Student−t 3 Joe 3 Gumbel 1 2 2 2 0.01 0.05 1 1 1 0.1 0.2 0.2 0.1 0.05 −2 −2 0.05 0.01 −2 −1 0 1 2 3 −3 −3 0.01 −3 −3 0.15 −2 0.1 −1 0.15 −1 −1 0.15 0 0 0 0.2 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Figure 1: Contour plots of the six copula functions defined in Table 1 with standard normal margins for data simulated using a Kendall’s τ coefficient of 0.5. The Gaussian, Student-t (here with ν = 3) and Frank copulas allow for equal degrees of positive and negative dependence. Gaussian and Frank show a weaker tail dependence as compared to Student-t, and Frank exhibits a slightly stronger dependence in the middle of the distribution. Clayton is asymmetric with a strong lower tail dependence but a weaker upper tail dependence. Vice versa for the Gumbel and Joe copulas. 5 3 90 degrees 3 0 degrees 0.01 2 2 0.01 0.05 0.05 0.1 1 1 0.1 0.15 0.15 0.2 0 −3 −2 −1 −3 −2 −1 0 0.2 −3 −2 −1 0 1 2 3 −3 −2 0 1 2 3 2 3 2 1 0 0 1 2 3 270 degrees 3 180 degrees −1 0.2 0.2 0.05 0.01 −3 −2 −1 0.15 −3 −2 −1 −3 −2 −1 0.15 0.1 0 1 2 0.1 0.05 0.01 3 −3 −2 −1 0 1 Figure 2: Contour plots of the Clayton copula with standard normal margins rotated by 0, 90, 180 and 270 degrees for data simulated using a Kendall’s τ coefficient of 0.5 for positive dependence and of −0.5 for negative dependence. The plots for the rotated Gumbel and Joe copulas follow the same rationale. 2.1.1 Linear predictor specification The linear predictor for the treatment can be written as η1i = uT1i α1 + K1 X s1k1 (z1k1 i ), (1) k1 =1 whereas that for the outcome as η2i = ψy1i + uT2i α2 + K2 X s2k2 (z2k2 i ), (2) k2 =1 where ψ is the effect of the treatment on the outcome on the scale of the linear predictor, uT1i = (1, u12i , . . . , u1P1 i ) is the ith row of U1 = (u11 , . . . , u1n )T , the n × P1 model matrix containing P1 parametric terms (e.g., intercept, dummy and categorical variables), α1 is a coefficient vector, and the s1k1 are unknown smooth functions of the K1 continuous covariates z1k1 i . Varying coefficient models can be obtained by multiplying one or more smooth terms by some predictor(s) (Hastie & Tibshirani, 1993), and smooth functions of two or more covariates can also be considered (Wood, 2006). Similarly, uT2i = (1, u22i , . . . , u2P2 i ) is the ith row vector of the n × P2 model matrix U2 = (u21 , . . . , u2n )T , α2 is a parameter vector, and the s2k2 are unknown smooth terms of the K2 continuous regressors z2k2 i . The smooth functions are subject to the centering P (identifiability) constraint ni=1 svkv (zvkv i ) = 0, v = 1, 2, kv = 1, . . . , Kv (Wood, 2006). The smooth functions are represented using the regression spline approach (e.g., Ruppert et al., 2003). Specifically, svkv (zvkv i ) is approximated by a linear combination of known spline basis 6 PJvkv βvkv j bvkv j (zvkv i ) = functions, bvkv j (zvkv i ), and regression parameters, βvkv j , i.e. svkv (zvkv i ) = j=1 Bvkv (zvkv i )T βvkv , where Jvkv is the number of spline bases used to represent svkv (·), Bvkv (zvkv i )T is the ith vector of dimension Jvkv containing the basis functions evaluated at the observation T zvkv i , i.e. Bvkv (zvkv i ) = bvkv 1 (zvkv i ), bvkv 2 (zvkv i ), . . . , bvkv Jvkv (zvkv i ) , and βvkv is the corresponding parameter vector. Evaluating Bvkv (zvkv i ) for each i yields Jvkv curves with different degrees of complexity which multiplied by some value for βvkv and then summed will give a (linear or non-linear) estimate for svkv (zvkv ); see Ruppert et al. (2003) for a detailed overview. Basis functions should be chosen to have convenient mathematical and numerical properties. We employ low rank thin plate regression splines (Wood, 2006), although B-splines and cubic regression splines are supported in our implementation. The cases of smooth terms multiplied by some covariate(s) and of smooths of more than one variable follow a similar construction; see Wood (2006, Chapter 4) for full details. Linear predictors (1) and (2) can, therefore, be written as η1i = uT1i α1 +BT1i β1 and η2i = ψy1i +uT2i α2 +BT2i β2 , where BTvi = Bv1 (zv1i )T , . . . , BvKv (zvKv i )T T T and βvT = (βv1 , . . . , βvK ). After defining X1i = (uT1i , BT1i )T and X2i = (y1i , uT2i , BT2i )T , we have v η1i = XT1i δ1 and η2i = XT2i δ2 where δ1T = (αT1 , β1T ) and δ2T = (ψ, αT2 , β2T ). To identify the parameters in η2i , it is typically assumed that the exclusion restriction on the exogenous variables holds: the regressors in the treatment equation should contain at least one or more covariates (usually referred to as instruments) not included in the outcome equation. However, as shown for instance in Han & Vytlacil (2013), Marra & Radice (2011a) and Wilde (2000), the presence of this restriction may not be necessary to obtain consistent estimates of the model parameters. 2.2 Sample average treatment effect The effect of the treatment y1i on the response probability P(y2i = 1|y1i ) is of primary interest. That is, the aim is to investigate how the treatment changes the expected outcome. Thus, the treatment effect is given by the difference between the expected outcome with treatment and the expected outcome without treatment. Different measures of treatment effect have been proposed in the literature. Here, we focus on the average treatment effect in the specific sample at hand (SATE; Imbens, 2004), which in our case can be defined as n 1X P(y2i = 1|y1i = 1) − P(y2i = 1|y1i = 0), SATE(δ, X) = n i=1 where P(y2i = 1|y1i = 1) = P(y2i = 1|y1i = 0) = (y1i =1) C Φ(η1i ), Φ(η2i ); θ , Φ(η1i ) (y =0) (y =0) Φ(η2i 1i ) − C Φ(η1i ), Φ(η2i 1i ); θ 1 − Φ(η1i ) (y =r) , represents the linear predictor the linear predictors are defined in the previous section, η2i 1i evaluated at r equal to 1 or 0, δ T = (δ1T , δ2T , θ∗ ), with θ∗ being a transformation of θ which is 7 useful in estimation (see the next section), and X = (x1 | . . . |xn )T with xi defined as (XT1i , XT2i )T . SATE(δ, X) can be estimated using SATE(δ̂, X), whereas a confidence interval for it can be obtained employing the delta method. Specifically, the appropriate estimator of the asymptotic variance of SATE(δ̂, X) is ∂SATE(δ, X) ∂δ T δ=δ̂ where Vδ is defined in Section 2.5 and ∂SATE(δ, X) V̂δ , ∂δ δ=δ̂ (3) #T " ∂SATE(δ, X) T ∂SATE(δ, X) T ∂SATE(δ, X) ∂SATE(δ, X) , , , = ∂δ ∂δ1 ∂δ2 ∂θ∗ with elements defined in Appendix A. Alternatively, Bayesian posterior simulation can be employed as explained in Section 2.5. 2.3 Estimation Let us denote the log-likelihood of a given copula function as `(δ), where δ is defined in the previous section. Note that, because θ is bounded in most cases (see Table 1), we use a proper transformation of it, θ∗ , to ensure that in optimization δ ∈ Rp , where p is the total number of parameters; see Table 2 for the list of transformations employed. Given the flexible linear predictor structure considered here, unpenalized estimation can result in smooth term estimates that are too rough to produce practically useful results (e.g., Ruppert et al., 2003). This issue is dealt with by 2 R 00 P P v svkv (zvkv ) dzvkv which measures the second-order using the penalty term 2v=1 K kv =1 λvkv roughness of the smooth terms in the model. The λvkv are smoothing parameters controlling the trade-off between fit and smoothness. Since regression splines are linear in their model parameters, P P v the overall penalty can be written as β T Sλ β where β T = (β1T , β2T ), Sλ = 2v=1 K kv =1 λvkv Svkv and the Svkv are positive semi-definite known square matrices. The formulas for the bvkv j (zvkv i ) and Svkv depend on the type of spline employed and we refer the reader to Ruppert et al. (2003) and Wood (2003, 2006) for these details. The function to maximize is 1 `p (δ) = `(δ) − β T Sλ β, 2 (4) where the penalty term can also be written as 1/2δ T S̃λ δ where S̃λ is an overall penalty matrix defined as diag(0TP1 , λ1k1 S1k1 , . . . , λ1K1 S1K1 , 0, 0TP2 , λ2k2 S2k2 , . . . , λ2K2 S2K2 , 0) with 0TPv = (0v1 , . . . , 0vPv ). Given a parameter vector value for λ̂ = (λ̂1k1 , . . . , λ̂1K1 , λ̂2k2 , . . . , λ̂2K2 ), we seek to maximize (4). In practice, this is achieved by using a trust region algorithm (Nocedal & Wright, 2006, [a] Chapter 4). Specifically, let us define the penalized gradient and Hessian at iteration a as gp = [a] [a] [a] g[a] − S̃λ̂ δ [a] and H[a] − S̃λ̂ , where g[a] is made up of g1 = ∂`(δ)/∂δ1 |δ1 =δ[a] , g2 = p = H 1 [a] ∂`(δ)/∂δ2 |δ2 =δ[a] and g3 = ∂`(δ)/∂θ∗ |θ∗ =θ[a] , and the Hessian matrix has a 3 × 3 matrix block 2 ∗ 8 Copula Clayton Frank Gaussian/Student-t Gumbel Joe θ∗ log(θ − ) θ− tanh−1 (θ) log(θ − 1) log(θ − 1 − ) Table 2: Transformations, θ∗ , of the dependence parameter, θ, used in optimization. Quantity is set to the machine smallest positive floating-point number multiplied by 106 and is used in some cases to ensure that the dependence parameters lie in the ranges reported in Table 1. [a] structure with (r, h)th element Hr,h = ∂ 2 `(δ)/∂δr ∂δhT |δr =δr[a] ,δ =δ[a] , r, h = 1, . . . , 3, where δ3 = h h θ∗ ; details on the structure of g and H can be found in Appendix B. Each iteration of the trust region algorithm solves the problem 1 T [a] [a] max `˘p (δ) = `p (δ [a] ) + δ T g[a] p + δ Hp δ so that kδk ≤ r , δ 2 δ [a+1] = arg max `˘p (δ), δ where k · k denotes the Euclidean norm and r[a] represents the radius of the trust region. At each iteration of the algorithm, the quadratic approximation of the objective function at δ [a] is maximized subject to the constraint that the solution falls within a trust region with radius r[a] . The proposed solution is then accepted or rejected and the trust region expanded or shrunken based on the ratio between the improvement in the objective function when going from δ [a] to δ [a+1] and that predicted by the quadratic approximation. The exact details of the implementation used here can be found in Geyer (2013) who also discusses numerical stability and termination criteria. Despite there is no guarantee that a trust region algorithm will always converge faster than standard approaches, it may work better for difficult optimization problems (Geyer, 2013; Nocedal & Wright, 2006). In our experience, this approach proved to be considerably faster and more stable than classic alternatives. In the Student-t case, we fix the degrees of freedom ν as the derivatives for this parameter are involved, and g and H would then be in four (rather than three) main dimensions, hence making the algorithm slower and less stable. As a solution, we propose fitting as many models as the number of a set of ν values and assessing the sensitivity of the results to this parameter. In our experience, this proved to be a practical and effective means of handling ν. 2.3.1 Smoothness selection If the model has more than one smooth term per equation, then smoothing parameter selection by direct grid search optimization of, for instance, a prediction error criterion can be computationally burdensome. It is therefore pivotal for practical modeling to estimate λ in an automatic way. There are many techniques for automatic multiple smoothing parameter selection within the penalized likelihood framework; see Ruppert et al. (2003) and Wood (2006) for detailed overviews. Note that joint estimation of δ and λ via maximization of (4) would clearly lead to over-fitting since the 9 highest value for `p (δ) would be obtained when λ = 0. In fact, λ should be selected so that the estimated smooth components are as close as possible to the true functions. We essentially adopt the approach described in Marra & Radice (2011a) which applies a modified version of the unbiased risk estimator (UBRE; Craven & Wahba, 1979; Gu, 2002) to each working linear model obtained from a trust region iteration. Specifically, the problem to solve is 1 √ [a] [a] 2 [a] k W (z − X̃δ)k2 − 1 + γtr(Aλ ), ň ň = arg min Vu (λ), min Vu (λ) = λ λ[a+1] (5) λ √ [a] where ň = 2n, W is the square root of a block diagonal matrix made up of 2 × 2 ma[a] [a] trices Wi with (r, h)th element given by −∂ 2 `(δ)i /∂ηri ∂ηhi |ηri =η[a] ,η =η[a] , r, h = 1, 2, zi is hi ri hi n oT −1,[a] [a] [a] [a] the 2-dimensional vector X̃i δ + Wi di , di = ∂`(δ)i /∂η1i |η1i =η[a] , ∂`(δ)i /∂η2i |η2i =η[a] , 1i T 2i T T [a] X̃i = diag X1i , X2i , with X1i and X2i defined in Section 2.1.1, X̃ = X̃1 | . . . |X̃n , Aλ = T T X̃(X̃ W[a] X̃ + Šλ )−1 X̃ W[a] is the hat matrix, [a] Šλ = diag(0TP1 , λ1k1 S1k1 , . . . , λ1K1 S1K1 , 0, 0TP2 , λ2k2 S2k2 , . . . , λ2K2 S2K2 ), tr(Aλ ) represents the estimated degrees of freedom (edf ) of the penalized model (e.g., Wood, 2006), and γ is a tuning parameter which can be increased from its usual value of 1 to obtain smoother models (Kim & Gu (2004) suggest using γ ≈ 1.4 to correct the tendency to over-fitting of prediction error criteria). In practice, this approach is implemented using the automatic method by Wood (2004), which is based on Newton’s method and can evaluate the approximate bivariate UBRE and its first and second derivatives with respect to log(λ), since the estimated smoothing parameters must be positive, in an efficient and stable way. Broadly speaking, this is achieved using a series of pivoted QR and [a] singular value decompositions, which make evaluation of tr(Aλ ) for new trial values of λ cheap and derivative calculations efficient and stable; see Wood (2004) for full details. The two steps, one for δ and the other for λ, are iterated until convergence. They can be summarized as follows: step 1 For a given parameter vector value δ [a] and holding the smoothing parameter vector fixed at λ[a] , find an estimate of δ: δ [a+1] = arg max `˘p (δ). δ step 2 Using δ [a+1] construct the working linear model quantities needed in (5) and find an estimate of λ: λ[a+1] = arg min Vu (λ). λ √ √ In step 2, quantities W−1 d, Wz and WX̃ are efficiently set up by exploiting the sparse structure of W. As opposed to the implementation discussed in Marra & Radice (2011a), where θ∗ plays a role in Vu (λ) and as a consequence ň = 3, for convenience the working linear model quantities in (5) are constructed for fixed values of θ∗ . This is because θ∗ is not either penalized or specified as a function of a linear predictor, such as η1i , whose components may need to be penalized. Hence, the algorithm presented here is more efficient than that in Marra & Radice (2011a). 10 2.4 Copula model selection Copula models with a single dependence parameter can be thought of as non-nested models (except, of course, for Gaussian and Student-t when ν → ∞). As suggested by Zimmer & Trivedi (2006) among others, one approach for choosing between copula models is to use either the Akaike or (Schwarz) Bayesian information criterion (AIC and BIC, respectively). In our case, AIC = −2`(δ̂) + 2edf and BIC = −2`(δ̂) + log(n)edf , where the log-likelihood is evaluated at the penalized parameter estimates and edf = tr(Âλ̂ ). As pointed out by Brechmann & Schepsmeier (2013), the tests proposed by Vuong (1989) and Clarke (2007) may be more reliable than information criteria when non-nested models are compared. These are essentially likelihood-ratio-based tests which measure the distance between two statistical models. In the Vuong test, the null hypothesis is that the two competing models are equally close to the actual model, whereas the alternative is that one model is closer. Let `C1 (δ̂C1 ) and `C2 (δ̂C2 ) denote the log-likelihoods of two bivariate copula models C1 and C2 evaluated at their respective parameter estimates, edfC1 and edfC2 be their estimated degrees of freedom and qi be equal to `i,C1 (δ̂C1 ) − `i,C2 (δ̂C2 ). In the current context, the Vuong test is n o edfC1 −edfC2 `C1 (δ̂C1 ) − `C2 (δ̂C2 ) − log(n) 2 d qP V = −→ N (0, 1), H0 n 2 i=1 (qi − q̄) where H0 : E(qi ) = 0 ∀i = 1, . . . , n. For a significance level ς, if V > Φ−1 (1 − ς/2) then model C1 is preferred over C2 and vice versa when V < Φ−1 (1 − ς/2). If |V | ≤ Φ−1 (1 − ς/2) then we cannot discriminate between the two models given the data and as a consequence H0 cannot be rejected. In the Clarke test, H0 : P(qi > 0) = 0.5 ∀i = 1, . . . , n. That is, if the two models are statistically equivalent then the log-likelihood ratios of the observations should be evenly distributed around zero and 50% of the ratios should be larger than zero. The Clarke test is n X edfC1 − edfC2 d 1 qi − B= log(n) −→ Bin(n, 0.5). H0 2n i=1 Critical values can be obtained as shown in Clarke (2007). Intuitively, model C1 is preferred over C2 if B is significantly larger than its expected value under the null hypothesis (n/2), and vice versa. If B is not significantly different from n/2 then C1 can be thought of as equivalent to C2 . 2.5 Confidence intervals and variable selection The parameter estimator for δ can be written in iteratively re-weighted least squares form as δ̂ = T T T −1 (X̃ WX̃ + S̃λ )−1 X̃ Wz, where X̃ WX̃ + S̃λ = −Hp . It then follows that Vδ̂ = −H−1 p HHp . As in Marra & Radice (2011a), the alternative Bayesian result Vδ = −H−1 p can be employed as well. As shown by Marra & Wood (2012), at finite sample sizes Vδ can produce intervals with close to nominal coverage probabilities. This is because the Bayesian covariance matrix includes both a bias and variance component in a frequentist sense, which is a feature that is not shared by 11 Vδ̂ . Note that for unpenalized model components Vδ and Vδ̂ are equivalent. Confidence intervals for linear functions of the model parameters can be easily constructed. For instance, intervals for ŝvkv (zvkv i ) can be based on N (svkv (zvkv i ), Bvkv (zvkv i )T Vδvkv Bvkv (zvkv i )), where Vδvkv is the sub-matrix of Vδ corresponding to the regression spline parameters associated with ŝvkv (zvkv i ). Intervals for non-linear functions of the model coefficients (e.g., θ, Kendall’s τ and SATE) can be conveniently obtained by simulation from the posterior distribution of δ as follows: step 1 Draw nsim random vectors from N (δ̂, V̂δ ). step 2 Calculate the nsim simulated realizations of the function of interest. For instance, for a Gaussim sian copula θ = tanh(θ∗ ), hence θ sim = (θ1sim , θ2sim , . . . , θnsim ) where θisim = tanh(θ∗,i ), sim i = 1, . . . , nsim . step 3 Using θ sim calculate the lower, (ς/2), and upper, 1 − ς/2, quantiles. Small values for nsim are typically tolerable; nsim is set to 1000 because the above procedure is relatively inexpensive to run. The significance level is usually set to 0.05. It may be argued that intervals for non-linear functions can be obtained using the delta method (see, e.g., Section 2.2). However, for functions like Kendall’s τ such a method may not be easy to apply. Strictly speaking, point-wise confidence intervals for smooth components are not adequate for variable selection, although they are often used in practice for this purpose (e.g., Ruppert et al., 2003). As shown by Wood (2013), in the context of extended generalized additive models, and discussed by Marra (2013), for semiparametric bivariate probit models, ŝvkv v̇N (svkv , Vsvkv ), where svkv = [svkv (zvkv 1 ), svkv (zvkv 2 ), . . . , svkv (zvkv n )]T , Vsvkv = Bvkv (zvkv )Vδvkv Bvkv (zvkv )T and Bvkv (zvkv ) denotes a full column rank matrix such that ŝvkv = Bvkv (zvkv )β̂vkv . It is then possible to obtain approximate p-values, for testing the hypothesis svkv = 0, based on vkv − ŝvkv v̇χ2rvkv , Trvkv = ŝTvkv Vsrvk v vkv − is the rank rvkv Moore-Penrose pseudo-inverse of Vsvkv , which can deal with rank where Vrsvk v deficiencies. Parameter rvkv is selected using the notion of edf used in (5). Because edf is not an integer, it can be rounded as follows (Wood, 2013) rvkv = ( floor(edfvkv ) if edfvkv < floor(edfvkv ) + 0.05 floor(edfvkv ) + 1 otherwise , which proved effective in semiparametric bivariate probit models (Marra, 2013). Alternatively, variable selection can be achieved adopting a single penalty shrinkage approach as described in Marra & Radice (2011a) and Marra & Wood (2011). 12 2.6 Asymptotic considerations In this section, we present some argumentation related to the asymptotic behavior of the penalized maximum likelihood estimator defined as δ̂ = arg max `p (δ), δ where `p (δ) is given in (4), and the behavior of the ensuing SATE estimator constructed in Section 2.2. We consider the situation in which the spline bases approximating the smooth components {bvkv j , j = 1, . . . , Jvkv , kv = 1, . . . , Kv , v = 1, 2} are of a fixed dimension, i.e. the Jvkv are fixed. Note that the unknown smooth functions {svkv , kv = 1, . . . , Kv , v = 1, 2} may not have an exact representation as linear combinations of given basis functions and consequently the unknown functions and parameters may not be asymptotically identified by their estimators as the sample size grows. However, the case of fixed basis dimensions is of relevance as in practice these have to be fixed. In this scenario, the method provides estimates which tend in probability to quantities best approximating the unknown functions and parameters in terms of Kullback-Leibler measure. (Recall that the Kullback-Leibler distance between two density functions f and g is defined as R∞ KL(f ||g) = −∞ f log(f /g) if f is absolutely continuous with respect to g and 0 otherwise.) Let Lt be the likelihood function for the true model which, in our case, contains the true smooth functions appearing in the linear predictors η1 and η2 given in (1) and (2) and true value of the dependence parameter θ∗ of a given copula, and let `t be the corresponding log-likelihood. Then the Kullback-Leibler distance between the likelihood Lt in the true model and the likelihood L(δ) PK2 P 1 s (z ) and in the model where K 1k 1k i 1 1 k2 =1 s1k2 (z2k2 i ) are replaced with their spline approxk1 =1 T T imations B1i β1 and B2i β2 is equal to KL(Lt ||L(δ)) = E `t − `(δ) , where the expectation is taken with respect to the true model distribution and δ = (δ1T , δ2T , θ∗ )T . T T Define parameter vector δ 0 = (δ10 , δ20 , θ∗0 )T as the minimizer of the above distance, that is δ 0 = arg min KL Lt ||L(δ) . δ It follows that δ 0 is the maximizer of the expected unpenalized log-likelihood `(·) and as a consequence Eg(δ 0 ) = 0. Remind that g(δ) and H(δ) denote the gradient vector and Hessian matrix of `(·) calculated at a point δ and let gp (δ) = g(δ) − S̃λ δ and Hp (δ) = H(δ) − S̃λ be the penalized versions of them. Below, we define classic conditions related to the score vector, Hessian and Fisher information matrix as well as the penalty matrix (see, e.g., Kauermann (2005) who used similar assumptions in the context of survival models). The assumptions are (A1) g(δ 0 ) = OP (n1/2 ), (A2) EH(δ 0 ) = O(n), (A3) H(δ 0 ) − EH(δ 0 ) = OP (n1/2 ), 13 (A4) S̃λ = o n1/2 , where S̃λ is defined in Section 2.3. Conditions (A1) and (A3) are the usual assumptions of n1/2 asymptotics (e.g., Barndorff-Nielsen & Cox, 1989). Note that, as the n observations are assumed to be independent, g(δ 0 ) and H(δ 0 ) are made up of sums of independent random variables. Assumptions (A1) and (A3) imply that, given the model, the average values n1 g(δ 0 ) and n1 H(δ 0 ) over the random sample converge in probability to their expected values at the rate n−1/2 . Condition (A4) can be equivalently formulated as λvkv = o n1/2 for kv = 1, . . . , Kv , v = 1, 2, assuming that the matrices Svkv are asymptotically bounded. Theorem 1. Under conditions (A1)-(A4) we have δ̂ − δ 0 = OP (n−1/2 ) as n → ∞. Let SATE0 be equal to n 1 X 0 P (y2i = 1|y1i = 1) − P0 (y2i = 1|y1i = 0) , SATE = n i=1 0 where P0 (y2i = 1|y1i = 1) = 0 P (y2i = 1|y1i = 0) = 0 (y1i =1) ),Φ(η 0 );θ 0 C Φ(η2i 1i , 0 ) Φ(η1i (y1i =0) 0 0 (y1i =0) ),Φ(η 0 );θ 0 Φ(η2i )−C Φ(η2i 1i 0 ) 1−Φ(η1i (6) , 0 0 η1i = XT1i δ10 and η2i = XT2i δ20 . In order to prove consistency for the estimator of the SATE we introduce the additional assumption that the (A5) probabilities (6) are differentiable as functions of δ and their gradients are bounded in the neighborhood of δ 0 , uniformly for all xi = (XT1i , XT2i )T . Theorem 2. If conditions (A1)-(A5) hold then \ − SATE0 = OP (n−1/2 ) as n → ∞, SATE \ = SATE(δ̂, X) as defined in Section 2.2. where SATE Proof of Theorem 1. We show that the following approximation holds 0 0 δ̂ − δ ≈ −EH(δ ) + S̃λ −1 (g(δ 0 ) − S̃λ δ 0 ), (7) which implies the asymptotic consistency of δ̂ at the rate n−1/2 . We adopt the argumentation used in the theory of maximum likelihood estimation (e.g., McCullagh, 1987) which involves a Taylor expansion of the score in the neighborhood of δ 0 . A similar approach was used by Kauermann (2005) and Kauermann et al. (2009) in the context of penalized spline smoothing. For simplicity of notation, we omit all terms of order higher than 1 and assume that higher order derivatives of 14 the log-likelihood behave in a similar manner as those defined in (A1)-(A3). The first-order Taylor expansion of gp (δ̂) around gp (δ 0 ) implies gp (δ̂) = gp (δ 0 ) + Hp (δ 0 )(δ̂ − δ 0 ) + (higher order terms), which, after using the fact that gp (δ̂) = 0 and inverting the above series (e.g., Barndorff-Nielsen & Cox, 1989), leads to δ̂ − δ 0 = −Hp (δ 0 )−1 (g(δ 0 ) − S̃λ δ 0 ) + . . . We then decompose Hp (δ 0 ) as 0 Hp (δ ) = H(δ ) − EH(δ ) + EH(δ ) − S̃λ = R − F(λ), 0 0 0 where R = H(δ 0 ) − EH(δ 0 ) represents a stochastic error and F(λ) = −EH(δ 0 ) + S̃λ is the penalized Fisher information matrix. Now, let f (·) = (· − F(λ))−1 be an auxiliary function of a matrix argument. Using a Taylor expansion of f (R) around f (0), we obtain the following representation of Hp (δ 0 )−1 Hp (δ 0 )−1 = −F(λ)−1 − F(λ)−1 R(F(λ)−1 )T + . . . Assumptions (A2)-(A4) imply Hp (δ 0 )−1 = −F(λ)−1 I − RF(λ)−1 + . . . = −F(λ)−1 I + OP (n−1/2 ) , where I is an identity matrix. Thus δ̂ − δ 0 = F(λ)−1 (g(δ 0 ) − S̃λ δ 0 ) (I + oP (1)) + . . . , (8) which proves (7) and hence δ̂ − δ 0 = OP (n−1/2 ) as n → ∞. (9) Remark 1. (a) From approximation (8), the asymptotic bias and covariance matrix of δ̂ can be derived. Specifically, bias(δ̂) = E δ̂ − δ 0 ≈ −F(λ)−1 S̃λ δ 0 , where the property Eg(δ 0 ) = 0 of δ 0 has been used, and Cov(δ̂) ≈ −F(λ)−1 EH(δ 0 )F(λ)−1 , (10) which follows from the fact that Cov(g(δ 0 )) = −EH(δ 0 ). In addition, conditions (A2) and (A4) imply that bias(δ̂) = o(n−1/2 ) and Cov(δ̂) = O(n−1 ). 15 (b) Assumption (A4) implies that √ nCov(δ̂) ≈ and √ nVδ ≈ 1 √ E −H(δ 0 ) n −1 1 − √ H(δ 0 ) n −1 where Vδ = −H−1 p is the Bayesian approximation of the covariance matrix of δ̂ mentioned in Section 2.5. Thus, the frequentist asymptotic approximation (10) and the Bayesian result become equivalent as the sample size n grows to ∞. d −1/2 (c) As g(δ 0 ) is a sum of i.i.d. components, it follows that (−EH(δ 0 )) g(δ 0 ) → N (0, I). Hence, approximation (8) also implies asymptotic normality of the normalized estimator δ̂. \ = SATE(δ̂, X) where SATE(δ, X) can be expressed as Proof of Theorem 2. Recall that SATE P n 1 i=1 sate(δ, xi ), with sate(δ, xi ) determined by n (y1i =1) C Φ(η1i ), Φ(η2i ); θ Φ(η1i ) − (y =0) Φ(η2i 1i ) −C (y =0) Φ(η1i ), Φ(η2i 1i ); θ 1 − Φ(η1i ) . The mean value theorem yields sate(δ̂, xi ) = sate(δ 0 , xi ) + ∂ sate(δ̃, xi )T (δ̂ − δ 0 ), ∂δ ∂ sate(δ̃, xi ) is the gradient vector of sate(·, xi ) for some δ̃ = (1 − c)δ 0 + cδ̂, c > 0, where ∂δ expressed as a function of δ calculated at a point δ = δ̃, for i = 1, . . . , n. Thus, n n X 1X ∂ \= 1 SATE sate(δ̃, xi )T (δ̂ − δ 0 ) sate(δ 0 , xi ) + n i=1 n i=1 ∂δ n 1X ∂ = SATE + sate(δ̃, xi )T (δ̂ − δ 0 ) n i=1 ∂δ 0 (11) As for the second term in (11), Schwarz inequality implies n n 1X ∂ 1 X ∂ T 0 sate(δ̃, xi ) (δ̂ − δ ) ≤ sate(δ̃, xi ) ||δ̂ − δ 0 ||. n i=1 ∂δ n i=1 ∂δ ∂ sate(·, xi ) is bounded in the neighborhood of δ 0 uniformly for all xi Given the assumption that ∂δ and that ||δ̂ − δ 0 || = OP (n−1/2 ) proved in (9), the assertion follows. Remark 2. Expression (10) for the asymptotic covariance matrix of δ̂ can be used to construct the 16 \ using the delta method, namely asymptotic variance of SATE T 0 0 \ ≈ − ∂SATE(δ , X) F(λ)−1 EH(δ 0 )F(λ)−1 ∂SATE(δ , X) , VarSATE ∂δ ∂δ which is equivalent in the limit to expression (3) given in Section 2.2, as motivated in Remark 1(b). Moreover, it follows from the delta method and Remark 1(c) that the normalized estimator \ is asymptotically normal. SATE 3 Analysis of health care utilization data The analysis presented in this section was performed in the R environment (R Development Core Team, 2013) using the package SemiParBIVProbit which implements the methods discussed in this article (Marra & Radice, 2013). 3.1 Data We used data from the 2008 Medical Expenditure Panel Survey (MEPS; http://www.meps.ahrq.gov/) which provides a nationally representative sample that includes information on demographics, individual health status, health care utilization and private health insurance coverage. We excluded individuals younger than 18 years old given their different overall health profiles and expected usage patterns as compared to older individuals. Those 65 and older were also excluded since the availability of Medicare obviates the primary insurance decision for almost all US citizens. Individuals that did not have a complete set of socioeconomic and demographic control variables were also excluded from the sample (e.g., missing values for education or income). After exclusions, the final dataset contains 13132 observations. Table 3 summarizes the variables used in the analysis. The choice of these variables was motivated largely by the findings reported in previous related studies (e.g., Shane & Trivedi, 2012, and references therein). Private health insurance status, which is an important determinant of the use of health care services, is a potentially endogenous variable. This is because unobserved variables, such as say allergy and risk aversiveness, are likely to influence both health service utilization and private insurance decision. Sometimes the effect of private health insurance can be interpreted as adverse selection or moral hazard (e.g., Buchmueller et al., 2005). Adverse selection occurs when individuals with a greater demand for medical care, because of poor health for instance, are expected to have a greater demand for insurance. Moral hazard refers to the tendency of people to be more inclined to seek health services, and doctors to be more inclined to recommend them when all costs are covered. The problem of residual confounding arises when mis-modeling the effect of observed confounders (e.g., Benedetti & Abrahamowicz, 2004). For instance, the impact of age and education might exhibit a non-linear behavior which, if not accounted for, will enter the unstructured term of the model and hence induce bias in the estimated effect of insurance coverage on health care utilization. The assumption of linear dependence between private health insurance and health care utilization may also be restrictive if there exists an asymmetric tail dependence 17 Variable Outcome visits.hosp Treatment private Demographic-socioeconomic age gender race education income region Health-related health bmi diabetes hypertension hyperlipidemia limitation Definition =1 at least one visit to hospital outpatient departments =1 private health insurance age in years =1 male =1 white, =2 black, =3 native American, =4 others years of education income (000’s) =1 northeast, =2 midwest, =3 south, =4 west =1 excellent, =2 very good, =3 good, =4 fair,=5 poor body mass index =1 diabetic =1 hypertensive on =1 hyperlipidemic =1 health limits physical activity Table 3: Description of outcome and treatment variables, and observed confounders. between them (Winkelmann, 2012). These issues are addressed using the method presented in Section 2. 3.2 Models Following previous work on the subject (e.g., Holly et al., 1998; Fabbri & Monfardini, 2003; Shane & Trivedi, 2012; Srivastava & Zhao, 2008), the equations for private health insurance and health care utilization were specified, in R notation, as treat.eq <- private ~ as.factor(health) + as.factor(race) + as.factor(region) + limitation + gender + diabetes + hypertension + hyperlipidemia + self + s(bmi) + s(income) + s(age) + s(education) out.eq <- visits.hosp ~ private + as.factor(health) + as.factor(race) + as.factor(region) + limitation + gender + diabetes + hypertension + hyperlipidemia + self + s(bmi) + s(income) + s(age) + s(education) where as.factor coerces its argument to a factor and the s() are the unknown smooth functions described in Section 2.1.1. The smooth components were represented using penalized thin plate regression splines with basis dimensions equal to 20 and penalties based on second order derivatives (Wood, 2006). In cross-sectional studies, 20 bases typically suffice to represent well smooth functions, although sensitivity analysis using more spline bases is advisable when the estimated degrees of freedom of the smooth components are close to the number of bases used. 18 The non-linear specification for bmi, income, age and education arises from the fact that these covariate embody productivity and life-cycle effects that are likely to affect P(private = 1) and P(visits.hosp = 1) non-linearly. In fact, in related studies, Holly et al. (1998) and Fabbri & Monfardini (2003) considered a model for health care utilization that contains linear and quadratic terms in bmi, income, age and education whereas Marra & Radice (2011b) specified a model containing smooth functions of them. Considering all copulas discussed in Section 2.1, including the case in which the outcome equation is estimated alone (this will be referred to as Independent) and Student-t with 3, 6, 9 and 12 degrees of freedom, we fitted 19 models. Based on the AIC the chosen models were those which capture strong upper tail dependences, i.e. Clayton180 , Gumbel0 and Joe0 . After applying the Vuong test to these three models, it emerged that it is not possible to discriminate among them. However, the Clarke test showed a preference for Joe0 . 3.3 Empirical results 3.3.1 Measure of dependence We start off by commenting on the results for the Kendall’s τ of all models fitted (see Table 4). This measure captures the association between the unobserved confounders present in the unstructured term of the model, after controlling for observed confounders. The models without AIC support which can account for asymmetric dependence indicate either an absence or a negative association between the two equations, whereas those which can account for symmetric dependence suggest a moderate association. Interestingly, the small yet significant τ , estimated using the preferred models (Joe0 , Gumbel0 and Clayton180 ), indicate that there exists some association between the unstructured terms of the model equations for private health insurance and hospital utilization which is most likely due to the presence of unobserved confounders. 3.3.2 SATE of private health insurance The estimated SATE (in %) and confidence interval (CI) for all fitted copula models are reported in Table 4. The Table also reports the estimated SATE for the case in which the unobserved confounding issue is not taken into account (Independent). Several points are worth noting. • The models which received the most support from the AIC (Joe0 , Gumbel0 and Clayton180 , which account for strong positive upper tail dependence) show similar point estimates with overlapping CIs. • The models without AIC support (which can account for negative tail dependence and positive lower tail dependence) exhibit point estimates which are systematically different from those obtained from the models with AIC support. Specifically, the models which suggest absence of association (τ̂ = 0) consistently show similar point estimates as that of Independent, whereas the models which indicate negative association show implausible negative estimated effects of the SATE. • If the presence of unobserved confounders is not accounted for then the point estimate is smaller than that obtained using the chosen models which can control for this issue. 19 Copula Independent Gaussian Student-t3 Student-t6 Student-t9 Student-t12 Frank Clayton0 Clayton90 Clayton180 Clayton270 Gumbel0 Gumbel90 Gumbel180 Gumbel270 Joe0 Joe90 Joe180 Joe270 \ (95% CIs) SATE 4.29 (2.98,5.60) 5.34 (3.85,6.83) 5.68 (4.24,7.13) 5.51 (4.01,7.01) 5.49 (3.99,6.99) 5.45 (3.94,6.95) 4.82 (3.39,6.25) 4.25 (2.89,5.61) 4.25 (2.89,5.61) 5.71 (4.24,7.18) -5.75 (-7.38,-4.13) 5.63 (4.15,7.12) 4.25 (2.89,5.61) 4.25 (2.89,5.61) 4.25 (2.89,5.61) 5.68 (4.21,7.15) -10.10 (-11.97,-8.23) 4.25 (2.89,5.61) 4.25 (2.89,5.61) τ̂ (95% CIs) 0.20 (0.07,0.33) 0.46 (0.37,0.54) 0.35 (0.24,0.45) 0.32 (0.20,0.42) 0.29 (0.17,0.40) 0.21 (0.04,0.35) 0.00 (0.00,0.00) 0.00 (-1.00,0.00) 0.13 (0.07,0.23) -0.14 (-0.45,-0.03) 0.19 (0.11,0.31) 0.00 (0.00,0.00) 0.00 (0.00,0.00) 0.00 (0.00,0.00) 0.14 (0.08,0.23) -0.15 (-0.45,-0.04) 0.00 (0.00,0.00) 0.00 (0.00,0.00) Table 4: Estimated SATE (in %) and τ obtained using different copula models for the 2008 MEPS data. 95% confidence intervals for the SATE have been obtained using the delta method detailed in Section 2.2, and those for τ using Bayesian posterior simulation as described in Section 2.5. • As the number of degrees of freedom for the Student-t copula increases, the estimated SATE obviously gets closer to that obtained with a Gaussian copula. • Using the Gaussian copula the estimated SATE is 5.3% whereas that obtained using Joe0 is 5.7%. This mild increase in the effect of private health insurance is difficult to judge as it may or may not have an impact on policy decisions regarding, for instance, recognizing the important role of private health insurance in financing hospitals. But such speculation is beyond the scope of this paper. The CIs for the SATE reported in Table 4 were obtained using the delta method (Section 2.2). Marginally wider intervals were obtained using the Bayesian posterior simulation approach described in Section 2.5. A small simulation study revealed that the simulation approach and the delta method can yield intervals with empirical coverage probabilities which are nearly identical and very close to nominal coverages. The study also revealed that intervals for Kendall’s τ obtained by posterior simulation have close to nominal coverages and that the true SATE can be recovered well. R code for the simulation experiment and results have not been reported here to save space and are available upon request. 3.3.3 Parametric components We report the estimated effects for the Joe0 copula model. Similar results where obtained using Gumbel0 and Clayton180 which also account for strong positive upper tail dependence. These are 20 Treatment Eq. Variable Parameter estimate Std. error gender -0.08 0.03 race=2 0.03 0.04 race=3 -0.17 0.13 race=4 0.10 0.05 region=2 0.24 0.05 region=3 0.06 0.04 region=4 0.01 0.04 health=2 -0.05 0.03 health=3 -0.11 0.04 health=4 -0.25 0.05 health=5 -0.20 0.11 diabetes 0.08 0.06 hypertension 0.12 0.04 hyperlipidemia 0.15 0.04 limitation -0.11 0.07 Outcome Eq. Parameter estimate Std. error -0.42 0.03 -0.10 0.04 -0.18 0.16 -0.15 0.06 0.22 0.05 -0.17 0.04 -0.25 0.05 0.13 0.04 0.17 0.04 0.27 0.06 0.50 0.11 0.13 0.06 0.09 0.04 0.24 0.04 -0.49 0.06 Table 5: Estimated coefficients and standard errors of the parametric components from the Joe0 copula model. available upon request. Most of these effects have the expected signs. Females are slightly more likely to purchase insurance and being hospitalized than males. This may be explained by a higher demand for medical services among women during their reproductive years (e.g., Sindelar, 1982). As for race, there is not a significant difference between whites and nonwhites in terms of purchasing private health insurance but there is some difference in terms of being hospitalized; nonwhites seem to be less likely to use health care services. This is consistent with the findings by Shane & Trivedi (2012). Regarding the region variable, residents of the Midwest are more likely to have a private insurance and to use health care services as compared to those of the Northeast. Individuals’ evaluation of their health states is a potential predictor of health care utilization. Those who are in good health are less likely to access health care services. In the same vein, those who expect themselves to be in good health have little to gain from insurance while those who are in poor health are more likely to purchase health insurance. The results for the hospital utilization equation support this hypothesis indicating that the less healthy individuals are, the more likely they are to be admitted into hospitals. The positive relationship between self-assessed health and insurance purchase is counter intuitive to the hypothesis of moral hazard and adverse selection. However, such finding is not unusual and has been obtained in several previous studies (see Srivastava & Zhao, 2008, and references therein). The more objective measures of health status (i.e., diabetes, hypertension and hyperlipidemia) suggest that medical need is an important determinant of hospital utilization and insurance purchase. 21 3.3.4 Non-parametric components Figures 3 and 4 report the smooth function estimates for the treatment and outcome equations (and associated Bayesian intervals) when applying the Joe0 copula model on the MEPS data. The estimated smooth functions obtained using the other copulas (not reported here but available upon request) were similar. The effects of bmi, income, age and education in the treatment and outcome equations show different degrees of nonlinearity. The point-wise confidence intervals of the smooth functions for bmi in the treatment equation and income in the outcome equation contain the zero line for most of the covariate value range. The intervals of the smooth for bmi in the outcome equation contain the zero line for the whole range. This suggests that bmi and income are weak predictors of private health insurance and health care utilization, and that bmi is not an important determinant of hospital utilization. Similar conclusions can be drawn by looking at the p-values for the smooth terms reported in the captions of Figures 3 and 4. As for the remaining variables, the estimated effects have the expected patterns. For example, age is a significant determinant in both equations. The probability of purchasing a private health insurance is found to increase with age with a slight drop-off for the 58+ age group. This is suggestive of a higher probability of private health insurance purchase as individuals become older and less likely to stay healthy (e.g., Hopkins & Kiddi, 1996). The probability of using health care services also increases with age. Insurance decision as well as health care utilization appear to be highly associated with education. Education is likely to increase individuals’ awareness of health care services and the benefits of purchasing a private health insurance. Higher household income is associated with an increased probability of purchasing a private health insurance. It is worth noting that the parametric and non-parametric estimated effects for the outcome equation reported here can be interpreted in a qualitative way only. The actual effects can be calculated using simulation or the formulas in Greene (2012) which account for the fact that confounders appearing in the treatment equation have an indirect effect (through the endogenous variable) on the outcome as well as a direct effect because they also appear in the outcome equation. 4 Discussion In summary, we have presented a model which can allow researchers to obtain consistent estimates of treatment effectiveness in the presence of unobserved and residual confounding, and of dependencies in the tails of the distribution of treatment and outcome. This model extends previous work by accommodating simultaneously flexible confounder effects as well as non-linear treatment-outcome dependence. In doing so, we have provided inferential tools for the proposed framework and presented some argumentation related to the asymptotic behavior of the introduced penalized maximum likelihood estimator and the ensuing sample average treatment effect. For implementation, we have developed the necessary computational procedures which are available in the R package SemiParBIVProbit. Using the proposed method, we have examined the effect of private health insurance on health 22 s(income,8.4) −0.5 0.5 1.5 −1.5 s(bmi,5.99) −1.5 −0.5 −2.5 30 40 bmi 50 60 70 0e+00 1e+05 2e+05 income 3e+05 −0.3 s(age,5.45) −0.1 0.1 s(education,5.91) −1.0 −0.5 0.0 0.5 0.3 20 20 30 40 age 50 60 0 5 10 education 15 −0.1 s(income,2.57) −0.2 0.0 0.2 s(bmi,1) 0.1 0.2 0.3 0.4 Figure 3: Smooth function estimates and associated 95% point-wise confidence intervals in the treatment equation obtained applying the Joe0 copula regression spline model on the 2008 MEPS data. The rug plot, at the bottom of each graph, shows the covariate values. P-values for the smooth terms of bmi, income, age and income are 0.0681, < 0.000, < 0.000 and < 0.000, respectively. 30 40 bmi 50 60 70 0e+00 1e+05 2e+05 income 3e+05 20 30 40 age 50 −0.6 −0.4 s(age,1.48) −0.2 0.0 0.2 s(education,4.84) −0.2 0.0 0.2 0.4 20 60 0 5 10 education 15 Figure 4: Smooth function estimates and associated 95% point-wise confidence intervals in the outcome equation obtained applying the Joe0 copula regression spline model on the 2008 MEPS data. P-values for the smooth terms of bmi, income, age and income are 0.132, 0.097, < 0.000 and < 0.000, respectively. 23 care utilization using the 2008 MEPS dataset. There is a generally accepted notion that private health coverage is affected by endogeneity as it is not randomly assigned as in a controlled trial but rather is the result of individual preferences and health status, such as allergy and risk aversiveness. Also, the impacts of continuous confounders such as age and education are likely to be complex since they embody productivity and life-cycle effects that are likely to influence nonlinearly private health insurance and health care utilization. Finally, there may exist dependencies that appear in the tails of the distribution of both insurance and health care utilization which need to be modeled. To our knowledge, no studies have examined the impact of private health coverage accounting for endogeneity, non-linear contributions of observed confounders and non-linear dependence between insurance and health care utilization, partly due to the lack of appropriate analytical tools to discern these simultaneous effects. By applying the newly developed statistical framework to the 2008 MEPS data we found that not accounting for the endogeneity problem underestimates the effect of private health insurance, that some of the observed confounder effects are non-linear, and that the models which received most support from the data are those which allow for strong positive upper tail dependence. It is worth stressing that the proposed approach may have a significant wider appeal. Indeed, many other situations exist in which it would be useful to estimate the effect of a treatment on an outcome using observational data. Examples include estimating the impact of diabetes on employment, physical activity on obesity, and insurance on mortality among HIV-infected individuals, to name a few. Future research directions include the use of various marginal distributions that can be bound by a copula (e.g., generalized extreme value distributions as in Wang & Dey (2010)), allowing for different kinds of dependencies in the upper and lower tails by using two-parameter bivariate linking copulas (e.g., Brechmann & Schepsmeier, 2013), and generalizing the proposed approach to trivariate system models which control for the endogeneity of the treatment and for non-random sample selection in the outcome (Srivastava & Zhao, 2008). Using more flexible link functions, two-parameter linking copulas and adopting a system of three equations is likely to make the model more accurate and the framework presented here can constitute a building block for these developments. Appendix Appendix A: Derivatives of SATE(δ, X) with respect to δ The components in ∂SATE(δ, X)/∂δ that are referred to in Section 2.2 are given below. n ∂SATE(δ, X) 1X = ∂δ1 n i=1 ∂ (y =1) C Φ(η1i ),Φ(η2i 1i );θ Φ(η1i ) (y − Φ(η2i 1i (y =0) )−C Φ(η1i ),Φ(η2i 1i );θ 1−Φ(η1i ) ∂δ1 24 =0) , (12) ∂ n ∂SATE(δ, X) 1X = ∂δ2 n i=1 (y =1) C Φ(η1i ),Φ(η2i 1i );θ (y =1) C Φ(η1i ),Φ(η2i 1i );θ Φ(η1i ) (y − Φ(η2i 1i =0) (y =0) )−C ,Φ(η1i ),Φ(η2i 1i );θ , (13) (y =0) )−C Φ(η1i ),Φ(η2i 1i );θ . (14) 1−Φ(η1i ) ∂δ2 and n 1X ∂SATE(δ, X) = ∂θ∗ n i=1 ∂ Φ(η1i ) (y − Φ(η2i 1i =0) 1−Φ(η1i ) ∂θ∗ The quantities inside the square brackets of (12), (13) and (14) can be written as ( h1 Φ(η1i ) − C(Φ(η1i ), Φ(η2i ); θ) Φ(η1i )2 y1i =1 h1 (1 − Φ(η1i )) − (Φ(η2i ) − C(Φ(η1i ), Φ(η2i ); θ)) + [1 − Φ(η )]2 1i and y1i =0 ) φ(η1i )X1i , 1 − h2 h2 φ(η2i )X2i − , φ(η )X 2i 2i Φ(η1i ) y1i =1 1 − Φ(η1i ) y1i =0 ∂C(Φ(η1i ),Φ(η2i );θ) ∂θ ∂θ ∂θ∗ Φ(η1i ) + y1i =1 ∂C(Φ(η1i ),Φ(η2i );θ) ∂θ ∂θ ∂θ∗ 1 − Φ(η1i ) , y1i =0 where hv = ∂C(Φ(η1i ), Φ(η2i ); θ)/∂Φ(ηvi ), for v = 1, 2, φ(·) is the density function of the standard univariate normal distribution, ∂θ/∂θ∗ can be obtained using the transformations in Table 2, and all the other quantities are defined in Section 2. Appendix B: Gradient and Hessian The quantities g and H that are referred to in Section 2.3 are given below. Gradient n ∂`(δ) X 1 ∂P(y1i = 1, y2i = 1) g1 = y1i y2i = ∂δ1 P(y1i = 1, y2i = 1) ∂δ1 i=1 1 ∂P(y1i = 1, y2i = 0) P(y1i = 1, y2i = 0) ∂δ1 1 ∂P(y1i = 0, y2i = 1) + (1 − y1i )y2i P(y1i = 0, y2i = 1) ∂δ1 1 ∂P(y1i = 0, y2i = 0) + (1 − y1i )(1 − y2i ) , P(y1i = 0, y2i = 0) ∂δ1 + y1i (1 − y2i ) 25 where ∂P(y1i = 1, y2i ∂δ1 ∂P(y1i = 1, y2i ∂δ1 ∂P(y1i = 0, y2i ∂δ1 ∂P(y1i = 0, y2i ∂δ1 = 1) = 0) = h1 φ(η1i )X1i , ∂P(y1i = 1, y2i = 1) , ∂δ1 ∂P(y1i = 1, y2i = 1) = 1) =− , ∂δ1 = 0) ∂P(y1i = 1, y2i = 0) =− . ∂δ1 = φ(η1i )X1i − n 1 ∂`(δ) X ∂P(y1i = 1, y2i = 1) y1i y2i = g2 = ∂δ2 P(y1i = 1, y2i = 1) ∂δ2 i=1 ∂P(y1i = 1, y2i = 0) 1 P(y1i = 1, y2i = 0) ∂δ2 ∂P(y1i = 0, y2i = 1) 1 + (1 − y1i )y2i P(y1i = 0, y2i = 1) ∂δ2 ∂P(y1i = 0, y2i = 0) 1 , + (1 − y1i )(1 − y2i ) P(y1i = 0, y2i = 0) ∂δ2 + y1i (1 − y2i ) where ∂P(y1i = 1, y2i ∂δ2 ∂P(y1i = 1, y2i ∂δ2 ∂P(y1i = 0, y2i ∂δ2 ∂P(y1i = 0, y2i ∂δ2 = 1) = h2 φ(η2i )X2i , ∂P(y1i = 1, y2i = 1) , ∂δ2 = 1) ∂P(y1i = 1, y2i = 1) = φ(η2i )X2i − , ∂δ2 ∂P(y1i = 0, y2i = 1) = 0) =− . ∂δ1 = 0) g3 = =− ∂`(δ) ∂`(δ) ∂θ = , ∂θ∗ ∂θ ∂θ∗ where n ∂`(δ) X ∂P(y1i = 1, y2i = 1) 1 = y1i y2i ∂θ P(y1i = 1, y2i = 1) ∂θ i=1 1 ∂P(y1i = 1, y2i = 0) P(y1i = 1, y2i = 0) ∂θ ∂P(y1i = 0, y2i = 1) 1 + (1 − y1i )y2i P(y1i = 0, y2i = 1) ∂θ 1 ∂P(y1i = 0, y2i = 0) + (1 − y1i )(1 − y2i ) , P(y1i = 0, y2i = 0) ∂θ + y1i (1 − y2i ) 26 and ∂P(y1i = 1, y2i ∂θ ∂P(y1i = 1, y2i ∂θ ∂P(y1i = 0, y2i ∂θ ∂P(y1i = 0, y2i ∂θ = 1) ∂C(Φ(η1i ), Φ(η2i ); θ) , ∂θ = 0) ∂P(y1i = 1, y2i = 1) =− , ∂θ ∂P(y1i = 1, y2i = 1) = 1) =− , ∂θ ∂P(y1i = 1, y2i = 1) = 0) = . ∂θ = Hessian 2 H1,1 = ∂ `(δ) = ∂δ1 ∂δ1T n X i=1 y1i y2i + y1i (1 − y2i ) + (1 − y1i )y2i ∂ 2 P(y1i =1,y2i =1) P(y1i ∂δ1 ∂δ1T = 1, y2i = 1) − h = 1, y2i = 0) − h ∂P(y1i =1,y2i =1) ∂δ1 [P(y1i = 1, y2i = 1)]2 ∂ 2 P(y1i =1,y2i =0) P(y1i ∂δ1 ∂δ1T ∂P(y1i =1,y2i =0) ∂δ1 i2 [P(y1i = 1, y2i = 0)]2 h i2 ∂P(y1i =0,y2i =1) ∂ 2 P(y1i =0,y2i =1) P(y1i = 0, y2i = 1) − ∂δ1 ∂δ1 ∂δ T 1 [P(y1i = 0, y2i = 1)]2 + (1 − y1i )(1 − y2i ) ∂ 2 P(y1i =0,y2i =0) P(y1i ∂δ1 ∂δ1T = 0, y2i = 0) − h ∂P(y1i =0,y2i =0) ∂δ1 [P(y1i = 0, y2i = 0)]2 where ∂ 2 P(y1i = 1, y2i ∂δ1 ∂δ1T ∂ 2 P(y1i = 1, y2i ∂δ1 ∂δ1T ∂ 2 P(y1i = 0, y2i ∂δ1 ∂δ1T ∂ 2 P(y1i = 0, y2i ∂δ1 ∂δ1T i2 = 1) ∂h1 φ(η1i )X1i − h1 φ(η1i )η1i XT1i X1i , ∂δ1 = 0) ∂ 2 P(y1i = 1, y2i = 1) = −φ(η1i )η1i XT1i X1i − , ∂δ1 ∂δ1T = 1) ∂ 2 P(y1i = 1, y2i = 1) =− , ∂δ1 ∂δ1T ∂ 2 P(y1i = 1, y2i = 0) = 0) =− . ∂δ1 ∂δ1T = 27 i2 , n X 2 H2,2 = ∂ `(δ) = ∂δ2 ∂δ2T i=1 ∂ 2 P(y1i =1,y2i =1) P(y1i ∂δ2 ∂δ2T y1i y2i + y1i (1 − y2i ) + (1 − y1i )y2i = 1, y2i = 1) − h = 1, y2i = 0) − h ∂P(y1i =1,y2i =1) ∂δ2 [P(y1i = 1, y2i = 1)]2 ∂ 2 P(y1i =1,y2i =0) P(y1i ∂δ2 ∂δ2T ∂P(y1i =1,y2i =0) ∂δ2 i2 i2 [P(y1i = 1, y2i = 0)]2 h i2 ∂P(y1i =0,y2i =1) ∂ 2 P(y1i =0,y2i =1) P(y = 0, y = 1) − 1i 2i T ∂δ2 ∂δ2 ∂δ 2 [P(y1i = 0, y2i = 1)]2 + (1 − y1i )(1 − y2i ) ∂ 2 P(y1i =0,y2i =0) P(y1i ∂δ2 ∂δ2T = 0, y2i = 0) − h ∂P(y1i =0,y2i =0) ∂δ2 [P(y1i = 0, y2i = 0)]2 i2 , where ∂ 2 P(y1i = 1, y2i ∂δ2 ∂δ2T ∂ 2 P(y1i = 1, y2i ∂δ2 ∂δ2T ∂ 2 P(y1i = 0, y2i ∂δ2 ∂δ2T ∂ 2 P(y1i = 0, y2i ∂δ2 ∂δ2T H3,3 ∂`(δ) ∂θ∗ ∂ ∂ 2 `(δ) = = ∂θ∗2 ∂θ∗ = 1) ∂h2 φ(η2i )X2i − h2 φ(η2i )η2i XT2i X2i , ∂δ2 ∂ 2 P(y1i = 1, y2i = 1) = 0) =− , ∂δ2 ∂δ2T = 1) ∂ 2 P(y1i = 1, y2i = 1) = −φ(η2i )η2i XT2i X2i − , ∂δ2 ∂δ2T ∂ 2 P(y1i = 0, y2i = 1) = 0) =− . ∂δ2 ∂δ2T ∂ = = ∂`(δ) ∂θ∗ ∂θ ∂ ∂θ = ∂θ∗ ∂`(δ) ∂θ ∂θ ∂θ∗ ∂θ ∂θ = ∂θ∗ ∂`(δ) ∂ 2 θ ∂`(δ)2 ∂θ + ∂θ2 ∂θ∗ ∂θ ∂θ∗2 where 2 ∂ `(δ) = ∂θ2 n X i=1 y1i y2i + y1i (1 − y2i ) + (1 − y1i )y2i ∂ 2 P(y1i =1,y2i =1) P(y1i ∂θ 2 = 1, y2i = 1) − h = 1, y2i = 0) − h ∂P(y1i =1,y2i =1) ∂θ [P(y1i = 1, y2i = 1)]2 ∂ 2 P(y1i =1,y2i =0) P(y1i ∂θ 2 ∂P(y1i =1,y2i =0) ∂θ i2 i2 [P(y1i = 1, y2i = 0)]2 i2 h ∂ 2 P(y1i =0,y2i =1) ∂P(y1i =0,y2i =1) P(y = 0, y = 1) − 1i 2i ∂θ 2 ∂θ + (1 − y1i )(1 − y2i ) [P(y1i = 0, y2i = 1)]2 ∂ 2 P(y1i =0,y2i =0) P(y1i ∂θ 2 = 0, y2i = 0) − h [P(y1i = 0, y2i = 0)]2 28 ∂P(y1i =0,y2i =0) ∂θ i2 , ∂θ , ∂θ∗ and ∂ 2 P(y1i = 1, y2i ∂θ2 2 ∂ P(y1i = 1, y2i ∂θ2 2 ∂ P(y1i = 0, y2i ∂θ2 2 ∂ P(y1i = 0, y2i ∂θ2 2 H1,2 ∂ `(δ) = = ∂δ1 ∂δ2T n X i=1 y1i y2i + y1i (1 − y2i ) + (1 − y1i )y2i ∂ 2 C(Φ(η1i ), Φ(η2i ); θ) , ∂θ2 = 0) ∂ 2 P(y1i = 1, y2i = 1) =− , ∂θ2 ∂ 2 P(y1i = 1, y2i = 1) = 1) =− , ∂θ2 ∂ 2 P(y1i = 1, y2i = 1) = 0) = . ∂θ2 = 1) = ∂P(y1i =1,y2i =1) P(y1i ∂δ1 ∂δ2T = 1, y2i = 1) − ∂P(y1i =1,y2i =1) ∂P(y1i =1,y2i =1) ∂δ1 ∂δ2 2 [P(y1i = 1, y2i = 1)] ∂P(y1i =1,y2i =0) P(y1i ∂δ1 ∂δ2T = 1, y2i = 0) − ∂P(y1i =1,y2i =0) ∂P(y1i =1,y2i =0) ∂δ1 ∂δ2 2 [P(y1i = 1, y2i = 0)] ∂P(y1i =0,y2i =1) P(y1i ∂δ1 ∂δ2T + (1 − y1i )(1 − y2i ) = 0, y2i = 1) − ∂P(y1i =0,y2i =1) ∂P(y1i =0,y2i =1) ∂δ1 ∂δ2 2 [P(y1i = 0, y2i = 1)] ∂P(y1i =0,y2i =0) P(y1i ∂δ1 ∂δ2T = 0, y2i = 0) − [P(y1i = 0, y2i = 0)] where ∂ 2 P(y1i = 1, y2i ∂δ1 ∂δ2T ∂ 2 P(y1i = 1, y2i ∂δ1 ∂δ2T ∂ 2 P(y1i = 0, y2i ∂δ1 ∂δ2T ∂ 2 P(y1i = 0, y2i ∂δ1 ∂δ2T ∂P(y1i =0,y2i =0) ∂P(y1i =0,y2i =0) ∂δ1 ∂δ2 2 = 1) ∂h1 φ(η1i )X1i , ∂δ2 ∂ 2 P(y1i = 1, y2i = 1) = 0) =− , ∂δ1 ∂δ2T = 1) ∂ 2 P(y1i = 1, y2i = 1) =− , ∂δ1 ∂δ2T ∂ 2 P(y1i = 1, y2i = 1) = 0) = . ∂δ1 ∂δ2T = 29 , H1,3 ( ∂P2 (y1i =1,y2i =1) n 2i =1) ∂P(y1i =1,y2i =1) X P(y1i = 1, y2i = 1) − ∂P(y1i =1,y ∂ 2 `(δ) ∂δ1 ∂θ∗ ∂δ1 ∂θ∗ y1i y2i = = 2 ∂δ1 ∂θ∗ [P(y1i = 1, y2i = 1)] i=1 + y1i (1 − y2i ) + (1 − y1i )y2i ∂P2 (y1i =1,y2i =0) P(y1i ∂δ1 ∂θ∗ = 1, y2i = 0) − ∂P(y1i =1,y2i =0) ∂P(y1i =1,y2i =0) ∂δ1 ∂θ∗ 2 = 0, y2i = 1) − ∂P(y1i =0,y2i =1) ∂P(y1i =0,y2i =1) ∂δ1 ∂θ∗ 2 [P(y1i = 1, y2i = 0)] ∂P2 (y1i =0,y2i =1) P(y1i ∂δ1 ∂θ∗ [P(y1i = 0, y2i = 1)] ∂P2 (y 1i =0,y2i =0) ∂δ1 ∂∂θ∗ + (1 − y1i )(1 − y2i ) P(y1i = 0, y2i = 0) − ∂P(y1i =0,y2i =0) ∂P(y1i =0,y2i =0) ∂δ1 ∂θ∗ 2 [P(y1i = 0, y2i = 0)] ) , ) , where ∂ 2 P(y1i = 1, y2i ∂δ1 ∂θ∗ 2 ∂ P(y1i = 1, y2i ∂δ1 ∂θ∗ 2 ∂ P(y1i = 0, y2i ∂δ1 ∂θ∗ 2 ∂ P(y1i = 0, y2i ∂δ1 ∂θ∗ H2,3 = 1) ∂h1 φ(η1i )X1i , ∂θ∗ = 0) ∂ 2 P(y1i = 1, y2i = 1) =− , ∂δ1 ∂θ∗ ∂ 2 P(y1i = 1, y2i = 1) = 1) =− , ∂δ1 ∂θ∗ ∂ 2 P(y1i = 1, y2i = 1) = 0) = . ∂δ1 ∂θ∗ = ( ∂P2 (y1i =1,y2i =1) n 2i =1) ∂P(y1i =1,y2i =1) X P(y1i = 1, y2i = 1) − ∂P(y1i =1,y ∂ 2 `(δ) ∂δ2 ∂θ∗ ∂δ2 ∂θ∗ = y1i y2i = 2 ∂δ2 ∂θ∗ [P(y = 1, y = 1)] 1i 2i i=1 + y1i (1 − y2i ) + (1 − y1i )y2i ∂P2 (y1i =1,y2i =0) P(y1i ∂δ2 ∂θ∗ = 1, y2i = 0) − ∂P(y1i =1,y2i =0) ∂P(y1i =1,y2i =0) ∂δ2 ∂θ∗ 2 = 0, y2i = 1) − ∂P(y1i =0,y2i =1) ∂P(y1i =0,y2i =1) ∂δ2 ∂θ∗ 2 [P(y1i = 1, y2i = 0)] ∂P2 (y1i =0,y2i =1) P(y1i ∂δ2 ∂θ∗ [P(y1i = 0, y2i = 1)] ∂P2 (y + (1 − y1i )(1 − y2i ) 1i =0,y2i =0) P(y1i ∂δ2 ∂∂θ∗ = 0, y2i = 0) − [P(y1i = 0, y2i = 0)] where ∂ 2 P(y1i = 1, y2i ∂δ2 ∂θ∗ ∂ 2 P(y1i = 1, y2i ∂δ2 ∂θ∗ ∂ 2 P(y1i = 0, y2i ∂δ2 ∂θ∗ 2 ∂ P(y1i = 0, y2i ∂δ2 ∂θ∗ ∂P(y1i =0,y2i =0) ∂P(y1i =0,y2i =0) ∂δ2 ∂θ∗ 2 ∂h2 φ(η2i )X2i , ∂θ∗ = 0) ∂ 2 P(y1i = 1, y2i = 1) =− , ∂δ2 ∂θ∗ = 1) ∂ 2 P(y1i = 1, y2i = 1) =− , ∂δ2 ∂θ∗ ∂ 2 P(y1i = 1, y2i = 1) = 0) = . ∂δ2 ∂θ∗ = 1) = 30 Expressions for hv , ∂h1 /∂δv , ∂hv /∂θ∗ , for v = 1, 2, ∂h2 /∂δ2 , ∂θ/∂θ∗ , ∂ 2 θ/θ∗2 , ∂C(Φ(η1i ), Φ(η2i ); θ)/∂θ and ∂ 2 C(Φ(η1i ), Φ(η2i ); θ)/∂θ2 for all the copulas in Table 1 are not reported here to save space and are implemented in the R package SemiParBIVProbit (Marra & Radice, 2013). These have been derived analytically and verified using numerical derivatives. References Barndorff-Nielsen, O. & Cox, D. (1989). Asymptotic Techniques for Use in Statistics. London: Chapman and Hall. Benedetti, A. & Abrahamowicz, M. (2004). Using generalized additive models to reduce residual confounding. Statisitcs in Medicine, 23, 3781–801. Brechmann, E. C. & Schepsmeier, U. (2013). Modeling dependence with c- and d-vine copulas: The R package CDVine. Journal of Statistical Software, 52(3), 1–27. Buchmueller, T. C., Grumbach, K., Kronick, R., & Kahn, J. G. (2005). Book review: The effect of health insurance on medical care utilization and implications for insurance expansion: A review of the literature. Medical Care Research and Review, 62, 3–30. Chib, S. & Greenberg, E. (2007). Semiparametric modeling and estimation of instrumental variable models. Journal of Computational and Graphical Statistics, 16, 86–114. Clarke, K. A. (2007). A simple distribution-free test for non-nested model selection. Political Analysis, 15, 347–363. Craven, P. & Wahba, G. (1979). Smoothing noisy data with spline functions. Numerische Mathematik, 31, 377–403. Fabbri, D. & Monfardini, C. (2003). Public vs. private health care services demand in italy. Giornale degli Economisti, 62(1), 93–123. Geyer, C. J. (2013). Trust regions. http://cran.r-project.org/web/packages/trust/vignet Gitto, L., Santoro, D., & Sobbrio, G. (2006). Choice of dialysis treatment and type of medical unit (private vs public), application of a recursive bivariate probit. Health Economics, 15, 1251– 1256. Goldman, D., Bhattacharya, J., McCaffrey, D., Duan, N., Leibowitz, A., Joyce, G., & Morton, S. (2001). Effect of insurance on mortality in an HIV-positive population in care. Journal of the American Statistical Association, 96, 883–894. Greene, W. H. (2012). Econometric Analysis. Prentice Hall, New York. Gu, C. (2002). Smoothing Spline ANOVA Models. Springer-Verlag, London. 31 Han, S. & Vytlacil, E. J. (2013). Identification in a generalization of bivariate probit models with endogenous regressors. Working paper. Hastie, T. & Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society Series B, 55, 757–796. Holly, A., Gardiol, L., Domenighetti, G., & Brigitte, B. (1998). An econometric model of health care utilization and health insurance in switzerland. European Economic Review, 42(3-5), 513– 522. Hopkins, S. & Kiddi, M. (1996). The determinants of the demand for private health insurance under medicare. Applied Economicsi, 28, 1623–1632. Imbens, G. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics, 86, 4–29. Jones, A. M., Koolman, X., & Doorslaer, E. V. (2006). The impact of having supplementary private health insurance on the uses of specialists. Annales d’Economie et de Statistique, (83/84), 251– 275. Kauermann, G. (2005). Penalized spline smoothing in multivariable survival models with varying coefficients. Computational Statistics and Data Analysis, 49, 169–186. Kauermann, G., Krivobokova, T., & Fahrmeir, L. (2009). Some asymptotics results on generalized penalized spline smoothing. Journal of Royal Statistical Society Series B, 71, 487–503. Kawatkar, A. A. & Nichol, M. B. (2009). Estimation of causal effects of physical activity on obesity by a recursive bivariate probit model. Value in Health, 12, A131–A132. Kim, Y. J. & Gu, C. (2004). Smoothing spline gaussian regression: More scalable computation via efficient approximation. Journal of the Royal Statistical Society Series B, 66, 337–356. Latif, E. (2009). The impact of diabetes on employment in Canada. Health Economics, 18, 577– 589. Li, Y. & Jensen, G. A. (2011). The impact of private long-term care insurance on the use of long-term care. Inquiry, 48(1), 34–50. Maddala, G. S. (1983). Limited Dependent and Qualitative Variables in Econometrics. Cambridge University Press, Cambridge. Marra, G. (2013). On p-values for semiparametric bivariate probit models. Statistical Methodology, 10, 23–28. Marra, G. & Radice, R. (2011a). Estimation of a semiparametric recursive bivariate probit model in the presence of endogeneity. Canadian Journal of Statistics, 39, 259–279. 32 Marra, G. & Radice, R. (2011b). A flexible instrumental variable approach. Statistical Modelling, 11, 581–603. Marra, G. & Radice, R. (2013). SemiParBIVProbit: Semiparametric Bivariate Probit Modelling. R package version 3.2-7. Marra, G. & Wood, S. (2012). Coverage properties of confidence intervals for generalized additive model components. Scandinavian Journal of Statistics, 39, 53–74. Marra, G. & Wood, S. N. (2011). Practical variable selection for generalized additive models. Computational Statistics and Data Analysis, 55, 2372–2387. McCullagh, P. (1987). Tensor Methods in Statistics. London: Chapman and Hall. Nelsen, R. (2006). An Introduction to Copulas. New York: Springer. Nocedal, J. & Wright, S. J. (2006). Numerical Optimization. New York: Springer-Verlag. R Development Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric Regression. Cambridge University Press, New York. Schokkaert, E., Ourti, T. V., Graeve, D. D., Lecluyse, A., & de Voorde, C. V. (2010). Supplemental health insurance and equality of access in belgium. Health Economics, 4, 37–395. Shane, D. & Trivedi, P. (2012). What Drives Differences in Health Care Demand? The Role of Health Insurance and Selection Bias. Health, Econometrics and Data Group (HEDG) Working Papers 12/09, HEDG, c/o Department of Economics, University of York. Sindelar, J. (1982). Differential use of medical care by sex. The Journal of Political Economy, 90, 1003–1019. Sklar, A. (1959). Fonctions de répartition é n dimensions et leurs marges. Publications de l’Institut de Statistique de l’Université de Paris, 8, 229–231. Sklar, A. (1973). Random variables, joint distributions, and copulas. Kybernetica, 9, 449–460. Srivastava, P. & Zhao, X. (2008). Impact of Private Health Insurance on the Choice of Public versus Private Hospital Services. Health, Econometrics and Data Group (HEDG) Working Papers 08/17, HEDG, c/o Department of Economics, University of York. Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57, 307–333. Wang, X. & Dey, D. K. (2010). Generalized extreme value regression for binary response data: An application to b2b electronic payments system adoption. The Annals of Applied Statistics, 4, 2000–2023. 33 Wilde, J. (2000). Identification of multiple equation probit models with endogenous dummy regressors. Economics Letters, 69, 309–312. Winkelmann, R. (2012). Copula bivariate probit models: with an application to medical expenditures. Health Economics, 21, 1444–1455. Wood, S. (2004). Stable and efficient multiple smoothing parameter estimation for generalized additive models. Journal of the American Statistical Association, 99, 673–686. Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical Society Series B, 65, 95–114. Wood, S. N. (2006). Generalized Additive Models: An Introduction With R. Chapman & Hall/CRC, London. Wood, S. N. (2013). On p-values for smooth components of an extended generalized additive model. Biometrika, 100, 221–228. Zimmer, D. M. & Trivedi, P. K. (2006). Using trivariate copulas to model sample selection and treatment effects: application to family health care demand. Journal of Business & Economic Statistics, 24, 63–76. 34