Copula Regression Spline Models for Binary Outcomes ∗ Rosalba Radice

advertisement
Copula Regression Spline Models for Binary Outcomes
With Application in Health Care Utilization∗
Rosalba Radice†
Department of Economics, Mathematics and Statistics
Birkbeck, London, U.K.
Malet Street, London WC1E 7HX, U.K.
Giampiero Marra and Małgorzata Wojtyś
Department of Statistical Science
University College London
Gower Street, London WC1E 6BT, U.K.
Abstract
We introduce a framework for estimating the effect that a binary treatment has on a binary outcome in the presence of unobserved and residual confounding, and of non-linear
dependence between treatment and outcome. The method is applied to a case study which
uses data from the 2008 Medical Expenditure Panel Survey and whose aim is to estimate the
effect of private health insurance on health care utilization. Unobserved confounding arises
when variables which are associated with both treatment and outcome (hence confounders) are
not available. Residual confounding occurs when observed confounders are insufficiently accounted for in the analysis. Treatment and outcome may also exhibit a dependence that cannot
be modeled using a linear measure of association. The problem of unobserved confounding
is addressed using a two-equation structural latent variable framework, where one equation
describes a binary outcome as a function of a binary treatment whereas the other equation
determines whether the treatment is received. The residual confounding issue is tackled by
modeling flexibly covariate-response relationships using a spline approach. Non-linear dependence between treatment and outcome is dealt with by using copula functions. Related
model fitting and inferential procedures are developed and asymptotic arguments presented.
The findings of our empirical analysis suggest that the three issues tackled in this paper are
present when estimating the impact of private health insurance on health care utilization, and
that useful insights can be gained by using the proposed framework.
Key Words: Binary outcomes; Copula; Penalized regression spline; Residual confounding;
Simultaneous equation estimation; Unobserved confounding.
∗
†
Research Report number 321, Department of Statistical Science, University College London. Date: August 2013
r.radice.bbk@ac.uk
1
1 Introduction
Quantifying the effect of a treatment on an outcome is a challenging task in observational studies. This is because there can be confounders, that is variables associated with both treatment and
outcome, which are not observable as they are either unknown or not readily quantifiable. This
problem, known in econometrics as endogeneity of the treatment, constitutes a serious limitation to
the use of classic methods since they may yield inconsistent estimates of the effect of interest. The
matter is further complicated by the problem of residual confounding which arises when observed
confounders are insufficiently accounted for in the analysis; unmodeled confounder-outcome effects enter the unstructured (error) term of the model causing the treatment to be associated with it
and hence inducing bias in the treatment effect. An additional issue which warrants consideration
is that treatment and outcome might be associated in a non-linear fashion; for instance, there might
exist an asymmetric tail dependence that a linear measure of association would not be able to fully
capture.
In this article, we consider the case in which the researcher is interested in estimating the effect
of a binary treatment on a binary outcome in the presence of unobserved and residual confounding,
and of non-linear dependence between treatment and outcome. To clarify ideas, let us consider a
case study which uses data from the 2008 Medical Expenditure Panel Survey (MEPS) and whose
goal is to estimate the effect of having private health insurance on the probability of using health
care services. Because this study is based on observational data obtained through a populationbased survey, insurance coverage is not randomly assigned as in a controlled trial but rather is the
result of demand and supply variables, including individual preferences and health status. Even
if these data are from a detailed survey, it is not possible to fully control for such variables. As
a consequence, the effect of insurance coverage is likely to be biased because of the presence of
unobserved confounders that are associated with insurance coverage and health care utilization.
The problem of residual confounding arises because the effects of observed confounders such as
age and education may be complex since they embody productivity and life-cycle effects that are
likely to affect private health insurance and health care utilization non-linearly. If these relationships are mis-modeled then the unstructured term of the model and private health insurance will be
associated hence causing the effect of insurance on the probability of using health care services to
be biased. Finally, insurance and health care utilization might be non-linearly associated. In other
words, there might be dependencies that appear in the tails of the distribution of both insurance
and health care utilization which need to be modeled.
The effect of unobserved confounding can be accounted for by using the recursive bivariate
probit (e.g., Greene, 2012; Maddala, 1983). This model controls for unobserved confounding by
using a two-equation structural latent variable framework, where one equation describes a binary
outcome (health care utilization) as a function of a binary treatment (insurance coverage) whereas
the other equation determines whether the treatment is received. The model is completed by assuming that the latent errors of the two equations follow a standard bivariate normal distribution
with correlation θ; θ 6= 0 suggests that unobserved confounding is present, hence joint estimation
of the two equations is required. Applications in economics and biostatistics include those by
2
Goldman et al. (2001), Jones et al. (2006), Gitto et al. (2006), Latif (2009), Kawatkar & Nichol
(2009), Schokkaert et al. (2010) and Li & Jensen (2011). The limitations of this model are, however, the inability to deal with residual confounding and to model non-linear dependencies between the two equations. To deal with the problem of residual confounding, Marra & Radice
(2011a) introduced a penalized likelihood estimation approach to estimate simultaneously the parameters of a system of two binary equations where covariate-response relationships are flexibly modeled using a spline approach. Alternative approaches are available in the literature (e.g.,
Chib & Greenberg, 2007), however the lack of user-friendly software makes them unfeasible for
practitioners. To deal with the problem of non-linear dependence between treatment and outcome,
Winkelmann (2012) discussed a modification of the recursive bivariate probit that maintains the
normal assumption for the marginal distributions of the two equations while introducing nonnormal dependence between them using the Frank and Clayton copulas.
The contribution of this article is twofold, one theoretical and the other practical. First, we
extend the procedures discussed in Marra & Radice (2011a) and Winkelmann (2012) to deal simultaneously with unobserved and residual confounding, and non-linear dependencies between
the model equations. In particular, we generalize the penalized likelihood estimation approach
based on the assumption of bivariate normality presented in Marra & Radice (2011a) by allowing
for non-normal dependencies; this is achieved by employing the Clayton, Frank, Gumbel, Joe and
Student-t copulas and the rotated versions of Clayton, Gumbel and Joe. This generalization is of
paramount importance as in many applied studies the kind of dependence is not known a priori
and using copulas that can capture only positive or symmetric dependencies, as in Winkelmann
(2012), might mean failing to uncover possibly interesting structures in the data. We also provide
some theoretical argumentation related to the asymptotic behavior of the proposed estimator and
the ensuing formula to calculate the treatment effect. Second, we implement the methods discussed in this article in the R package SemiParBIVProbit (Marra & Radice, 2013). This can
be particularly attractive to practitioners who wish to fit such models. No other computational
alternatives which consider copula regression spline models for binary outcomes are available in
the literature.
The rest of the paper is organized as follows. Section 2 discusses the model structure, parameter estimation procedure, copula model selection, confidence intervals and variable selection, and
presents some asymptotic results for the introduced penalized maximum likelihood estimator and
the ensuing formula used to calculate the impact of the treatment on the outcome. Section 3 applies the method to the MEPS data mentioned above, whereas Section 4 concludes and discusses
some possible extensions of the work.
2 Methods
2.1 Model definition
The focus is on a pair of random variables (y1i , y2i ), for i = 1, . . . , n, where yvi ∈ {0, 1}, for
v = 1, 2, and n represents the sample size. Variable y1i refers to the treatment whereas y2i to
3
∗
the outcome. The observed yvi can be viewed as determined by a latent continuous variable yvi
,
∗
∗
such that yvi = 1(yvi
> 0), for which we assume that yvi
∼ N (ηvi , 1) where ηvi ∈ R is a linear
predictor defined in the next section for v = 1, 2. Recall that 1 is the indicator function. The
joint cumulative distribution function (cdf) of event (y1i = 1, y2i = 1) can be defined by using the
copula representation (Sklar, 1959, 1973), i.e.
P(y1i = 1, y2i = 1) = C(P(y1i = 1), P(y2i = 1); θ),
where P(yvi = 1) = Φ(ηvi ), Φ(·) is the cdf of the standard univariate normal distribution, C
is a two-place copula function and θ is an association parameter measuring the dependence between the two marginals P(y1i = 1) and P(y2i = 1). In other words, the joint distribution is
expressed in terms of marginal distributions and a function C that binds them together. A substantial advantage of the copula approach is that the marginal distributions may come from different
families. This construction allows researchers to consider marginal distributions and the dependence between them as two separate but related issues. The families considered are Clayton,
Frank, Gaussian, Gumbel, Joe and Student-t, as well as rotated versions of the Clayton, Gumbel
and Joe copulas. Rotation by 180 degrees leads to a survival copula (C180 ), while rotation by
90 (C90 ) and 270 degrees (C270 ) allows for the modeling of negative dependence which is not
possible with the non-rotated and 180 degrees-rotated versions. The families considered here are
defined in Table 1 and displayed in Figure 1. The rotated versions can be obtained using (e.g.,
Brechmann & Schepsmeier, 2013)
C90 (ui , vi ) = vi − C(1 − ui , vi ),
C180 (ui , vi ) = ui + vi − 1 − C(1 − ui , 1 − vi ),
C270 (ui , vi ) = ui − C(ui , 1 − vi ),
where ui = P(y1i = 1) and vi = P(y2i = 1); Figure 2 shows the Clayton copula rotated by
0, 90, 180 and 270 degrees. Note that the parameter ranges of the θ for the copulas rotated by
90 and 270 degrees are on a negative scale; e.g., for the Gumbel copula rotated by 90 and 270
degrees θ has to be smaller than −1. As can be seen from Table 1, θ is not especially meaningful,
except for the Gaussian and Student-t copulas. However, as shown in Brechmann & Schepsmeier
(2013), for each copula there exists a relation between θ and the Kendall’s τ coefficient (which a
measure of association that lies in the customary range −1 to 1). For full details on copulas and
their properties see Nelsen (2006).
The log-likelihood function for the recursive bivariate probit model can be expressed as (Greene,
2012; Maddala, 1983)
`=
n
X
i=1
{y1i y2i log P(y1i = 1, y2i = 1) + y1i (1 − y2i ) log P(y1i = 1, y2i = 0)
,
+(1 − y1i )y2i log P(y1i = 0, y2i = 1) + (1 − y1i )(1 − y2i ) log P(y1i = 0, y2i = 0)}
where P(y1i = 1, y2i = 0) = P(y1i = 1)−P(y1i = 1, y2i = 1), P(y1i = 0, y2i = 1) = P(y2i = 1)−
4
C(u, v; θ)
−1/θ
−θ
u
+
v
−
1
−θ−1 log 1 + (e−θu − 1)(e−θv − 1)/(e−θ − 1)
−1
−1
n Φ2 (Φ (u), Φ (v); θ) o
1/θ
exp − (− log u)θ + (− log v)θ
1/θ
1 − (1 − u)θ + (1 − v)θ − (1 − u)θ (1 − v)θ
−1
t2,ν (t−1
ν (u), tν (v); θ)
Copula
Clayton
Frank
Gaussian
−θ
Gumbel
Joe
Student-t
Parameter range
θ ∈ (0, ∞)
θ ∈ R\ {0}
θ ∈ [−1, 1]
θ ∈ [1, ∞)
θ ∈ (1, ∞)
θ ∈ [−1, 1], ν ∈ (2, ∞)
Table 1: Definition of some classic copula functions. Φ2 (·, ·; θ) denotes the cumulative distribution function of
a standard bivariate normal distribution with correlation coefficient θ and t2,ν (·, ·; θ) that of a standard bivariate
Student-t distribution with correlation coefficient θ and ν degrees of freedom. t−1
ν denotes the inverse univariate
Student-t distribution function with ν degrees of freedom. Recall that ν controls the heaviness of the tails.
P(y1i = 1, y2i = 1) and P(y1i = 0, y2i = 0) = 1−[P(y1i = 1) + P(y2i = 1) − P(y1i = 1, y2i = 1)].
3
Gaussian
3
Frank
3
Clayton
2
2
2
0.01
0.05
0.05
0.1
0.15
0.15
0.1
1
1
1
0.1
0.15
0.2
0
0
0
0.2
−1
−1
−2
0.05
−2
−1
−2
0.2
−3
−2
−1
0
1
2
3
−3
0.01
−3
−3
0.01
−3
−2
−1
0
2
3
−3
−2
−1
0
1
2
3
3
Student−t
3
Joe
3
Gumbel
1
2
2
2
0.01
0.05
1
1
1
0.1
0.2
0.2
0.1
0.05
−2
−2
0.05
0.01
−2
−1
0
1
2
3
−3
−3
0.01
−3
−3
0.15
−2
0.1
−1
0.15
−1
−1
0.15
0
0
0
0.2
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
Figure 1: Contour plots of the six copula functions defined in Table 1 with standard normal margins for data simulated
using a Kendall’s τ coefficient of 0.5. The Gaussian, Student-t (here with ν = 3) and Frank copulas allow for equal
degrees of positive and negative dependence. Gaussian and Frank show a weaker tail dependence as compared to
Student-t, and Frank exhibits a slightly stronger dependence in the middle of the distribution. Clayton is asymmetric
with a strong lower tail dependence but a weaker upper tail dependence. Vice versa for the Gumbel and Joe copulas.
5
3
90 degrees
3
0 degrees
0.01
2
2
0.01
0.05
0.05
0.1
1
1
0.1
0.15
0.15
0.2
0
−3 −2 −1
−3 −2 −1
0
0.2
−3
−2
−1
0
1
2
3
−3
−2
0
1
2
3
2
3
2
1
0
0
1
2
3
270 degrees
3
180 degrees
−1
0.2
0.2
0.05
0.01
−3
−2
−1
0.15
−3 −2 −1
−3 −2 −1
0.15
0.1
0
1
2
0.1
0.05
0.01
3
−3
−2
−1
0
1
Figure 2: Contour plots of the Clayton copula with standard normal margins rotated by 0, 90, 180 and 270 degrees for
data simulated using a Kendall’s τ coefficient of 0.5 for positive dependence and of −0.5 for negative dependence.
The plots for the rotated Gumbel and Joe copulas follow the same rationale.
2.1.1
Linear predictor specification
The linear predictor for the treatment can be written as
η1i =
uT1i α1
+
K1
X
s1k1 (z1k1 i ),
(1)
k1 =1
whereas that for the outcome as
η2i = ψy1i + uT2i α2 +
K2
X
s2k2 (z2k2 i ),
(2)
k2 =1
where ψ is the effect of the treatment on the outcome on the scale of the linear predictor, uT1i =
(1, u12i , . . . , u1P1 i ) is the ith row of U1 = (u11 , . . . , u1n )T , the n × P1 model matrix containing
P1 parametric terms (e.g., intercept, dummy and categorical variables), α1 is a coefficient vector, and the s1k1 are unknown smooth functions of the K1 continuous covariates z1k1 i . Varying
coefficient models can be obtained by multiplying one or more smooth terms by some predictor(s) (Hastie & Tibshirani, 1993), and smooth functions of two or more covariates can also be
considered (Wood, 2006). Similarly, uT2i = (1, u22i , . . . , u2P2 i ) is the ith row vector of the n × P2
model matrix U2 = (u21 , . . . , u2n )T , α2 is a parameter vector, and the s2k2 are unknown smooth
terms of the K2 continuous regressors z2k2 i . The smooth functions are subject to the centering
P
(identifiability) constraint ni=1 svkv (zvkv i ) = 0, v = 1, 2, kv = 1, . . . , Kv (Wood, 2006).
The smooth functions are represented using the regression spline approach (e.g., Ruppert et al.,
2003). Specifically, svkv (zvkv i ) is approximated by a linear combination of known spline basis
6
PJvkv
βvkv j bvkv j (zvkv i ) =
functions, bvkv j (zvkv i ), and regression parameters, βvkv j , i.e. svkv (zvkv i ) = j=1
Bvkv (zvkv i )T βvkv , where Jvkv is the number of spline bases used to represent svkv (·), Bvkv (zvkv i )T
is the ith vector of dimension Jvkv containing the basis functions evaluated at the observation
T
zvkv i , i.e. Bvkv (zvkv i ) = bvkv 1 (zvkv i ), bvkv 2 (zvkv i ), . . . , bvkv Jvkv (zvkv i ) , and βvkv is the corresponding parameter vector. Evaluating Bvkv (zvkv i ) for each i yields Jvkv curves with different
degrees of complexity which multiplied by some value for βvkv and then summed will give a
(linear or non-linear) estimate for svkv (zvkv ); see Ruppert et al. (2003) for a detailed overview.
Basis functions should be chosen to have convenient mathematical and numerical properties. We
employ low rank thin plate regression splines (Wood, 2006), although B-splines and cubic regression splines are supported in our implementation. The cases of smooth terms multiplied by
some covariate(s) and of smooths of more than one variable follow a similar construction; see
Wood (2006, Chapter 4) for full details. Linear predictors (1) and (2) can, therefore, be written as
η1i = uT1i α1 +BT1i β1 and η2i = ψy1i +uT2i α2 +BT2i β2 , where BTvi = Bv1 (zv1i )T , . . . , BvKv (zvKv i )T
T
T
and βvT = (βv1
, . . . , βvK
). After defining X1i = (uT1i , BT1i )T and X2i = (y1i , uT2i , BT2i )T , we have
v
η1i = XT1i δ1 and η2i = XT2i δ2 where δ1T = (αT1 , β1T ) and δ2T = (ψ, αT2 , β2T ).
To identify the parameters in η2i , it is typically assumed that the exclusion restriction on the
exogenous variables holds: the regressors in the treatment equation should contain at least one or
more covariates (usually referred to as instruments) not included in the outcome equation. However, as shown for instance in Han & Vytlacil (2013), Marra & Radice (2011a) and Wilde (2000),
the presence of this restriction may not be necessary to obtain consistent estimates of the model
parameters.
2.2 Sample average treatment effect
The effect of the treatment y1i on the response probability P(y2i = 1|y1i ) is of primary interest.
That is, the aim is to investigate how the treatment changes the expected outcome. Thus, the
treatment effect is given by the difference between the expected outcome with treatment and the
expected outcome without treatment. Different measures of treatment effect have been proposed
in the literature. Here, we focus on the average treatment effect in the specific sample at hand
(SATE; Imbens, 2004), which in our case can be defined as
n
1X
P(y2i = 1|y1i = 1) − P(y2i = 1|y1i = 0),
SATE(δ, X) =
n i=1
where
P(y2i = 1|y1i = 1) =
P(y2i = 1|y1i = 0) =
(y1i =1)
C Φ(η1i ), Φ(η2i
); θ
,
Φ(η1i )
(y =0)
(y =0)
Φ(η2i 1i ) − C Φ(η1i ), Φ(η2i 1i ); θ
1 − Φ(η1i )
(y =r)
,
represents the linear predictor
the linear predictors are defined in the previous section, η2i 1i
evaluated at r equal to 1 or 0, δ T = (δ1T , δ2T , θ∗ ), with θ∗ being a transformation of θ which is
7
useful in estimation (see the next section), and X = (x1 | . . . |xn )T with xi defined as (XT1i , XT2i )T .
SATE(δ, X) can be estimated using SATE(δ̂, X), whereas a confidence interval for it can be
obtained employing the delta method. Specifically, the appropriate estimator of the asymptotic
variance of SATE(δ̂, X) is
∂SATE(δ, X) ∂δ
T
δ=δ̂
where Vδ is defined in Section 2.5 and
∂SATE(δ, X) V̂δ
,
∂δ
δ=δ̂
(3)
#T
"
∂SATE(δ, X) T ∂SATE(δ, X) T ∂SATE(δ, X)
∂SATE(δ, X)
,
,
,
=
∂δ
∂δ1
∂δ2
∂θ∗
with elements defined in Appendix A. Alternatively, Bayesian posterior simulation can be employed as explained in Section 2.5.
2.3 Estimation
Let us denote the log-likelihood of a given copula function as `(δ), where δ is defined in the
previous section. Note that, because θ is bounded in most cases (see Table 1), we use a proper
transformation of it, θ∗ , to ensure that in optimization δ ∈ Rp , where p is the total number of
parameters; see Table 2 for the list of transformations employed. Given the flexible linear predictor
structure considered here, unpenalized estimation can result in smooth term estimates that are too
rough to produce practically useful results (e.g., Ruppert et al., 2003). This issue is dealt with by
2
R 00
P P v
svkv (zvkv ) dzvkv which measures the second-order
using the penalty term 2v=1 K
kv =1 λvkv
roughness of the smooth terms in the model. The λvkv are smoothing parameters controlling the
trade-off between fit and smoothness. Since regression splines are linear in their model parameters,
P P v
the overall penalty can be written as β T Sλ β where β T = (β1T , β2T ), Sλ = 2v=1 K
kv =1 λvkv Svkv
and the Svkv are positive semi-definite known square matrices. The formulas for the bvkv j (zvkv i )
and Svkv depend on the type of spline employed and we refer the reader to Ruppert et al. (2003)
and Wood (2003, 2006) for these details. The function to maximize is
1
`p (δ) = `(δ) − β T Sλ β,
2
(4)
where the penalty term can also be written as 1/2δ T S̃λ δ where S̃λ is an overall penalty matrix defined as diag(0TP1 , λ1k1 S1k1 , . . . , λ1K1 S1K1 , 0, 0TP2 , λ2k2 S2k2 , . . . , λ2K2 S2K2 , 0) with 0TPv =
(0v1 , . . . , 0vPv ).
Given a parameter vector value for λ̂ = (λ̂1k1 , . . . , λ̂1K1 , λ̂2k2 , . . . , λ̂2K2 ), we seek to maximize
(4). In practice, this is achieved by using a trust region algorithm (Nocedal & Wright, 2006,
[a]
Chapter 4). Specifically, let us define the penalized gradient and Hessian at iteration a as gp =
[a]
[a]
[a]
g[a] − S̃λ̂ δ [a] and H[a]
− S̃λ̂ , where g[a] is made up of g1 = ∂`(δ)/∂δ1 |δ1 =δ[a] , g2 =
p = H
1
[a]
∂`(δ)/∂δ2 |δ2 =δ[a] and g3 = ∂`(δ)/∂θ∗ |θ∗ =θ[a] , and the Hessian matrix has a 3 × 3 matrix block
2
∗
8
Copula
Clayton
Frank
Gaussian/Student-t
Gumbel
Joe
θ∗
log(θ − )
θ−
tanh−1 (θ)
log(θ − 1)
log(θ − 1 − )
Table 2: Transformations, θ∗ , of the dependence parameter, θ, used in optimization. Quantity is set to the machine
smallest positive floating-point number multiplied by 106 and is used in some cases to ensure that the dependence
parameters lie in the ranges reported in Table 1.
[a]
structure with (r, h)th element Hr,h = ∂ 2 `(δ)/∂δr ∂δhT |δr =δr[a] ,δ =δ[a] , r, h = 1, . . . , 3, where δ3 =
h
h
θ∗ ; details on the structure of g and H can be found in Appendix B. Each iteration of the trust
region algorithm solves the problem
1 T [a]
[a]
max `˘p (δ) = `p (δ [a] ) + δ T g[a]
p + δ Hp δ so that kδk ≤ r ,
δ
2
δ [a+1] = arg max `˘p (δ),
δ
where k · k denotes the Euclidean norm and r[a] represents the radius of the trust region. At
each iteration of the algorithm, the quadratic approximation of the objective function at δ [a] is
maximized subject to the constraint that the solution falls within a trust region with radius r[a] .
The proposed solution is then accepted or rejected and the trust region expanded or shrunken
based on the ratio between the improvement in the objective function when going from δ [a] to
δ [a+1] and that predicted by the quadratic approximation. The exact details of the implementation
used here can be found in Geyer (2013) who also discusses numerical stability and termination
criteria. Despite there is no guarantee that a trust region algorithm will always converge faster
than standard approaches, it may work better for difficult optimization problems (Geyer, 2013;
Nocedal & Wright, 2006). In our experience, this approach proved to be considerably faster and
more stable than classic alternatives.
In the Student-t case, we fix the degrees of freedom ν as the derivatives for this parameter are
involved, and g and H would then be in four (rather than three) main dimensions, hence making
the algorithm slower and less stable. As a solution, we propose fitting as many models as the
number of a set of ν values and assessing the sensitivity of the results to this parameter. In our
experience, this proved to be a practical and effective means of handling ν.
2.3.1
Smoothness selection
If the model has more than one smooth term per equation, then smoothing parameter selection by
direct grid search optimization of, for instance, a prediction error criterion can be computationally
burdensome. It is therefore pivotal for practical modeling to estimate λ in an automatic way. There
are many techniques for automatic multiple smoothing parameter selection within the penalized
likelihood framework; see Ruppert et al. (2003) and Wood (2006) for detailed overviews. Note
that joint estimation of δ and λ via maximization of (4) would clearly lead to over-fitting since the
9
highest value for `p (δ) would be obtained when λ = 0. In fact, λ should be selected so that the
estimated smooth components are as close as possible to the true functions.
We essentially adopt the approach described in Marra & Radice (2011a) which applies a modified version of the unbiased risk estimator (UBRE; Craven & Wahba, 1979; Gu, 2002) to each
working linear model obtained from a trust region iteration. Specifically, the problem to solve is
1 √ [a] [a]
2
[a]
k W (z − X̃δ)k2 − 1 + γtr(Aλ ),
ň
ň
= arg min Vu (λ),
min Vu (λ) =
λ
λ[a+1]
(5)
λ
√ [a]
where ň = 2n, W is the square root of a block diagonal matrix made up of 2 × 2 ma[a]
[a]
trices Wi with (r, h)th element given by −∂ 2 `(δ)i /∂ηri ∂ηhi |ηri =η[a] ,η =η[a] , r, h = 1, 2, zi is
hi
ri
hi
n
oT
−1,[a] [a]
[a]
[a]
the 2-dimensional vector X̃i δ + Wi
di , di = ∂`(δ)i /∂η1i |η1i =η[a] , ∂`(δ)i /∂η2i |η2i =η[a] ,
1i
T 2i
T T
[a]
X̃i = diag X1i , X2i , with X1i and X2i defined in Section 2.1.1, X̃ = X̃1 | . . . |X̃n , Aλ =
T
T
X̃(X̃ W[a] X̃ + Šλ )−1 X̃ W[a] is the hat matrix,
[a]
Šλ = diag(0TP1 , λ1k1 S1k1 , . . . , λ1K1 S1K1 , 0, 0TP2 , λ2k2 S2k2 , . . . , λ2K2 S2K2 ), tr(Aλ ) represents the estimated degrees of freedom (edf ) of the penalized model (e.g., Wood, 2006), and γ is a tuning
parameter which can be increased from its usual value of 1 to obtain smoother models (Kim & Gu
(2004) suggest using γ ≈ 1.4 to correct the tendency to over-fitting of prediction error criteria).
In practice, this approach is implemented using the automatic method by Wood (2004), which is
based on Newton’s method and can evaluate the approximate bivariate UBRE and its first and second derivatives with respect to log(λ), since the estimated smoothing parameters must be positive,
in an efficient and stable way. Broadly speaking, this is achieved using a series of pivoted QR and
[a]
singular value decompositions, which make evaluation of tr(Aλ ) for new trial values of λ cheap
and derivative calculations efficient and stable; see Wood (2004) for full details. The two steps,
one for δ and the other for λ, are iterated until convergence. They can be summarized as follows:
step 1 For a given parameter vector value δ [a] and holding the smoothing parameter vector fixed at
λ[a] , find an estimate of δ:
δ [a+1] = arg max `˘p (δ).
δ
step 2 Using δ [a+1] construct the working linear model quantities needed in (5) and find an estimate
of λ:
λ[a+1] = arg min Vu (λ).
λ
√
√
In step 2, quantities W−1 d, Wz and WX̃ are efficiently set up by exploiting the sparse structure
of W. As opposed to the implementation discussed in Marra & Radice (2011a), where θ∗ plays a
role in Vu (λ) and as a consequence ň = 3, for convenience the working linear model quantities in
(5) are constructed for fixed values of θ∗ . This is because θ∗ is not either penalized or specified as
a function of a linear predictor, such as η1i , whose components may need to be penalized. Hence,
the algorithm presented here is more efficient than that in Marra & Radice (2011a).
10
2.4 Copula model selection
Copula models with a single dependence parameter can be thought of as non-nested models (except, of course, for Gaussian and Student-t when ν → ∞). As suggested by Zimmer & Trivedi
(2006) among others, one approach for choosing between copula models is to use either the
Akaike or (Schwarz) Bayesian information criterion (AIC and BIC, respectively). In our case,
AIC = −2`(δ̂) + 2edf and BIC = −2`(δ̂) + log(n)edf , where the log-likelihood is evaluated at
the penalized parameter estimates and edf = tr(Âλ̂ ).
As pointed out by Brechmann & Schepsmeier (2013), the tests proposed by Vuong (1989)
and Clarke (2007) may be more reliable than information criteria when non-nested models are
compared. These are essentially likelihood-ratio-based tests which measure the distance between
two statistical models. In the Vuong test, the null hypothesis is that the two competing models are
equally close to the actual model, whereas the alternative is that one model is closer. Let `C1 (δ̂C1 )
and `C2 (δ̂C2 ) denote the log-likelihoods of two bivariate copula models C1 and C2 evaluated at
their respective parameter estimates, edfC1 and edfC2 be their estimated degrees of freedom and qi
be equal to `i,C1 (δ̂C1 ) − `i,C2 (δ̂C2 ). In the current context, the Vuong test is
n
o
edfC1 −edfC2
`C1 (δ̂C1 ) − `C2 (δ̂C2 ) −
log(n)
2
d
qP
V =
−→ N (0, 1),
H0
n
2
i=1 (qi − q̄)
where H0 : E(qi ) = 0 ∀i = 1, . . . , n. For a significance level ς, if V > Φ−1 (1 − ς/2) then
model C1 is preferred over C2 and vice versa when V < Φ−1 (1 − ς/2). If |V | ≤ Φ−1 (1 − ς/2)
then we cannot discriminate between the two models given the data and as a consequence H0
cannot be rejected. In the Clarke test, H0 : P(qi > 0) = 0.5 ∀i = 1, . . . , n. That is, if the
two models are statistically equivalent then the log-likelihood ratios of the observations should be
evenly distributed around zero and 50% of the ratios should be larger than zero. The Clarke test is
n
X
edfC1 − edfC2
d
1 qi −
B=
log(n)
−→ Bin(n, 0.5).
H0
2n
i=1
Critical values can be obtained as shown in Clarke (2007). Intuitively, model C1 is preferred over
C2 if B is significantly larger than its expected value under the null hypothesis (n/2), and vice
versa. If B is not significantly different from n/2 then C1 can be thought of as equivalent to C2 .
2.5 Confidence intervals and variable selection
The parameter estimator for δ can be written in iteratively re-weighted least squares form as δ̂ =
T
T
T
−1
(X̃ WX̃ + S̃λ )−1 X̃ Wz, where X̃ WX̃ + S̃λ = −Hp . It then follows that Vδ̂ = −H−1
p HHp .
As in Marra & Radice (2011a), the alternative Bayesian result Vδ = −H−1
p can be employed as
well. As shown by Marra & Wood (2012), at finite sample sizes Vδ can produce intervals with
close to nominal coverage probabilities. This is because the Bayesian covariance matrix includes
both a bias and variance component in a frequentist sense, which is a feature that is not shared by
11
Vδ̂ . Note that for unpenalized model components Vδ and Vδ̂ are equivalent.
Confidence intervals for linear functions of the model parameters can be easily constructed.
For instance, intervals for ŝvkv (zvkv i ) can be based on N (svkv (zvkv i ), Bvkv (zvkv i )T Vδvkv Bvkv (zvkv i )),
where Vδvkv is the sub-matrix of Vδ corresponding to the regression spline parameters associated
with ŝvkv (zvkv i ). Intervals for non-linear functions of the model coefficients (e.g., θ, Kendall’s
τ and SATE) can be conveniently obtained by simulation from the posterior distribution of δ as
follows:
step 1 Draw nsim random vectors from N (δ̂, V̂δ ).
step 2 Calculate the nsim simulated realizations of the function of interest. For instance, for a Gaussim
sian copula θ = tanh(θ∗ ), hence θ sim = (θ1sim , θ2sim , . . . , θnsim
) where θisim = tanh(θ∗,i
),
sim
i = 1, . . . , nsim .
step 3 Using θ sim calculate the lower, (ς/2), and upper, 1 − ς/2, quantiles.
Small values for nsim are typically tolerable; nsim is set to 1000 because the above procedure is
relatively inexpensive to run. The significance level is usually set to 0.05. It may be argued that
intervals for non-linear functions can be obtained using the delta method (see, e.g., Section 2.2).
However, for functions like Kendall’s τ such a method may not be easy to apply.
Strictly speaking, point-wise confidence intervals for smooth components are not adequate for
variable selection, although they are often used in practice for this purpose (e.g., Ruppert et al.,
2003). As shown by Wood (2013), in the context of extended generalized additive models,
and discussed by Marra (2013), for semiparametric bivariate probit models, ŝvkv v̇N (svkv , Vsvkv ),
where svkv = [svkv (zvkv 1 ), svkv (zvkv 2 ), . . . , svkv (zvkv n )]T , Vsvkv = Bvkv (zvkv )Vδvkv Bvkv (zvkv )T and
Bvkv (zvkv ) denotes a full column rank matrix such that ŝvkv = Bvkv (zvkv )β̂vkv . It is then possible
to obtain approximate p-values, for testing the hypothesis svkv = 0, based on
vkv −
ŝvkv v̇χ2rvkv ,
Trvkv = ŝTvkv Vsrvk
v
vkv −
is the rank rvkv Moore-Penrose pseudo-inverse of Vsvkv , which can deal with rank
where Vrsvk
v
deficiencies. Parameter rvkv is selected using the notion of edf used in (5). Because edf is not an
integer, it can be rounded as follows (Wood, 2013)
rvkv =
(
floor(edfvkv )
if edfvkv < floor(edfvkv ) + 0.05
floor(edfvkv ) + 1
otherwise
,
which proved effective in semiparametric bivariate probit models (Marra, 2013). Alternatively,
variable selection can be achieved adopting a single penalty shrinkage approach as described in
Marra & Radice (2011a) and Marra & Wood (2011).
12
2.6 Asymptotic considerations
In this section, we present some argumentation related to the asymptotic behavior of the penalized
maximum likelihood estimator defined as
δ̂ = arg max `p (δ),
δ
where `p (δ) is given in (4), and the behavior of the ensuing SATE estimator constructed in Section
2.2. We consider the situation in which the spline bases approximating the smooth components
{bvkv j , j = 1, . . . , Jvkv , kv = 1, . . . , Kv , v = 1, 2} are of a fixed dimension, i.e. the Jvkv are fixed.
Note that the unknown smooth functions {svkv , kv = 1, . . . , Kv , v = 1, 2} may not have an exact representation as linear combinations of given basis functions and consequently the unknown
functions and parameters may not be asymptotically identified by their estimators as the sample
size grows. However, the case of fixed basis dimensions is of relevance as in practice these have
to be fixed. In this scenario, the method provides estimates which tend in probability to quantities
best approximating the unknown functions and parameters in terms of Kullback-Leibler measure.
(Recall that the Kullback-Leibler distance between two density functions f and g is defined as
R∞
KL(f ||g) = −∞ f log(f /g) if f is absolutely continuous with respect to g and 0 otherwise.)
Let Lt be the likelihood function for the true model which, in our case, contains the true smooth
functions appearing in the linear predictors η1 and η2 given in (1) and (2) and true value of the
dependence parameter θ∗ of a given copula, and let `t be the corresponding log-likelihood. Then
the Kullback-Leibler distance between the likelihood Lt in the true model and the likelihood L(δ)
PK2
P 1
s
(z
)
and
in the model where K
1k
1k
i
1
1
k2 =1 s1k2 (z2k2 i ) are replaced with their spline approxk1 =1
T
T
imations B1i β1 and B2i β2 is equal to
KL(Lt ||L(δ)) = E `t − `(δ) ,
where the expectation is taken with respect to the true model distribution and δ = (δ1T , δ2T , θ∗ )T .
T
T
Define parameter vector δ 0 = (δ10 , δ20 , θ∗0 )T as the minimizer of the above distance, that is
δ 0 = arg min KL Lt ||L(δ) .
δ
It follows that δ 0 is the maximizer of the expected unpenalized log-likelihood `(·) and as a consequence Eg(δ 0 ) = 0. Remind that g(δ) and H(δ) denote the gradient vector and Hessian matrix of
`(·) calculated at a point δ and let gp (δ) = g(δ) − S̃λ δ and Hp (δ) = H(δ) − S̃λ be the penalized
versions of them. Below, we define classic conditions related to the score vector, Hessian and
Fisher information matrix as well as the penalty matrix (see, e.g., Kauermann (2005) who used
similar assumptions in the context of survival models). The assumptions are
(A1) g(δ 0 ) = OP (n1/2 ),
(A2) EH(δ 0 ) = O(n),
(A3) H(δ 0 ) − EH(δ 0 ) = OP (n1/2 ),
13
(A4) S̃λ = o n1/2 , where S̃λ is defined in Section 2.3.
Conditions (A1) and (A3) are the usual assumptions of n1/2 asymptotics (e.g., Barndorff-Nielsen & Cox,
1989). Note that, as the n observations are assumed to be independent, g(δ 0 ) and H(δ 0 ) are made
up of sums of independent random variables. Assumptions (A1) and (A3) imply that, given the
model, the average values n1 g(δ 0 ) and n1 H(δ 0 ) over the random sample converge in probability to their expected values at the rate n−1/2 . Condition (A4) can be equivalently formulated as
λvkv = o n1/2 for kv = 1, . . . , Kv , v = 1, 2, assuming that the matrices Svkv are asymptotically
bounded.
Theorem 1. Under conditions (A1)-(A4) we have
δ̂ − δ 0 = OP (n−1/2 ) as n → ∞.
Let SATE0 be equal to
n
1 X 0
P (y2i = 1|y1i = 1) − P0 (y2i = 1|y1i = 0) ,
SATE =
n i=1
0
where
P0 (y2i = 1|y1i = 1) =
0
P (y2i = 1|y1i = 0) =
0 (y1i =1) ),Φ(η 0 );θ 0
C Φ(η2i
1i
,
0 )
Φ(η1i
(y1i =0)
0
0 (y1i =0) ),Φ(η 0 );θ 0
Φ(η2i
)−C Φ(η2i
1i
0 )
1−Φ(η1i
(6)
,
0
0
η1i
= XT1i δ10 and η2i
= XT2i δ20 . In order to prove consistency for the estimator of the SATE we
introduce the additional assumption that the
(A5) probabilities (6) are differentiable as functions of δ and their gradients are bounded in
the neighborhood of δ 0 , uniformly for all xi = (XT1i , XT2i )T .
Theorem 2. If conditions (A1)-(A5) hold then
\ − SATE0 = OP (n−1/2 ) as n → ∞,
SATE
\ = SATE(δ̂, X) as defined in Section 2.2.
where SATE
Proof of Theorem 1. We show that the following approximation holds
0
0
δ̂ − δ ≈ −EH(δ ) + S̃λ
−1
(g(δ 0 ) − S̃λ δ 0 ),
(7)
which implies the asymptotic consistency of δ̂ at the rate n−1/2 . We adopt the argumentation used
in the theory of maximum likelihood estimation (e.g., McCullagh, 1987) which involves a Taylor
expansion of the score in the neighborhood of δ 0 . A similar approach was used by Kauermann
(2005) and Kauermann et al. (2009) in the context of penalized spline smoothing. For simplicity
of notation, we omit all terms of order higher than 1 and assume that higher order derivatives of
14
the log-likelihood behave in a similar manner as those defined in (A1)-(A3).
The first-order Taylor expansion of gp (δ̂) around gp (δ 0 ) implies
gp (δ̂) = gp (δ 0 ) + Hp (δ 0 )(δ̂ − δ 0 ) + (higher order terms),
which, after using the fact that gp (δ̂) = 0 and inverting the above series (e.g., Barndorff-Nielsen & Cox,
1989), leads to
δ̂ − δ 0 = −Hp (δ 0 )−1 (g(δ 0 ) − S̃λ δ 0 ) + . . .
We then decompose Hp (δ 0 ) as
0
Hp (δ ) = H(δ ) − EH(δ ) + EH(δ ) − S̃λ = R − F(λ),
0
0
0
where R = H(δ 0 ) − EH(δ 0 ) represents a stochastic error and F(λ) = −EH(δ 0 ) + S̃λ is the
penalized Fisher information matrix. Now, let f (·) = (· − F(λ))−1 be an auxiliary function
of a matrix argument. Using a Taylor expansion of f (R) around f (0), we obtain the following
representation of Hp (δ 0 )−1
Hp (δ 0 )−1 = −F(λ)−1 − F(λ)−1 R(F(λ)−1 )T + . . .
Assumptions (A2)-(A4) imply
Hp (δ 0 )−1 = −F(λ)−1 I − RF(λ)−1 + . . . = −F(λ)−1 I + OP (n−1/2 ) ,
where I is an identity matrix. Thus
δ̂ − δ 0 = F(λ)−1 (g(δ 0 ) − S̃λ δ 0 ) (I + oP (1)) + . . . ,
(8)
which proves (7) and hence
δ̂ − δ 0 = OP (n−1/2 ) as n → ∞.
(9)
Remark 1. (a) From approximation (8), the asymptotic bias and covariance matrix of δ̂ can be
derived. Specifically,
bias(δ̂) = E δ̂ − δ 0 ≈ −F(λ)−1 S̃λ δ 0 ,
where the property Eg(δ 0 ) = 0 of δ 0 has been used, and
Cov(δ̂) ≈ −F(λ)−1 EH(δ 0 )F(λ)−1 ,
(10)
which follows from the fact that Cov(g(δ 0 )) = −EH(δ 0 ). In addition, conditions (A2) and (A4)
imply that
bias(δ̂) = o(n−1/2 ) and Cov(δ̂) = O(n−1 ).
15
(b) Assumption (A4) implies that
√
nCov(δ̂) ≈
and
√
nVδ ≈
1 √ E −H(δ 0 )
n
−1
1
− √ H(δ 0 )
n
−1
where Vδ = −H−1
p is the Bayesian approximation of the covariance matrix of δ̂ mentioned in
Section 2.5. Thus, the frequentist asymptotic approximation (10) and the Bayesian result become
equivalent as the sample size n grows to ∞.
d
−1/2
(c) As g(δ 0 ) is a sum of i.i.d. components, it follows that (−EH(δ 0 ))
g(δ 0 ) → N (0, I).
Hence, approximation (8) also implies asymptotic normality of the normalized estimator δ̂.
\ = SATE(δ̂, X) where SATE(δ, X) can be expressed as
Proof of Theorem 2. Recall that SATE
P
n
1
i=1 sate(δ, xi ), with sate(δ, xi ) determined by
n
(y1i =1)
C Φ(η1i ), Φ(η2i
); θ
Φ(η1i )
−
(y =0)
Φ(η2i 1i )
−C
(y =0)
Φ(η1i ), Φ(η2i 1i ); θ
1 − Φ(η1i )
.
The mean value theorem yields
sate(δ̂, xi ) = sate(δ 0 , xi ) +
∂
sate(δ̃, xi )T (δ̂ − δ 0 ),
∂δ
∂
sate(δ̃, xi ) is the gradient vector of sate(·, xi )
for some δ̃ = (1 − c)δ 0 + cδ̂, c > 0, where ∂δ
expressed as a function of δ calculated at a point δ = δ̃, for i = 1, . . . , n. Thus,
n
n
X
1X ∂
\= 1
SATE
sate(δ̃, xi )T (δ̂ − δ 0 )
sate(δ 0 , xi ) +
n i=1
n i=1 ∂δ
n
1X ∂
= SATE +
sate(δ̃, xi )T (δ̂ − δ 0 )
n i=1 ∂δ
0
(11)
As for the second term in (11), Schwarz inequality implies
n
n
1X ∂
1 X ∂
T
0
sate(δ̃, xi ) (δ̂ − δ ) ≤
sate(δ̃, xi ) ||δ̂ − δ 0 ||.
n i=1 ∂δ
n i=1 ∂δ
∂
sate(·, xi ) is bounded in the neighborhood of δ 0 uniformly for all xi
Given the assumption that ∂δ
and that ||δ̂ − δ 0 || = OP (n−1/2 ) proved in (9), the assertion follows.
Remark 2. Expression (10) for the asymptotic covariance matrix of δ̂ can be used to construct the
16
\ using the delta method, namely
asymptotic variance of SATE
T
0
0
\ ≈ − ∂SATE(δ , X) F(λ)−1 EH(δ 0 )F(λ)−1 ∂SATE(δ , X) ,
VarSATE
∂δ
∂δ
which is equivalent in the limit to expression (3) given in Section 2.2, as motivated in Remark
1(b). Moreover, it follows from the delta method and Remark 1(c) that the normalized estimator
\ is asymptotically normal.
SATE
3 Analysis of health care utilization data
The analysis presented in this section was performed in the R environment (R Development Core Team,
2013) using the package SemiParBIVProbit which implements the methods discussed in this
article (Marra & Radice, 2013).
3.1 Data
We used data from the 2008 Medical Expenditure Panel Survey (MEPS; http://www.meps.ahrq.gov/)
which provides a nationally representative sample that includes information on demographics, individual health status, health care utilization and private health insurance coverage. We excluded
individuals younger than 18 years old given their different overall health profiles and expected
usage patterns as compared to older individuals. Those 65 and older were also excluded since
the availability of Medicare obviates the primary insurance decision for almost all US citizens.
Individuals that did not have a complete set of socioeconomic and demographic control variables
were also excluded from the sample (e.g., missing values for education or income). After exclusions, the final dataset contains 13132 observations. Table 3 summarizes the variables used in the
analysis. The choice of these variables was motivated largely by the findings reported in previous
related studies (e.g., Shane & Trivedi, 2012, and references therein).
Private health insurance status, which is an important determinant of the use of health care
services, is a potentially endogenous variable. This is because unobserved variables, such as say
allergy and risk aversiveness, are likely to influence both health service utilization and private insurance decision. Sometimes the effect of private health insurance can be interpreted as adverse
selection or moral hazard (e.g., Buchmueller et al., 2005). Adverse selection occurs when individuals with a greater demand for medical care, because of poor health for instance, are expected to
have a greater demand for insurance. Moral hazard refers to the tendency of people to be more
inclined to seek health services, and doctors to be more inclined to recommend them when all
costs are covered. The problem of residual confounding arises when mis-modeling the effect of
observed confounders (e.g., Benedetti & Abrahamowicz, 2004). For instance, the impact of age
and education might exhibit a non-linear behavior which, if not accounted for, will enter the unstructured term of the model and hence induce bias in the estimated effect of insurance coverage
on health care utilization. The assumption of linear dependence between private health insurance
and health care utilization may also be restrictive if there exists an asymmetric tail dependence
17
Variable
Outcome
visits.hosp
Treatment
private
Demographic-socioeconomic
age
gender
race
education
income
region
Health-related
health
bmi
diabetes
hypertension
hyperlipidemia
limitation
Definition
=1 at least one visit to hospital outpatient departments
=1 private health insurance
age in years
=1 male
=1 white, =2 black, =3 native American, =4 others
years of education
income (000’s)
=1 northeast, =2 midwest, =3 south, =4 west
=1 excellent, =2 very good, =3 good, =4 fair,=5 poor
body mass index
=1 diabetic
=1 hypertensive on
=1 hyperlipidemic
=1 health limits physical activity
Table 3: Description of outcome and treatment variables, and observed confounders.
between them (Winkelmann, 2012). These issues are addressed using the method presented in
Section 2.
3.2 Models
Following previous work on the subject (e.g., Holly et al., 1998; Fabbri & Monfardini, 2003; Shane & Trivedi,
2012; Srivastava & Zhao, 2008), the equations for private health insurance and health care utilization were specified, in R notation, as
treat.eq <- private ~ as.factor(health) + as.factor(race) +
as.factor(region) + limitation + gender + diabetes +
hypertension + hyperlipidemia + self + s(bmi) + s(income) +
s(age) + s(education)
out.eq <- visits.hosp ~ private + as.factor(health) + as.factor(race) +
as.factor(region) + limitation + gender + diabetes +
hypertension + hyperlipidemia + self + s(bmi) + s(income) +
s(age) + s(education)
where as.factor coerces its argument to a factor and the s() are the unknown smooth functions described in Section 2.1.1. The smooth components were represented using penalized thin
plate regression splines with basis dimensions equal to 20 and penalties based on second order
derivatives (Wood, 2006). In cross-sectional studies, 20 bases typically suffice to represent well
smooth functions, although sensitivity analysis using more spline bases is advisable when the estimated degrees of freedom of the smooth components are close to the number of bases used.
18
The non-linear specification for bmi, income, age and education arises from the fact that these
covariate embody productivity and life-cycle effects that are likely to affect P(private = 1)
and P(visits.hosp = 1) non-linearly. In fact, in related studies, Holly et al. (1998) and
Fabbri & Monfardini (2003) considered a model for health care utilization that contains linear and
quadratic terms in bmi, income, age and education whereas Marra & Radice (2011b) specified a
model containing smooth functions of them. Considering all copulas discussed in Section 2.1,
including the case in which the outcome equation is estimated alone (this will be referred to as Independent) and Student-t with 3, 6, 9 and 12 degrees of freedom, we fitted 19 models. Based on the
AIC the chosen models were those which capture strong upper tail dependences, i.e. Clayton180 ,
Gumbel0 and Joe0 . After applying the Vuong test to these three models, it emerged that it is not
possible to discriminate among them. However, the Clarke test showed a preference for Joe0 .
3.3 Empirical results
3.3.1
Measure of dependence
We start off by commenting on the results for the Kendall’s τ of all models fitted (see Table 4). This
measure captures the association between the unobserved confounders present in the unstructured
term of the model, after controlling for observed confounders. The models without AIC support
which can account for asymmetric dependence indicate either an absence or a negative association
between the two equations, whereas those which can account for symmetric dependence suggest
a moderate association. Interestingly, the small yet significant τ , estimated using the preferred
models (Joe0 , Gumbel0 and Clayton180 ), indicate that there exists some association between the
unstructured terms of the model equations for private health insurance and hospital utilization
which is most likely due to the presence of unobserved confounders.
3.3.2
SATE of private health insurance
The estimated SATE (in %) and confidence interval (CI) for all fitted copula models are reported
in Table 4. The Table also reports the estimated SATE for the case in which the unobserved
confounding issue is not taken into account (Independent). Several points are worth noting.
• The models which received the most support from the AIC (Joe0 , Gumbel0 and Clayton180 ,
which account for strong positive upper tail dependence) show similar point estimates with
overlapping CIs.
• The models without AIC support (which can account for negative tail dependence and
positive lower tail dependence) exhibit point estimates which are systematically different
from those obtained from the models with AIC support. Specifically, the models which
suggest absence of association (τ̂ = 0) consistently show similar point estimates as that
of Independent, whereas the models which indicate negative association show implausible
negative estimated effects of the SATE.
• If the presence of unobserved confounders is not accounted for then the point estimate is
smaller than that obtained using the chosen models which can control for this issue.
19
Copula
Independent
Gaussian
Student-t3
Student-t6
Student-t9
Student-t12
Frank
Clayton0
Clayton90
Clayton180
Clayton270
Gumbel0
Gumbel90
Gumbel180
Gumbel270
Joe0
Joe90
Joe180
Joe270
\ (95% CIs)
SATE
4.29 (2.98,5.60)
5.34 (3.85,6.83)
5.68 (4.24,7.13)
5.51 (4.01,7.01)
5.49 (3.99,6.99)
5.45 (3.94,6.95)
4.82 (3.39,6.25)
4.25 (2.89,5.61)
4.25 (2.89,5.61)
5.71 (4.24,7.18)
-5.75 (-7.38,-4.13)
5.63 (4.15,7.12)
4.25 (2.89,5.61)
4.25 (2.89,5.61)
4.25 (2.89,5.61)
5.68 (4.21,7.15)
-10.10 (-11.97,-8.23)
4.25 (2.89,5.61)
4.25 (2.89,5.61)
τ̂ (95% CIs)
0.20 (0.07,0.33)
0.46 (0.37,0.54)
0.35 (0.24,0.45)
0.32 (0.20,0.42)
0.29 (0.17,0.40)
0.21 (0.04,0.35)
0.00 (0.00,0.00)
0.00 (-1.00,0.00)
0.13 (0.07,0.23)
-0.14 (-0.45,-0.03)
0.19 (0.11,0.31)
0.00 (0.00,0.00)
0.00 (0.00,0.00)
0.00 (0.00,0.00)
0.14 (0.08,0.23)
-0.15 (-0.45,-0.04)
0.00 (0.00,0.00)
0.00 (0.00,0.00)
Table 4: Estimated SATE (in %) and τ obtained using different copula models for the 2008 MEPS data. 95% confidence intervals for the SATE have been obtained using the delta method detailed in Section 2.2, and those for τ using
Bayesian posterior simulation as described in Section 2.5.
• As the number of degrees of freedom for the Student-t copula increases, the estimated SATE
obviously gets closer to that obtained with a Gaussian copula.
• Using the Gaussian copula the estimated SATE is 5.3% whereas that obtained using Joe0
is 5.7%. This mild increase in the effect of private health insurance is difficult to judge as
it may or may not have an impact on policy decisions regarding, for instance, recognizing
the important role of private health insurance in financing hospitals. But such speculation is
beyond the scope of this paper.
The CIs for the SATE reported in Table 4 were obtained using the delta method (Section 2.2).
Marginally wider intervals were obtained using the Bayesian posterior simulation approach described in Section 2.5. A small simulation study revealed that the simulation approach and the
delta method can yield intervals with empirical coverage probabilities which are nearly identical
and very close to nominal coverages. The study also revealed that intervals for Kendall’s τ obtained by posterior simulation have close to nominal coverages and that the true SATE can be
recovered well. R code for the simulation experiment and results have not been reported here to
save space and are available upon request.
3.3.3
Parametric components
We report the estimated effects for the Joe0 copula model. Similar results where obtained using
Gumbel0 and Clayton180 which also account for strong positive upper tail dependence. These are
20
Treatment Eq.
Variable
Parameter estimate Std. error
gender
-0.08
0.03
race=2
0.03
0.04
race=3
-0.17
0.13
race=4
0.10
0.05
region=2
0.24
0.05
region=3
0.06
0.04
region=4
0.01
0.04
health=2
-0.05
0.03
health=3
-0.11
0.04
health=4
-0.25
0.05
health=5
-0.20
0.11
diabetes
0.08
0.06
hypertension
0.12
0.04
hyperlipidemia
0.15
0.04
limitation
-0.11
0.07
Outcome Eq.
Parameter estimate Std. error
-0.42
0.03
-0.10
0.04
-0.18
0.16
-0.15
0.06
0.22
0.05
-0.17
0.04
-0.25
0.05
0.13
0.04
0.17
0.04
0.27
0.06
0.50
0.11
0.13
0.06
0.09
0.04
0.24
0.04
-0.49
0.06
Table 5: Estimated coefficients and standard errors of the parametric components from the Joe0 copula model.
available upon request.
Most of these effects have the expected signs. Females are slightly more likely to purchase insurance and being hospitalized than males. This may be explained by a higher demand for medical
services among women during their reproductive years (e.g., Sindelar, 1982). As for race, there
is not a significant difference between whites and nonwhites in terms of purchasing private health
insurance but there is some difference in terms of being hospitalized; nonwhites seem to be less
likely to use health care services. This is consistent with the findings by Shane & Trivedi (2012).
Regarding the region variable, residents of the Midwest are more likely to have a private insurance
and to use health care services as compared to those of the Northeast. Individuals’ evaluation of
their health states is a potential predictor of health care utilization. Those who are in good health
are less likely to access health care services. In the same vein, those who expect themselves to
be in good health have little to gain from insurance while those who are in poor health are more
likely to purchase health insurance. The results for the hospital utilization equation support this
hypothesis indicating that the less healthy individuals are, the more likely they are to be admitted
into hospitals. The positive relationship between self-assessed health and insurance purchase is
counter intuitive to the hypothesis of moral hazard and adverse selection. However, such finding
is not unusual and has been obtained in several previous studies (see Srivastava & Zhao, 2008, and
references therein). The more objective measures of health status (i.e., diabetes, hypertension and
hyperlipidemia) suggest that medical need is an important determinant of hospital utilization and
insurance purchase.
21
3.3.4
Non-parametric components
Figures 3 and 4 report the smooth function estimates for the treatment and outcome equations
(and associated Bayesian intervals) when applying the Joe0 copula model on the MEPS data. The
estimated smooth functions obtained using the other copulas (not reported here but available upon
request) were similar.
The effects of bmi, income, age and education in the treatment and outcome equations show
different degrees of nonlinearity. The point-wise confidence intervals of the smooth functions for
bmi in the treatment equation and income in the outcome equation contain the zero line for most
of the covariate value range. The intervals of the smooth for bmi in the outcome equation contain
the zero line for the whole range. This suggests that bmi and income are weak predictors of private health insurance and health care utilization, and that bmi is not an important determinant of
hospital utilization. Similar conclusions can be drawn by looking at the p-values for the smooth
terms reported in the captions of Figures 3 and 4. As for the remaining variables, the estimated
effects have the expected patterns. For example, age is a significant determinant in both equations.
The probability of purchasing a private health insurance is found to increase with age with a slight
drop-off for the 58+ age group. This is suggestive of a higher probability of private health insurance purchase as individuals become older and less likely to stay healthy (e.g., Hopkins & Kiddi,
1996). The probability of using health care services also increases with age. Insurance decision as
well as health care utilization appear to be highly associated with education. Education is likely
to increase individuals’ awareness of health care services and the benefits of purchasing a private health insurance. Higher household income is associated with an increased probability of
purchasing a private health insurance.
It is worth noting that the parametric and non-parametric estimated effects for the outcome
equation reported here can be interpreted in a qualitative way only. The actual effects can be
calculated using simulation or the formulas in Greene (2012) which account for the fact that confounders appearing in the treatment equation have an indirect effect (through the endogenous variable) on the outcome as well as a direct effect because they also appear in the outcome equation.
4 Discussion
In summary, we have presented a model which can allow researchers to obtain consistent estimates of treatment effectiveness in the presence of unobserved and residual confounding, and of
dependencies in the tails of the distribution of treatment and outcome. This model extends previous work by accommodating simultaneously flexible confounder effects as well as non-linear
treatment-outcome dependence. In doing so, we have provided inferential tools for the proposed
framework and presented some argumentation related to the asymptotic behavior of the introduced
penalized maximum likelihood estimator and the ensuing sample average treatment effect. For implementation, we have developed the necessary computational procedures which are available in
the R package SemiParBIVProbit.
Using the proposed method, we have examined the effect of private health insurance on health
22
s(income,8.4)
−0.5
0.5
1.5
−1.5
s(bmi,5.99)
−1.5
−0.5
−2.5
30
40
bmi
50
60
70
0e+00
1e+05
2e+05
income
3e+05
−0.3
s(age,5.45)
−0.1
0.1
s(education,5.91)
−1.0 −0.5 0.0 0.5
0.3
20
20
30
40
age
50
60
0
5
10
education
15
−0.1
s(income,2.57)
−0.2 0.0
0.2
s(bmi,1)
0.1 0.2 0.3
0.4
Figure 3: Smooth function estimates and associated 95% point-wise confidence intervals in the treatment equation
obtained applying the Joe0 copula regression spline model on the 2008 MEPS data. The rug plot, at the bottom of
each graph, shows the covariate values. P-values for the smooth terms of bmi, income, age and income are 0.0681,
< 0.000, < 0.000 and < 0.000, respectively.
30
40
bmi
50
60
70
0e+00
1e+05
2e+05
income
3e+05
20
30
40
age
50
−0.6
−0.4
s(age,1.48)
−0.2 0.0
0.2
s(education,4.84)
−0.2 0.0 0.2
0.4
20
60
0
5
10
education
15
Figure 4: Smooth function estimates and associated 95% point-wise confidence intervals in the outcome equation
obtained applying the Joe0 copula regression spline model on the 2008 MEPS data. P-values for the smooth terms of
bmi, income, age and income are 0.132, 0.097, < 0.000 and < 0.000, respectively.
23
care utilization using the 2008 MEPS dataset. There is a generally accepted notion that private
health coverage is affected by endogeneity as it is not randomly assigned as in a controlled trial
but rather is the result of individual preferences and health status, such as allergy and risk aversiveness. Also, the impacts of continuous confounders such as age and education are likely to
be complex since they embody productivity and life-cycle effects that are likely to influence nonlinearly private health insurance and health care utilization. Finally, there may exist dependencies
that appear in the tails of the distribution of both insurance and health care utilization which need
to be modeled. To our knowledge, no studies have examined the impact of private health coverage accounting for endogeneity, non-linear contributions of observed confounders and non-linear
dependence between insurance and health care utilization, partly due to the lack of appropriate
analytical tools to discern these simultaneous effects. By applying the newly developed statistical
framework to the 2008 MEPS data we found that not accounting for the endogeneity problem underestimates the effect of private health insurance, that some of the observed confounder effects
are non-linear, and that the models which received most support from the data are those which
allow for strong positive upper tail dependence. It is worth stressing that the proposed approach
may have a significant wider appeal. Indeed, many other situations exist in which it would be useful to estimate the effect of a treatment on an outcome using observational data. Examples include
estimating the impact of diabetes on employment, physical activity on obesity, and insurance on
mortality among HIV-infected individuals, to name a few.
Future research directions include the use of various marginal distributions that can be bound
by a copula (e.g., generalized extreme value distributions as in Wang & Dey (2010)), allowing
for different kinds of dependencies in the upper and lower tails by using two-parameter bivariate
linking copulas (e.g., Brechmann & Schepsmeier, 2013), and generalizing the proposed approach
to trivariate system models which control for the endogeneity of the treatment and for non-random
sample selection in the outcome (Srivastava & Zhao, 2008). Using more flexible link functions,
two-parameter linking copulas and adopting a system of three equations is likely to make the
model more accurate and the framework presented here can constitute a building block for these
developments.
Appendix
Appendix A: Derivatives of SATE(δ, X) with respect to δ
The components in ∂SATE(δ, X)/∂δ that are referred to in Section 2.2 are given below.
n
∂SATE(δ, X)
1X
=
∂δ1
n i=1
∂
(y =1)
C Φ(η1i ),Φ(η2i 1i
);θ
Φ(η1i )
(y
−
Φ(η2i 1i
(y =0)
)−C Φ(η1i ),Φ(η2i 1i
);θ
1−Φ(η1i )
∂δ1
24
=0)
,
(12)
∂
n
∂SATE(δ, X)
1X
=
∂δ2
n i=1
(y =1)
C Φ(η1i ),Φ(η2i 1i
);θ
(y =1)
C Φ(η1i ),Φ(η2i 1i
);θ
Φ(η1i )
(y
−
Φ(η2i 1i
=0)
(y =0)
)−C ,Φ(η1i ),Φ(η2i 1i
);θ
,
(13)
(y =0)
)−C Φ(η1i ),Φ(η2i 1i
);θ
.
(14)
1−Φ(η1i )
∂δ2
and
n
1X
∂SATE(δ, X)
=
∂θ∗
n i=1
∂
Φ(η1i )
(y
−
Φ(η2i 1i
=0)
1−Φ(η1i )
∂θ∗
The quantities inside the square brackets of (12), (13) and (14) can be written as
(
h1 Φ(η1i ) − C(Φ(η1i ), Φ(η2i ); θ) Φ(η1i )2
y1i =1
h1 (1 − Φ(η1i )) − (Φ(η2i ) − C(Φ(η1i ), Φ(η2i ); θ)) +
[1 − Φ(η )]2
1i
and
y1i =0
)
φ(η1i )X1i ,
1 − h2
h2 φ(η2i )X2i −
,
φ(η
)X
2i
2i
Φ(η1i ) y1i =1 1 − Φ(η1i )
y1i =0
∂C(Φ(η1i ),Φ(η2i );θ) ∂θ
∂θ
∂θ∗
Φ(η1i )
+
y1i =1
∂C(Φ(η1i ),Φ(η2i );θ) ∂θ
∂θ
∂θ∗
1 − Φ(η1i )
,
y1i =0
where hv = ∂C(Φ(η1i ), Φ(η2i ); θ)/∂Φ(ηvi ), for v = 1, 2, φ(·) is the density function of the standard univariate normal distribution, ∂θ/∂θ∗ can be obtained using the transformations in Table 2,
and all the other quantities are defined in Section 2.
Appendix B: Gradient and Hessian
The quantities g and H that are referred to in Section 2.3 are given below.
Gradient
n ∂`(δ) X
1
∂P(y1i = 1, y2i = 1)
g1 =
y1i y2i
=
∂δ1
P(y1i = 1, y2i = 1)
∂δ1
i=1
1
∂P(y1i = 1, y2i = 0)
P(y1i = 1, y2i = 0)
∂δ1
1
∂P(y1i = 0, y2i = 1)
+ (1 − y1i )y2i
P(y1i = 0, y2i = 1)
∂δ1
1
∂P(y1i = 0, y2i = 0)
+ (1 − y1i )(1 − y2i )
,
P(y1i = 0, y2i = 0)
∂δ1
+ y1i (1 − y2i )
25
where
∂P(y1i = 1, y2i
∂δ1
∂P(y1i = 1, y2i
∂δ1
∂P(y1i = 0, y2i
∂δ1
∂P(y1i = 0, y2i
∂δ1
= 1)
= 0)
= h1 φ(η1i )X1i ,
∂P(y1i = 1, y2i = 1)
,
∂δ1
∂P(y1i = 1, y2i = 1)
= 1)
=−
,
∂δ1
= 0)
∂P(y1i = 1, y2i = 0)
=−
.
∂δ1
= φ(η1i )X1i −
n 1
∂`(δ) X
∂P(y1i = 1, y2i = 1)
y1i y2i
=
g2 =
∂δ2
P(y1i = 1, y2i = 1)
∂δ2
i=1
∂P(y1i = 1, y2i = 0)
1
P(y1i = 1, y2i = 0)
∂δ2
∂P(y1i = 0, y2i = 1)
1
+ (1 − y1i )y2i
P(y1i = 0, y2i = 1)
∂δ2
∂P(y1i = 0, y2i = 0)
1
,
+ (1 − y1i )(1 − y2i )
P(y1i = 0, y2i = 0)
∂δ2
+ y1i (1 − y2i )
where
∂P(y1i = 1, y2i
∂δ2
∂P(y1i = 1, y2i
∂δ2
∂P(y1i = 0, y2i
∂δ2
∂P(y1i = 0, y2i
∂δ2
= 1)
= h2 φ(η2i )X2i ,
∂P(y1i = 1, y2i = 1)
,
∂δ2
= 1)
∂P(y1i = 1, y2i = 1)
= φ(η2i )X2i −
,
∂δ2
∂P(y1i = 0, y2i = 1)
= 0)
=−
.
∂δ1
= 0)
g3 =
=−
∂`(δ)
∂`(δ) ∂θ
=
,
∂θ∗
∂θ ∂θ∗
where
n ∂`(δ) X
∂P(y1i = 1, y2i = 1)
1
=
y1i y2i
∂θ
P(y1i = 1, y2i = 1)
∂θ
i=1
1
∂P(y1i = 1, y2i = 0)
P(y1i = 1, y2i = 0)
∂θ
∂P(y1i = 0, y2i = 1)
1
+ (1 − y1i )y2i
P(y1i = 0, y2i = 1)
∂θ
1
∂P(y1i = 0, y2i = 0)
+ (1 − y1i )(1 − y2i )
,
P(y1i = 0, y2i = 0)
∂θ
+ y1i (1 − y2i )
26
and
∂P(y1i = 1, y2i
∂θ
∂P(y1i = 1, y2i
∂θ
∂P(y1i = 0, y2i
∂θ
∂P(y1i = 0, y2i
∂θ
= 1)
∂C(Φ(η1i ), Φ(η2i ); θ)
,
∂θ
= 0)
∂P(y1i = 1, y2i = 1)
=−
,
∂θ
∂P(y1i = 1, y2i = 1)
= 1)
=−
,
∂θ
∂P(y1i = 1, y2i = 1)
= 0)
=
.
∂θ
=
Hessian
2
H1,1 =
∂ `(δ)
=
∂δ1 ∂δ1T

n 

X
i=1
y1i y2i


+ y1i (1 − y2i )
+ (1 − y1i )y2i
∂ 2 P(y1i =1,y2i =1)
P(y1i
∂δ1 ∂δ1T
= 1, y2i = 1) −
h
= 1, y2i = 0) −
h
∂P(y1i =1,y2i =1)
∂δ1
[P(y1i = 1, y2i = 1)]2
∂ 2 P(y1i =1,y2i =0)
P(y1i
∂δ1 ∂δ1T
∂P(y1i =1,y2i =0)
∂δ1
i2
[P(y1i = 1, y2i = 0)]2
h
i2
∂P(y1i =0,y2i =1)
∂ 2 P(y1i =0,y2i =1)
P(y1i = 0, y2i = 1) −
∂δ1
∂δ1 ∂δ T
1
[P(y1i = 0, y2i = 1)]2
+ (1 − y1i )(1 − y2i )
∂ 2 P(y1i =0,y2i =0)
P(y1i
∂δ1 ∂δ1T
= 0, y2i = 0) −
h
∂P(y1i =0,y2i =0)
∂δ1
[P(y1i = 0, y2i = 0)]2
where
∂ 2 P(y1i = 1, y2i
∂δ1 ∂δ1T
∂ 2 P(y1i = 1, y2i
∂δ1 ∂δ1T
∂ 2 P(y1i = 0, y2i
∂δ1 ∂δ1T
∂ 2 P(y1i = 0, y2i
∂δ1 ∂δ1T
i2
= 1)
∂h1
φ(η1i )X1i − h1 φ(η1i )η1i XT1i X1i ,
∂δ1
= 0)
∂ 2 P(y1i = 1, y2i = 1)
= −φ(η1i )η1i XT1i X1i −
,
∂δ1 ∂δ1T
= 1)
∂ 2 P(y1i = 1, y2i = 1)
=−
,
∂δ1 ∂δ1T
∂ 2 P(y1i = 1, y2i = 0)
= 0)
=−
.
∂δ1 ∂δ1T
=
27
i2 




,

n 

X
2
H2,2 =
∂ `(δ)
=
∂δ2 ∂δ2T
i=1
∂ 2 P(y1i =1,y2i =1)
P(y1i
∂δ2 ∂δ2T
y1i y2i


+ y1i (1 − y2i )
+ (1 − y1i )y2i
= 1, y2i = 1) −
h
= 1, y2i = 0) −
h
∂P(y1i =1,y2i =1)
∂δ2
[P(y1i = 1, y2i = 1)]2
∂ 2 P(y1i =1,y2i =0)
P(y1i
∂δ2 ∂δ2T
∂P(y1i =1,y2i =0)
∂δ2
i2
i2
[P(y1i = 1, y2i = 0)]2
h
i2
∂P(y1i =0,y2i =1)
∂ 2 P(y1i =0,y2i =1)
P(y
=
0,
y
=
1)
−
1i
2i
T
∂δ2
∂δ2 ∂δ
2
[P(y1i = 0, y2i = 1)]2
+ (1 − y1i )(1 − y2i )
∂ 2 P(y1i =0,y2i =0)
P(y1i
∂δ2 ∂δ2T
= 0, y2i = 0) −
h
∂P(y1i =0,y2i =0)
∂δ2
[P(y1i = 0, y2i = 0)]2
i2 


,


where
∂ 2 P(y1i = 1, y2i
∂δ2 ∂δ2T
∂ 2 P(y1i = 1, y2i
∂δ2 ∂δ2T
∂ 2 P(y1i = 0, y2i
∂δ2 ∂δ2T
∂ 2 P(y1i = 0, y2i
∂δ2 ∂δ2T
H3,3
∂`(δ)
∂θ∗
∂
∂ 2 `(δ)
=
=
∂θ∗2
∂θ∗
= 1)
∂h2
φ(η2i )X2i − h2 φ(η2i )η2i XT2i X2i ,
∂δ2
∂ 2 P(y1i = 1, y2i = 1)
= 0)
=−
,
∂δ2 ∂δ2T
= 1)
∂ 2 P(y1i = 1, y2i = 1)
= −φ(η2i )η2i XT2i X2i −
,
∂δ2 ∂δ2T
∂ 2 P(y1i = 0, y2i = 1)
= 0)
=−
.
∂δ2 ∂δ2T
∂
=
=
∂`(δ)
∂θ∗
∂θ
∂
∂θ
=
∂θ∗
∂`(δ) ∂θ
∂θ ∂θ∗
∂θ
∂θ
=
∂θ∗
∂`(δ) ∂ 2 θ
∂`(δ)2 ∂θ
+
∂θ2 ∂θ∗
∂θ ∂θ∗2
where
2
∂ `(δ)
=
∂θ2

n 

X
i=1
y1i y2i


+ y1i (1 − y2i )
+ (1 − y1i )y2i
∂ 2 P(y1i =1,y2i =1)
P(y1i
∂θ 2
= 1, y2i = 1) −
h
= 1, y2i = 0) −
h
∂P(y1i =1,y2i =1)
∂θ
[P(y1i = 1, y2i = 1)]2
∂ 2 P(y1i =1,y2i =0)
P(y1i
∂θ 2
∂P(y1i =1,y2i =0)
∂θ
i2
i2
[P(y1i = 1, y2i = 0)]2
i2
h
∂ 2 P(y1i =0,y2i =1)
∂P(y1i =0,y2i =1)
P(y
=
0,
y
=
1)
−
1i
2i
∂θ 2
∂θ
+ (1 − y1i )(1 − y2i )
[P(y1i = 0, y2i = 1)]2
∂ 2 P(y1i =0,y2i =0)
P(y1i
∂θ 2
= 0, y2i = 0) −
h
[P(y1i = 0, y2i = 0)]2
28
∂P(y1i =0,y2i =0)
∂θ
i2 




,
∂θ
,
∂θ∗
and
∂ 2 P(y1i = 1, y2i
∂θ2
2
∂ P(y1i = 1, y2i
∂θ2
2
∂ P(y1i = 0, y2i
∂θ2
2
∂ P(y1i = 0, y2i
∂θ2
2
H1,2
∂ `(δ)
=
=
∂δ1 ∂δ2T

n 
X
i=1
y1i y2i

+ y1i (1 − y2i )
+ (1 − y1i )y2i
∂ 2 C(Φ(η1i ), Φ(η2i ); θ)
,
∂θ2
= 0)
∂ 2 P(y1i = 1, y2i = 1)
=−
,
∂θ2
∂ 2 P(y1i = 1, y2i = 1)
= 1)
=−
,
∂θ2
∂ 2 P(y1i = 1, y2i = 1)
= 0)
=
.
∂θ2
= 1)
=
∂P(y1i =1,y2i =1)
P(y1i
∂δ1 ∂δ2T
= 1, y2i = 1) −
∂P(y1i =1,y2i =1) ∂P(y1i =1,y2i =1)
∂δ1
∂δ2
2
[P(y1i = 1, y2i = 1)]
∂P(y1i =1,y2i =0)
P(y1i
∂δ1 ∂δ2T
= 1, y2i = 0) −
∂P(y1i =1,y2i =0) ∂P(y1i =1,y2i =0)
∂δ1
∂δ2
2
[P(y1i = 1, y2i = 0)]
∂P(y1i =0,y2i =1)
P(y1i
∂δ1 ∂δ2T
+ (1 − y1i )(1 − y2i )
= 0, y2i = 1) −
∂P(y1i =0,y2i =1) ∂P(y1i =0,y2i =1)
∂δ1
∂δ2
2
[P(y1i = 0, y2i = 1)]
∂P(y1i =0,y2i =0)
P(y1i
∂δ1 ∂δ2T
= 0, y2i = 0) −
[P(y1i = 0, y2i = 0)]
where
∂ 2 P(y1i = 1, y2i
∂δ1 ∂δ2T
∂ 2 P(y1i = 1, y2i
∂δ1 ∂δ2T
∂ 2 P(y1i = 0, y2i
∂δ1 ∂δ2T
∂ 2 P(y1i = 0, y2i
∂δ1 ∂δ2T
∂P(y1i =0,y2i =0) ∂P(y1i =0,y2i =0)
∂δ1
∂δ2
2
= 1)
∂h1
φ(η1i )X1i ,
∂δ2
∂ 2 P(y1i = 1, y2i = 1)
= 0)
=−
,
∂δ1 ∂δ2T
= 1)
∂ 2 P(y1i = 1, y2i = 1)
=−
,
∂δ1 ∂δ2T
∂ 2 P(y1i = 1, y2i = 1)
= 0)
=
.
∂δ1 ∂δ2T
=
29



,
H1,3
(
∂P2 (y1i =1,y2i =1)
n
2i =1) ∂P(y1i =1,y2i =1)
X
P(y1i = 1, y2i = 1) − ∂P(y1i =1,y
∂ 2 `(δ)
∂δ1 ∂θ∗
∂δ1
∂θ∗
y1i y2i
=
=
2
∂δ1 ∂θ∗
[P(y1i = 1, y2i = 1)]
i=1
+ y1i (1 − y2i )
+ (1 − y1i )y2i
∂P2 (y1i =1,y2i =0)
P(y1i
∂δ1 ∂θ∗
= 1, y2i = 0) −
∂P(y1i =1,y2i =0) ∂P(y1i =1,y2i =0)
∂δ1
∂θ∗
2
= 0, y2i = 1) −
∂P(y1i =0,y2i =1) ∂P(y1i =0,y2i =1)
∂δ1
∂θ∗
2
[P(y1i = 1, y2i = 0)]
∂P2 (y1i =0,y2i =1)
P(y1i
∂δ1 ∂θ∗
[P(y1i = 0, y2i = 1)]
∂P2 (y
1i =0,y2i =0)
∂δ1 ∂∂θ∗
+ (1 − y1i )(1 − y2i )
P(y1i = 0, y2i = 0) −
∂P(y1i =0,y2i =0) ∂P(y1i =0,y2i =0)
∂δ1
∂θ∗
2
[P(y1i = 0, y2i = 0)]
)
,
)
,
where
∂ 2 P(y1i = 1, y2i
∂δ1 ∂θ∗
2
∂ P(y1i = 1, y2i
∂δ1 ∂θ∗
2
∂ P(y1i = 0, y2i
∂δ1 ∂θ∗
2
∂ P(y1i = 0, y2i
∂δ1 ∂θ∗
H2,3
= 1)
∂h1
φ(η1i )X1i ,
∂θ∗
= 0)
∂ 2 P(y1i = 1, y2i = 1)
=−
,
∂δ1 ∂θ∗
∂ 2 P(y1i = 1, y2i = 1)
= 1)
=−
,
∂δ1 ∂θ∗
∂ 2 P(y1i = 1, y2i = 1)
= 0)
=
.
∂δ1 ∂θ∗
=
(
∂P2 (y1i =1,y2i =1)
n
2i =1) ∂P(y1i =1,y2i =1)
X
P(y1i = 1, y2i = 1) − ∂P(y1i =1,y
∂ 2 `(δ)
∂δ2 ∂θ∗
∂δ2
∂θ∗
=
y1i y2i
=
2
∂δ2 ∂θ∗
[P(y
=
1,
y
=
1)]
1i
2i
i=1
+ y1i (1 − y2i )
+ (1 − y1i )y2i
∂P2 (y1i =1,y2i =0)
P(y1i
∂δ2 ∂θ∗
= 1, y2i = 0) −
∂P(y1i =1,y2i =0) ∂P(y1i =1,y2i =0)
∂δ2
∂θ∗
2
= 0, y2i = 1) −
∂P(y1i =0,y2i =1) ∂P(y1i =0,y2i =1)
∂δ2
∂θ∗
2
[P(y1i = 1, y2i = 0)]
∂P2 (y1i =0,y2i =1)
P(y1i
∂δ2 ∂θ∗
[P(y1i = 0, y2i = 1)]
∂P2 (y
+ (1 − y1i )(1 − y2i )
1i =0,y2i =0)
P(y1i
∂δ2 ∂∂θ∗
= 0, y2i = 0) −
[P(y1i = 0, y2i = 0)]
where
∂ 2 P(y1i = 1, y2i
∂δ2 ∂θ∗
∂ 2 P(y1i = 1, y2i
∂δ2 ∂θ∗
∂ 2 P(y1i = 0, y2i
∂δ2 ∂θ∗
2
∂ P(y1i = 0, y2i
∂δ2 ∂θ∗
∂P(y1i =0,y2i =0) ∂P(y1i =0,y2i =0)
∂δ2
∂θ∗
2
∂h2
φ(η2i )X2i ,
∂θ∗
= 0)
∂ 2 P(y1i = 1, y2i = 1)
=−
,
∂δ2 ∂θ∗
= 1)
∂ 2 P(y1i = 1, y2i = 1)
=−
,
∂δ2 ∂θ∗
∂ 2 P(y1i = 1, y2i = 1)
= 0)
=
.
∂δ2 ∂θ∗
= 1)
=
30
Expressions for hv , ∂h1 /∂δv , ∂hv /∂θ∗ , for v = 1, 2, ∂h2 /∂δ2 , ∂θ/∂θ∗ , ∂ 2 θ/θ∗2 , ∂C(Φ(η1i ), Φ(η2i ); θ)/∂θ
and ∂ 2 C(Φ(η1i ), Φ(η2i ); θ)/∂θ2 for all the copulas in Table 1 are not reported here to save space
and are implemented in the R package SemiParBIVProbit (Marra & Radice, 2013). These
have been derived analytically and verified using numerical derivatives.
References
Barndorff-Nielsen, O. & Cox, D. (1989). Asymptotic Techniques for Use in Statistics. London:
Chapman and Hall.
Benedetti, A. & Abrahamowicz, M. (2004). Using generalized additive models to reduce residual
confounding. Statisitcs in Medicine, 23, 3781–801.
Brechmann, E. C. & Schepsmeier, U. (2013). Modeling dependence with c- and d-vine copulas:
The R package CDVine. Journal of Statistical Software, 52(3), 1–27.
Buchmueller, T. C., Grumbach, K., Kronick, R., & Kahn, J. G. (2005). Book review: The effect of
health insurance on medical care utilization and implications for insurance expansion: A review
of the literature. Medical Care Research and Review, 62, 3–30.
Chib, S. & Greenberg, E. (2007). Semiparametric modeling and estimation of instrumental variable models. Journal of Computational and Graphical Statistics, 16, 86–114.
Clarke, K. A. (2007). A simple distribution-free test for non-nested model selection. Political
Analysis, 15, 347–363.
Craven, P. & Wahba, G. (1979). Smoothing noisy data with spline functions. Numerische Mathematik, 31, 377–403.
Fabbri, D. & Monfardini, C. (2003). Public vs. private health care services demand in italy.
Giornale degli Economisti, 62(1), 93–123.
Geyer, C. J. (2013). Trust regions. http://cran.r-project.org/web/packages/trust/vignet
Gitto, L., Santoro, D., & Sobbrio, G. (2006). Choice of dialysis treatment and type of medical unit
(private vs public), application of a recursive bivariate probit. Health Economics, 15, 1251–
1256.
Goldman, D., Bhattacharya, J., McCaffrey, D., Duan, N., Leibowitz, A., Joyce, G., & Morton, S.
(2001). Effect of insurance on mortality in an HIV-positive population in care. Journal of the
American Statistical Association, 96, 883–894.
Greene, W. H. (2012). Econometric Analysis. Prentice Hall, New York.
Gu, C. (2002). Smoothing Spline ANOVA Models. Springer-Verlag, London.
31
Han, S. & Vytlacil, E. J. (2013). Identification in a generalization of bivariate probit models with
endogenous regressors. Working paper.
Hastie, T. & Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical
Society Series B, 55, 757–796.
Holly, A., Gardiol, L., Domenighetti, G., & Brigitte, B. (1998). An econometric model of health
care utilization and health insurance in switzerland. European Economic Review, 42(3-5), 513–
522.
Hopkins, S. & Kiddi, M. (1996). The determinants of the demand for private health insurance
under medicare. Applied Economicsi, 28, 1623–1632.
Imbens, G. (2004). Nonparametric estimation of average treatment effects under exogeneity: A
review. Review of Economics and Statistics, 86, 4–29.
Jones, A. M., Koolman, X., & Doorslaer, E. V. (2006). The impact of having supplementary private
health insurance on the uses of specialists. Annales d’Economie et de Statistique, (83/84), 251–
275.
Kauermann, G. (2005). Penalized spline smoothing in multivariable survival models with varying
coefficients. Computational Statistics and Data Analysis, 49, 169–186.
Kauermann, G., Krivobokova, T., & Fahrmeir, L. (2009). Some asymptotics results on generalized
penalized spline smoothing. Journal of Royal Statistical Society Series B, 71, 487–503.
Kawatkar, A. A. & Nichol, M. B. (2009). Estimation of causal effects of physical activity on
obesity by a recursive bivariate probit model. Value in Health, 12, A131–A132.
Kim, Y. J. & Gu, C. (2004). Smoothing spline gaussian regression: More scalable computation
via efficient approximation. Journal of the Royal Statistical Society Series B, 66, 337–356.
Latif, E. (2009). The impact of diabetes on employment in Canada. Health Economics, 18, 577–
589.
Li, Y. & Jensen, G. A. (2011). The impact of private long-term care insurance on the use of
long-term care. Inquiry, 48(1), 34–50.
Maddala, G. S. (1983). Limited Dependent and Qualitative Variables in Econometrics. Cambridge
University Press, Cambridge.
Marra, G. (2013). On p-values for semiparametric bivariate probit models. Statistical Methodology, 10, 23–28.
Marra, G. & Radice, R. (2011a). Estimation of a semiparametric recursive bivariate probit model
in the presence of endogeneity. Canadian Journal of Statistics, 39, 259–279.
32
Marra, G. & Radice, R. (2011b). A flexible instrumental variable approach. Statistical Modelling,
11, 581–603.
Marra, G. & Radice, R. (2013). SemiParBIVProbit: Semiparametric Bivariate Probit Modelling.
R package version 3.2-7.
Marra, G. & Wood, S. (2012). Coverage properties of confidence intervals for generalized additive
model components. Scandinavian Journal of Statistics, 39, 53–74.
Marra, G. & Wood, S. N. (2011). Practical variable selection for generalized additive models.
Computational Statistics and Data Analysis, 55, 2372–2387.
McCullagh, P. (1987). Tensor Methods in Statistics. London: Chapman and Hall.
Nelsen, R. (2006). An Introduction to Copulas. New York: Springer.
Nocedal, J. & Wright, S. J. (2006). Numerical Optimization. New York: Springer-Verlag.
R Development Core Team (2013). R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric Regression. Cambridge
University Press, New York.
Schokkaert, E., Ourti, T. V., Graeve, D. D., Lecluyse, A., & de Voorde, C. V. (2010). Supplemental
health insurance and equality of access in belgium. Health Economics, 4, 37–395.
Shane, D. & Trivedi, P. (2012). What Drives Differences in Health Care Demand? The Role of
Health Insurance and Selection Bias. Health, Econometrics and Data Group (HEDG) Working
Papers 12/09, HEDG, c/o Department of Economics, University of York.
Sindelar, J. (1982). Differential use of medical care by sex. The Journal of Political Economy, 90,
1003–1019.
Sklar, A. (1959). Fonctions de répartition é n dimensions et leurs marges. Publications de l’Institut
de Statistique de l’Université de Paris, 8, 229–231.
Sklar, A. (1973). Random variables, joint distributions, and copulas. Kybernetica, 9, 449–460.
Srivastava, P. & Zhao, X. (2008). Impact of Private Health Insurance on the Choice of Public
versus Private Hospital Services. Health, Econometrics and Data Group (HEDG) Working
Papers 08/17, HEDG, c/o Department of Economics, University of York.
Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57, 307–333.
Wang, X. & Dey, D. K. (2010). Generalized extreme value regression for binary response data:
An application to b2b electronic payments system adoption. The Annals of Applied Statistics,
4, 2000–2023.
33
Wilde, J. (2000). Identification of multiple equation probit models with endogenous dummy regressors. Economics Letters, 69, 309–312.
Winkelmann, R. (2012). Copula bivariate probit models: with an application to medical expenditures. Health Economics, 21, 1444–1455.
Wood, S. (2004). Stable and efficient multiple smoothing parameter estimation for generalized
additive models. Journal of the American Statistical Association, 99, 673–686.
Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical Society Series
B, 65, 95–114.
Wood, S. N. (2006). Generalized Additive Models: An Introduction With R. Chapman &
Hall/CRC, London.
Wood, S. N. (2013). On p-values for smooth components of an extended generalized additive
model. Biometrika, 100, 221–228.
Zimmer, D. M. & Trivedi, P. K. (2006). Using trivariate copulas to model sample selection and
treatment effects: application to family health care demand. Journal of Business & Economic
Statistics, 24, 63–76.
34
Download