A method for analyzing clustered interval

advertisement
A method for analyzing clustered interval-censored data based onCox’s model
Chew-Teng Kor, Kuang-Fu Cheng and Yi-Hau Chen
Methods for analyzing interval-censored data are well established. Unfortunately, these
methods are inappropriate for the studies with correlated data. In this paper, we focus on
developing a method for analyzing clustered interval-censored data. Our method is based
on Cox’s proportional hazard model with piecewise-constant baseline hazard function. The
correlation structure of the data can be modeled by using Clayton’s copula or independence
model with proper adjustment in the covariance estimation.We establish estimating
equations for the regression parameters and baseline hazards (and a parameter in copula)
simultaneously. Simulation results confirm that the point estimators follow a multivariate
normal distribution, and our proposed variance estimations are reliable. In particular, we
found that the approach with independence model worked well even when the true
correlation model was derived from Clayton’s copula. We applied our method to a
family-based cohort study of pandemic H1N1 influenza in Taiwan during 2009–2010.
Using the proposed method, we investigate the impact of vaccination and family contacts
on the incidence of pH1N1 influenza.
Keywords: cluster; copula model; Cox model; estimating equation; interval-censored
1. Introduction
Interval-censored data often occur in many studies of epidemiology, longitudinal, or
biomedical research in which subjects are followed periodically for the event of interest. In
these studies, the event time T is not directly observable but may be detected in some
periodic examination interval, denoted as (L,R] where L is the left examination time and R
is the right examination time. For the special case that subjects have only one examination
time at R, data are called as the current status data or ‘case 1’ interval-censored data.
The statistical methods for analyzing interval-censored data have been widely studied.
For examples, Peto [1], Turnbull [2], and Groeneboom and Wellner [3] proposed
nonparametric maximum likelihood methods for estimating the distribution function with
interval-censored data. Moreover, Groeneboom and Wellner [3] established asymptotic
properties for the nonparametric maximal likelihood estimator.
Finkelstein [4], on the other hand, developed a likelihood approach on the basis of the Cox
proportional hazards regression model using interval-censored data. Many authors extended
her likelihood approach to other situations; for example, Huang [5] studied the current
status data and showed the maximum likelihood estimator of regression parameter to be
consistent and to have asymptotic normal distribution with
convergence rate.
Extending to bivariate interval-censored data, Goggins and Finkelstein [6] and Kim and
Xue [7] both considered a marginal method with working independence assumption for
analyzing Cox’s regression parameter. They followed the idea of Wei, Lin, and Weissfeld
[8] to propose a sandwich-type covariance estimate. However, their method requires a large
number of parameters to model the baseline survival function. To analyze clustered current
status data, Cook and Tolusso [9] considered the use of second order generalized estimating
equations (GEE) and a copula model.
In this paper, we focus on analyzing clustered interval-censored data on the basis of a
Cox’s [10] proportional hazard model. Our goal is to estimate the baseline survival function
and Cox’s regression parameters. We consider a piecewise-constant hazard function for the
baseline hazard to simplify our analysis. We employ the GEE approach for estimating the
Cox regression parameters and a multinomial likelihood approach for estimating the
baseline hazard parameters. Either independence or a parametric covariance model is used
to model correlations for the data within cluster. If we apply the independence model, we
account for correlation effect within cluster through the use of the sandwich-type
covariance matrix estimate.
A family-based cohort study of pandemic H1N1 influenza in Taiwan during 2009–2010
was given to demonstrate application of our method. The pH1N1 influenza study was
conducted by the Center for Infectious Disease Education and Research in China Medical
University and aimed to study the behavior of the household transmissions and vaccine
effectiveness pertaining to seasonal influenza viruses. Family data from two cities,
Taichung and Nantau, of Taiwan were collected with written consent. All subjects within
family were followed up for 1 year, and their blood samples were collected at two different
time points to determine whether the subjects were infected. The infection status was
determined by the level of the hemagglutination inhibition (HI) titer. Understanding the
impact of risk factors such as family contacts and vaccination on the infection outcome can
provide us valuable information for controlling the disease.
We organize the paper as follows. We present the notations and methods in Sections 2
and 3. Sections 4 and 5 contain simulation studies and data analysis, respectively. Section 6
gives our concluding remarks.
2. The model
Suppose there are M families with ni subjects in the i th family,
We
denote Tij as the time to the occurrence of influenza for subject j in family
,
measured from the beginning of the study. However, Tij ’s are not directly observable.
Instead, we have two examination-time points, denoted by Lij and Rij ,to determine if the
subject has been infected.As the event time may be leftcensored, interval-censored, or
right-censored, we represent the event time information with two binary variables .
defined as
(1)
where
is the usual indicator function. There are only three possible outcomes for
if the subject was infected by influenza before the first
examination time;
if the subject was infected by influenza between
the first and second examination times; and
if the subject was not
infected by influenza before the second examination time. We can summarize the observed
data as
where
represents a
vector of covariates. Of note, the
examination times considered in this paper can vary across all subjects. In the case that
we have only one examination time, and
is always zero. On the other
hand,if there are more than two examination-time points, our methods proposed in this
paper also can be straightforwardly extended.Please see the details in Appendix.
2.2. Hazard regression model
To model the event time, we assume that T follows a proportional hazard model (Cox [10])
given by
Where
is the hazard function of T , evaluated at time t and given covariates
is the baseline hazard function, and
is a set of regression parameters.
We assume that the baseline hazard function is a piecewise-constant function with I jump
points
on the nonnegative real line. We write
the piecewise-constant baseline hazard function as
Where
is the hazard rate at
. We respectively
write the corresponding baseline cumulative hazard and survival function as
where
Under the proportional hazard regression model, we can write the conditional
expectation of
;
The conditional expectations play an important role in the estimation of parameters. In the
following, we will construct two estimating equations separately for and
3. Estimation method
We follow two assumptions by Finkelstein [4]: (a) the censoring mechanism is independent
of both the failure time and covariates; and (b) all subjects will eventually fail unless it is
censored. Under these assumptions, we propose using a system of pseudo-likelihood score
equations for estimating baseline hazard parameters and a GEE approach for estimating
regression parameters. We consider two approaches to account for the correlation: (a) to
assume that data within cluster are independent and use a sandwich-type estimate for
covariance estimation or (b) to use Clayton’s copula function to model correlations.
Simulation results confirm that both approaches worked well under our simulation
conditions.
3.1. Estimating equations for the baseline hazards
We first assume that the regression parameters are known. On the basis of this, we give the
log of pseudo-likelihood function of
by
where
Taking derivative with respect to each ˛l , we have the
following score equations:
3.2. Estimating equations for the regression parameters
We assume that the baseline hazards
are known. We employ the GEE approach by
Liang and Zeger [11] to construct estimating equations for . They are given by
where
3.3. Correlation model and parameter estimates
There are two approaches that can be used to account for correlations.
(a) Independence model approach: assuming that the covariance matrix is given by
This leads the estimating equation for
to be given by
Define
. and let
be the solution from Equations (8) and (10). Under
regularity con-Oditions, we can prove that
mean zero and covariance matrix of
is asymptotically normal with
, which can be consistently estimated by
(b) Covariance model approach: following Cook and Tolusso [9], we use Clayton’s copula
to model the covariance structure, where the Clayton model is
and the parameter
measures the association between the event times within cluster.
Then each element of covariance Vi in (9) is given by
The joint probability
is determined by the Clayton’s copula, which
depends
on an additional parameter
.
Using the alternative expression of the covariance matrix Vi as above, the estimating
equations for are
Let
be the vector of all
products, where
pair-wise
The expectation of Zi is
defined as
which is a
extra estimating equation for
vector. Under the copula model, we consider an
where
vector and
can obtain the estimate of parameter
by simultaneously solving
We use algorithms such as Newton–Raphson algorithm or bisection algorithm to solve this
system of equations. Define
and let
be the solution. Under regularity
conditions, we can prove that
is asymptotically normal with mean zero and
covariance matrix of
, which we can consistently estimate by the sandwich-type
estimate
4. Simulation studies
We conduct a simulation study to evaluate the finite-sample performance of the proposed
methods.
4.1. Data generation
We considered three independent covariates. Let covariates
be generated
by
Given the number of
family M and family size
, we generated the multivariate failure time .
of the subjects in the ith family from the joint distribution given by
This is amultivariate distribution with Clayton’s copula. For each subject, we generated
themarginal failure time according to the Cox regression model with hazard
function
where
We
generated the first examination time L by the exponential distribution exp(0.5) and
determined the second examination time by R= L+A, where A ~Uniform.(0,1.5)
Under this setting, right, interval, and left censoring rates were about 29%, 40%, and 31%,
respectively.
We selected
which corresponds to Kendall’s tau
equal to
0.5. The number of families was either 100 or 200 with family size equal to 1, 2, 3, or 5 or
randomly selected between 2 and 6. The number of cut points for piecewise-constant
baseline hazard function was I = 5, and
sample distribution based on
were the six- quantiles (sextiles) of the
examination-time points. Finally, we based all
simulation results on 1000 replications.
4.2. Results
We show the results for the estimates of Cox regression parameters in Table I. They
include the empirical bias (Bias), standard error (SE) of point estimator, average of the
estimated standard error (SSE) given by (11), and the empirical coverage probability of the
95% Wald’s confidence intervals based on the estimated standard error (CP*).We give the
latter results to show the accuracy of the normal approximation.
We see from Table I that all parameter estimates have relatively small bias and variance
when using either the independence model or covariance model. We can decrease the bias
or variance by increasing either the number of families and/or family size. Unreported
simulation results also indicated that the bias and variance of the regression estimates could
be reduced by increasing the number of examination
times from two to three.
The variance estimate based on the independence model or the covariance model was also
very similar to the true variance. The former result showed that when the correlation
structure within family was not known to the users, it is still possible to account for the
correlation effect by using a proper sandwichtype covariance estimate. The coverage
probabilities were also shown to be closed to the nominal value,
indicating the validity of the normal approximation. Overall, the differences between using
the independence model or covariance model were small. This confirms that the approach
based on the independence model is also reliable, and its efficiency loss is small.
Regarding the association parameter Kendall’s
in Clayton’s copula, we note that
our estimate had simulated bias ranging from -0.035 to -0.027and variance ranging from
0.1031 to 0.0723 when there were 100 families with size equal to 2, 3, or 5. When there
were 200 families, the ranges of the bias and variance became ( -0.0301,-0.0134) and
(0.0773, 0.0416), respectively. This also showed that one can decrease the bias and
variance of the association parameter by increasing the number of families and/or family
size. Simulation results also indicated that the sandwich-type variance estimates were very
similar to the true variances. In the simulations, their differences were no more than 0.52%.
5. Application to pH1N1 data in Taiwan
The pH1N1 dataset was obtained from an infectious disease study conducted in Central
Taiwan between3/15/2009 and 12/1/2009. In this study, school children and their family
members were recruited with written consent. One hundred four households from Taichung
city and Nantou county were involved in the study, with 306 household members agreeing
to having their blood samples collected and answering basic questionnaires. The first time
points for taking blood samples were between April and June, 2009 (after the 2008–2009
influenza season), and the second time points were between September and
October, 2009 (before taking vaccination based on 2009–2010 seasonal and 2009 pandemic
influenza strains). The level of the HI titer was used to determine whether the subject was
infected by flu virus. Using the defined test of HI titer, we found that 66 subjects caught the
influenza virus before the first inspection times, 132 subjects caught the influenza virus
between their first and second inspection times, and 108 subjects did not catch any
influenza virus before their second inspection times. The focus of this study was to
investigate the antibody response against the influenza infection and factors impacting the
infection. Factors considered in this paper include age, gender, household size, vaccination
history, mother/father/grandmother contact level (low or high), urban or rural areas, and so
forth. We apply the method proposed in this paper to study the pH1N1 data. We fit our data
to the Cox’s regression model using piecewise-constant baseline hazard function with jump
points at 1.06, 3.60, 6.20, 6.53, and 8.54 months, which correspond to the dates on
4/15/2009, 7/1/2009, 9/20/2009, 10/1/2009, and 11/31/2009. Table II provides estimates of
Cox’s regression parameters based on the independence and covariance models. In this
application, the association parameter Kendall’s in Clayton’s copula is 0.426, and its 95%
confidence interval is (0.279, 0.574).
The results from Table II indicate that the approach based on either independence or
covariance model leads to similar conclusions. The analysis also reveals that subjects who
were vaccinated in 2008 and had higher HI titer response (HI>40) could be better protected
from catching the 2009 pandemic influenza. The hazard ratio under the independence
model was 0.6839 with p-valueD0.042, and that under the covariance model based on
Clayton’s copula was 0.6776 with p-valueD0.026. We also detected significant protection
effect for subjects with high frequency of mother–child contact. The cor-responding hazard
ratio was 0.6513 with p-valueD0.036 under the independence model and 0.6339 with
p-valueD0.022 under the covariance model.
6. Discussion
In this paper, we have developed a method for analyzing clustered interval-censored
data.We used Cox’sproportional hazards model as a basis for our analysis. Clearly, we can
also extend our method to analyze other models.We derive estimating equations for
regression parameters, parameters for the baseline hazard function, and an association
parameter for the covariance model. If the independence model is used,
the GEE method with sandwich-type variance estimation is suggested and has been shown
to have satisfactory performance in our simulations. The variance estimate based on the
correct covariance model is smaller than that based on the independence model. However,
their difference seems small and is negligible in some scenarios.
We have used a piecewise-constant hazard function to estimate of the baseline hazard
function. This approach is simple in nature. However, its performance depends on the
judicious choice of cut points. We had used five cut points in our simulations and in the
practical example, and it seemed that this choice worked well. Unreported simulation
results also showed that using six cut points might reduce the variance
of regression estimate. However, the improvement was marginal. Clearly, this issue
deserves more study in the future.
Note that other copulas also can be applied instead of Clayton’s copula for modeling
covariance structure. According to Cook and Tolusso [9], in the analysis of clustered
current status data, the behavior of the regression estimates based on various copula models
are similar, except for the estimates of the parameters pertaining to the copula models. We
expect that such conclusion continues to hold in the analysis of clustered interval-censored
data.
It is also important to remark that there are other approaches to account for the correlated
intervalcensored data. For example, Hens et al. [12] studied the behavior of the
bivariate-correlated gamma frailty model for case I interval-censored data (current status
data). Scarlett et al. [13] discussed a parametric frailty model for the analysis of clustered
and interval-censored failure time data following a Weibull distribution. To use these
methods, however, it is necessary to provide sensitivity analysis on the frailty model.
The present work is motivated by the real problem of analyzing correlated
interval-censored data. It is possible to extend the methodology to the problem of analyzing
correlated data with recurrent event data. Sometimes, it is also likely that we observe
different covariate value at different examination times. Under this scenario, if it is
meaningful, we also can let the survival probability in different examinationtime interval to
depend on the corresponding covariate value. However, the performance of the regression
estimates needs more careful study. These challenging but interesting topics will be studied
in the future.
Here, we extend our result to multiple examination times. Let
time for the
j th subject in the ith family,
Also, let
Define
The pseudo-likelihood score equations for the baseline hazards
be the kth examination
and
are
The estimating equations for
The joint probability
parameter . To
estimate
the
using GEE approach are
also depends on the association
association
parameter,
we
and
Here
is a vector similar to Zi , except that
are replaced by
define
Acknowledgements
The research was supported in part by a grant from the National Science Council of Taiwan.
The authors thank one associate editor and reviewers for their comments.
Download