A method for analyzing clustered interval-censored data based onCox’s model Chew-Teng Kor, Kuang-Fu Cheng and Yi-Hau Chen Methods for analyzing interval-censored data are well established. Unfortunately, these methods are inappropriate for the studies with correlated data. In this paper, we focus on developing a method for analyzing clustered interval-censored data. Our method is based on Cox’s proportional hazard model with piecewise-constant baseline hazard function. The correlation structure of the data can be modeled by using Clayton’s copula or independence model with proper adjustment in the covariance estimation.We establish estimating equations for the regression parameters and baseline hazards (and a parameter in copula) simultaneously. Simulation results confirm that the point estimators follow a multivariate normal distribution, and our proposed variance estimations are reliable. In particular, we found that the approach with independence model worked well even when the true correlation model was derived from Clayton’s copula. We applied our method to a family-based cohort study of pandemic H1N1 influenza in Taiwan during 2009–2010. Using the proposed method, we investigate the impact of vaccination and family contacts on the incidence of pH1N1 influenza. Keywords: cluster; copula model; Cox model; estimating equation; interval-censored 1. Introduction Interval-censored data often occur in many studies of epidemiology, longitudinal, or biomedical research in which subjects are followed periodically for the event of interest. In these studies, the event time T is not directly observable but may be detected in some periodic examination interval, denoted as (L,R] where L is the left examination time and R is the right examination time. For the special case that subjects have only one examination time at R, data are called as the current status data or ‘case 1’ interval-censored data. The statistical methods for analyzing interval-censored data have been widely studied. For examples, Peto [1], Turnbull [2], and Groeneboom and Wellner [3] proposed nonparametric maximum likelihood methods for estimating the distribution function with interval-censored data. Moreover, Groeneboom and Wellner [3] established asymptotic properties for the nonparametric maximal likelihood estimator. Finkelstein [4], on the other hand, developed a likelihood approach on the basis of the Cox proportional hazards regression model using interval-censored data. Many authors extended her likelihood approach to other situations; for example, Huang [5] studied the current status data and showed the maximum likelihood estimator of regression parameter to be consistent and to have asymptotic normal distribution with convergence rate. Extending to bivariate interval-censored data, Goggins and Finkelstein [6] and Kim and Xue [7] both considered a marginal method with working independence assumption for analyzing Cox’s regression parameter. They followed the idea of Wei, Lin, and Weissfeld [8] to propose a sandwich-type covariance estimate. However, their method requires a large number of parameters to model the baseline survival function. To analyze clustered current status data, Cook and Tolusso [9] considered the use of second order generalized estimating equations (GEE) and a copula model. In this paper, we focus on analyzing clustered interval-censored data on the basis of a Cox’s [10] proportional hazard model. Our goal is to estimate the baseline survival function and Cox’s regression parameters. We consider a piecewise-constant hazard function for the baseline hazard to simplify our analysis. We employ the GEE approach for estimating the Cox regression parameters and a multinomial likelihood approach for estimating the baseline hazard parameters. Either independence or a parametric covariance model is used to model correlations for the data within cluster. If we apply the independence model, we account for correlation effect within cluster through the use of the sandwich-type covariance matrix estimate. A family-based cohort study of pandemic H1N1 influenza in Taiwan during 2009–2010 was given to demonstrate application of our method. The pH1N1 influenza study was conducted by the Center for Infectious Disease Education and Research in China Medical University and aimed to study the behavior of the household transmissions and vaccine effectiveness pertaining to seasonal influenza viruses. Family data from two cities, Taichung and Nantau, of Taiwan were collected with written consent. All subjects within family were followed up for 1 year, and their blood samples were collected at two different time points to determine whether the subjects were infected. The infection status was determined by the level of the hemagglutination inhibition (HI) titer. Understanding the impact of risk factors such as family contacts and vaccination on the infection outcome can provide us valuable information for controlling the disease. We organize the paper as follows. We present the notations and methods in Sections 2 and 3. Sections 4 and 5 contain simulation studies and data analysis, respectively. Section 6 gives our concluding remarks. 2. The model Suppose there are M families with ni subjects in the i th family, We denote Tij as the time to the occurrence of influenza for subject j in family , measured from the beginning of the study. However, Tij ’s are not directly observable. Instead, we have two examination-time points, denoted by Lij and Rij ,to determine if the subject has been infected.As the event time may be leftcensored, interval-censored, or right-censored, we represent the event time information with two binary variables . defined as (1) where is the usual indicator function. There are only three possible outcomes for if the subject was infected by influenza before the first examination time; if the subject was infected by influenza between the first and second examination times; and if the subject was not infected by influenza before the second examination time. We can summarize the observed data as where represents a vector of covariates. Of note, the examination times considered in this paper can vary across all subjects. In the case that we have only one examination time, and is always zero. On the other hand,if there are more than two examination-time points, our methods proposed in this paper also can be straightforwardly extended.Please see the details in Appendix. 2.2. Hazard regression model To model the event time, we assume that T follows a proportional hazard model (Cox [10]) given by Where is the hazard function of T , evaluated at time t and given covariates is the baseline hazard function, and is a set of regression parameters. We assume that the baseline hazard function is a piecewise-constant function with I jump points on the nonnegative real line. We write the piecewise-constant baseline hazard function as Where is the hazard rate at . We respectively write the corresponding baseline cumulative hazard and survival function as where Under the proportional hazard regression model, we can write the conditional expectation of ; The conditional expectations play an important role in the estimation of parameters. In the following, we will construct two estimating equations separately for and 3. Estimation method We follow two assumptions by Finkelstein [4]: (a) the censoring mechanism is independent of both the failure time and covariates; and (b) all subjects will eventually fail unless it is censored. Under these assumptions, we propose using a system of pseudo-likelihood score equations for estimating baseline hazard parameters and a GEE approach for estimating regression parameters. We consider two approaches to account for the correlation: (a) to assume that data within cluster are independent and use a sandwich-type estimate for covariance estimation or (b) to use Clayton’s copula function to model correlations. Simulation results confirm that both approaches worked well under our simulation conditions. 3.1. Estimating equations for the baseline hazards We first assume that the regression parameters are known. On the basis of this, we give the log of pseudo-likelihood function of by where Taking derivative with respect to each ˛l , we have the following score equations: 3.2. Estimating equations for the regression parameters We assume that the baseline hazards are known. We employ the GEE approach by Liang and Zeger [11] to construct estimating equations for . They are given by where 3.3. Correlation model and parameter estimates There are two approaches that can be used to account for correlations. (a) Independence model approach: assuming that the covariance matrix is given by This leads the estimating equation for to be given by Define . and let be the solution from Equations (8) and (10). Under regularity con-Oditions, we can prove that mean zero and covariance matrix of is asymptotically normal with , which can be consistently estimated by (b) Covariance model approach: following Cook and Tolusso [9], we use Clayton’s copula to model the covariance structure, where the Clayton model is and the parameter measures the association between the event times within cluster. Then each element of covariance Vi in (9) is given by The joint probability is determined by the Clayton’s copula, which depends on an additional parameter . Using the alternative expression of the covariance matrix Vi as above, the estimating equations for are Let be the vector of all products, where pair-wise The expectation of Zi is defined as which is a extra estimating equation for vector. Under the copula model, we consider an where vector and can obtain the estimate of parameter by simultaneously solving We use algorithms such as Newton–Raphson algorithm or bisection algorithm to solve this system of equations. Define and let be the solution. Under regularity conditions, we can prove that is asymptotically normal with mean zero and covariance matrix of , which we can consistently estimate by the sandwich-type estimate 4. Simulation studies We conduct a simulation study to evaluate the finite-sample performance of the proposed methods. 4.1. Data generation We considered three independent covariates. Let covariates be generated by Given the number of family M and family size , we generated the multivariate failure time . of the subjects in the ith family from the joint distribution given by This is amultivariate distribution with Clayton’s copula. For each subject, we generated themarginal failure time according to the Cox regression model with hazard function where We generated the first examination time L by the exponential distribution exp(0.5) and determined the second examination time by R= L+A, where A ~Uniform.(0,1.5) Under this setting, right, interval, and left censoring rates were about 29%, 40%, and 31%, respectively. We selected which corresponds to Kendall’s tau equal to 0.5. The number of families was either 100 or 200 with family size equal to 1, 2, 3, or 5 or randomly selected between 2 and 6. The number of cut points for piecewise-constant baseline hazard function was I = 5, and sample distribution based on were the six- quantiles (sextiles) of the examination-time points. Finally, we based all simulation results on 1000 replications. 4.2. Results We show the results for the estimates of Cox regression parameters in Table I. They include the empirical bias (Bias), standard error (SE) of point estimator, average of the estimated standard error (SSE) given by (11), and the empirical coverage probability of the 95% Wald’s confidence intervals based on the estimated standard error (CP*).We give the latter results to show the accuracy of the normal approximation. We see from Table I that all parameter estimates have relatively small bias and variance when using either the independence model or covariance model. We can decrease the bias or variance by increasing either the number of families and/or family size. Unreported simulation results also indicated that the bias and variance of the regression estimates could be reduced by increasing the number of examination times from two to three. The variance estimate based on the independence model or the covariance model was also very similar to the true variance. The former result showed that when the correlation structure within family was not known to the users, it is still possible to account for the correlation effect by using a proper sandwichtype covariance estimate. The coverage probabilities were also shown to be closed to the nominal value, indicating the validity of the normal approximation. Overall, the differences between using the independence model or covariance model were small. This confirms that the approach based on the independence model is also reliable, and its efficiency loss is small. Regarding the association parameter Kendall’s in Clayton’s copula, we note that our estimate had simulated bias ranging from -0.035 to -0.027and variance ranging from 0.1031 to 0.0723 when there were 100 families with size equal to 2, 3, or 5. When there were 200 families, the ranges of the bias and variance became ( -0.0301,-0.0134) and (0.0773, 0.0416), respectively. This also showed that one can decrease the bias and variance of the association parameter by increasing the number of families and/or family size. Simulation results also indicated that the sandwich-type variance estimates were very similar to the true variances. In the simulations, their differences were no more than 0.52%. 5. Application to pH1N1 data in Taiwan The pH1N1 dataset was obtained from an infectious disease study conducted in Central Taiwan between3/15/2009 and 12/1/2009. In this study, school children and their family members were recruited with written consent. One hundred four households from Taichung city and Nantou county were involved in the study, with 306 household members agreeing to having their blood samples collected and answering basic questionnaires. The first time points for taking blood samples were between April and June, 2009 (after the 2008–2009 influenza season), and the second time points were between September and October, 2009 (before taking vaccination based on 2009–2010 seasonal and 2009 pandemic influenza strains). The level of the HI titer was used to determine whether the subject was infected by flu virus. Using the defined test of HI titer, we found that 66 subjects caught the influenza virus before the first inspection times, 132 subjects caught the influenza virus between their first and second inspection times, and 108 subjects did not catch any influenza virus before their second inspection times. The focus of this study was to investigate the antibody response against the influenza infection and factors impacting the infection. Factors considered in this paper include age, gender, household size, vaccination history, mother/father/grandmother contact level (low or high), urban or rural areas, and so forth. We apply the method proposed in this paper to study the pH1N1 data. We fit our data to the Cox’s regression model using piecewise-constant baseline hazard function with jump points at 1.06, 3.60, 6.20, 6.53, and 8.54 months, which correspond to the dates on 4/15/2009, 7/1/2009, 9/20/2009, 10/1/2009, and 11/31/2009. Table II provides estimates of Cox’s regression parameters based on the independence and covariance models. In this application, the association parameter Kendall’s in Clayton’s copula is 0.426, and its 95% confidence interval is (0.279, 0.574). The results from Table II indicate that the approach based on either independence or covariance model leads to similar conclusions. The analysis also reveals that subjects who were vaccinated in 2008 and had higher HI titer response (HI>40) could be better protected from catching the 2009 pandemic influenza. The hazard ratio under the independence model was 0.6839 with p-valueD0.042, and that under the covariance model based on Clayton’s copula was 0.6776 with p-valueD0.026. We also detected significant protection effect for subjects with high frequency of mother–child contact. The cor-responding hazard ratio was 0.6513 with p-valueD0.036 under the independence model and 0.6339 with p-valueD0.022 under the covariance model. 6. Discussion In this paper, we have developed a method for analyzing clustered interval-censored data.We used Cox’sproportional hazards model as a basis for our analysis. Clearly, we can also extend our method to analyze other models.We derive estimating equations for regression parameters, parameters for the baseline hazard function, and an association parameter for the covariance model. If the independence model is used, the GEE method with sandwich-type variance estimation is suggested and has been shown to have satisfactory performance in our simulations. The variance estimate based on the correct covariance model is smaller than that based on the independence model. However, their difference seems small and is negligible in some scenarios. We have used a piecewise-constant hazard function to estimate of the baseline hazard function. This approach is simple in nature. However, its performance depends on the judicious choice of cut points. We had used five cut points in our simulations and in the practical example, and it seemed that this choice worked well. Unreported simulation results also showed that using six cut points might reduce the variance of regression estimate. However, the improvement was marginal. Clearly, this issue deserves more study in the future. Note that other copulas also can be applied instead of Clayton’s copula for modeling covariance structure. According to Cook and Tolusso [9], in the analysis of clustered current status data, the behavior of the regression estimates based on various copula models are similar, except for the estimates of the parameters pertaining to the copula models. We expect that such conclusion continues to hold in the analysis of clustered interval-censored data. It is also important to remark that there are other approaches to account for the correlated intervalcensored data. For example, Hens et al. [12] studied the behavior of the bivariate-correlated gamma frailty model for case I interval-censored data (current status data). Scarlett et al. [13] discussed a parametric frailty model for the analysis of clustered and interval-censored failure time data following a Weibull distribution. To use these methods, however, it is necessary to provide sensitivity analysis on the frailty model. The present work is motivated by the real problem of analyzing correlated interval-censored data. It is possible to extend the methodology to the problem of analyzing correlated data with recurrent event data. Sometimes, it is also likely that we observe different covariate value at different examination times. Under this scenario, if it is meaningful, we also can let the survival probability in different examinationtime interval to depend on the corresponding covariate value. However, the performance of the regression estimates needs more careful study. These challenging but interesting topics will be studied in the future. Here, we extend our result to multiple examination times. Let time for the j th subject in the ith family, Also, let Define The pseudo-likelihood score equations for the baseline hazards be the kth examination and are The estimating equations for The joint probability parameter . To estimate the using GEE approach are also depends on the association association parameter, we and Here is a vector similar to Zi , except that are replaced by define Acknowledgements The research was supported in part by a grant from the National Science Council of Taiwan. The authors thank one associate editor and reviewers for their comments.