Supplementary On-line material for “Flexible modeling of the effects of continuous prognostic factors in relative survival” Amel MAHBOUBI1,2, Michal ABRAHAMOWICZ1,2, Roch GIORGI3, Christine BINQUET4,5, Claire BONITHON-KOPP4,5, Catherine QUANTIN4,5 1 Department of Epidemiology and Biostatistics, McGill University, Montreal, Que., Canada H3A 1A1 Division of Clinical Epidemiology, McGill University Health Centre, Montreal, Que., Canada H3A 1A1 3 LERTIM EA 3283, Faculté de Médecine, Université de la méditerranée, Marseille, F-13005, France. 4 Inserm, U866, Dijon, F-21079, France; Univ Bourgogne, Dijon, F-21079, France. 5 Department of Biostatistics and Medical Informatics. Dijon University Hospital, France 2 The current document presents supplementary material that, for sake of brevity, could not be included in our paper Flexible modeling of the effects of continuous prognostic factors in relative survival. Specifically, section 1 provides the formula of the full log likelihood of our model, section 2 provides details of the iterative alternating conditional estimation algorithm, outlined in section 2.3 of the manuscript. Then, section 3 describes the procedures used to generate data for the simulation studies reported in section 3 of the manuscript. Finally, section 4 provides some additional results for the reallife colon cancer application discussed in section 4 of the manuscript. 1. Full log likelihood of our new model. The general formula for the full log likelihood of the model of Estève et al [2] is : n LogL c t , z d i * ln e t x, z c t , z i 1 where di is the censoring indicator, di =1 if the i-th observation is uncensored and di=0 if the i-th observation is censored and e (t x z ) are the fixed constants, corresponding to the expected hazard of ‘natural’ background mortality. Notice that accounting for time-varying background mortality modifies the likelihood, relative to crude survival analyses of all-cause observed mortality [2]. To derive the full loglikelihood of our flexible model (3), we need to replace c (t z ) by c (t z ) exp( (t )) exp( i ( z i ) i (t )) , where the respective functions are defined by equations (4a) – i (4c) in the manuscript. 2. Details of the iterative alternating conditional estimation algorithm and initial values selection. In our flexible relative survival model, the hazard function for disease-related mortality, c , is modeled as: c (t z ) exp( (t )) exp( i ( z i ) i (t )) i We estimate this model through a 3-step iterative alternating conditional full maximum likelihood estimation (mle) algorithm (see section 2.3 of the manuscript). The 3 steps involve estimation of, respectively: (1) baseline log hazard of disease specific mortality, for subjects with covariate vector z=0, i.e. γ(t) defined in equation (4c) in the manuscript; (2) non-linear functions i ( z i ) defined in (4a); (3) time-dependent functions i (t ) defined in (4b). At each step, the estimation of the coefficients that define respective functions is conditional on the fixed values of all the coefficients of the two other functions, corresponding to their most recent estimates. In particular, in the main iteration v of the algorithm, (1) γ(t) is estimated conditional on the values of i ( z i ) and i (t ) estimated in the previous iteration v-1; (2) i ( z i ) is conditional on γ(t) from iteration v and i (t ) from iteration v-1; whereas (3) i (t ) is conditional on both γ(t) and i ( z i ) from the steps (1) and (2) of the same iteration v. The iterations stop after the total reduction in the negative log likelihood (LogL) relative to the previous main iteration, is below 0.001. After step (2) of the v-th iteration, the estimated coefficients of i ( z i ) are divided by a constant Div such that the unit range constraint is met. Accordingly, at the beginning of step (3) of the same v-th iteration, the initial coefficients of i (t ) are multiplied by the same constant Div, in order to preserve the value of the product of the two functions in (3). The algorithm requires setting initial values for the coefficients estimated in each of the three steps. Below we explain how these initial values are obtained, for each step. Step (1) of the 1st iteration: The coefficients of (t) in (4c) are initially estimated assuming that all the covariates have no effect. Accordingly, in this initial step (t) is estimated as the un-adjusted log hazard for the marginal distribution of time to disease-specific death, for the entire study population. To get the initial values of the coefficients in (4c), we first fit a simple univariate exponential survival model with constant-overtime hazard ̂ , to data on all subjects. We then set all initial values in (4c) to the estimated hazard rate: j = ̂ , j=1,…,6. Because of the normalization properties of B-splines that ensures the k B t 1 j 1 j [24], these initial values imply (t) = ̂ . We then obtain the mle estimates of γk coefficients in (4c), while fixing all coefficients for i ( z i ) in (4a) and i (t ) in (4b) to 0, to estimate the un-adjusted log hazard for disease-specific mortality in the entire study population. Step (2) of the 1st iteration: The initial values for all coefficients of i ( z i ) in (4a) are set to 0, which would imply that none of the covariates has an association with disease-specific mortality. Then, all the coefficients in (4a), for all p covariates, are simultaneously estimated, conditional on the mle of ˆt from step (1), while imposing the PH assumption on all the covariates. The latter assumption implies that, at step (2) of the 1st iteration, the time-dependent functions i (t ) , for all covariates i=1,…p, are replaced by a constant value 1. In the 1st and all subsequent iterations, after the new values of ̂ i,l have been estimated in step (2), they are transformed so as to respect the ‘scale constraint’ discussed above. Specifically, each of the coefficients ̂ i,l , estimated for a given covariate z i is divided by a constant Di v, specific for this covariate and a given iteration v, calculated as the difference between the maximum and the minimum of the estimated values of i ( z i ) , across the sample range of z i values. Accordingly, the empirical range of the transformed values ̂ i* z i equals 1.0. Step (3) of the 1st iteration : The initial values for all coefficients of i (t ) in (4b) are set to the constant Div, calculated in step (2) of the 1st iteration (see above). Given the aforementioned normalization of B-splines, this implies that i (t ) = Div for all t. Thus, in the 1st iteration, the estimation of the time-dependent function starts with a constant log hazard ratio, which is consistent with the PH assumption. Then, all the coefficients in (4b), for all p covariates, are simultaneously estimated, conditional on the mle’s of ˆt from step (1) and of ̂ i* z i from step (2). In the subsequent iterations (v>1) of the algorithm, the initial values for the coefficients being estimated at each of the three steps correspond to their estimates, appropriately re-scaled for i ( z i ) and i (t ) , obtained at the corresponding step of the previous v-1 iteration. 3. Details of data generation for simulation studies This section provides details regarding data generation procedures, and underlying assumptions, that are outlined in section 3.1 of the manuscript. 3.1. Generation of prognostic factor vectors We assumed the study focuses on estimating the effect of the continuous covariate of primary interest, age at diagnosis, on the mortality hazard. Age was generated from a normal distribution, conditional on the patient’s sex, with N[69.8;11.2] for men and N[71.4;12.6] for women, where the parameters were based on the actual colon cancer registry data. We also simulated three additional binary covariates, for each individual. The three binary covariates were generated from the following binomial distributions: (a) sex with P(male)=P(female)=0.5, (b) cancer stage at diagnosis with P(stage II) = 0.61 vs P(stage III) = 0.39, (c) tumour location with P(left) = 0.62 vs P(right)=0.38. Except for the dependence of the age on the sex, all other patient characteristics were assumed independent of each other. Once individual covariate vectors were generated for all N=2,000 study subjects, the resulting covariate matrix was kept fixed across all simulations, i.e. the same simulated population was used in all simulated scenarios. 3.2 Generation of times to cancer-related death We assumed that the hazard of cancer-related death depended on all four aforementioned prognostic factors. When simulating the time to death due to cancer, the effects of the three binary covariates were assumed to be consistent with the conventional parametric PH hypothesis. In contrast, we assumed that age at diagnosis had, depending on the scenario, time-varying (TD) and/or non-linear (NL) effects on the log hazard. Yet, generating event times conditional on complex time-varying effects and/or covariates presents serious analytical challenges, and standard methods such as inversion usually do not apply in this context [20]. To avoid such difficulties, when generating time to cancer-related death, we relied on the ‘permutational algorithm’, proposed and validated in our previous studies [20, 31-32]. This algorithm permits generating event times, conditional on arbitrarily complex effects of covariates, while controlling also for the ‘marginal’ distribution of the event time in the entire study population. Our application of the permutational algorithm involved three steps [32]: 1) Generation of (un-censored) event times from the ‘marginal’ distribution, independent of covariates. We first generated N expected times to cancer-related death, independently of covariates. The permutational algorithm allows the user to generate event times from an arbitrary marginal distribution, including fully non-parametric distributions [31-32]. Because, in prognostic studies of most cancers, the frequency of observed deaths gradually decreases during the follow-up, for simplicity’s sake, we have generated the event times assuming a linear decreasing probability density function. Specifically, for each month (t) during the follow up, the number of cancer-related deaths Np(t) was defined as: Np(t)=Floor[38-0.6*(t-1)], where t=1,…,60, and Floor[.] means an integer part of the expression, resulting in a total of Nd = 1,200 generated expected times of cancer-related death. (Notice that this implied that, among N=2,000 study subjects, up to 1,200 (60%) would have cancer-related deaths during the first 5 years after diagnosis, if there were no censoring and no earlier deaths due to other causes. Because we incorporated administrative censoring at 5 years, of all subjects still ‘alive’, there was no need to generate event times beyond 5 years). For each of the Np(t) events generated for a given month t, the exact time of cancer-related death (in days) was then simulated from the uniform distribution U[1,30]. 2) Specifying the multivariable model for the hazard of cancer-related death. We then assumed that the ‘true’ hazard of cancer-related death depends on the prognostic factors, through the following general formula, which involved (i) the constant-over-time (PH) effects of all three binary covariates, as well as (ii) non-linear and time-dependent effects for age: HR t | z exp ln( 1.25) sex ln( 1.2)location ln( 3) stage age age age (t ) (1) where HR(t|z) indicates the hazard ratio, at time t after diagnosis, for a subject with the covariate vector z, relative to the ‘reference’ population with z=0. Notice that, because of the time-varying effect of age, the individual hazard ratios also vary over time. Three alternative scenarios for the ‘true’ functions age (age) and age t were considered. (The ‘true’ functions are shown, by white circles, in the respective panels of Figure 2 in the manuscript). Scenarios 1 and 2 have the same time dependent function age t , and scenarios 1 and 3 have the same non-log linear function age (age) for age at diagnosis. Specifically: 2 t 1.52 age 60 In scenario 1 : age ( z age ) and age (t ) exp 5 20 ln 0.5 age 60 In scenario 2 : age ( z age ) and age (t ) exp t 1 3 20 2 2 I age70 * age 70 15.21 100 t 1.52 ( t ) exp In scenario 3: age ( z age ) and age 5 4 where I age70 is the indicator function: I age70 =1 if age 70 and I age70 =0 if age >70. 3) Matching times of cancer-related deaths with covariate vectors. Finally, to ensure that the simulated mortality data reflected the ‘true’ model (1) for HR(t|z) specified in step 2) above, we matched each of the Nd times generated in step 1) with one of the N covariate vectors generated in section 3.1 above. This matching was based on the probability, derived from the partial likelihood of the ‘true’ model (1) that the event at a given time t corresponded to a specific covariate vector [32]. Specifically, we first ordered simulated event times from the earliest to the latest. Next, for each subsequent event time t i , i=1,…, Nd simulated in step 1), we calculated, for each subject s among the S i subjects still at risk at time t i , the following ratio: pi,s HRs t i | z s HR t Si p 1 p i | zp (2) where HRs t | z s and HR p t | z p are, respectively, the evaluations of the HR function specified in (1), at time t i , for subjects s and p. Then, we sampled a single covariate vector, corresponding to subject s e , from the S i subjects still at risk, using the weighted random sampling with weights defined by pi ,s in (2). This weighting ensured that the subject’s risk of event at a given time was proportional to his/her true HR defined in (1) [3132]. The covariate vector for the selected subject s e was then assigned the cancer-related death at the respective time t i , and the subject was excluded from all subsequent risk sets, for t > t i . The same process was continued until all Nd events from step 1) were assigned to individual subjects. 3.3 Generation of times to natural death Individual times of ‘natural’ death were generated, independently of times to cancer-related death, using the life tables for the Côte d’Or administrative region in France, stratified by sex and age. These tables allowed us to determine the probability that a given subject would die within the next year, conditional on sex and age, which was increased by 1 year for each of the subsequent years of followup. Once a natural death in a specific year was generated for a given subject, the exact time to death within this year (in days) was generated from the uniform U[0,365] distribution. As for cancer-related death, this process was continued only until the end of the fifth year of follow-up, i.e. the time of the administrative censoring, at the end of the study. 3.4 Creating final data for analysis The final step involved creating the final dataset, for the purpose of analyses. This required determining, for each individual, whether a death was ‘observed’ during the follow-up and, if so, the time of death. All subjects for whom no time of death was assigned in either section 3.2 or 3.3 above, were censored at 5 years. If, until the end of the 5-year follow-up, a subject was assigned only one of the two hypothetically possible causes of death, in the final dataset, we recorded a death at the corresponding time. Finally, for those subjects who were assigned separate times to death due to (i) cancer, and (ii) other causes (in sections 3.2 and 3.3, respectively), the death at the earlier of the two times was recorded. It is important to note that the final analyzable dataset did not discriminate between the two competing causes of death. For each of the three scenarios, corresponding to different ‘true’ effects of age at diagnosis (see beginning of section 3.2 above), one hundred data sets were generated. Notice that both the covariate matrix Z and the number of times to cancer-related death, generated in section 3.2 above, were kept constant across all simulations. However, in each simulated sample, the matching of these two components, as well as generation of times to natural death (section 3.3), was performed de novo, independently of any other simulated sample, resulting in random sample-to-sample variation in the ‘observed’ mortality patterns [31]. 4. Selected empirical results for the colon cancer application. Table 1 summarizes the distribution of the prognostic factors used in multivariable analyses of mortality, in the cohort of 813 stage I colon cancer patients, described in section 4 of the manuscript. For each subgroup of patients, the last 2 columns of Table 1 show the number and proportion of patients who died of any cause during the first 5 years after colon cancer diagnosis. Figure 1 shows the cubic spline estimate of the baseline hazard of colon cancer-specific mortality, estimated in the cohort of 813 stage I colon cancer patients discussed in the-real life application (see section 4 of the manuscript). The estimate shows very high initial risk of death, right after diagnosis, reflecting high mortality due to post-surgery complications. The low hazard of mortality after the first few months of follow-up, implies that the later (after 6 months) segments of time-dependent estimates in Figure 2a were supported by only few deaths, which explains the over-fit bias seen in this figure. Figure 2 allows assessing the robustness of the estimates, for the cohort of 813 stage I colon cancer patients discussed in the-real life application (see section 4 of the manuscript) with respect to the order of steps 2 and 3 of our alternating conditional algorithm (see end of section 2.3). For both continuous covariates: age (panels (a) and (b)) and calendar year of diagnosis (panels (c) and (d)), the estimates of both β(t) and α(z) are virtually identical when the order of the two steps is reversed. Furthermore, for calendar year of diagnosis, panels (d) and (c) show that, respectively, (i) cancer-related mortality decreased gradually, in an approximately linear fashion, between 1976 and 2000, and (ii) this mortality reduction applied to both early mortality, soon after diagnosis, and later mortality, up to 5 years after diagnosis, which is consistent with the PH assumption. References (numbered as in the manuscript) 20. Abrahamowicz M, MacKenzie T, Esdaile JM. Time-dependent hazard ratio: modelling and hypothesis testing with application in lupus nephritis Journal of the American Statistical Association 1996; 91: 1432-1439. 24. de Boor C. A Practical guide to Splines. New York, 1978. 31. Sylvestre MP, Abrahamowicz M. Comparison of algorithms to generate event times conditional on time-dependent covariates. Stat Med 2008; 27: 2618-2634. 32. MacKenzie T, Abrahamowicz M. Marginal and hazard ratio specific random data generation: Applications to semi-parametric bootstrapping. Statistics and Computing 2002; 12: 245-252. Table 1: Distributions of the colon cancer prognostic factors and the corresponding all-causes mortality Prognostic factor Age Gender Tumour location Period of diagnosis TOTAL 64 65 - 74 75 Woman Man Right colon Left colon 1976 - 1980 1980 - 1985 1986 - 1990 1991 - 1995 1996 - 2000 Number 240 289 284 330 483 164 649 98 151 183 200 181 813 (%)* 30 35 35 41 59 20 80 0.12 0.19 0.22 0.25 0.22 Deaths at 5 years 30 68 121 67 152 43 176 34 51 45 51 38 219 % of all 813 patients † % of patients in a given category, who died within first 5 years after diagnosis. * (%)† 12.5 23.5 42.6 20.3 31.5 26.2 27.1 34.7 33.8 24.6 25.5 21.0 26,9