sim_4208_sm_SuppInfo

advertisement
Supplementary On-line material for “Flexible modeling of the effects of continuous
prognostic factors in relative survival”
Amel MAHBOUBI1,2, Michal ABRAHAMOWICZ1,2, Roch GIORGI3, Christine BINQUET4,5,
Claire BONITHON-KOPP4,5, Catherine QUANTIN4,5
1
Department of Epidemiology and Biostatistics, McGill University, Montreal, Que., Canada H3A 1A1
Division of Clinical Epidemiology, McGill University Health Centre, Montreal, Que., Canada H3A
1A1
3 LERTIM EA 3283, Faculté de Médecine, Université de la méditerranée, Marseille, F-13005, France.
4 Inserm, U866, Dijon, F-21079, France; Univ Bourgogne, Dijon, F-21079, France.
5 Department of Biostatistics and Medical Informatics. Dijon University Hospital, France
2
The current document presents supplementary material that, for sake of brevity, could not be included
in our paper Flexible modeling of the effects of continuous prognostic factors in relative survival.
Specifically, section 1 provides the formula of the full log likelihood of our model, section 2 provides
details of the iterative alternating conditional estimation algorithm, outlined in section 2.3 of the
manuscript. Then, section 3 describes the procedures used to generate data for the simulation studies
reported in section 3 of the manuscript. Finally, section 4 provides some additional results for the reallife colon cancer application discussed in section 4 of the manuscript.
1. Full log likelihood of our new model.
The general formula for the full log likelihood of the model of Estève et al [2] is :
n
LogL     c t , z   d i * ln e t  x, z   c t , z 
i 1
where di is the censoring indicator, di =1 if the i-th observation is uncensored and di=0 if the i-th
observation is censored and e (t  x z ) are the fixed constants, corresponding to the expected hazard of
‘natural’ background mortality. Notice that accounting for time-varying background mortality modifies
the likelihood, relative to crude survival analyses of all-cause observed mortality [2].
To derive the full loglikelihood of our flexible model (3), we need to replace c (t z ) by
c (t z )  exp(  (t )) exp(  i ( z i )  i (t )) , where the respective functions are defined by equations (4a) –
i
(4c) in the manuscript.
2. Details of the iterative alternating conditional estimation algorithm and initial values selection.
In our flexible relative survival model, the hazard function for disease-related mortality,  c , is modeled
as:
c (t z )  exp(  (t )) exp(  i ( z i )  i (t ))
i
We estimate this model through a 3-step iterative alternating conditional full maximum likelihood
estimation (mle) algorithm (see section 2.3 of the manuscript). The 3 steps involve estimation of,
respectively:
(1) baseline log hazard of disease specific mortality, for subjects with covariate vector z=0, i.e. γ(t)
defined in equation (4c) in the manuscript;
(2) non-linear functions  i ( z i ) defined in (4a);
(3) time-dependent functions  i (t ) defined in (4b).
At each step, the estimation of the coefficients that define respective functions is conditional on the
fixed values of all the coefficients of the two other functions, corresponding to their most recent
estimates. In particular, in the main iteration v of the algorithm, (1) γ(t) is estimated conditional on the
values of  i ( z i ) and  i (t ) estimated in the previous iteration v-1; (2)  i ( z i ) is conditional on γ(t)
from iteration v and  i (t ) from iteration v-1; whereas (3)  i (t ) is conditional on both γ(t) and  i ( z i )
from the steps (1) and (2) of the same iteration v. The iterations stop after the total reduction in the
negative log likelihood (LogL) relative to the previous main iteration, is below 0.001.
After step (2) of the v-th iteration, the estimated coefficients of  i ( z i ) are divided by a constant Div
such that the unit range constraint is met. Accordingly, at the beginning of step (3) of the same v-th
iteration, the initial coefficients of  i (t ) are multiplied by the same constant Div, in order to preserve
the value of the product of the two functions in (3).
The algorithm requires setting initial values for the coefficients estimated in each of the three steps.
Below we explain how these initial values are obtained, for each step.
Step (1) of the 1st iteration:
The coefficients of (t) in (4c) are initially estimated assuming that all the covariates have no effect.
Accordingly, in this initial step (t) is estimated as the un-adjusted log hazard for the marginal
distribution of time to disease-specific death, for the entire study population. To get the initial values of
the coefficients in (4c), we first fit a simple univariate exponential survival model with constant-overtime hazard ̂ , to data on all subjects. We then set all initial values in (4c) to the estimated hazard rate:
j = ̂ , j=1,…,6. Because of the normalization properties of B-splines that ensures the
k
 B t   1
j 1
j
[24], these initial values imply (t) = ̂ .
We then obtain the mle estimates of γk coefficients in (4c), while fixing all coefficients for  i ( z i ) in
(4a) and  i (t ) in (4b) to 0, to estimate the un-adjusted log hazard for disease-specific mortality in the
entire study population.
Step (2) of the 1st iteration:
The initial values for all coefficients of  i ( z i ) in (4a) are set to 0, which would imply that none of the
covariates has an association with disease-specific mortality. Then, all the coefficients in (4a), for all p
covariates, are simultaneously estimated, conditional on the mle of ˆt  from step (1), while imposing
the PH assumption on all the covariates. The latter assumption implies that, at step (2) of the 1st
iteration, the time-dependent functions  i (t ) , for all covariates i=1,…p, are replaced by a constant
value 1.
In the 1st and all subsequent iterations, after the new values of ̂ i,l have been estimated in step (2), they
are transformed so as to respect the ‘scale constraint’ discussed above. Specifically, each of the
coefficients ̂ i,l , estimated for a given covariate z i is divided by a constant Di v, specific for this
covariate and a given iteration v, calculated as the difference between the maximum and the minimum
of the estimated values of  i ( z i ) , across the sample range of z i values. Accordingly, the empirical
range of the transformed values ̂ i* z i  equals 1.0.
Step (3) of the 1st iteration :
The initial values for all coefficients of  i (t ) in (4b) are set to the constant Div, calculated in step (2) of
the 1st iteration (see above). Given the aforementioned normalization of B-splines, this implies that
 i (t ) = Div for all t. Thus, in the 1st iteration, the estimation of the time-dependent function starts with a
constant log hazard ratio, which is consistent with the PH assumption. Then, all the coefficients in (4b),
for all p covariates, are simultaneously estimated, conditional on the mle’s of ˆt  from step (1) and of
̂ i* z i  from step (2).
In the subsequent iterations (v>1) of the algorithm, the initial values for the coefficients being
estimated at each of the three steps correspond to their estimates, appropriately re-scaled for  i ( z i )
and  i (t ) , obtained at the corresponding step of the previous v-1 iteration.
3. Details of data generation for simulation studies
This section provides details regarding data generation procedures, and underlying assumptions, that
are outlined in section 3.1 of the manuscript.
3.1. Generation of prognostic factor vectors
We assumed the study focuses on estimating the effect of the continuous covariate of primary interest,
age at diagnosis, on the mortality hazard. Age was generated from a normal distribution, conditional on
the patient’s sex, with N[69.8;11.2] for men and N[71.4;12.6] for women, where the parameters were
based on the actual colon cancer registry data. We also simulated three additional binary covariates, for
each individual. The three binary covariates were generated from the following binomial distributions:
(a) sex with P(male)=P(female)=0.5, (b) cancer stage at diagnosis with P(stage II) = 0.61 vs P(stage III)
= 0.39, (c) tumour location with P(left) = 0.62 vs P(right)=0.38. Except for the dependence of the age
on the sex, all other patient characteristics were assumed independent of each other. Once individual
covariate vectors were generated for all N=2,000 study subjects, the resulting covariate matrix was kept
fixed across all simulations, i.e. the same simulated population was used in all simulated scenarios.
3.2 Generation of times to cancer-related death
We assumed that the hazard of cancer-related death depended on all four aforementioned prognostic
factors. When simulating the time to death due to cancer, the effects of the three binary covariates were
assumed to be consistent with the conventional parametric PH hypothesis. In contrast, we assumed that
age at diagnosis had, depending on the scenario, time-varying (TD) and/or non-linear (NL) effects on
the log hazard. Yet, generating event times conditional on complex time-varying effects and/or
covariates presents serious analytical challenges, and standard methods such as inversion usually do not
apply in this context [20].
To avoid such difficulties, when generating time to cancer-related death, we relied on the
‘permutational algorithm’, proposed and validated in our previous studies [20, 31-32]. This algorithm
permits generating event times, conditional on arbitrarily complex effects of covariates, while
controlling also for the ‘marginal’ distribution of the event time in the entire study population. Our
application of the permutational algorithm involved three steps [32]:
1) Generation of (un-censored) event times from the ‘marginal’ distribution, independent of covariates.
We first generated N expected times to cancer-related death, independently of covariates. The
permutational algorithm allows the user to generate event times from an arbitrary marginal distribution,
including fully non-parametric distributions [31-32]. Because, in prognostic studies of most cancers,
the frequency of observed deaths gradually decreases during the follow-up, for simplicity’s sake, we
have generated the event times assuming a linear decreasing probability density function. Specifically,
for each month (t) during the follow up, the number of cancer-related deaths Np(t) was defined as:
Np(t)=Floor[38-0.6*(t-1)], where t=1,…,60, and Floor[.] means an integer part of the expression,
resulting in a total of Nd = 1,200 generated expected times of cancer-related death. (Notice that this
implied that, among N=2,000 study subjects, up to 1,200 (60%) would have cancer-related deaths
during the first 5 years after diagnosis, if there were no censoring and no earlier deaths due to other
causes. Because we incorporated administrative censoring at 5 years, of all subjects still ‘alive’, there
was no need to generate event times beyond 5 years). For each of the Np(t) events generated for a given
month t, the exact time of cancer-related death (in days) was then simulated from the uniform
distribution U[1,30].
2) Specifying the multivariable model for the hazard of cancer-related death. We then assumed that the
‘true’ hazard of cancer-related death depends on the prognostic factors, through the following general
formula, which involved (i) the constant-over-time (PH) effects of all three binary covariates, as well as
(ii) non-linear and time-dependent effects for age:

HR t | z   exp ln( 1.25) sex  ln( 1.2)location  ln( 3) stage   age age  age (t )

(1)
where HR(t|z) indicates the hazard ratio, at time t after diagnosis, for a subject with the covariate vector
z, relative to the ‘reference’ population with z=0. Notice that, because of the time-varying effect of age,
the individual hazard ratios also vary over time.
Three alternative scenarios for the ‘true’ functions  age (age) and  age t  were considered. (The ‘true’
functions are shown, by white circles, in the respective panels of Figure 2 in the manuscript). Scenarios
1 and 2 have the same time dependent function  age t  , and scenarios 1 and 3 have the same non-log
linear function  age (age) for age at diagnosis.
Specifically:
2
 t  1.52 
 age  60 
In scenario 1 :  age ( z age )  
 and  age (t )  exp 

5
 20 


  ln 0.5 
 age  60 
In scenario 2 :  age ( z age )  
 and  age (t )  exp  t 1 

 3 
 20 
 
2
2




 I age70 *  age  70   15.21 




100
 t  1.52 



(
t
)

exp
In scenario 3:  age ( z age )  
and



age
5
4








where I age70 is the indicator function: I age70 =1 if age  70 and I age70 =0 if age >70.
3) Matching times of cancer-related deaths with covariate vectors. Finally, to ensure that the simulated
mortality data reflected the ‘true’ model (1) for HR(t|z) specified in step 2) above, we matched each of
the Nd times generated in step 1) with one of the N covariate vectors generated in section 3.1 above.
This matching was based on the probability, derived from the partial likelihood of the ‘true’ model (1)
that the event at a given time t corresponded to a specific covariate vector [32]. Specifically, we first
ordered simulated event times from the earliest to the latest. Next, for each subsequent event time t i ,
i=1,…, Nd simulated in step 1), we calculated, for each subject s among the S i subjects still at risk at
time t i , the following ratio:
pi,s 
HRs t i | z s 
 HR t
Si
p 1
p
i
| zp 
(2)
where HRs t | z s  and HR p t | z p  are, respectively, the evaluations of the HR function specified in (1),
at time t i , for subjects s and p.
Then, we sampled a single covariate vector, corresponding to subject s e , from the S i subjects still at

risk, using the weighted random sampling with weights defined by pi ,s in (2). This weighting ensured
that the subject’s risk of event at a given time was proportional to his/her true HR defined in (1) [3132]. The covariate vector for the selected subject s e was then assigned the cancer-related death at the
respective time t i , and the subject was excluded from all subsequent risk sets, for t > t i . The same
process was continued until all Nd events from step 1) were assigned to individual subjects.
3.3 Generation of times to natural death
Individual times of ‘natural’ death were generated, independently of times to cancer-related death,
using the life tables for the Côte d’Or administrative region in France, stratified by sex and age. These
tables allowed us to determine the probability that a given subject would die within the next year,
conditional on sex and age, which was increased by 1 year for each of the subsequent years of followup. Once a natural death in a specific year was generated for a given subject, the exact time to death
within this year (in days) was generated from the uniform U[0,365] distribution. As for cancer-related
death, this process was continued only until the end of the fifth year of follow-up, i.e. the time of the
administrative censoring, at the end of the study.
3.4 Creating final data for analysis
The final step involved creating the final dataset, for the purpose of analyses. This required
determining, for each individual, whether a death was ‘observed’ during the follow-up and, if so, the
time of death. All subjects for whom no time of death was assigned in either section 3.2 or 3.3 above,
were censored at 5 years. If, until the end of the 5-year follow-up, a subject was assigned only one of
the two hypothetically possible causes of death, in the final dataset, we recorded a death at the
corresponding time. Finally, for those subjects who were assigned separate times to death due to (i)
cancer, and (ii) other causes (in sections 3.2 and 3.3, respectively), the death at the earlier of the two
times was recorded. It is important to note that the final analyzable dataset did not discriminate
between the two competing causes of death.
For each of the three scenarios, corresponding to different ‘true’ effects of age at diagnosis (see
beginning of section 3.2 above), one hundred data sets were generated. Notice that both the covariate
matrix Z and the number of times to cancer-related death, generated in section 3.2 above, were kept
constant across all simulations. However, in each simulated sample, the matching of these two
components, as well as generation of times to natural death (section 3.3), was performed de novo,
independently of any other simulated sample, resulting in random sample-to-sample variation in the
‘observed’ mortality patterns [31].
4. Selected empirical results for the colon cancer application.
Table 1 summarizes the distribution of the prognostic factors used in multivariable analyses of
mortality, in the cohort of 813 stage I colon cancer patients, described in section 4 of the manuscript.
For each subgroup of patients, the last 2 columns of Table 1 show the number and proportion of
patients who died of any cause during the first 5 years after colon cancer diagnosis.
Figure 1 shows the cubic spline estimate of the baseline hazard of colon cancer-specific mortality,
estimated in the cohort of 813 stage I colon cancer patients discussed in the-real life application (see
section 4 of the manuscript). The estimate shows very high initial risk of death, right after diagnosis,
reflecting high mortality due to post-surgery complications. The low hazard of mortality after the first
few months of follow-up, implies that the later (after 6 months) segments of time-dependent estimates
in Figure 2a were supported by only few deaths, which explains the over-fit bias seen in this figure.
Figure 2 allows assessing the robustness of the estimates, for the cohort of 813 stage I colon cancer
patients discussed in the-real life application (see section 4 of the manuscript) with respect to the order
of steps 2 and 3 of our alternating conditional algorithm (see end of section 2.3). For both continuous
covariates: age (panels (a) and (b)) and calendar year of diagnosis (panels (c) and (d)), the estimates of
both β(t) and α(z) are virtually identical when the order of the two steps is reversed. Furthermore, for
calendar year of diagnosis, panels (d) and (c) show that, respectively, (i) cancer-related mortality
decreased gradually, in an approximately linear fashion, between 1976 and 2000, and (ii) this mortality
reduction applied to both early mortality, soon after diagnosis, and later mortality, up to 5 years after
diagnosis, which is consistent with the PH assumption.
References (numbered as in the manuscript)
20.
Abrahamowicz M, MacKenzie T, Esdaile JM. Time-dependent hazard ratio: modelling and
hypothesis testing with application in lupus nephritis Journal of the American Statistical Association
1996; 91: 1432-1439.
24.
de Boor C. A Practical guide to Splines. New York, 1978.
31.
Sylvestre MP, Abrahamowicz M. Comparison of algorithms to generate event times conditional
on time-dependent covariates. Stat Med 2008; 27: 2618-2634.
32.
MacKenzie T, Abrahamowicz M. Marginal and hazard ratio specific random data generation:
Applications to semi-parametric bootstrapping. Statistics and Computing 2002; 12: 245-252.
Table 1: Distributions of the colon cancer prognostic factors and the corresponding all-causes mortality
Prognostic factor
Age
Gender
Tumour location
Period of diagnosis
TOTAL
64
65 - 74
 75
Woman
Man
Right colon
Left colon
1976 - 1980
1980 - 1985
1986 - 1990
1991 - 1995
1996 - 2000
Number
240
289
284
330
483
164
649
98
151
183
200
181
813
(%)*
30
35
35
41
59
20
80
0.12
0.19
0.22
0.25
0.22
Deaths at
5 years
30
68
121
67
152
43
176
34
51
45
51
38
219
% of all 813 patients
†
% of patients in a given category, who died within first 5 years after diagnosis.
*
(%)†
12.5
23.5
42.6
20.3
31.5
26.2
27.1
34.7
33.8
24.6
25.5
21.0
26,9
Download