Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27 - 29 July 2015 Missing data and net survival analysis Bernard Rachet General context Population-based, routine data Cancer registry data Clinical data – tumour, treatment, comorbidity Cancer survival and roles played by patient, tumour and healthcare factors (very) large data sets, but incomplete information, which we have handled using multiple imputation procedure with Rubin’s rules Preliminary results of on-going work Multiple imputation procedure Under Missing At Random (MAR) assumption 1. Impute the missing data from π ππ ππ to give K ‘complete’ data sets 2. Fit the substantive model to each of the K data sets, to obtain K estimates of the parameters and estimates of their variance 3. Combine them using Rubin’s rules Multiple imputation steps Analysis Imputation Pooling Incomplete data Final results K completed data sets K analysis results Pooling K estimates – Rubin’s rules Given K completed data sets, there are: ο’Μ k , k ο½ 1,..., K 2 with variance ο³ˆ k , k ο½ 1,..., K K estimates Pooled estimate Total variance ο’ˆMI 1 ο½ K K ˆ ο’ ο₯ k k ο½1 1 ˆ ˆ ˆ VMI ο½ W ο« (1 ο« )B K ˆ within-imputation variance between-imputation variance 1 Wο½ K K 2 ο³ ο₯ k k ο½1 K 1 ˆ ο ο’ˆ ) 2 Bˆ ο½ ( ο’ ο₯ k MI K - 1 k ο½1 Multiple imputation procedure Congeniality 1. Imputation model congenial with substantive model 2. Given the substantive model from π π π , π π π π π is a congenial imputation model if both π and π are correctly specified 3. Valid inference (under MAR) if π π π π π (approximately) represents data structure and substantive model Concepts and measures of interest Aims Prognosis of a cancer and impact at population level Concepts Excess hazard Excess hazard ratio Net survival Crude probabilities of death from cancer and other causes Relative survival data setting Population-based data Expected mortality hazard from life tables By single year age and sex, and calendar year, geography, deprivation Nur et al, 2009 - Settings Population-based cohort of colorectal cancer patients Complete information on age, sex, follow-up time, vital status, deprivation, comorbidity, surgical treatment Tumour stage, morphology and grade: 45% incomplete data Relative survival data setting λ π₯ = λπ π₯ + ππ₯π π₯π½ Substantive model: generalised linear model (Dickman et al, Stat Med 2005) πππ ππ − πππ = πππ π¦π + π₯π½ Link function ππ ~ππππ π ππ ππ ; ππ = λπ π¦π ; π¦π person-time at risk πππ expected number of deaths – life tables Excess hazard ratio (+ Ederer-2 relative survival) Offset Data description Variable Category Patients No. % 29 563 100.0 Stage I II III IV Missing Morphology Adenocarcinoma Mucinous and serous Other Neoplasm, NOS1 2 193 7 326 7 726 643 11 684 12.3 41.0 43.2 3.6 (39.5) 23 693 2 314 128 90.7 8.9 0.5 3 428 (11.6) 3 212 16 047 2 907 7 397 14.5 72.4 13.1 (25.0) Grade I II III/IV Missing Missing information associated with: • Older ages • More deprived categories • Less treatment with curative intent • Higher probability of death Missing information in several variables Multiple imputation using Full Conditional Specification (chained equations – van Buuren, 1999) Same basic assumptions than in multiple imputation Assumes a joint (multivariate) distribution exists without specifying its form ο¨ ο©ο΄ ...ο΄ f ο¨Y f ο¨Yi ,1 , Yi , 2 ,..., Yi , p ο© ο½ f Yi , p Yi ,1 ,..., Yi , p ο1 ο¨ ο΄ f Yi , p ο1 Yi ,1 ,..., Yi , p ο 2 Imputation model (joint model for the data) i,2 ο© ο© Yi ,1 ο΄ f ο¨Yi ,1 ο© Y ~ N ο¨β, Ω ο© Gibbs sampler to: 1. Estimate the parameters in the joint imputation model 2. Impute the missing data Multivariate problem split into a series of univariate problems Imputation models Outcomes Ordinal regression for stage and grade Polytomous regression for morphology Covariables Other two covariables with incomplete information Sex, age, deprivation, comorbidity, treatment, cancer site Vital status Follow-up time (years): piecewise function (0, 0.5, 1, 2, 3, 4, 5, 5+) Time-dependent effects (categorical) for deprivation and age Substantive (excess hazard) model includes all these variables (binary) time-dependent effects Results Variable Category Patients No. % 29 563 100.0 Data after imputation % Stage I II III IV Missing Morphology Adenocarcinoma Mucinous and serous Other Neoplasm, NOS1 2 193 7 326 7 726 643 11 684 12.3 41.0 43.2 3.6 (39.5) 10.1 36.1 47.4 6.2 23 693 2 314 128 90.7 8.9 0.5 90.5 8.9 0.5 3 428 (11.6) 3 212 16 047 2 907 7 397 14.5 72.4 13.1 (25.0) Grade I II III/IV Missing 13.6 72.0 14.4 Missing information associated with: • Older ages • More deprived categories • Less treatment with curative intent • Higher probability of death Results Complete-case analysis (16 223 cases) Multiple imputation (29 563 cases) Period since diagnosis over which EHR was estimated Five years** First year Second to fifth Five years** First year Second to fifth years years EHR I II III IV Missing 15 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 to 99 1.0 3.6 10.2 26.4 95% CI 2.7 7.7 19.6 EHR 95% CI EHR 95% CI 4.7 13.5 35.5 EHR 1.0 2.6 7.0 16.5 1.0 1.1 1.4 2.0 2.7 4.0 0.8 1.0 1.5 2.0 2.9 1.5 1.9 2.7 3.7 5.5 1.0 1.3 1.2 1.2 1.1 0.9 1.0 1.0 1.0 0.9 0.7 1.6 1.5 1.5 1.4 1.3 Other results – Indicator approach • Systematically underestimates variance of EHRs • Overestimates EHRs for tumour morphology • Underestimates EHRs for age and deprivation • Does not identify time-dependent effects 95% CI 2.2 5.9 13.8 EHR 95% CI EHR 95% CI 3.0 8.4 19.8 1.0 1.3 1.7 2.4 3.6 5.4 1.0 1.4 2.0 2.9 4.4 1.6 2.1 2.9 4.3 6.6 1.0 1.3 1.3 1.3 1.4 1.5 1.1 1.1 1.1 1.2 1.2 1.5 1.5 1.6 1.6 1.9 Stage-specific survival Before imputation After imputation 100 80 80 Relative survival (%) 100 60 40 60 40 20 20 I II III IV missing I 0 II III IV 0 0 1 2 3 Years since diagnosis 4 5 0 1 2 3 Years since diagnosis 4 5 Limitations Tutorial paper – no systematic evaluation Relatively simple substantive model piecewise model categorical variables Further recent methodological developments in: multiple imputation net survival, flexible modelling More systematic evaluation – simulations Concepts and measures of interest Excess hazard λπΈ π‘ = λπ π‘ − λ π π‘ λπ π‘ ππ‘ = πππ π‘ ππ π‘ ; λπ π‘ ππ‘ = π π‘ = Net survival ππΈ π‘ = Crude mortality πΉπΆ π‘ = 1 πππ π‘ π‘ π − 0 λπΈ π’ ππ’ π‘ 0 ππ π’ − λπΈ π’ ππ’ π π π π‘ λππ π=1 π ππ π‘ π‘ Expected probability of surviving up to t Modelling approach Flexible multivariable excess hazard model Excess hazard Time-dependent and non-linear effects (splines) Variables affecting both mortality processes (cancer and other causes of death) included in the model Net survival is the mean of individual net survival functions predicted by the model Multiple imputation procedure Congeniality 1. Imputation model congenial with substantive model 2. Given the substantive model from π π π , π π π π π is a congenial imputation model if both π and π are correctly specified 3. Valid inference (under MAR) if π π π π π (approximately) represents data structure and substantive model 4. Problematic within net survival setting and with nonlinear and time-dependent effects Falcaro et al, 2015 – Study settings Data 44,461 men diagnosed with a colorectal cancer in 1998-2006, followed up to 2009 Age at diagnosis (continuous), tumour stage (4 categories), deprivation (5 categories) Missing stage: 30% MCAR πππππ‘ ππ π π = 1 ππ = πΏ0 MAR on X πππππ‘ ππ π π = 1 ππ = πΌ0 + πΌ1 (ageπ −60) MAR πππππ‘ ππ π π = 1 ππ = πΎ0 + πΎ1 (ageπ −60) + πΎ2 ππ + πΎ3 π·π π = 1 if stage missing 100 simulated data sets per scenario Distribution on fully observed data and empirical expected distribution in remaining complete records Substantive model Flexible log cumulative excess hazard model ππ ΛπΈ π‘ π₯π = π 1 ππ π‘ ; πΈπ , ππ + π·′ππ + π 2 ππππ ; πΈπ , ππ Flexible functions: restricted cubic splines Baseline excess hazard: 5 df, 4 internal knots and 2 boundary knots Age (continuous): 3 df, 2 internal knots Covariables: deprivation and stage Aims: estimate effect of stage (log EHR) and stage-specific net survival at 1, 5 and 10 years since diagnosis Imputation models Outcome (stage) Ordinal or multinomial logistic regression Covariables Survival time and log(survival time) or Nelson-Aalen estimate of the cumulative hazard Event indicator Age – splines defined as in the substantive model Deprivation – dummy variables 30 imputations Net survival: Rubin’s rules applied on πππ −πππ ππΈ π‘ to obtain approximate normality, then back-transformed Multiple imputation strategy Multiple Imputation Strategy Functional Form How Survival Is Modeled in the Imputation MI_ologit_surv MI_ologit_na MI_mlogit_surv MI_mlogit_na Ordinal logistic Ordinal logistic Multinomial logistic Multinomial logistic Survival time and log survival time Nelson-Aalen estimate of cumulative hazard Survival time and log survival time Nelson-Aalen estimate of cumulative hazard Results Bias in log excess hazard ratio estimates for stage (reference stage 1), 100 replications Poor results with ordered logit even under MCAR scenario Stage-specific net survival at 1 year, 100 replications Results Bias in stage-specific net survival estimates at 1 year, 100 replications Comments Promising results despite that the parameter estimated in the substantive model (here excess hazard) does not correspond to the final outcome of interest (net survival) Limitations No time-dependent effects of stage Which joint model? Which variables in the imputation models? • Vital status • Nelson-Aalen estimates of cumulative hazard • Interactions with time since diagnosis (age at diagnosis, deprivation…) • Other relevant interactions (tumour stage, region…) • other factors (treatment variables, co-morbidities, hospital volume, surgeon’s experience…) Limitations and challenges: preliminary study Simulated data set – colon cancer, 12,048 men followed up at least 5 years Baseline excess hazard: 5 df, 4 internal knots Covariables: stage, deprivation, age Time-dependent effects of stage: 2 df, 1 internal knot for each higher stage Non-linear effects of age: 3 df, 2 internal knots Substantive model ππ ΛπΈ π‘ π₯π = π 1 ππ π‘ ; πΈπ , ππ + π·′ ππ + π 2 ππππ ; πΈπ , ππ + π 3π π π‘ππππ π‘ ; πΈπ , ππ Missing stage simulated as in previous example – 100 data sets per scenario, with 30% missing stage Focus on MAR here Limitations and challenges: preliminary study Time (year) Net Survival function Complete Stage MAR 1 1 5 0.95 0.91 0.99 0.99 2 1 5 0.90 0.78 0.97 0.90 3 1 5 0.77 0.46 0.86 0.59 4 1 5 0.32 0.06 0.41 0.09 Simulation of missingness mechanisms as in previous example Same imputation model was applied (multinomial, Nelson-Aalen) Results – Excess hazard ratios for stage Tumour stage 2 (reference stage 1) 3.5 3 2.5 2 1.5 1 True EHR Complete-case EHRs Imputed EHRs .5 0 0 1 2 3 Time since diagnosis (years) 4 5 Results – Excess hazard ratios for stage Tumour stage 3 (reference stage 1) 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 True EHR Complete-case EHRs Imputed EHRs 0 1 2 3 Time since diagnosis (years) 4 5 Results – Excess hazard ratios for stage Tumour stage 4 (reference stage 1) 60 55 50 45 40 35 30 25 20 15 10 True EHR Complete-case EHRs Imputed EHRs 5 0 0 1 2 3 Time since diagnosis (years) 4 5 Results – Stage-specific net survival Tumour stage 1 1 .9 .8 .7 .6 .5 .4 .3 .2 .1 0 0 1 2 3 Time since diagnosis (years) 4 5 Results – Stage-specific net survival Tumour stage 2 1 .9 .8 .7 .6 .5 .4 .3 .2 .1 0 0 1 2 3 Time since diagnosis (years) 4 5 Results – Stage-specific net survival Tumour stage 3 1 .9 .8 .7 .6 .5 .4 .3 .2 .1 0 0 1 2 3 Time since diagnosis (years) 4 5 Results – Stage-specific net survival Tumour stage 4 1 .9 .8 .7 .6 .5 .4 .3 .2 .1 0 0 1 2 3 Time since diagnosis (years) 4 5 Conclusion and development Why MI? Strength: clear division between imputation and analysis stages both efficiency and MAR plausibility increased Challenge: incompatibility between imputation and substantive models asymptotically biased estimates Define joint model for flexible excess hazard models Multiple imputation by fully conditional specification with substantive model compatible algorithm (SMC-FCS) Bartlett JW et al. Statistical Methods in Medical Research 2015 References Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: John Wiley & Sons; 1987. Van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999; 18: 681β94. White IR, Royston P. Imputing missing covariate values for the Cox model. Stat Med 2009; 28: 1982–98. Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP. Modelling relative survival in the presence of incomplete data: a tutorial. Int J Epidemiol 2010; 39: 118β28. Carpenter JR, Kenward MG. Multiple imputation and its application. Chichester: John Wiley & Sons; 2013. Falcaro M, Nur U, Rachet B, Carpenter JR. Estimating excess hazard ratios and net survival when covariate data are missing: strategies for multiple imputation. Epidemiology 2015; 26: 421-8. Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res 2015; 24: 462-97. http://www.missingdata.org.uk/