Workshop on Flexible Models for Longitudinal Biostatistics

advertisement
Workshop on Flexible Models for Longitudinal
and Survival Data with Applications in
Biostatistics
Warwick, 27 - 29 July 2015
Missing data and net survival analysis
Bernard Rachet
General context
Population-based, routine data
Cancer registry data
Clinical data – tumour, treatment, comorbidity
Cancer survival and roles played by patient, tumour and healthcare factors
(very) large data sets, but incomplete information, which we have
handled using multiple imputation procedure with Rubin’s
rules
Preliminary results of on-going work
Multiple imputation procedure
Under Missing At Random (MAR) assumption
1.
Impute the missing data from 𝑓 π˜π‘€ π˜π‘‚ to give K ‘complete’
data sets
2.
Fit the substantive model to each of the K data sets, to
obtain K estimates of the parameters and estimates of their
variance
3.
Combine them using Rubin’s rules
Multiple imputation steps
Analysis
Imputation
Pooling
Incomplete
data
Final results
K completed data
sets
K analysis results
Pooling K estimates – Rubin’s rules
Given K completed data sets, there are:
̂ k , k ο€½ 1,..., K
2
with variance ˆ k , k ο€½ 1,..., K
K estimates
Pooled estimate
Total variance
ˆMI
1
ο€½
K
K
ˆ

οƒ₯ k
k ο€½1
1 ˆ
ˆ
ˆ
VMI ο€½ W  (1  )B
K ˆ
within-imputation variance
between-imputation variance
1
Wο€½
K
K
2

οƒ₯ k
k ο€½1
K
1
ˆ ο€­ ˆ ) 2
Bˆ ο€½
(

οƒ₯ k MI
K - 1 k ο€½1
Multiple imputation procedure
Congeniality
1.
Imputation model congenial with substantive model
2.
Given the substantive model from 𝑓 𝐘 𝐗 , 𝑓 𝐘 𝐗 𝑔 𝐗 is a
congenial imputation model if both 𝑓 and 𝑔 are correctly
specified
3.
Valid inference (under MAR) if 𝑓 𝐘 𝐗 𝑔 𝐗 (approximately)
represents data structure and substantive model
Concepts and measures of interest
Aims
Prognosis of a cancer and impact at population level
Concepts
Excess hazard
Excess hazard ratio
Net survival
Crude probabilities of death from cancer and other causes
Relative survival data setting
Population-based data
Expected mortality hazard from life tables
By single year age and sex, and calendar year, geography, deprivation
Nur et al, 2009 - Settings
Population-based cohort of colorectal cancer patients
Complete information on age, sex, follow-up time, vital status, deprivation,
comorbidity, surgical treatment
Tumour stage, morphology and grade: 45% incomplete data
Relative survival data setting
λ π‘₯ = λ𝑃 π‘₯ + 𝑒π‘₯𝑝 π‘₯𝛽
Substantive model: generalised linear model (Dickman et al, Stat Med 2005)
π‘™π‘œπ‘” πœ‡π‘— − 𝑑𝑃𝑗 = π‘™π‘œπ‘” 𝑦𝑗 + π‘₯𝛽
Link function
𝑑𝑗 ~π‘ƒπ‘œπ‘–π‘ π‘ π‘œπ‘› πœ‡π‘— ; πœ‡π‘— = λ𝑗 𝑦𝑗 ; 𝑦𝑗 person-time at risk
𝑑𝑃𝑗 expected number of deaths – life tables
Excess hazard ratio (+ Ederer-2 relative survival)
Offset
Data description
Variable
Category
Patients
No.
%
29 563
100.0
Stage
I
II
III
IV
Missing
Morphology
Adenocarcinoma
Mucinous and serous
Other
Neoplasm, NOS1
2 193
7 326
7 726
643
11 684
12.3
41.0
43.2
3.6
(39.5)
23 693
2 314
128
90.7
8.9
0.5
3 428
(11.6)
3 212
16 047
2 907
7 397
14.5
72.4
13.1
(25.0)
Grade
I
II
III/IV
Missing
Missing information associated with:
• Older ages
• More deprived categories
• Less treatment with curative intent
• Higher probability of death
Missing information in several variables
Multiple imputation using Full Conditional Specification (chained
equations – van Buuren, 1999)
Same basic assumptions than in multiple imputation
Assumes a joint (multivariate) distribution exists without specifying its
form

 ...ο‚΄ f Y
f Yi ,1 , Yi , 2 ,..., Yi , p  ο€½ f Yi , p Yi ,1 ,..., Yi , p ο€­1

ο‚΄ f Yi , p ο€­1 Yi ,1 ,..., Yi , p ο€­ 2
Imputation model (joint model for the data)
i,2


Yi ,1 ο‚΄ f Yi ,1 
Y ~ N β, ٠
Gibbs sampler to:
1. Estimate the parameters in the joint imputation model
2. Impute the missing data
Multivariate problem split into a series of univariate problems
Imputation models
Outcomes
Ordinal regression for stage and grade
Polytomous regression for morphology
Covariables
Other two covariables with incomplete information
Sex, age, deprivation, comorbidity, treatment, cancer site
Vital status
Follow-up time (years): piecewise function (0, 0.5, 1, 2, 3, 4, 5, 5+)
Time-dependent effects (categorical) for deprivation and age
Substantive (excess hazard) model includes
all these variables
(binary) time-dependent effects
Results
Variable
Category
Patients
No.
%
29 563
100.0
Data after
imputation
%
Stage
I
II
III
IV
Missing
Morphology
Adenocarcinoma
Mucinous and serous
Other
Neoplasm, NOS1
2 193
7 326
7 726
643
11 684
12.3
41.0
43.2
3.6
(39.5)
10.1
36.1
47.4
6.2
23 693
2 314
128
90.7
8.9
0.5
90.5
8.9
0.5
3 428
(11.6)
3 212
16 047
2 907
7 397
14.5
72.4
13.1
(25.0)
Grade
I
II
III/IV
Missing
13.6
72.0
14.4
Missing information associated with:
• Older ages
• More deprived categories
• Less treatment with curative intent
• Higher probability of death
Results
Complete-case analysis (16 223 cases)
Multiple imputation (29 563 cases)
Period since diagnosis over which EHR was estimated
Five years**
First year
Second to fifth
Five years**
First year
Second to fifth
years
years
EHR
I
II
III
IV
Missing
15 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 to 99
1.0
3.6
10.2
26.4
95% CI
2.7
7.7
19.6
EHR
95% CI
EHR
95% CI
4.7
13.5
35.5
EHR
1.0
2.6
7.0
16.5
1.0
1.1
1.4
2.0
2.7
4.0
0.8
1.0
1.5
2.0
2.9
1.5
1.9
2.7
3.7
5.5
1.0
1.3
1.2
1.2
1.1
0.9
1.0
1.0
1.0
0.9
0.7
1.6
1.5
1.5
1.4
1.3
Other results – Indicator approach
• Systematically underestimates variance of EHRs
• Overestimates EHRs for tumour morphology
• Underestimates EHRs for age and deprivation
• Does not identify time-dependent effects
95% CI
2.2
5.9
13.8
EHR
95% CI
EHR
95% CI
3.0
8.4
19.8
1.0
1.3
1.7
2.4
3.6
5.4
1.0
1.4
2.0
2.9
4.4
1.6
2.1
2.9
4.3
6.6
1.0
1.3
1.3
1.3
1.4
1.5
1.1
1.1
1.1
1.2
1.2
1.5
1.5
1.6
1.6
1.9
Stage-specific survival
Before imputation
After imputation
100
80
80
Relative survival (%)
100
60
40
60
40
20
20
I
II
III
IV
missing
I
0
II
III
IV
0
0
1
2
3
Years since diagnosis
4
5
0
1
2
3
Years since diagnosis
4
5
Limitations
Tutorial paper – no systematic evaluation
Relatively simple substantive model
piecewise model
categorical variables
Further recent methodological developments in:
multiple imputation
net survival, flexible modelling
More systematic evaluation – simulations
Concepts and measures of interest
Excess hazard
λ𝐸 𝑑 = λ𝑂 𝑑 − λ 𝑃 𝑑
λ𝑂 𝑑 𝑑𝑑 =
π‘‘π‘π‘Š 𝑑
π‘Œπ‘Š 𝑑
; λ𝑃 𝑑 𝑑𝑑 =
π‘Š 𝑑 =
Net survival
𝑆𝐸 𝑑 =
Crude mortality
𝐹𝐢 𝑑 =
1
𝑆𝑃𝑖 𝑑
𝑑
𝑒 − 0 λ𝐸 𝑒 𝑑𝑒
𝑑
0
𝑆𝑂 𝑒 − λ𝐸 𝑒 𝑑𝑒
𝑛
π‘Š
π‘Œ
𝑑 λ𝑃𝑖
𝑖=1 𝑖
π‘Œπ‘Š 𝑑
𝑑
Expected probability
of surviving up to t
Modelling approach
Flexible multivariable excess hazard model
Excess hazard
Time-dependent and non-linear effects (splines)
Variables affecting both mortality processes (cancer and other
causes of death) included in the model
Net survival is the mean of individual net survival functions predicted
by the model
Multiple imputation procedure
Congeniality
1.
Imputation model congenial with substantive model
2.
Given the substantive model from 𝑓 𝐘 𝐗 , 𝑓 𝐘 𝐗 𝑔 𝐗 is a
congenial imputation model if both 𝑓 and 𝑔 are correctly
specified
3.
Valid inference (under MAR) if 𝑓 𝐘 𝐗 𝑔 𝐗 (approximately)
represents data structure and substantive model
4.
Problematic within net survival setting and with nonlinear and time-dependent effects
Falcaro et al, 2015 – Study settings
Data
44,461 men diagnosed with a colorectal cancer in 1998-2006, followed up
to 2009
Age at diagnosis (continuous), tumour stage (4 categories), deprivation (5
categories)
Missing stage: 30%
MCAR
π‘™π‘œπ‘”π‘–π‘‘ π‘ƒπ‘Ÿ 𝑅𝑖 = 1 𝒁𝑖
= 𝛿0
MAR on X
π‘™π‘œπ‘”π‘–π‘‘ π‘ƒπ‘Ÿ 𝑅𝑖 = 1 𝒁𝑖
= 𝛼0 + 𝛼1 (age𝑖 −60)
MAR
π‘™π‘œπ‘”π‘–π‘‘ π‘ƒπ‘Ÿ 𝑅𝑖 = 1 𝒁𝑖
= 𝛾0 + 𝛾1 (age𝑖 −60) + 𝛾2 𝑇𝑖 + 𝛾3 𝐷𝑖
𝑅 = 1 if stage missing
100 simulated data sets per scenario
Distribution on fully observed data and empirical
expected distribution in remaining complete records
Substantive model
Flexible log cumulative excess hazard model
𝑙𝑛 Λ𝐸 𝑑 π‘₯𝑖
= 𝑠1 𝑙𝑛 𝑑 ; 𝜸𝟏 , π’ŒπŸ + 𝜷′π’™π’Š + 𝑠2 π‘Žπ‘”π‘’π‘– ; 𝜸𝟐 , π’ŒπŸ
Flexible functions: restricted cubic splines
Baseline excess hazard: 5 df, 4 internal knots and 2 boundary knots
Age (continuous): 3 df, 2 internal knots
Covariables: deprivation and stage
Aims: estimate effect of stage (log EHR) and stage-specific net survival at 1, 5
and 10 years since diagnosis
Imputation models
Outcome (stage)
Ordinal or multinomial logistic regression
Covariables
Survival time and log(survival time) or Nelson-Aalen estimate of the
cumulative hazard
Event indicator
Age – splines defined as in the substantive model
Deprivation – dummy variables
30 imputations
Net survival: Rubin’s rules applied on π‘™π‘œπ‘” −π‘™π‘œπ‘” 𝑆𝐸 𝑑
to obtain approximate normality, then back-transformed
Multiple imputation strategy
Multiple Imputation Strategy
Functional Form
How Survival Is Modeled in the Imputation
MI_ologit_surv
MI_ologit_na
MI_mlogit_surv
MI_mlogit_na
Ordinal logistic
Ordinal logistic
Multinomial logistic
Multinomial logistic
Survival time and log survival time
Nelson-Aalen estimate of cumulative hazard
Survival time and log survival time
Nelson-Aalen estimate of cumulative hazard
Results
Bias in log excess hazard ratio estimates for stage (reference stage 1), 100 replications
Poor results with ordered logit even under MCAR scenario
Stage-specific net survival at 1 year, 100 replications
Results
Bias in stage-specific net survival estimates at 1 year, 100 replications
Comments
Promising results despite that the parameter estimated in the substantive model
(here excess hazard) does not correspond to the final outcome of interest (net
survival)
Limitations
No time-dependent effects of stage
Which joint model?
Which variables in the imputation models?
• Vital status
• Nelson-Aalen estimates of cumulative hazard
• Interactions with time since diagnosis (age at diagnosis, deprivation…)
• Other relevant interactions (tumour stage, region…)
• other factors (treatment variables, co-morbidities, hospital volume,
surgeon’s experience…)
Limitations and challenges: preliminary study
Simulated data set – colon cancer, 12,048 men followed up at least 5
years
Baseline excess hazard: 5 df, 4 internal knots
Covariables: stage, deprivation, age
Time-dependent effects of stage: 2 df, 1 internal knot for each higher stage
Non-linear effects of age: 3 df, 2 internal knots
Substantive model
𝑙𝑛 Λ𝐸 𝑑 π‘₯𝑖
= 𝑠1 𝑙𝑛 𝑑 ; 𝜸𝟏 , π’ŒπŸ + 𝜷′ π’™π’Š + 𝑠2 π‘Žπ‘”π‘’π‘– ; 𝜸𝟐 , π’ŒπŸ + 𝑠3𝑗 π‘ π‘‘π‘Žπ‘”π‘’π‘— 𝑑 ; πœΈπŸ‘ , π’ŒπŸ‘
Missing stage simulated as in previous example – 100 data sets per
scenario, with 30% missing stage
Focus on MAR here
Limitations and challenges: preliminary study
Time
(year)
Net Survival function
Complete
Stage
MAR
1
1
5
0.95
0.91
0.99
0.99
2
1
5
0.90
0.78
0.97
0.90
3
1
5
0.77
0.46
0.86
0.59
4
1
5
0.32
0.06
0.41
0.09
Simulation of missingness mechanisms
as in previous example
Same imputation model was applied
(multinomial, Nelson-Aalen)
Results – Excess hazard ratios for stage
Tumour stage 2 (reference stage 1)
3.5
3
2.5
2
1.5
1
True EHR
Complete-case EHRs
Imputed EHRs
.5
0
0
1
2
3
Time since diagnosis (years)
4
5
Results – Excess hazard ratios for stage
Tumour stage 3 (reference stage 1)
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
True EHR
Complete-case EHRs
Imputed EHRs
0
1
2
3
Time since diagnosis (years)
4
5
Results – Excess hazard ratios for stage
Tumour stage 4 (reference stage 1)
60
55
50
45
40
35
30
25
20
15
10
True EHR
Complete-case EHRs
Imputed EHRs
5
0
0
1
2
3
Time since diagnosis (years)
4
5
Results – Stage-specific net survival
Tumour stage 1
1
.9
.8
.7
.6
.5
.4
.3
.2
.1
0
0
1
2
3
Time since diagnosis (years)
4
5
Results – Stage-specific net survival
Tumour stage 2
1
.9
.8
.7
.6
.5
.4
.3
.2
.1
0
0
1
2
3
Time since diagnosis (years)
4
5
Results – Stage-specific net survival
Tumour stage 3
1
.9
.8
.7
.6
.5
.4
.3
.2
.1
0
0
1
2
3
Time since diagnosis (years)
4
5
Results – Stage-specific net survival
Tumour stage 4
1
.9
.8
.7
.6
.5
.4
.3
.2
.1
0
0
1
2
3
Time since diagnosis (years)
4
5
Conclusion and development
Why MI?
Strength: clear division between imputation and analysis stages
both efficiency and MAR plausibility increased
Challenge: incompatibility between imputation and substantive models
asymptotically biased estimates
Define joint model for flexible excess hazard models
Multiple imputation by fully conditional specification with substantive
model compatible algorithm (SMC-FCS)
Bartlett JW et al. Statistical Methods in Medical Research 2015
References
Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: John Wiley
& Sons; 1987.
Van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood
pressure covariates in survival analysis. Stat Med 1999; 18: 681‐94.
White IR, Royston P. Imputing missing covariate values for the Cox model. Stat
Med 2009; 28: 1982–98.
Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP. Modelling relative survival
in the presence of incomplete data: a tutorial. Int J Epidemiol 2010; 39: 118‐28.
Carpenter JR, Kenward MG. Multiple imputation and its application. Chichester:
John Wiley & Sons; 2013.
Falcaro M, Nur U, Rachet B, Carpenter JR. Estimating excess hazard ratios and net
survival when covariate data are missing: strategies for multiple imputation.
Epidemiology 2015; 26: 421-8.
Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates
by fully conditional specification: accommodating the substantive model. Stat
Methods Med Res 2015; 24: 462-97.
http://www.missingdata.org.uk/
Download