UK11_welch

advertisement

Testing the performance of the two-fold FCS algorithm for multiple imputation of longitudinal clinical records

Catherine Welch 1 , Irene Petersen 1 , Jonathan Bartlett 2 ,

Ian White 3 , Richard Morris 1 , Louise Marston 1 , Kate Walters 1 ,

Irwin Nazareth 1 and James Carpenter 2

1 Department of Primary Care and Population Health, UCL

2 Department of Medical Statistics, LSHTM

3 MRC Biostatistics, Cambridge Funding: MRC

The Health Improvement Network (THIN) primary care database

• GP records

• 9 million patients over 15 years in 450 practices

• Powerful data source for research into coronary heart disease (CHD)

• Studies complicated by missing data

• Up to 38% of health indicator measurements are missing in newly registered patients 1

1 Marston et al , 2010 Pharmacoepidemiology and Drug Safety

Partially observed data in THIN

• Missing data never intended to be recorded

• Data recorded at irregular intervals

• Non-monotone missingness p pattern

Multiple Imputation (MI) and THIN

• Most MI designed for cross-sectional data

• Impute both continuous and discrete variables at many time points

– Standard ICE using Stata struggles with this

• New method developed by Nevalainen et al

– Two-fold fully conditional specification (FCS) algorithm

– Imputes each time point separately

– Uses information recorded before and after time point

Nevalainen et al , 2009 Statistics in Medicine

A graphical illustration of the two-fold FCS algorithm

Among-time iteration

Within-time iteration

Nevalainen et al , 2009 Statistics in Medicine f ( X ij mis

| X i

1

, X i

,

 j

, X i

1

, Y ij

)

Algorithm validation

• Nevalainen et al

– Proposed the two-fold FCS approach

– Validated algorithm using data sampled from case-control

– 3 time points included with a linear substantive model

• Our previous work

• Imputed data had accurate coefficients and acceptable level of variation in these settings

Simulation

• Before we apply the algorithm to THIN we want to test it in a complex setting similar to THIN

• Test algorithm in simulation study:

– Create 1000 full datasets

– Remove values

– Apply two-fold FCS algorithm

– Fit regression model for risk of CHD

• Full data

• Complete case data

• Imputed data

– Compare results

Advantages of using simulated data

• We know the original distributions so we can compare with distribution of imputed data and test for bias

• Create different scenarios to test the algorithm

• Design data so it is close to THIN data

Simple dataset

• 5000 men, 10 years of data

• CHD diagnosis from 2000 – yes/no

• Age – 5 year age bands

• Smoking status recorded in 2000

– smokers, ex- and non-smokers

• Anti-hypertensive drug prescription – yes/no

• Systolic blood pressure (mmHg)

• Weight (kg)

• Townsend score quintile – 1 (least) to 5 (most)

• Registration – indicate if patient registered in 1999

Results from exponential regression model

• Outcome : Time to CHD

• Exposures in year 2000: age, Townsend score quintile, weight, blood pressure, smoking status, anti-hypertensive drug treatment, registration in

1999

• Analysis of 1000 datasets

Generated data results

Results of fitting exponential regression model

Variables THIN data log risk ratio

Full simulated data

Log risk ratio SE

0.2935

0.2868

0.0957

Anti-hypertensive drug treatment

Systolic blood pressure (mmHg)

Weight (kg)

0.0048

Smoking status

0.0019

Nonsmoker

Exsmoker

Current smoker

Reference

0.0679

0.2386

0.0049

0.0019

0.0692

0.2385

0.0026

0.0032

0.1074

0.1143

Adjusted for age, registration in 1999 and Townsend score quintile

70% missing completely at random (MCAR) missingness mechanisms

• Missing data on blood pressure, weight, smoking

• In THIN:

– 30 - 70% missing in any given year,

• E.g. 70% missing equivalent to a health indicator recorded approximately every 3 years

– If one variable is missing other variables also more likely to be missing

70% MCAR results

Variables THIN data

Log risk ratio

Simulated data

Full data

Log risk ratio SE

Complete case

Log risk ratio SE

Anti-hypertensive drug treatment

Systolic blood pressure (mmHg)

Weight (kg)

0.2935

0.0048

0.2868

0.0049

0.0957

0.0026

0.2852

0.0051

0.1931

0.0055

0.0019

0.0019

0.0032

0.0015

0.0062

Smoking status

Nonsmoker

Reference

Exsmoker

Current smoker

0.0679

0.2386

0.0692

0.2385

0.1074

0.1143

0.0633

0.2307

0.2151

0.2299

Adjusted for age, registration in 1999 and Townsend score quintile

Two-fold FCS algorithm

• Stata ICE – series of chained equations

• 3 among-time iterations, 10 within-time iterations

• Produce 3 imputed datasets

• 1 year time window i-3 i-2 i-1 i i+1 i+2 i+3

Imputing time-independent variables

• Algorithm designed to impute time-dependent variables and does not account for imputing timeindependent variables

• Smoking status in 2000 is a time-independent variable

• Need to extend algorithm for this

Imputing time-independent variables

• For each among-time iteration, time-independent variables imputed first

Impute time-independent variables

• Algorithm will be cycle through time points with smoking status included as an auxiliary variable.

Results following imputation

• We would expect to see similar log risk ratios to the THIN data

• The standard errors for variables with no missing data will be close to those from the full data

• The standard errors for variables with missing data will be smaller to the complete case analysis but not recover to the size of the full data

Results following imputation

Variables THIN data

Log risk ratio

0.2935

Full data

Log risk ratio SE

Simulated data

Complete case Imputed data

Log risk ratio SE

Log risk ratio SE

0.2868

0.0957

0.2852

0.1931

0.2848 0.1066

Anti-hypertensive drug treatment

Systolic blood pressure (mmHg)

Weight (kg)

0.0048

0.0049

0.0026

0.0051

0.0055

0.0050 0.0052

0.0019

0.0019

0.0032

0.0015

0.0062

0.0023 0.0053

Smoking status

Nonsmoker

Exsmoker

Current smoker

Reference

0.0679

0.2386

0.0692

0.2385

0.1074

0.1143

0.0633

0.2307

0.2151

0.2299

Adjusted for age, registration in 1999 and Townsend score quintile

0.0654 0.2288

0.2409 0.2453

Results following imputation

Variables THIN data

Log risk ratio

0.2935

Full data

Log risk ratio SE

Simulated data

Complete case Imputed data

Log risk ratio SE

Log risk ratio SE

0.2868

0.0957

0.2852

0.1931

0.2848 0.1066

Anti-hypertensive drug treatment

Systolic blood pressure (mmHg)

Weight (kg)

0.0048

0.0049

0.0026

0.0051

0.0055

0.0050 0.0052

0.0019

0.0019

0.0032

0.0015

0.0062

0.0023 0.0053

Smoking status

Nonsmoker

Exsmoker

Current smoker

Reference

0.0679

0.2386

0.0692

0.2385

0.1074

0.1143

0.0633

0.2307

0.2151

0.2299

Adjusted for age, registration in 1999 and Townsend score quintile

0.0654 0.2288

0.2409 0.2453

Results following imputation

Variables THIN data

Log risk ratio

0.2935

Full data

Log risk ratio SE

Simulated data

Complete case Imputed data

Log risk ratio SE

Log risk ratio SE

0.2868

0.0957

0.2852

0.1931

0.2848 0.1066

Anti-hypertensive drug treatment

Systolic blood pressure (mmHg)

Weight (kg)

0.0048

0.0049

0.0026

0.0051

0.0055

0.0050 0.0052

0.0019

0.0019

0.0032

0.0015

0.0062

0.0023 0.0053

Smoking status

Nonsmoker

Exsmoker

Current smoker

Reference

0.0679

0.2386

0.0692

0.2385

0.1074

0.1143

0.0633

0.2307

0.2151

0.2299

Adjusted for age, registration in 1999 and Townsend score quintile

0.0654 0.2288

0.2409 0.2453

Results following imputation

Variables THIN data

Log risk ratio

0.2935

Full data

Log risk ratio SE

Simulated data

Complete case Imputed data

Log risk ratio SE

Log risk ratio SE

0.2868

0.0957

0.2852

0.1931

0.2848 0.1066

Anti-hypertensive drug treatment

Systolic blood pressure (mmHg)

Weight (kg)

0.0048

0.0049

0.0026

0.0051

0.0055

0.0050 0.0052

0.0019

0.0019

0.0032

0.0015

0.0062

0.0023 0.0053

Smoking status

Nonsmoker

Exsmoker

Current smoker

Reference

0.0679

0.2386

0.0692

0.2385

0.1074

0.1143

0.0633

0.2307

0.2151

0.2299

Adjusted for age, registration in 1999 and Townsend score quintile

0.0654 0.2288

0.2409 0.2453

Correlations

• Previous results imply accurate imputations for missing data in 2000

• Alternative method required:

– Assess correlations between measurements recorded at different times

• We would like to maintain the correlations structure in the generated and imputed data at all time points

Correlations

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Year of weight measurement correlated with weight measured in 2000

Full simulated data Imputed simulated data

Increase time window

• Increased the time window to 2 and 3 years

• This slightly improves the estimates of coefficients and SE

2 year time window

3 year time window i-3 i-2 i-1 i i+1 i+2 i+3

Increase time window

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Year of weight measurement correlated with weight measured in 2000

Full simulated data 1 year 2 years 3 years

In summary

• The two-fold FCS algorithm gives unbiased imputations with:

– 70% missing data

– Exponential regression model, and

– MCAR missingness mechanisms

• The correlation structure is maintained as the time window increases

Discussion

• Algorithm effective because at least one measurement during follow-up

• Same results with MAR

• Future work…

– Introduce censoring

– Change smoking status to be time-dependent

– Interactions

Download