Multiple Imputation of missing data in longitudinal electronic health records

advertisement
Multiple Imputation of missing data in
longitudinal electronic health records
Irene Petersen, PhD
Primary Care & Population Health
Introduction
•
•
•
•
Senior Lecturer (Statistics and Epidemiology)
Research team epidemiologists/statisticians/PhD students
Primary care databases 50+ studies
THIN and CPRD
• Research topics
– Prescribed medicine in pregnancy
– Mental health
– Cardiovascular diseases
– Infectious diseases
– Methodological questions
• Missing data
• Confounding (by indication)
http://www.ucl.ac.uk/pcph/research-groups-themes/thin-pub/
Or just google THIN UCL
Funding and Acknowledgement
•
•
•
•
•
•
•
•
•
James Carpenter
Jonathan Bartlett
Sarah Hardoon
Louise Marston
Richard Morris
Irwin Nazareth
Kate Walters
Catherine Welch
Ian White
Funded by Medical Research Council (MRC), UK
Today
• Missing data
• Different methods to deal with missing data
• Multiple imputation (MI) of missing data
• Multiple imputation in longitudinal records
Primary Care in United Kingdom
• Health care is free in UK
• Vast majority (>95%) is registered with a general
practice (family doctors + nurses)
• Primary care – General Practice
– General medical care
– Most prescriptions are issued in primary care
• Secondary care – Hospital
• Tertiary care – Specialist hospitals
The Health Improvement Network (THIN) (1)
• One of the UK’s largest primary care databases
• Anonymised records 11 million patients in over 550
practices
• Medical diagnoses and symptoms, preventative
measures, test results and immunisations, prescriptions,
referrals to secondary care and free text information
• Demographic information e.g. year of birth, sex, social
deprivation (Townsend score)
The Health Improvement Network (THIN) (2)
• Broadly representative of the UK population (sex,
age, size of practice and geographic distribution)
• 77 million years of patient data
Missing data in primary care records
Health indicators
• Blood pressure
• Weight
• Height
• Smoking
• Alcohol
• Cholesterol
How much data is missing 1 year after
registration?
488 384 patients registered with General
Practitioner (GP) in 2004-06
• Missing data
–
–
–
–
–
Smoking 22%
Blood pressure 30%
Weight 34%
Alcohol 37%
Height 38%
Marston et al. Pharmacoepidemiology and drug safty 2010; 19: 618e–626
Recording of weight
60
weight
40
20
0
95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11
19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 20
Year measurement recorded
Registered 1995
Registered 2005
Registered 2000
Registered 2010
80
60
weight
measurement recorded
Recording of weight in diabetics and nondiabetics
40
20
0
95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11
19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 20
Year measurement recorded
Registered 1995
Registered 2005
solid line - diabetes, dashed line - no diabetes
Registered 2000
Registered 2010
Recording of weight by age and gender
40
30
20
10
0
16 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Age (years)
Male
Female
So how much is missing?
• It depends….
• About 2/3 have health indicators in first year after
registration
• Most patients have at least one record while they
are registered
• Difference between gender and age
Missing data mechanisms
• Missing Complete At Random (MCAR)
• Missing At Random (MAR)
• Missing Not At Random (MNAR)
Missingness mechanisms
• ‘Missing completely at random’ (MCAR)
– the reasons for the missing data are not associated with the
observed or missing values (e.g. Not possible to measure blood
pressure due to equipment failure)
• ‘Missing at random’ (MAR)
– the reasons for the missing data are not associated with the values
of the missing data conditional on the observed data (e.g. once you
know someone’s age, their chance of having blood pressure
recorded is independent of their blood pressure level)
• ‘Missing not at random’ (MNAR)
– even given the observed data the reasons for the missing data are
associated with the missing values (e.g. patients with a high blood
pressure are more likely to have blood pressure measured)
Missing data mechanisms
• MCAR: Missingness of Y is independent of Y and X
• MAR: Missingness of Y is independent of Y given X
• MNAR: Missingness of Y is depending on Y, even
after conditioning on X
• Usually we cannot test these assumptions, but we can
exclude MCAR if MAR.
Recording of weight by age and gender
Missingness mechanism?
40
30
20
10
0
16 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Age (years)
Male
Female
What should we do with the missing data?
•
•
•
•
Complete case analysis
Exclude variables with incomplete records
Create missing data category
(Multiple) Imputation
Complete case analysis
• Only include patients with complete records
• In some situations OK but…
• Discards vast amounts of (probably) useful information in
the incomplete
• Make assumptions
– Complete cases represent full dataset?
– MCAR
• Reduce sample size
• Reduce power of study
Some real data…
• Cardiovascular risk in people with mental illnesses
• Sample of 42,213 people
• Risk factors:
–
–
–
–
–
Age
Sex
Smoking
Diabetes
Blood pressure
Risk of cardiovascular diseases in 42,213 people
Complete case analysis
Age, years (per unit increase)
Sex: Females v males
SBP, mmHg (per unit increase)
Diabetes: Yes v no
Smoking: Never
Ex
Current
Complete case
N=3,736
Hazard ratio (95% CI)
P
1.05 (1.04 to 1.06)
<0.001
1.19 (0.86 to 1.66)
0.3
1.94 (1.33 to 2.82)
0.001
1.19 (0.82 to 1.71)
1
1.77 (1.23 to 2.53)
1.56 (1.07 to 2.28)
0.4
0.002
0.02
Other methods to deal with missing data
• Exclude variables with incomplete records
– For example do the study without accounting for
smoking
– Study may be biased due to confounding 
• Create missing data category
– Mixed bag – results not meaningful 
– Severe bias can arise, in any direction
– Variable will not correctly adjust for confounding
80
100
120
bp
140
160
180
200
Missing data category
0
1
2
x
3
4
80
100
120
bp
140
160
180
200
Missing data category
0
1
2
x
3
4
Mean imputation
• Impute average values for missing data
• For example replace all missing blood pressure
values with population average measure (130/80)
Issues with mean imputation
Systolic Blood pressure 10 000 observation
20 % missing = 130 mmHg
Mean = 130 Variance = 256
3000
2000
0
1000
Frequency
0
1000
Frequency
2000
3000
Mean = 130 Variance = 319
50
100
150
bp
200
50
100
150
bp2
200
Regression Imputation
• Fit a regression model
• Use all information
available in existing data
• Provides a ‘best guess’
Health indicators
• Blood pressure
• Cholesterol
• Weight
• Height
• Smoking
• Alcohol
Predictors
• Age
• Gender
• Social deprivation
• Ethnicity
• Diseases/illness
• Medication
Issues with mean imputation and regression
imputation
• Just ONE estimate for each missing value
• Methods do not account for uncertainty of the
missing data
• Creates datasets with too small variation 
– (too narrow confidence intervals)
• Bias results
Multiple imputation (MI) of missing data
• Builds on regression imputation – two stages
• Stage 1 Create multiple copies of datasets
– We will never know the true values of the missing data
– Set of values – not just a single value
• Stage 2 Analyses of multiple imputed data
– Estimates different in individual datasets
– Only useful when averaged together
• Implemented in SAS, Stata, R
MI – fully conditional specification (FCS)
Combine thousands of regression models…..
Y1, Y2, Y3, x1, x2
1) Initially, impute missing values in Y1, Y2 and Y3 by randomly sampling
from the observed values.
2) Impute missing values in Y1 depending on obsevered values in Y1 and
imputed and observed values of Y2, Y3 and x1 and x2
3) Impute missing values in Y2 depending on observed values in Y2 and
imputed and observed values of Y1, Y3 and x1 and x2 and so on…
Fully conditional specification (FCS) MI
• Breaks the problem down into individual
regression models
f(Y1|Y1(obs), Y2, Y3, x1, x2)
f(Y2|Y1, Y2 (obs), Y3, x1, x2)
f(Y3|Y1, Y2, Y3 (obs), x1, x2)
• Each is a model for a single variable
• Logistic, linear model
FCS Multiple Imputation
• Builds on the Missing At Random assumption
• We need to think….
A few things to consider before doing MI
• Why are the data missing?
• What variables may explain missing data?
– Age, gender, deprivation, diseases, drug treatment
• Clear idea of your subsequent data analysis
– Outcome?
– Do you expect any interactions?
•
Outcome and interactions need to be considered in imputation
model!
Sterne et al. Multiple imputation for missing data in epidemiological
and clinical research: potential and pitfalls BMJ 2009; 338:b2393
Carpenter and Kenward Multiple Imputation and its Application 2013
Multiple imputation of longitudinal
data
So far we have considered multiple imputation in a
dataset without considering time
This may be fine and all you need to do
But what if we have longitudinal records?
Longitudinal health data
ID
Variable
2000
2001
2002
A
A
A
A
A
B
B
B
B
B
C
C
C
C
C
Smoking
Weight
Height
SBP
D
Smoking
Weight
Height
SBP
D
Smoking
Weight
Height
SBP
D
Yes
75
Yes
Yes
2003
170
120
No
61
1
No
Yes
58
No
155
120
160
140
85
No
90
140
1
Cohort study
Baseline
How should we deal with missing data at baseline?
Different options…
1. MI just at baseline
2. Develop a MI model with several time blocks
3. Do something else
Just use information from baseline year
• Many individuals don’t have information in that
year, but may have info in later or earlier year
• Loose information
Develop a MI model with several time blocks
f(Y1|Y1(obs t1), Y1(t1 - 1), Y1(t1 + 1), Y2(t1), Y2(t1-1),Y2(t1+1)….)
Cohort study
2000 2001
2002
Calendar Time
2003
2004
2005
2006
2007
2008
Develop a MI model with several time blocks
• This may be a good idea if we just have a few
time blocks
but
• Model may break down due to co-linearity
• Equal weight to measurements taken years apart
Do something else - Two fold FCS Multiple
Imputation
• Mix between a MI at baseline and MI including all
time blocks
Longitudinal multiple imputation – Twofold
FCS algorithm
•
•
•
•
Impute data at a given time block
Use information available +/- one time block
Move on to next time block
Repeat procedure x times
Nevalainen J, Kenward MG, Virtanen SM. Stat Med 2009; 28(29):3657-3669.
• Break the data into smaller (time) blocks (t)
• Calendar time or time since registration or time
since date of birth
• Select width of time blocks
– Year, month, data collection points….or
• Here we use calendar time and years as width of
our blocks
Cohort study
2000 2001
t–1
t
2002
t+1
Calendar Time
2003
2004
2005
2006
2007
2008
Cohort study
2000 2001
t–1
t
2002
t+1
Within time imputation
Calendar Time
2003
2004
2005
2006
2007
2008
Cohort study
2000 2001
2002
Calendar Time
2003
2004
2005
2006
2007
2008
2
Cohort study
2000 2001
2002
Calendar Time
2003
2004
2005
2006
2007
2008
2
Cohort study
2000 2001
2002
Calendar Time
2003
2004
2005
2006
2007
End of first Among time iteration
2008
2
Two-fold FCS algorithm implemented in Stata
http://www.ucl.ac.uk/pcph/research-groups-themes/thin-pub/missing_data
Simple approach – but does it work?
• Simulation studies
– Created 1000 datasets with NO missing data
– Removed 70% of data in any given year
– Simple cohort study (time to event model)
• Risk of cardiovascular disease using baseline info
– Compared results of:
• complete case analysis
• MI at baseline
• Twofold FCS algorithm
Results of simulation studies
• Complete case analysis loose a lot of information
• MI at baseline recovers some information
• Twofold FCS algorithm gives most precise
estimates
Implications for research
• Twofold provides better use of the information
available in longitudinal datasets
• Simulation studies suggest two-fold FCS algorithm
increase the precision of the estimates ~ double
the sample size in some situations
• New opportunities for research!
– Time dependent covariates
Back to some real data, before we finish
Risk of cardiovascular diseases in 42,213 people with
mental illnesses
Complete case
After MI
N=3,736
N= 42,313
Hazard ratio (95% CI)
Age, years (per unit
increase)
Sex: Females v
males
SBP, mmHg (per unit
increase)
Diabetes: Yes v no
Smoking: Never
Ex
Current
1.05 (1.04 to 1.06)
1.19 (0.86 to 1.66)
1.94 (1.33 to 2.82)
1.19 (0.82 to 1.71)
1
1.77 (1.23 to 2.53)
1.56 (1.07 to 2.28)
Risk of cardiovascular diseases in 42,213 people with
mental illnesses
Complete case
After MI
N=3,736
N= 42,313
Hazard ratio (95% CI)
Age, years (per unit
increase)
Sex: Females v
males
SBP, mmHg (per unit
increase)
Diabetes: Yes v no
Smoking: Never
Ex
Current
1.05 (1.04 to 1.06)
1.06 (1.06 to 1.06)
1.19 (0.86 to 1.66)
0.74 (0.68 to 0.81)
1.94 (1.33 to 2.82)
1.87 (1.67 to 2.09)
1.19 (0.82 to 1.71)
1
1.77 (1.23 to 2.53)
1.56 (1.07 to 2.28)
1.60 (1.38 to 1.86)
1
1.36 (1.24 to 1.50)
1.55 (1.40 to 1.71)
What is next?
• Further testing of Two-fold FCS algorithm
– Correlations between years
– Categorical variables like smoking, alcohol
• Short course on missing data 14 -15 November
2013, UCL London
• Stata programme twofold
http://www.ucl.ac.uk/pcph/research-groups-themes/thin-pub/missing_data
Further information:
http://missingdata.lshtm.ac.uk/
http://www.ucl.ac.uk/pcph/research-groups-themes/thin-pub/missing_data
i.petersen@ucl.ac.uk
Marston, L. et al. Issues in multiple imputation of missing data for large
general practice clinical databases. Pharmacoepidemiol Drug Saf. 2010
Jun;19(6):618-26.
D B Rubin. Inference and missing data. Biometrika, 63:581–592, 1976.
Nevalainen J. et al. Missing Values in Longitudinal Dietary Data: a Multiple
Imputation Approach Based on a Fully Conditional Specification. Stat. Med.
2009 28 3657-69.
Sterne et al. Multiple imputation for missing data in epidemiological and
clinical research: potential and pitfalls BMJ 2009 339, b2393
van Buuren, S. Multiple imputation of discrete and continuous data by fully
conditional specification. Statistical Methods in Medical Research, 16:219–242,
2007
Carpenter and Kenward Multiple Imputation and its Application 2013
Download