Multiple Imputation of missing data in longitudinal electronic health records Irene Petersen, PhD Primary Care & Population Health Introduction • • • • Senior Lecturer (Statistics and Epidemiology) Research team epidemiologists/statisticians/PhD students Primary care databases 50+ studies THIN and CPRD • Research topics – Prescribed medicine in pregnancy – Mental health – Cardiovascular diseases – Infectious diseases – Methodological questions • Missing data • Confounding (by indication) http://www.ucl.ac.uk/pcph/research-groups-themes/thin-pub/ Or just google THIN UCL Funding and Acknowledgement • • • • • • • • • James Carpenter Jonathan Bartlett Sarah Hardoon Louise Marston Richard Morris Irwin Nazareth Kate Walters Catherine Welch Ian White Funded by Medical Research Council (MRC), UK Today • Missing data • Different methods to deal with missing data • Multiple imputation (MI) of missing data • Multiple imputation in longitudinal records Primary Care in United Kingdom • Health care is free in UK • Vast majority (>95%) is registered with a general practice (family doctors + nurses) • Primary care – General Practice – General medical care – Most prescriptions are issued in primary care • Secondary care – Hospital • Tertiary care – Specialist hospitals The Health Improvement Network (THIN) (1) • One of the UK’s largest primary care databases • Anonymised records 11 million patients in over 550 practices • Medical diagnoses and symptoms, preventative measures, test results and immunisations, prescriptions, referrals to secondary care and free text information • Demographic information e.g. year of birth, sex, social deprivation (Townsend score) The Health Improvement Network (THIN) (2) • Broadly representative of the UK population (sex, age, size of practice and geographic distribution) • 77 million years of patient data Missing data in primary care records Health indicators • Blood pressure • Weight • Height • Smoking • Alcohol • Cholesterol How much data is missing 1 year after registration? 488 384 patients registered with General Practitioner (GP) in 2004-06 • Missing data – – – – – Smoking 22% Blood pressure 30% Weight 34% Alcohol 37% Height 38% Marston et al. Pharmacoepidemiology and drug safty 2010; 19: 618e–626 Recording of weight 60 weight 40 20 0 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 20 Year measurement recorded Registered 1995 Registered 2005 Registered 2000 Registered 2010 80 60 weight measurement recorded Recording of weight in diabetics and nondiabetics 40 20 0 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 20 Year measurement recorded Registered 1995 Registered 2005 solid line - diabetes, dashed line - no diabetes Registered 2000 Registered 2010 Recording of weight by age and gender 40 30 20 10 0 16 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Age (years) Male Female So how much is missing? • It depends…. • About 2/3 have health indicators in first year after registration • Most patients have at least one record while they are registered • Difference between gender and age Missing data mechanisms • Missing Complete At Random (MCAR) • Missing At Random (MAR) • Missing Not At Random (MNAR) Missingness mechanisms • ‘Missing completely at random’ (MCAR) – the reasons for the missing data are not associated with the observed or missing values (e.g. Not possible to measure blood pressure due to equipment failure) • ‘Missing at random’ (MAR) – the reasons for the missing data are not associated with the values of the missing data conditional on the observed data (e.g. once you know someone’s age, their chance of having blood pressure recorded is independent of their blood pressure level) • ‘Missing not at random’ (MNAR) – even given the observed data the reasons for the missing data are associated with the missing values (e.g. patients with a high blood pressure are more likely to have blood pressure measured) Missing data mechanisms • MCAR: Missingness of Y is independent of Y and X • MAR: Missingness of Y is independent of Y given X • MNAR: Missingness of Y is depending on Y, even after conditioning on X • Usually we cannot test these assumptions, but we can exclude MCAR if MAR. Recording of weight by age and gender Missingness mechanism? 40 30 20 10 0 16 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Age (years) Male Female What should we do with the missing data? • • • • Complete case analysis Exclude variables with incomplete records Create missing data category (Multiple) Imputation Complete case analysis • Only include patients with complete records • In some situations OK but… • Discards vast amounts of (probably) useful information in the incomplete • Make assumptions – Complete cases represent full dataset? – MCAR • Reduce sample size • Reduce power of study Some real data… • Cardiovascular risk in people with mental illnesses • Sample of 42,213 people • Risk factors: – – – – – Age Sex Smoking Diabetes Blood pressure Risk of cardiovascular diseases in 42,213 people Complete case analysis Age, years (per unit increase) Sex: Females v males SBP, mmHg (per unit increase) Diabetes: Yes v no Smoking: Never Ex Current Complete case N=3,736 Hazard ratio (95% CI) P 1.05 (1.04 to 1.06) <0.001 1.19 (0.86 to 1.66) 0.3 1.94 (1.33 to 2.82) 0.001 1.19 (0.82 to 1.71) 1 1.77 (1.23 to 2.53) 1.56 (1.07 to 2.28) 0.4 0.002 0.02 Other methods to deal with missing data • Exclude variables with incomplete records – For example do the study without accounting for smoking – Study may be biased due to confounding • Create missing data category – Mixed bag – results not meaningful – Severe bias can arise, in any direction – Variable will not correctly adjust for confounding 80 100 120 bp 140 160 180 200 Missing data category 0 1 2 x 3 4 80 100 120 bp 140 160 180 200 Missing data category 0 1 2 x 3 4 Mean imputation • Impute average values for missing data • For example replace all missing blood pressure values with population average measure (130/80) Issues with mean imputation Systolic Blood pressure 10 000 observation 20 % missing = 130 mmHg Mean = 130 Variance = 256 3000 2000 0 1000 Frequency 0 1000 Frequency 2000 3000 Mean = 130 Variance = 319 50 100 150 bp 200 50 100 150 bp2 200 Regression Imputation • Fit a regression model • Use all information available in existing data • Provides a ‘best guess’ Health indicators • Blood pressure • Cholesterol • Weight • Height • Smoking • Alcohol Predictors • Age • Gender • Social deprivation • Ethnicity • Diseases/illness • Medication Issues with mean imputation and regression imputation • Just ONE estimate for each missing value • Methods do not account for uncertainty of the missing data • Creates datasets with too small variation – (too narrow confidence intervals) • Bias results Multiple imputation (MI) of missing data • Builds on regression imputation – two stages • Stage 1 Create multiple copies of datasets – We will never know the true values of the missing data – Set of values – not just a single value • Stage 2 Analyses of multiple imputed data – Estimates different in individual datasets – Only useful when averaged together • Implemented in SAS, Stata, R MI – fully conditional specification (FCS) Combine thousands of regression models….. Y1, Y2, Y3, x1, x2 1) Initially, impute missing values in Y1, Y2 and Y3 by randomly sampling from the observed values. 2) Impute missing values in Y1 depending on obsevered values in Y1 and imputed and observed values of Y2, Y3 and x1 and x2 3) Impute missing values in Y2 depending on observed values in Y2 and imputed and observed values of Y1, Y3 and x1 and x2 and so on… Fully conditional specification (FCS) MI • Breaks the problem down into individual regression models f(Y1|Y1(obs), Y2, Y3, x1, x2) f(Y2|Y1, Y2 (obs), Y3, x1, x2) f(Y3|Y1, Y2, Y3 (obs), x1, x2) • Each is a model for a single variable • Logistic, linear model FCS Multiple Imputation • Builds on the Missing At Random assumption • We need to think…. A few things to consider before doing MI • Why are the data missing? • What variables may explain missing data? – Age, gender, deprivation, diseases, drug treatment • Clear idea of your subsequent data analysis – Outcome? – Do you expect any interactions? • Outcome and interactions need to be considered in imputation model! Sterne et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls BMJ 2009; 338:b2393 Carpenter and Kenward Multiple Imputation and its Application 2013 Multiple imputation of longitudinal data So far we have considered multiple imputation in a dataset without considering time This may be fine and all you need to do But what if we have longitudinal records? Longitudinal health data ID Variable 2000 2001 2002 A A A A A B B B B B C C C C C Smoking Weight Height SBP D Smoking Weight Height SBP D Smoking Weight Height SBP D Yes 75 Yes Yes 2003 170 120 No 61 1 No Yes 58 No 155 120 160 140 85 No 90 140 1 Cohort study Baseline How should we deal with missing data at baseline? Different options… 1. MI just at baseline 2. Develop a MI model with several time blocks 3. Do something else Just use information from baseline year • Many individuals don’t have information in that year, but may have info in later or earlier year • Loose information Develop a MI model with several time blocks f(Y1|Y1(obs t1), Y1(t1 - 1), Y1(t1 + 1), Y2(t1), Y2(t1-1),Y2(t1+1)….) Cohort study 2000 2001 2002 Calendar Time 2003 2004 2005 2006 2007 2008 Develop a MI model with several time blocks • This may be a good idea if we just have a few time blocks but • Model may break down due to co-linearity • Equal weight to measurements taken years apart Do something else - Two fold FCS Multiple Imputation • Mix between a MI at baseline and MI including all time blocks Longitudinal multiple imputation – Twofold FCS algorithm • • • • Impute data at a given time block Use information available +/- one time block Move on to next time block Repeat procedure x times Nevalainen J, Kenward MG, Virtanen SM. Stat Med 2009; 28(29):3657-3669. • Break the data into smaller (time) blocks (t) • Calendar time or time since registration or time since date of birth • Select width of time blocks – Year, month, data collection points….or • Here we use calendar time and years as width of our blocks Cohort study 2000 2001 t–1 t 2002 t+1 Calendar Time 2003 2004 2005 2006 2007 2008 Cohort study 2000 2001 t–1 t 2002 t+1 Within time imputation Calendar Time 2003 2004 2005 2006 2007 2008 Cohort study 2000 2001 2002 Calendar Time 2003 2004 2005 2006 2007 2008 2 Cohort study 2000 2001 2002 Calendar Time 2003 2004 2005 2006 2007 2008 2 Cohort study 2000 2001 2002 Calendar Time 2003 2004 2005 2006 2007 End of first Among time iteration 2008 2 Two-fold FCS algorithm implemented in Stata http://www.ucl.ac.uk/pcph/research-groups-themes/thin-pub/missing_data Simple approach – but does it work? • Simulation studies – Created 1000 datasets with NO missing data – Removed 70% of data in any given year – Simple cohort study (time to event model) • Risk of cardiovascular disease using baseline info – Compared results of: • complete case analysis • MI at baseline • Twofold FCS algorithm Results of simulation studies • Complete case analysis loose a lot of information • MI at baseline recovers some information • Twofold FCS algorithm gives most precise estimates Implications for research • Twofold provides better use of the information available in longitudinal datasets • Simulation studies suggest two-fold FCS algorithm increase the precision of the estimates ~ double the sample size in some situations • New opportunities for research! – Time dependent covariates Back to some real data, before we finish Risk of cardiovascular diseases in 42,213 people with mental illnesses Complete case After MI N=3,736 N= 42,313 Hazard ratio (95% CI) Age, years (per unit increase) Sex: Females v males SBP, mmHg (per unit increase) Diabetes: Yes v no Smoking: Never Ex Current 1.05 (1.04 to 1.06) 1.19 (0.86 to 1.66) 1.94 (1.33 to 2.82) 1.19 (0.82 to 1.71) 1 1.77 (1.23 to 2.53) 1.56 (1.07 to 2.28) Risk of cardiovascular diseases in 42,213 people with mental illnesses Complete case After MI N=3,736 N= 42,313 Hazard ratio (95% CI) Age, years (per unit increase) Sex: Females v males SBP, mmHg (per unit increase) Diabetes: Yes v no Smoking: Never Ex Current 1.05 (1.04 to 1.06) 1.06 (1.06 to 1.06) 1.19 (0.86 to 1.66) 0.74 (0.68 to 0.81) 1.94 (1.33 to 2.82) 1.87 (1.67 to 2.09) 1.19 (0.82 to 1.71) 1 1.77 (1.23 to 2.53) 1.56 (1.07 to 2.28) 1.60 (1.38 to 1.86) 1 1.36 (1.24 to 1.50) 1.55 (1.40 to 1.71) What is next? • Further testing of Two-fold FCS algorithm – Correlations between years – Categorical variables like smoking, alcohol • Short course on missing data 14 -15 November 2013, UCL London • Stata programme twofold http://www.ucl.ac.uk/pcph/research-groups-themes/thin-pub/missing_data Further information: http://missingdata.lshtm.ac.uk/ http://www.ucl.ac.uk/pcph/research-groups-themes/thin-pub/missing_data i.petersen@ucl.ac.uk Marston, L. et al. Issues in multiple imputation of missing data for large general practice clinical databases. Pharmacoepidemiol Drug Saf. 2010 Jun;19(6):618-26. D B Rubin. Inference and missing data. Biometrika, 63:581–592, 1976. Nevalainen J. et al. Missing Values in Longitudinal Dietary Data: a Multiple Imputation Approach Based on a Fully Conditional Specification. Stat. Med. 2009 28 3657-69. Sterne et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls BMJ 2009 339, b2393 van Buuren, S. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16:219–242, 2007 Carpenter and Kenward Multiple Imputation and its Application 2013