Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics The Tao of Missingness “The inside and the outside are one” Zen philosopher “Nothing is more real than nothing” Samuel Beckett Overview • Why missing data matters • Some useful definitions • Practical issues • Methods for imputation Missing data is inevitable! • Trials or observational studies are set up to • • • • • obtain complete data from everyone Multiple reminders for questionnaire data Important to distinguish valid unknown, not applicable, lost to follow-up, etc It’s not missing, it’s unknown! Despite investigators’ best efforts missing data is inevitable The key is to minimise loss of data in the first place Why does data go missing? • Poor trial management, lack of follow• • • • up Patients have Adverse Events (AE) and drop-out Patients fail to attend clinic / fill in questionnaire Migrate with no information available (They don’t write, they don’t call!) Leave study for no apparent reason Some real examples of reasons for missing data • “Emergency Christmas shopping” (reason for missed • • • • visit, early November) “The drugs will interfere with my drinking” (reason for eligible pt saying No to trial) “No you can’t come and see me: I’m better” (pt dropping out at V3) Changed address and/or phone number rendered pts untraceable (more frequent in the West) Two pts co-operated but refused photographs, one on religious grounds (despite giving consent) Does it matter? • Missing data can seriously damage a study’s credibility • Two main problems; May introduce bias Reduces Power Note that even worse in regression: •Pairwise comparisons leave out 38%+ •So two-group comparisons not too bad ID BMI HBA1c LDL Chol HDL 1 35.2 9.1 5.8 0.8 2 26.3 7.0 4.3 1.1 28.3 11.3 5.4 6.1 0.7 8.4 3.9 4.1 1.0 3 4 5 6 40.7 10.2 4.0 7 30.5 9.3 2.9 •Regression or any 8 26.1 3.5 5.2 other multidimensional analysis leaves out 75% of data - COMPLETE-CASE ONLY ANALYSIS Practical Tip 1 • Complete Case analysis is where the • • • • missing data problem is ignored Patients with missing data are excluded This will be obvious from the constructed tables The n in the tables reporting the analysis will be less than the N enrolled Even worse the dataset used may differ by outcome as n may change Practical Tip 1 • A useful and informative procedure is to create a table comparing the characteristics of the complete case dataset and those missing e.g. Factor Complete Cases Missing at 8 weeks Mean Age 32 50 Mean BMI 19 28 % Male 50% 65% One Solution? – Missing-indicator method • Code all missing as unknown and include • • • • • unknown category in regression model (Mea culpa!) Advantage that no subject excluded Difficult to interpret Does not deal with main issue of potential BIAS In fact, it will add bias….. Fudge rather than solution Example: Unknown stage (n=40/476) in Cox PH model for colorectal cancer Variables in the Equation age s exnum dukes dukes (1) dukes (2) dukes (3) dukes (4) cs core hyperco B .022 .031 SE .007 .122 .183 .882 1.961 1.427 .033 .369 .443 .423 .423 .448 .020 .136 N.b. Effect of known stages are now biased Wald 11.000 .064 114.441 .170 4.344 21.461 10.144 2.815 7.386 df 1 1 4 1 1 1 1 1 1 Sig. .001 .800 .000 .680 .037 .000 .001 .093 .007 Exp(B) 1.022 1.031 1.200 2.415 7.106 4.166 1.034 1.446 HR Unknown Stage vs. Stage A Imputation: Another Solution • Impute missing values and then carry out analysis with complete dataset • Advantage that no subject excluded • Many methods of estimating the missing values 1. 2. 3. 4. LVCF (LOCF) Last Value Carried Forward Mean or median value of measurements Expected value based on regression Expected value based on E-M algorithm Some notation • Yobs – observed data • Ymiss – missing data • R – missing data indicator: R R • Prob data = 1 indicates data observed, = 0 missing [R = 0 | Yobs ] prob of missing given values of observed data Some very difficult, opaque, but essential definitions (1) Missing Completely at Random (MCAR) • Prob (Missing) is independent of both: • • • • 1) observed data and 2) unobserved data Essentially observed data is a random sample of full data MCAR is what everyone falsely assumes! If MCAR is assumed, observed-case or complete-case analysis is valid. Observed-case analysis is software default! Representation of R as a stratification factor for responses Response Indicators Response Vector R1 R2 R3 R4 Y1 Y2 Y3 Y4 1 1 1 1 y y y y 1 0 1 1 y * y y 1 1 0 0 y y * * For MCAR: Prob [R = 0 | Yobs, Ymiss, X ] = Prob [ R = 0 | X] Possible to test for MCAR Park-Lee* test for MCAR • Within framework of GEE (Liang and Zeger) • Define indicator variables for each missing • • data pattern Fit model with indicators as covariates Test regression coefficients for indicators and if significant missing data mechanism is not MCAR *Park T and Lee S-Y. A test of missing completely at random for longitudinal data with missing observations. Statist Med 1997; 16: 1859-1871 Example of Park-Lee* test for MCAR Fit three indicator variables Missing data pattern 1 2 3 0 O O O 1 O M O Covariate Est/SE 2 O O M I1 0.65 2.03* 3 O M M I2 I3 3.51* Wave Ik = 1 if missing pattern k, = 0 otherwise For overall test p = 0.0023 Examples: MCAR • Six Cities Air Pollution Study – children changed schools because of parents so unrelated to health of children • In a trial Practice changed computer system so missing observations not related to previous observed or future values Practical Tip 2 • Check data for MCAR Little’s test) (note SPSS carries out • If assumption seems reasonable analyse • • • using complete-case only with impunity If missing data constitutes < 5% probably reasonable to assume MCAR If not, complete-case analysis is likely to be biased N.b. MCAR not that common Another essential definition Missing At Random (MAR) • Prob (Missing) is independent of: • • • 1) unobserved data but 2) dependent on observed data Essentially observed data is a random sample of full data in each stratum MAR is weaker version of MCAR assumption If MAR is assumed, many methods possible to impute data using observed data. Missing At Random (MAR) • Prob (missing) depends on Yobs but not on • • • • missing Ymiss Prob [ R = 0 | Yobs, Ymiss, X] = Prob [ R = 0 | Yobs, X] MCAR is a special case of MAR Use fact that missing Y for a person with same age, gender, BP, chol, BMI, etc. will be similar to a person with same characteristics who does have outcome Allows imputation methods based on observed data e.g. mean, regression Examples: MAR • Six Cities Air Pollution Study – children • • moved out of area because of nonrespiratory problems (e.g. type 1 diabetes) Men less likely to attend for follow-up visit but not related to values of their likely outcomes Repeated measures where missingness is not related to values would have obtained Single Imputation • Most common approach • • • is to add mean of values observed to impute missing Takes no account of differences related to other factors eg. HbA1c Takes no account of uncertainty in estimating missing value Makes clinicians uneasy! ID BMI 1 2 HBA 1c LDL Chol HDL 35.2 9.1 5.8 0.8 4.3 1.1 3 26.3 7.0 31.2 4 28.3 11.3 5.4 6.1 0.7 5 31.2 8.4 6 40.7 10.2 4.0 7 30.5 9.3 2.9 4.1 1.0 8 26.1 3.5 5.2 3.9 Single Imputation • Common method in • • • • longitudinal data Last Value Carried Forward (LVCF or LOCF) Common in RCTs Some journals and even FDA endorse But statistically unsound unless strong and unrealistic assumptions met (see LSHTM website) ID Baseline 4 weeks 8 weeks 1 15 13 13 2 29 32 32 3 43 4 32 29 25 5 19 36 26 6 10 10 13 7 31 25 20 8 19 43 19 43 18 Examples: Single Imputation • Last Value Carried Forward (LVCF or LOCF) very common in RCTs • Adalimumab in severe Crohn’s disease, nearly 50% of patients were lost-tofollow-up at 52 weeks in one trial and LVCF used (but relapsing-remitting condition!) • But legitimate use in Bell’s Palsy Trial! • No disagreement among statisticians that method is unsound Solution is Multiple Imputation! 1. Assumes data MAR 2. Missing data filled in m times 3. The m complete datasets are each analysed by using standard procedures 4. The results for the m complete datasets are combined for inference ID Baseli ne 4 weeks 8 weeks 1 15 13 12 2 29 32 30 3 35 4 ID 5 119 32 2 3 4 5 36 Baseli 29 ne 36 15 29 ID 43 44 4 25 weeks 26 13 32 Baseli ne 32 1929 19 36 1 8 weeks 13 32 4 8 43 week week s s 25 15 13 2 29 32 3 39 28 40 4 32 29 25 5 19 36 26 26 16 25 Multiple Imputation (MI) • Process derived by Donald Rubin (1987) • Replace missing values with set of plausible • • • values that also… Represents the uncertainty about the correct value Requires MAR assumption but NOT MCAR Many methods of estimating imputed values 1) regression, 2) propensity score, 3) MCMC Step 1: Multiple Imputation Methods 1) Regression – Missing values predicted by regression model of previous values and covariates • • • Fit model Xβ using any variables available (previous values and covariates) Repeat if further follow-up results missing Extract predicted value and save new dataset with predicted value inserted Missingness Model • How do I choose what factors to use in • • • • predicting imputed values? All factors related to outcome (i.e. all Xs) Plus importantly the outcome Any other factors possibly related to the reason for being missing Better to be overly inclusive and statistical significance not important Multiple Imputation: A Cautionary tale • Hippisley-Cox et al, BMJ 2007 developed a • • • • risk algorithm for CVD called QRISK 70% of Cholesterol values were missing and imputed using MI assuming data MAR Found NO association between CVD and cholesterol Investigation showed they had not used CVD outcome in the imputation model When rectified ‘true’ association found! Step 1: Multiple Imputation Methods 2) Propensity score – • create indicator variable R=0 for missing • Fit logistic model Xβ of propensity to be • • missing (R=0). Divide observations by quintiles of propensity score Allow random draws (~Bayesian bootstrap) of values from observed data in matching quintile to fill in missing data Step 1: Multiple Imputation Methods 3) Monte Carlo Markov Chain (MCMC) – • Imputation draws from conditional • • • • distribution of Ymiss | Yobs Posterior step simulates posterior mean and covariance matrix New estimates used iteratively in imputation step Process converges (hopefully) Incorporates EM algorithm Step 1: Multiple Imputation • All available in PROC MI in SAS • • • • software and creates m number of datasets Now available in SPSS v. 17 Note SPSS carries out Little’s test for MCAR S-plus – some functions Stata has full set of programs for MI How many (m) datasets do I need? Too many leads to data management problem Relative efficiency of using finite m imputations is given by RE = ( 1 + λ / m) -1 where λ is fraction of missing information RE λ m 10% 20% 30% 50% 70% 3 0.97 0.94 0.91 0.86 0.81 5 0.98 0.96 0.94 0.91 0.88 10 0.99 0.98 0.97 0.95 0.93 20 0.99 0.99 0.98 0.98 0.97 Step 2: Multiple Imputation • Analyse the now complete datasets in standard way • T-test, Regression, Survival, Logistic, GLM, Mixed model, etc… • Creates a set of parameter estimates for each of m datasets Step 3: Multiple Imputation • Combine results from m datasets • Standard way is calculate mean and variance of parameter estimate 1 Q m m i 1 Qi Let Ü be within-imputation variance and B the betweenimputation variance then the total variance T is T U (1 1 )B m Step 3: Multiple Imputation • Relatively easy, but fortunately SAS • • • • has a procedure to implement this called PROC MIANALYZE Good documentation SPSS now does this step in v.17! MI now considered gold standard methodology for drawing valid inferences in the face of missing data (with MAR) Still many people wary Alternative Solution: Weighting • Weight observed data to take account of • • • • under-representation of certain response profiles Does not involve imputation but assumes MAR First proposed in sample survey literature Relatively easy as most standard programs allow addition of weighting factor Requires weight wi and then complete case analysis weighted by 1/wi Alternative Solution: Weighting • Estimate wi = Pr [R = 0 | Yobs, X] • Repeat for multiple time points • Analyse complete cases weighted by wi • Example GEE with MAR • Intuitively good as weight people with missing data as similar to those with observed data Practical Tip 3 • If we assume MAR, method of MI provides means of valid inference • Comprehensive software in SAS and now SPSS • Other software incorporate as standard (Stata) • Consider weighting method as intuitively appealing Another essential definition Missing Not At Random (MNAR) • Prob (Missing) is dependent on both: • • • • • 1) unobserved data and 2) observed data Often referred to as nonignorable missing mechanism or informative missingness MNAR is completely unverifiable from the data Need to assess the sensitivity of results to different plausible explanations All standard methods are NOT valid Ongoing area of research in statistical methods Examples: NMAR • QOL missing in those with low quality of life and so missingness related to what might have been QOL • Measurement of weight loss more likely to be missing if weight loss likely to be low Missing Not At Random (MNAR) One method uses Structural Equation Modelling (SEM) • Requires specialist software • Often referred to as nonignorable missing • • • • mechanism or informative missingness MNAR is completely unverifiable from the data Need to assess the sensitivity of results to different plausible explanations All standard methods are NOT valid Ongoing area of research in statistical methods Summary • • • • • • • • Consider hierarchy of missing data MCAR, MAR, MNAR Ideal is to use MI if MAR or Weighting methods if MAR Tools now in SPSS Need to model missingness mechanism jointly with analysis of outcome if MNAR Complete case analysis needs to be justified! LVCF needs to be justified! Summary “…it is time to place CC analysis and simple imputation methods, in particular LOCF, in the Museum of Statistical Science..” Geert Molenberghs Editorial JRSS A, 2007:861-863 References • LSHTM website on missing data, sponsored by ESRC • • • • (www.lshtm.ac.uk/missingdata/start.html) Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to imputation of missing values. J Clin epidemiol 2006; 59: 1087-91 Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR. Multiple imputation for missing data in epidemiological and clinical research:potential and pitfalls. BMJ 2009; 338: b2393. Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, May M, Brindle P. Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study. BMJ 2007; 335: 136. Little, Roderick JA and Rubin, Donald B. (1987). Statistical Analysis with Missing Data John Wiley and Sons, New York. References • Dempster AP, Laird NM and Rubin DB. Maximum Likelihood from • • • • Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society 1977; Ser. B., 39: 1 - 38. Rubin DB. (1987). Multiple imputation for nonresponse in surveys. John Wiley & Sons, New York. Yuan YC. Multiple imputation for missing data: concepts and new development. SAS Institute Inc (P267-25) Software Documentation for SAS®, S-PLUS® and SPSS®. R Development Core Team(2005). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. The Tao of Missingness “There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know.” Donald Rumsfeld