BIOSTATISTICS 590 Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012 Basis for Presentation • This presentation is based on an article by: • T.E. Raghunathan • J.M. Lepkowski • J.V. Hoewyk • P. Solenberger “A multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models” Survey Methodology, June 2001 Vol. 27, No. 1, pp. 85-95 Rationale for Multiple Imputation • Incomplete data is a common problem • Allows to use an existing complete-data software, once the missing values have been imputed Basic Definitions • “Imputation” is the placement of one or more estimated answers into a field of a data record that previously had NO data • Draws from a predictive distribution Basic Strategy • To create imputations through fitting a sequence of multiple regressions • Regressions use the variable with missing data as the outcome (Y) variable • Regression models based on complete data are used to make predictions of Y when Y is missing • To draw values from the predictive distributions • Cyclical manner • The type of regression model varies by imputed variable (Example is coming up in future slides) Types of Regression Models Used 1. 2. 3. 4. 5. Linear Logistic Poisson Generalized logit Mixture of the above Remember! The type of regression model depends on the type of imputed variable! Assumptions in MI Technique • Population is infinite • Sample is SRS • Variables are one of the following: • Continuous • Binary • Categorical • Counts • Mixed Advantages of Multiple Imputation + + + + + Method for imputation is known; Analyses are based on the same # of cases; All data provided is used in each analysis; Allows for multiple predictors; Valid points and interval estimates under a general set of conditions are obtained by repeatedly applying the complete data software Imputation Method • Each imputation consists of “rounds” • Start round 1 by regressing the variable with fewest # of missing values • Remember! Imputations for missing values in Y are draws from the predictive distribution (Use predicted mean Y + a random draw from the normal error distribution) • Then, update X by replacing missing Y with the imputed value • X=full matrix with all variables (including Y) Lesion Location Temporal Etiology Lobectomy Lesion Size 2.72 Chronicity 89.3 Occipital Temporal Stroke Hemorrhage . . 36.3 55.3 Imputation Method • Move on to the next Y with fewest missing values • Repeat MI using updated X as predictors until all variables have been imputed Run the process M times; Yield M entire datasets; Each dataset has different set of imputed values, but the same data for complete values Example Time Effect of Smoking on Primary Cardiac Arrest (CA) • Case-control study • Examine relationship between smoking and CA Means and Proportions of Key Variables and Percent Missing Variable Control (n = 551) Cases (n = 347) % Missing Mean (SD) % Missing Mean (SD) Age 0 58.4 (10.4) 0 59.4 (9.9) BMI 8.2 25.8 (4.1) 2.6 26.4 (4.6) Years Smoked 16.8 24.8 (14.7) 5.4 31.7 (13.8) Proportion Female >= High School Smoking Status Proportion 0 23.2 0 19.9 0 76.8 0 61.9 0 Never Smoked 0 47.2 0 27.3 Former Smoker 0 42.1 0 38.2 Current Smoker 0 10.7 0 34.5 Intuitively… • What variables might predict missing data? • Could age, education, smoking status predict BMI? • Could age predict years smoked? • However, years smoked can only be imputed for current and former smokers! • Some values may need to be fixed post-MI Multiple Imputation Process in CA Study • Log (BMI) has fewest missing values • Regress Log (BMI) on age, female, education, Years_Smoked, smoking status, and cardiac arrest through normal linear model • Cardiac Arrest IS included in the imputation model • Predicted values of log (BMI) are saved to the dataset, replacing the missing values Multiple Imputation Process in CA Study • Next, Years Smoked was regressed on all of the variables above+ log (BMI) (Please note that the regression excludes ‘never-smokers’) • Predicted values of Years Smoked are saved to the dataset, replacing the missing values • M=25 imputations (Note: many researchers use M=5 or 5<M<10) • Original logistic regression model was fit for each MI data set How were estimates of coefficients and covariance matrices obtained? • IVEware software performs calculations, using estimates and covariance matrix • Combines the results from 5-25 regressions • Combines both within-regression and betweenregression error • IVEware: Imputation and Variance Estimation Software http://www.isr.umich.edu/src/smp/ive/ • Developed by our own Dr. Raghunathan & researchers at the Survey Methodology Program Complete-Case Analysis vs MI Predictor Variables Complete Case SRMI (n = 795) Method 1 (n = 898) Estimate (SE) Estimate (SE) Intercept -2.922 (0.791) -2.61 (0.757) Age 0.015 (0.009) 0.015 (0.009) Female -0.007 (0.203) -0.115 (0.189) Education -0.448 (0.173) -0.467 (0.166) BMI 0.056 (0.018) 0.049 (0.013) Current Smoker 1.693 (0.569) 2.001 (0.543) Former Smoker 0.003 (0.284) -0.029 (0.262) Current Smoker x Yrs Smoked -0.003 (0.015) -0.008 (0.013) Former Smoker x Yrs Smoked 0.019 (0.009) 0.014 (0.009) Results of the Multiple Imputations • MI standard errors are smaller: • due to additional subjects in imputed data • Modest changes in relationship between smoking and CA • Years Smoked in Former Smokers is a significant predictor of cardiac arrest in the complete-case analysis, but NOT in the MI analysis (!!!) Additional Variables MI Approach • Additional variables NOT in the substantive analysis can be used • Prediction for missing values in each variable borrows strength from all other variables • In our cardiac arrest example, impute dataset +50 additional variables SE are smaller • Improved efficiency vs. variables in model only In Addition… IVEware performs… 1. Single or multiple imputations 2. Analyses accounting for: • • Clustering Stratification Weighting 3. Combines information from multiple sources • (+some other functions beyond the scope of this presentation) Critique • This article might be too challenging and complicated as an entry-level description of multiple imputation • Some of the foundational concepts from this article have not been covered thus far in OJOC program • nonignorable missing mechanism RECOMMENDATION Start with “Survey Methodology” (2nd edition) by R.M. Groves, F. J. Fowler, Jr., M.P. Couper, J.M. Lepkowski, E. Singer, R. Tourangeau. Wiley Series in Survey Methodology, A John Wiley & Sons, Inc., Publication, 2009, p. 356. Thank You for Your Attention!