Introduction to Multiple Imputation CFDR Workshop Series Spring 2008 Outline • • • • • • • Missing data mechanisms What is Multiple Imputation? SAS Proc MI, Proc MIANALYZE Stata ICE, MICOMBINE SAS IVEware What’s the diff? Problems with categorical imputation 2 Missing data mechanisms • Missing Completely At Random (MCAR) – The probability of missingness doesn't depend on anything. • Missing At Random (MAR) – The probability of missingness does not depend on the unobserved value of the missing variable, but it can depend on any of the other variables in your dataset • Not Missing at Random (NMAR) – The probability of missingness depends on the unobserved value of the missing variable itself 3 4 What is Multiple Imputation? 1. Imputation • Make M=3 to 10 copies of incomplete data set filling in with conditionally random values 2. Analyses • Of each data set separately 3. Pooling • • Point estimates. Average across M analyses Standard errors. Combine variances . 5 1. Imputation: Multiple Copies of Dataset Y 44.61 54.3 49.87 X1 X2 11.37 178 8.65 156 9.22 . . 11.95 176 39.44 13.08 174 50.54 . . 44.75 11.12 176 51.86 10.33 166 40.84 10.95 168 46.77 10.25 . X3 1 0 . 1 1 1 0 0 . . _I_ 1 1 1 1 1 1 1 1 1 1 _I_ 2 2 2 2 2 2 2 2 2 2 Y 44.61 54.3 49.87 39.97 39.44 50.54 44.75 51.86 40.84 46.77 X1 11.37 8.65 9.22 11.95 13.08 9.117 11.12 10.33 10.95 10.25 X2 178 156 181.2 176 174 168.2 176 166 168 185.9 X3 1 0 0.23 1 1 1 0 0 0.756 0.632 Y X1 X2 X3 44.609 11.37 178 1 54.297 8.65 156 0 49.874 9.22 137.47 0.0666 39.849 11.95 176 1 39.442 13.08 174 1 50.541 9.9192 162.67 1 44.754 11.12 176 0 51.855 10.33 166 0 40.836 10.95 168 0.2288 46.774 10.25 184.83 0.0998 6 Three steps 1. Imputation • Make M=2 to 10 copies of incomplete data set filling in with conditionally random values 2. Analyses • Of each data set separately 3. Pooling • • Point estimates. Average across M analyses Standard errors. Combine variances . 7 What is MI? • STATA – based on each conditional density – chained equations • SAS – joint distribution of all the variables – assumed multivariate normal distribution • SAS IVEware – same as Stata, more options. 8 Stata Example • ICE to impute – Regression commands may be logistic, mlogit, ologit, or regress. • MICOMBINE to analyze and combine the results. – Supported regression cmds are clogit, cnreg, glm, logistic, logit, mlogit, ologit, oprobit, poisson, probit, qreg, regress, rreg, stcox, streg, or xtgee. • Easy to use, nice documentation 9 SAS example Oxygen RunTime RunPulse 44.609 11.37 178 54.297 8.65 156 49.874 9.22 . . 11.95 176 39.442 13.08 174 50.541 . . 44.754 11.12 176 51.855 10.33 166 40.836 10.95 168 46.774 10.25 . 39.407 12.63 174 45.441 9.63 164 10 Step 1: Proc MI • Typical syntax: proc mi data=mi_example out=outmi seed=1234; var Oxygen RunTime RunPulse; run; 11 Step 2: Run Models proc reg data=outmi outest=outreg covout noprint; model Oxygen = RunTime RUnPulse; by _Imputation_; run; Note that the regression output is stored as dataset “outreg” Proc’s= Reg, Logistic, Genmod, Mixed, GLM 12 Parameter Estimates & Covariance Matrices proc print data=outreg(obs=8); var _Imputation_ _Type_ _Name_ Intercept RunTime RunPulse; run; Obs _Imputation_ _TYPE_ 1 2 3 4 5 6 7 8 1 1 1 1 2 2 2 2 PARMS COV COV COV PARMS COV COV COV _NAME_ Intercept RunTime RunPulse Intercept RunTime RunPulse Intercept RunTime RunPulse 82.9694 65.1698 0.2646 -0.3952 85.1831 85.3406 -0.4467 -0.4679 -2.44422 0.26463 0.14005 -0.0101 -3.0485 -0.44671 0.13629 -0.00581 -0.06121 -0.39518 -0.0101 0.00293 -0.03452 -0.46786 -0.00581 0.00308 13 Step 3. Proc Mianalyze proc mianalyze data=outreg; modeleffects Intercept RunTime RunPulse; run; Parameter Estimate Multiple Imputation Parameter Estimates Std Error 95% Confidence Limits DF Intercept RunTime 92.696519 -2.915452 12.780914 0.48346 65.35758 -3.90873 120.0355 -1.9222 RunPulse -0.086795 0.070425 -0.23209 0.0585 Minimum 14.412 82.969385 26.264 -3.146336 24.163 -0.13547 Maximum Pr > |t| 101.288118 <.0001 -2.444217 <.0001 -0.034519 0.2296 14 Irritating Parameter Est. & Covariance Matrices • Syntax depends on what procedure you used in previous step: • proc mianalyze data=parmcov; (or) • proc mianalyze parms=parmsdat covb=covbdat; (or) • proc mianalyze parms=parmsdat xpxi=xpxidat; PROC’s: reg, genmod, logit, mixed, glm. 15 SAS IVEware: 4 Components 1. IMPUTE -- nice options. 2. DESCRIBE estimates the population means, proportions, subgroup differences, contrasts and linear combinations of means and proportions. A Taylor Series approach is used to obtain variance estimates appropriate for a user specified complex sample design. 3. REGRESS fits linear, logistic, polytomous, Poisson, Tobit and proportional hazard regression models for data resulting from a complex sample design. 4. SASMOD allows users to take into account complex sample design features when analyzing data with several SAS procedures. SAS PROCS can be called:CALIS, CATMOD, GENMOD, LIFEREG, MIXED, NLIN, PHREG, and PROBIT. 16 IVEware Impute IMPUTE assumes the variables in the data set are one of the following five types: (1) continuous (2) binary (3) categorical (polytomous with more than two categories) (4) counts (5) mixed The types of regression models used are linear, logistic, Poisson, generalized logit or mixed logistic/linear, depending on the type of variable being imputed. 17 SAS IVEware: 4 Components 1. IMPUTE -- nice options. 2. DESCRIBE estimates the population means, proportions, subgroup differences, contrasts and linear combinations of means and proportions. A Taylor Series approach is used to obtain variance estimates appropriate for a user specified complex sample design. 3. REGRESS fits linear, logistic, polytomous, Poisson, Tobit and proportional hazard regression models for data resulting from a complex sample design. 4. SASMOD allows users to take into account complex sample design features when analyzing data with several SAS procedures. SAS PROCS can be called:CALIS, CATMOD, GENMOD, LIFEREG, MIXED, NLIN, PHREG, and PROBIT. 18 A Few Issues • Do I impute the dependent variable? • Which model has more information? The imputation model or the analyst model? • How many imputations do I need to do? • Can I impute in one language and analyze in another? • How do I get summary statistics such as R squared? • Can I do this in SPSS? • Where do I go with questions? 19 Thanks Next up: “COLLATERAL CONSEQUENCES OF VIOLENCE IN DISADVANTAGED NEIGHBORHOODS” Dr. David Harding Wednesday, February 13, Noon - 1:00 pm Accessing and Analyzing Add Health Data Instructor: Dr. Meredith Porter Monday, February 25, 12:00-1:00 pm 20