Jin is designed Dr. Huber by Texas A&M HSC Korean Female Colon Cancer Event Risk Factors Range Smoking Habits Missing Non-event HR 95% CI P n % n % 1449400 79.57 4071 95.70 - - - - 19.32 93 2.19 1.000 1.000 1.000 1.000 0.25 21 0.49 1.174 1.058 1.303 0.0025 0.48 38 0.89 0.948 0.828 1.084 0.4339 0.30 26 0.61 0.991 0.901 1.09 0.8457 0.08 5 0.12 1.015 0.894 1.153 0.8162 No 351896 smoking Smoked before , 4611 but quitted Currently, 8735 1/2 pack Currently, 1/2-One 5534 pack Currently, More than 1410 One pack Not sure b/c Is smoking protective? Huge missing!! 1. Missing Completely At Random(MCAR) : depends neither on observation nor on missing Diff. by Why data are missing 2. Missing At Random(MAR) : depends only on observation 3. Not Missing At Random(NMAR) : depends both on observation and on missing Affect the effectiveness and biasness of methods for missing data 1. Complete Case Analysis(CCA) Older Methods 2. Available Case Analysis(ACA) 3. Mean imputation 4. Expectation and Maximum(EM) 5. Multiple Imputation Only CCA and MI Single Imputation Multiple Imputation 1. Complete Case Analysis (CCA) Y1 Y2 Y3 140 . 20 31 25 . 10 35 40 25 48 57 30 49 60 35 55 65 37 47 70 140 32 30 42 65 40 50 200 20 1. Delete all cases of missing values on Y1,Y2,Y3 2. Analyze remaining cases 1. CCA = NOT using any methods of handling missing data 2. By deleting cases, power will be decreased (b/c reduced sample size) 2. Multiple Imputation (MI) (1) Imputation Step (2) Analysis Step (3) Combination Step MI has 3 steps 2. MI (1) Imputation Step Y X1 Imputation Number Y X1 X2 1 1 44 11 178 2 1 3 1 10 X2 1 44 11 178 2 45 10 185 3 59 . . 4 49 9 . 5 60 8 170 6 50 . 44 7 11 176 . 8 10 49 8 9 170 50 . 4 1 11 5 1 12 6 1 13 7 14 1 8 15 1 9 16 1 17 45 10 185 Imputation Y X1 Number 16.5 136.4 59 1 44 8 11 2 X2 178 179.5 492 9 45 10 185 Imputation 9 Y X1 Number 63.9 602 8 59170 9 44 98.96 19 3 11 38.4 192.3 502 44 9 45 20 3 0 49 7 10 21 2 11 22 2 23 10 24 2 170 25 2 26 “5 complete datasets” 18 2 27 X2 178 185 3 - 8 59 170 63.88 -121.12 60 608.5 3 7 38.449 9 185.82 50 44 9 60 8 170 49 3 8 3 - 17650 644.2 33.65 44 50 11 88.94 6 3 11 176 -665.12 10 49 8 3 10 49 8 170 50 3 17097.00 50 -189.96 176 Imputation Number Y X1 X2 28 4 44 11 178 29 4 30 4 31 4 32 4 39 60 85 59 170 33 4 50 40 33.60 5 34 4 11 41 176 5 35 4 42 10 5 49 44 49 706.8 60 7 50 8 4 43 170 5 50 36 45 Imputation 10 185 Y Number 458.6 59 42.87 044 37 5 179.0 38 45 49 95 7 44 5 45 5 - 11 212.1 8 10 170 X1 X2 11 178 10 185 1.64 213.9 4 9 182.0 8 8 170 33.16 44 176 720.9 2 49 8 50 222.1 6 2. MI (2) Analysis Step * Standard statistical procedure > regression for each complete datasets (5) separately Variable names f Dependent v or rows of ariable estimated COV Y Imputation Number Label of model Type of statistics 1 1 MODEL1 PARMS 2 1 MODEL1 COV Intercept Y 9.49 3 4 5 1 1 2 MODEL1 MODEL1 MODEL1 COV COV PARMS X1 X2 Y Y Y 9.49 9.49 11.80 6 2 MODEL1 COV Intercept Y 11.80 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2 2 3 3 3 3 4 4 4 4 5 5 5 5 MODEL1 MODEL1 MODEL1 MODEL1 MODEL1 MODEL1 MODEL1 MODEL1 MODEL1 MODEL1 MODEL1 MODEL1 MODEL1 MODEL1 COV COV PARMS COV COV COV PARMS COV COV COV PARMS COV COV COV X1 X2 Y Y Y Y Y Y Y Y Y Y Y Y Y Y 11.80 11.80 3.86 3.86 3.86 3.86 1.76 1.76 1.76 1.76 1.46 1.46 1.46 1.46 Analyzed 5 times Intercept X1 X2 Intercept X1 X2 Intercept X1 X2 Root mean squared error Intercept 9.49 417.91 X1 -7.96 722.00 15.61 -15.61 0.34 -3.26 0.07 405.16 -7.81 1052.74 23.16 -23.16 0.52 -4.60 0.10 233.43 -4.31 28.82 -0.66 -0.66 0.02 -0.12 0.00 221.04 -4.17 5.20 -0.12 -0.12 0.00 -0.02 0.00 215.80 -4.08 3.36 -0.08 -0.08 0.00 -0.01 0.00 X2 Y -1.64 -1 -3.26 . 0.07 0.02 -1.53 . . -1 -4.60 . 0.10 0.02 -0.80 -0.12 0.00 0.00 -0.74 -0.02 0.00 0.00 -0.71 -0.01 0.00 0.00 . . -1 . . . -1 . . . -1 . . . 2. MI (3) Combination Step > the results from 5 data are combined to ONE with combination equations. 1. Combined estimate: 2. Variance Total: 3. Var. Within: 4. Var. Between: 5. DF: 6. Fraction missing Info. : 7. Confidence Interval: combined to 1 result * Comparison of methods to handle missing values Multiple Imputation EM method X X O O X X X O O X X X X X Good Estimates Variability X X X X O Best Statistical Power X O O O O Criteria Unbiased Parameter Estimation CCA ACA MCAR O MAR MNAR Mean MI is the BEST!! Imputation Excellent Estimation Variance among ‘M’est. b/c multiply imputed data by not deleting any cases (1) Imputation step of MI : imputation mechanisms for substituting missing values Pattern Univariate Monotone Type Normality Imputation mechanisms Continuous O Regression Univariate Monotone Continuous X Predictive Mean Matching Multivariate Not Monotone Continuous - MCMC MCMC is NOT tested to Univariate Simulated Data * 3000 obs. are generated on Z1, and X1,…,X6 (all variables are continuous) ( Xs: observed variables and Z: partly missing var. ) * Z1, and X1,…,X6 are drawn from multivariate normal dist with Means = 0 and Correlation = z1 z1 x1 x2 x3 x4 x5 x6 1.0000 0.7655 0.2764 0.0509 0.1612 0.2924 0.1052 x1 x2 x3 x4 x5 x6 1.0000 0.3233 1.0000 0.0351 0.5352 1.0000 0.1415 -0.0063 -0.0738 0.3581 0.8062 -0.0640 0.1124 -0.0061 -0.0764 1.0000 0.0441 0.1157 1.0000 0.0420 1.0000 Example Data (“A Predictive Study of Coronary Heart Disease” ) * 3154 obs. (all variables are continuous) - Missing variable: Systolic Blood Pressure (Mean: 128.63) - Observed variables: DBP(82.02), height(69.78), weight(169.95), age(46.28), BMI(24.52), and Cholesterol (Mean: 226.37) * Correlation = sbp sbp dbp height weight age bmi chol 1.0000 0.7700 0.0156 0.2513 0.1701 0.2878 0.1231 dbp height weight age bmi chol 1.0000 0.0070 1.0000 0.2940 0.5333 1.0000 0.1440 -0.0919 -0.0331 0.3428 -0.0633 0.8079 0.1296 -0.0889 0.0085 1.0000 0.0256 0.0892 1.0000 0.0706 1.0000 1. Missing Mechanisms 1) MCAR: Randomly Z1(SBP) deleted 2) MAR: After sorting by one of X(obs.var), Z1(SBP) deleted 3) NMAR: After sorting by Z1(SBP), Z1(SBP) deleted to 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% 2. Biasness mainly measured by RMSE (Root Mean Square Error)= Sqrt (Variance of Estimates + Bias^2) : captures estimates’ Accuracy and Variability and compares them in the same units. * True value= Mean of Z1 (SBP) at 0% missing * Estimate= Mean of Z1 (SBP) at 10% to 80% missing after MI When RMSE “smaller” → Estimation “better” 3. The method to deal with missing values (to measure effectiveness of MI) Complete Case Analysis (CCA) Multiple Imputation (MI) 4. Imputation numbers M=10, 20, 30, 40, and 50 numbers 5. Imputation model (z1= x1 x2 x3 x4 x5x6), all variable (z1= x1 x2 x5), highly corr. var to z1 z1=x1x2x5 model is best model b/c smallest RMSE (z1= x3 x4x6) rarely corr. var 6. Imputation Mechanisms Regression method PMM MCMC 7. 500 repetitions on each MI (to reduce random variability of imputation) ex) M=10 *500 reps. → Average them→ … Mean of Est. for M=10 M=50 *500 reps. → Average them→ Mean of Est. for M=50 8. Statistical Software STATA11 (Multiple Imputation) MAR MAR Proportion of missing data better CCA CCA MI MI NMAR 1.6 0.25 1.4 0.2 1.2 0.151 0.8 0.6 0.1 0.4 0.05 0.2 00 RMSE 1.6 0.12 1.4 0.1 1.2 0.08 1 0.8 0.06 0.6 0.04 0.4 0.02 0.2 00 RMSE RMSE RMSE RMSE MCAR MCAR Proportion of missing data CCA CCA 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Proportion of missing data MI MI CCA MI Under MCAR and MAR, both CCA and MI are Good. changing scale of Y axis, Under All missing mechanisms, MI is better than CCA. Percent of missing , RMSEs are linearly & Diff. of RMSE b/w CCA and MI > High amount of missing, using Multiple Imputation MAR NMAR 1.2 1.2 1 1 1 0.8 0.8 0.6 0.4 0.2 0.6 RMSE 1.2 RMSE RMSE MCAR Similar 0.4 0.2 0 10%20%30%40%50%60%70%80% 10% 20% 30% 40% 50% 60% 70% 80% Proportion of missing data 10 impute 40 impute 20 impute 50 impute 0.6 0.4 0.2 0 0.8 0 10%20%30%40%50%60%70%80% Proportion of missing data 30 impute 10 impute 20 impute 40 impute 50 impute 30 impute Proportion of missing data 10 impute 40 impute 20 impute 50 impute 30 impute Under NMAR, MIof biased est. at (Regardless imputation #) 80% missing b/c Under large MCAR RMSE and ≒ (MAR, 1 SDMI ofGood! data=0.99 ) 5 lines(M=10~M=50) go together and look like 1 line. > No difference among diff. Imputation numbers(m)= 10, 20, 30, 40, 50. NMAR MAR 1.4 1.4 1.4 1.2 1.2 1 1 1 0.8 0.8 0.8 0.6 0.4 1.2 RMSE RMSE RMSE MCAR 0.6 0.4 0.6 0.4 0.2 0.2 0.2 0 0 0 Proportion of missing data reg pmm Proportion of missing data Proportion of missing data mcmc reg pmm MCMC/ Reg. mcmc reg pmm Normality Theory Practically (MI) MCAR Normal Regression All imputation mechanisms MAR Normal Regression All imputation (Reg. slightly better)NMAR. *Normal assumption may notmechanisms be important under NMAR Not Normal PMM mcmc Regression, MCMC *MCMC is good under all missing mechanisms. Thus, MCMC can be used in univariate and continuous missing. 1. Under MCAR and MAR, theoretically Reg. should be better because of normality, but All method are good. However, Reg. method is slightly better under MAR. 2. Under NMAR, even though normality is not met, Reg. method is better than PMM. MAR MAR 20 4 18 3.5 16 3 14 12 2.5 10 2 8 1.56 14 0.52 00 Proportion of missing data CCA CCA MI better MI NMAR RMSE 20 1.618 1.416 1.214 12 1 10 0.8 8 0.6 6 0.4 4 2 0.2 0 0 10%20%30%40%50%60%70%80% 10%20%30%40%50%60%70%80% RMSE RMSE RMSE RMSE MCAR MCAR 10%20% 20%30% 30%40% 40%50% 50%60% 60%70% 70%80% 80% 10% Proportion of missing data CCA CCA MI MI 20 18 16 14 12 10 8 6 4 2 0 10% 20% 30% 40% 50% 60% 70% 80% Proportion of missing data CCA MI Under MCAR and MAR, both CCA and MI are Good. changing scale of Y axis, Under MCAR, MAR, and NMAR, MI produced significantly unbiased values than CCA. Percent of missing , RMSEs are linearly & Diff. of RMSE b/w CCA and MI > High amount of missing, Multiple Imputation is preferable 10% 20% 30% 40% 50% 60% 70% 80% Proportion of missing data 10 impute 30 impute 50 impute 20 impute 40 impute 16 14 12 10 8 6 4 2 0 NMAR Similar 10% 20%30% 40%50% 60%70% 80% Proportion of missing data 10 impute 30 impute 50 impute 20 impute 40 impute RMSE RMSE RMSE 16 14 12 10 8 6 4 2 0 MAR MCAR 16 14 12 10 8 6 4 2 0 10% 20% 30% 40% 50% 60% 70% 80% Proportion of missing data 10 impute 30 impute 50 impute 20 impute 40 impute Under NMAR, MI did not well at 80% missing (Regardless of imputation # and percent of missing ) due to large RMSE ≒ ( 1 SD of data=15.11 ) Under MCAR and MAR, MI produces unbiased est. No difference among increased Imputation numbers 10, 20, 30, 40, 50 > Increased Imputation numbers No sign. effect to correct bias in this data characteristics. = 13 13 8 3 RMSE MAR 18 RMSE RMSE MCAR 18 8 3 -2 10%20%30%40%50%60%70%80% Proportion of missing data reg pmm -2 10%20%30%40%50%60%70%80% Proportion of missing data mcmc reg pmm mcmc NMAR 18 16 14 12 10 8 6 4 2 0 MCMC/ Reg. 10%20%30%40%50%60%70%80% Proportion of missing data reg Normality Theory Practically(MI) MCAR Not Normal PMM All missing mechanisms MAR Not Normal PMM All missing mechanisms (PMM method slightly better ) *Normal assumption maybe important only under MAR. Not Normal PMM Regression, MCMC *MCMC is good to use under MCAR, MAR, and NMAR. NMAR pmm mcmc Thus, MCMC can be used not only in multivariate and continuous but also in PMM univariate andbetter continuous 1.Under MCAR and missing, MAR, theoretically should be becausemissing. normal assumption is broken, but All method are good. However, PMM method is slightly better under MAR. 2. Under NMAR, even though normality is not met, Reg. has lower RMSE than PMM. Conclusion 1. Multiple Imputation (MI) > Complete Case Analysis always. 2. No significant difference in imputation numbers in my data. 3. Under MCAR and MAR, MI produce unbiased estimates at high amount of missing. 4. However, under NMAR, the estimation by MI is also biased at high amount of missing. 5. MCMC is good for univariate and continuous missing under MCAR, MAR and NMAR.