DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Case Study 2002 Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre DEPARTMENT OF MATHEMATICS AND STATISTICS Presentation Outline Objective: Compare different approaches to handle missing data from a practitioner’s point of view 1. 2. 3. 4. 5. Preliminary analysis • Various plots Assessing the missing pattern • Spearman rank correlation, logistic regression Data analysis with missing data - Multiple Imputation • Random hot deck imputation with bootstrap • PROC MI and MIANALIZE (SAS) • Transcan function (Hmisc library in S plus or R) Conclusions Further work SSC Conference Hamilton Ontario May 2002 2 Preliminary analysis RESPONSE OVERVIEW Histogram of observed responses HISTOGRAM OF RESPONSE DVHST94 DVHST94 Sample size: 2389 Males: 1097 (45.9%) Females: 1292 (54.1%) 400 500 600 DEPARTMENT OF MATHEMATICS AND STATISTICS Observed: 1691 300 Missing: 698 (28.8%) 200 Mean: 0.9129 0 100 • 0.4 0.6 0.8 The response variable is highly skewed to the left. 1.0 DVHST94 SSC Conference Hamilton Ontario May 2002 3 Preliminary analysis DEPARTMENT OF MATHEMATICS AND STATISTICS • 8 covariates in total, first 4 shown here. • There appears to be a pattern of two clusters in the response DVHST94 (below 0.5 and above 0.5). • DVBMI94 appears to have some “wild” values ( = 96) – 43 observations , all males. (3.9% of males sample) – Wild values were replaced with the mean DVBMI94 of males – DVBMI94 transformation: NEW.DVBMI94 = abs (DVBMI94 – 22) SSC Conference Hamilton Ontario May 2002 4 Preliminary analysis DEPARTMENT OF MATHEMATICS AND STATISTICS • There are no obvious linear patterns between the covariates and the response DVHST94 • DVPP94 is recoded as dichotomous: NEW.DVPP94 = 0 (91% of observations) NEW.DVPP94 > 0 (9% of observations) • The AGEGRP covariate is recoded to NEW.AGE NEW.AGE = mid range value (AGEGRP) – 20 SSC Conference Hamilton Ontario May 2002 5 Preliminary analysis DEPARTMENT OF MATHEMATICS AND STATISTICS mean Mean DVHST94 N NEW.AGE 2 7 12 17 22 27 32 37 42 309 283 383 296 173 132 61 29 25 SEX Female Male 857 834 DVHHIN94 [ 1, 7) [ 8,10) [10,11] 635 259 435 362 7 DVSMKT94 1 2 3 4 5 6 535 57 44 305 155 595 NEW.DVPP94 DVPP94 > 0 DVPP94 = 0 172 1519 NUMCHRON 0 1 [2,9] 814 491 386 VISITS [ 0, 3) [ 3, 6) [ 6,12) [12,94] 487 433 382 389 NEW.WT6 [0.0547,0.447) [0.4473,0.824) [0.8239,1.430) [1.4297,7.445] 424 422 420 425 NEW.DVBMI94 [0.0, 1.6) [1.6, 3.1) [3.1, 6.1) [6.1,18.0] 453 383 433 422 Overall 1691 0.84 SSC Conference Hamilton Ontario May 2002 0.86 0.88 0.90 0.92 6 Preliminary analysis DEPARTMENT OF MATHEMATICS AND STATISTICS • Strength of marginal relationships between the covariates and the response using generalized Spearman chi-square SSC Conference Hamilton Ontario May 2002 7 DEPARTMENT OF MATHEMATICS AND STATISTICS Assessing the missing pattern • The missing pattern of the response does not appear to depend on the sampling weights SSC Conference Hamilton Ontario May 2002 8 DEPARTMENT OF MATHEMATICS AND STATISTICS Assessing the missing pattern • The missing values depend on age 500 100% 400 80% 300 60% 200 40% 100 20% 0 0% 2 7 12 17 22 27 NEW.AGE total sample size SSC Conference Hamilton Ontario May 2002 Percentage of missing Sample size Missing response DVHST94 vs NEW.AGE 32 37 42 % Missing values 9 DEPARTMENT OF MATHEMATICS AND STATISTICS Assessing the missing pattern LOGISTIC REGRESSION Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.058793 0.367083 -13.781 NEW.AGE 0.181625 0.007524 24.140 SEXMale -0.847947 0.131475 -6.450 DVHHIN94 0.047828 0.026768 1.787 DVSMKT94 -0.015131 0.031662 -0.478 NEW.DVPP94 = 0 0.233188 0.226732 1.028 NUMCHRON -0.087992 0.048783 -1.804 VISITS 0.012483 0.006563 1.902 NEW.WT6 -0.043935 0.077407 -0.568 NEW.DVBMI94 -0.015622 0.017299 -0.903 < 2e-16 *** < 2e-16 *** 1.12e-10 *** 0.0740 . 0.6327 0.3037 0.0713 . 0.0572 . 0.5703 0.3665 Missing response DVHST94 vs Gender 3000 2500 % missing for males: 24% 2000 Missing Observed 1500 % missing for females: 34% 1000 500 0 Male SSC Conference Hamilton Ontario May 2002 Female Total 10 Multiple imputation DEPARTMENT OF MATHEMATICS AND STATISTICS Methods: – Random Hot Deck MI with Bootstrap – SAS PROC MI and PROC MIANALIZE – Function TRANSCAN in S-plus from Hmisc Library (Frank Harrel) SSC Conference Hamilton Ontario May 2002 11 DEPARTMENT OF MATHEMATICS AND STATISTICS Multiple Imputation • IMPUTATION: IMPUTATION ANALYSIS POOLING Impute the missing entries of the incomplete data sets B times, resulting in B complete data sets. • ANALYSIS: Analyze each of the B completed data sets using weighted least squares. INCOMPLETE DATA IMPUTED DATA ANALYSIS RESULTS FINAL RESULTS • POOLING: Integrate the B analysis results into a final result. Simple rules exist for combining the B analyses. SSC Conference Hamilton Ontario May 2002 12 Random hot-deck MI with Bootstrap DEPARTMENT OF MATHEMATICS AND STATISTICS mean( ) Estimated Observed Missing response response ~ Complete data Total variance U B b 1 Be B b B where U B U i / B(within va riance) Choose randomly with replacement Probability ~ weights i 1 1 ( U1 , R1 ) (Within variance,R-square) B Be B (U i U B )(U i U B ) /( B 1) i 1 (between va riance ) B B = 1000 replicates Same procedure 1000(U1000, R1000) Estimated R R i 1 i B (Within variance ,R-square) Compute 95% CI for judging significance of predictors SSC Conference Hamilton Ontario May 2002 13 DEPARTMENT OF MATHEMATICS AND STATISTICS PROC MI & MIANALYZE Method PROC MI 1 By default generates 5 imputation values for each missing value 2 Imputation method: MCMC (Markov Chain Monte Carlo) EM algorithm determines initial values MCMC repeatedly simulates the distribution of interest from which the imputed values are drawn 3 Assumption: Data follows multivariate normal distribution PROC REG Fits five weighted linear regression models to the five complete data sets obtained from PROC MI (used by_imputation_statement ) PROC MIANALIZE Reads the parameter estimates and associated covariance matrix from the analysis performed on the multiple imputed data sets and derives valid statistics for the parameters SSC Conference Hamilton Ontario May 2002 14 DEPARTMENT OF MATHEMATICS AND STATISTICS TRANSCAN(Splus,Hmisc) Frank Harrell Transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables. Advantage: It approximates the multiple imputation algorithm described by Rubin’s Bayesian bootstrap. •Does not need normality assumption or symmetry of residuals. • Draws a sample of size r from r non-missing residuals. • Chooses a sample of size m from this sample of size r with replacement. m is the number of missing values. Yobs. , X obs LS Bootstrap (ˆ1,..., ˆr ) Bootstrap (ˆ1 ,..., ˆr' ) ' (ˆ1* ,..., ˆm* ) • Generates imputed values with the linear imputation model and the bootstrapped residuals. •Does shrinkage to avoid overfitting Disadvantage: •“Freezes” the imputation model before drawing the multiple imputations. This algorithm is repeated B times to obtain the multiple imputed data sets that are analyzed using WLS with the function LM. SSC Conference Hamilton Ontario May 2002 15 Comparing imputation methods DEPARTMENT OF MATHEMATICS AND STATISTICS S-plus TRANSCAN (Intercept) NEW.AGE SEX (Male=1) DVHHIN94 NEW.DVBMI94 DVSMKT94 NEW.DVPP94(=0) NUMCHRON VISITS Mean R-square 0.8495 -0.0039 0.0045 0.0083 -0.0001 0.0012 0.0904 -0.0174 -0.0026 (0.0135) (0.0004) (0.0045) (0.0016) (0.0007) (0.0014) (0.0085) (0.0022) (0.0003) 0.33 * * * * * * SAS PROC MI 0.9281 -0.0016 0.0023 0.0061 -0.0005 0.0009 0.0717 -0.0123 -0.0023 (0.01) * (0.0004) * (0.0045) (0.0012) * (0.0008) (0.0013) (0.0092) * (0.0023) * (0.0003) * Bootstrap (random hot deck) 0.8711 (0.0128) -0.0006 (0.0002) 0.0031 (0.0055) 0.0029 (0.0007) -0.0006 (0.0005) 0.0019 (0.0008) 0.0531 (0.0089) -0.0079 (0.0013) -0.0017 (0.0002) 0.193 0.093 Available data only * 0.861 * -0.0013 0.0037 * 0.0051 -0.0007 * 0.0012 * 0.0686 * -0.013 * -0.0023 (0.012) * (0.0003) * (0.0049) (0.0011) * (0.0007) (0.0012) (0.0081) * (0.0021) * (0.0003) * 0.183 Ranking: 1. TRANSCAN ( Advantage: shrinkage correction to prevent over fitting) 2. PROC MI (Drawback: normality assumption) 3. Bootstrap random hot deck (does not use the information of the covariates) SSC Conference Hamilton Ontario May 2002 16 Significant variables DEPARTMENT OF MATHEMATICS AND STATISTICS Intercept 0.0090 DVHHIN94 0.0160 0.0080 0.0016 0.9200 0.0140 0.0070 0.0014 0.9000 0.0120 0.0060 0.0012 0.0100 0.0080 0.0050 0.0040 0.0010 0.0008 0.0060 0.0030 0.0006 0.0040 0.0020 0.0004 0.0020 0.0010 0.0002 - 0.0000 0.8800 0.8600 0.8400 0.8200 0.8000 NEW.DVPP94(=0) 0.0018 0.9400 - 0.1000 0.0900 0.0800 0.0700 0.0600 0.0500 0.0400 0.0300 0.0200 0.0100 0.0000 0.0094 0.0092 0.0090 0.0088 0.0086 0.0084 0.0082 0.0080 0.0078 0.0076 0.0074 S-plus SAS PROC MI Random Hot Complete TRANSCAN Deck observations (Bootstrap) S-plus SAS PROC MI Random Hot Complete TRANSCAN Deck observations (Bootstrap) S-plus SAS PROC MIRandom Hot Complete TRANSCAN Deck observations (Bootstrap) VISITS NEW.AGE NUMCHRON Random Hot S-plus Deck Complete TRANSCAN SAS PROC MI (Bootstrap) observations S-plus Random Hot Complete TRANSCAN SAS PROC MIDeck (Bootstrap) observations Random Hot S-plus Deck Complete TRANSCAN SAS PROC MI (Bootstrap) observations 0.0000 0.0005 0.0000 0.0000 0.0004 -0.0005 0.0005 -0.0020 0.0003 -0.0010 0.0004 -0.0015 0.0004 -0.0040 -0.0060 0.0020 -0.0005 -0.0080 0.0015 -0.0010 0.0003 0.0002 -0.0015 0.0002 -0.0020 0.0001 -0.0020 -0.0025 0.0003 0.0003 0.0002 -0.0120 0.0010 0.0005 -0.0030 0.0002 -0.0035 0.0001 -0.0180 -0.0200 0.0001 -0.0040 0.0001 -0.0030 - -0.0045 - SSC Conference Hamilton Ontario May 2002 -0.0100 -0.0140 -0.0160 -0.0025 0.0025 - 17 DEPARTMENT OF MATHEMATICS AND STATISTICS Conclusions about the missing pattern • The missing values of the response variable DVHST94 are not MCAR. The probability of missing depends primarily on the age and sex covariates, therefore the missing values are MAR. SSC Conference Hamilton Ontario May 2002 18 DEPARTMENT OF MATHEMATICS AND STATISTICS Conclusions about multiple imputation • Transcan function appeared to perform better than PROC MI for imputing and analyzing this data set given non-normality. • Random hot deck MI with bootstrap gave significantly biased results. This approach does not take into account the information provided by the covariates therefore is not appropriate for data MAR. SSC Conference Hamilton Ontario May 2002 19 DEPARTMENT OF MATHEMATICS AND STATISTICS Conclusions about the data analysis • The health status of the population tends decreases with age. • People with higher income tend to have better health than people with less income. • People with lower health status demand more medical services (visits to a doctor). • People that are propense to depression have lower health. • Smoking does not appear to have a decisive influence on the health status. SSC Conference Hamilton Ontario May 2002 20 Future work DEPARTMENT OF MATHEMATICS AND STATISTICS • GLM could be used to model the categorical response GQ.H1 using a multinomial logistic model to impute the missing categorical responses • Interactions of the significant variables with the insignificant variables should be explored in order to further assess the concomitant effects (e.g. smoking and depression). SSC Conference Hamilton Ontario May 2002 21 DEPARTMENT OF MATHEMATICS AND STATISTICS Acknowledgements: Special thanks to professor Peggy Ng and George Monette for their support. SSC Conference Hamilton Ontario May 2002 22