How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1 The Missing Data Problem • Problems with Statistical Inference • Sample Size & Power • Biased Results Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 1-2). Hoboken, New Jersey: John Wiley & Sons. 2 Real World Examples • Respondents in a household survey refuse to report income • Missing results of manufacturing experiment due to equipment failure • Voters’ inability to express preference for a political candidate in an opinion poll Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 1-2). Hoboken, New Jersey: John Wiley & Sons. 3 Outline • Common Assumptions and Missing Data Patterns • Taxonomy of Methods for Handling Missing Values • Multiple Imputation • Maximum Likelihood • Simulation 4 Missing Data Patterns • All missing data are not created equal • Missing due to a random process • Missing due to a non-random process 5 A Simple Example: Income Survey Westfall, P., & Henning, K. (2013). Understanding Advanced Statistical Methods (1st ed.). Boca Raton, Florida: CRC Press, Taylor & Francis Group. 6 Univariate Missing Data Process: MCAR P.H. Westfall 7 Multivariate Missing Data Processes: MCAR and MAR http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 8 Missing Data Processes: MNAR http://www.stat.columbia.edu/~gelman/arm/missing.pdf 9 Taxonomy of Missing-Data Methods • Complete Case Analysis (Listwise Deletion) • Available Case Analysis (Pairwise Deletion) • Least Squares on Imputed Data • Multiple Imputation • Maximum Likelihood (and Bayes) Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 19-20). Hoboken, New Jersey: John Wiley & Sons. 10 Complete Case Analysis (Listwise Deletion) • Easy to implement • Works well when MCAR assumption is met • Wastes a lot of information http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRV 11 Q/Regression%20with%20Missing%20X's.pdf Available Case Analysis (Pairwise Deletion) • Attempts to minimize the loss of data in listwise deletion • Increases the power of your test • Usually is outperformed by Maximum Likelihood • Caveat: Can result in non-positive definite covariance matrices http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRV 12 Q/Regression%20with%20Missing%20X's.pdf Least Squares Imputation Methods • Unconditional Mean Substitution • Conditional Mean Imputation based on X • Conditional Mean Imputation based on X and Y http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRV 13 Q/Regression%20with%20Missing%20X's.pdf Unconditional Mean Substitution • Just take the sample mean of the observed data and use it for the missing values • Heavily biases the covariance matrix • Bias can be corrected but the inferences (confidence intervals, tests, etc.) are distorted and over-precise http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRV 14 Q/Regression%20with%20Missing%20X's.pdf Conditional Mean Imputation http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRV 15 Q/Regression%20with%20Missing%20X's.pdf Multiple Imputation Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 19-20). Hoboken, New Jersey: John Wiley & Sons. 16 Steps Involved in Multiple Imputation • Introduce random variation into the process of imputing missing values • Generate several data sets, each with different imputed values • Perform an analysis on each data set • Combine the results into a single set of parameter estimates, standard errors, and test statistics http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 17 Introducing Randomness into a M.I. Model http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 18 Adding Variability to the Imputed Values http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 19 Why Do We Want to Add Variability? • This is the whole point of multiple imputation http://www.stat.columbia.edu/~gelman/arm/missing.pdf 20 Combining Inferences from Imputed Data http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 21 Simplified Form using a Regression Example http://www.stat.columbia.edu/~gelman/arm/missing.pdf 22 Likelihood-Based Inference https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf 23 ML with Ignorable Missing Data https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf 24 ML with Ignorable Missing Data https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf 25 Comparison of Methods Listwise • Easiest to implement • Has minimal effect if data are MCAR, or MAR for large sample sizes • Has a tendency to bias results Multiple Imputation • Requires no special software once the imputed datasets are generated • Requires specification of a model • Requires more assumptions Pairwise • Uses more information than listwise • Increases statistical power • Also easy to implement Maximum Likelihood • Requires specification of a model for each variable • Most asymptotically efficient • Most complex • You get model comparison statistics (AIC, BIC, etc.) 26