Multiple Imputation

advertisement
How to Handle Missing
Values in Multivariate Data
By Jeff McNeal & Marlen Roberts
1
The Missing Data Problem
• Problems with Statistical Inference
• Sample Size & Power
• Biased Results
Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 1-2). Hoboken, New Jersey: John Wiley & Sons.
2
Real World Examples
• Respondents in a household survey refuse to report income
• Missing results of manufacturing experiment due to equipment
failure
• Voters’ inability to express preference for a political candidate in an
opinion poll
Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 1-2). Hoboken, New Jersey: John Wiley & Sons.
3
Outline
• Common Assumptions and Missing Data Patterns
• Taxonomy of Methods for Handling Missing Values
• Multiple Imputation
• Maximum Likelihood
• Simulation
4
Missing Data Patterns
• All missing data are not created equal
• Missing due to a random process
• Missing due to a non-random process
5
A Simple Example: Income Survey
Westfall, P., & Henning, K. (2013). Understanding Advanced Statistical Methods (1st ed.). Boca Raton, Florida: CRC Press, Taylor & Francis Group.
6
Univariate Missing Data Process: MCAR
P.H. Westfall
7
Multivariate Missing Data Processes:
MCAR and MAR
http://support.sas.com/resources/papers/proceedings12/312-2012.pdf
8
Missing Data Processes: MNAR
http://www.stat.columbia.edu/~gelman/arm/missing.pdf
9
Taxonomy of Missing-Data Methods
• Complete Case Analysis (Listwise Deletion)
• Available Case Analysis (Pairwise Deletion)
• Least Squares on Imputed Data
• Multiple Imputation
• Maximum Likelihood (and Bayes)
Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 19-20). Hoboken, New Jersey: John Wiley & Sons.
10
Complete Case Analysis (Listwise Deletion)
• Easy to implement
• Works well when MCAR assumption is met
• Wastes a lot of information
http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRV
11
Q/Regression%20with%20Missing%20X's.pdf
Available Case Analysis (Pairwise Deletion)
• Attempts to minimize the loss of data in listwise deletion
• Increases the power of your test
• Usually is outperformed by Maximum Likelihood
• Caveat: Can result in non-positive definite covariance matrices
http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRV
12
Q/Regression%20with%20Missing%20X's.pdf
Least Squares Imputation Methods
• Unconditional Mean Substitution
• Conditional Mean Imputation based on X
• Conditional Mean Imputation based on X and Y
http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRV
13
Q/Regression%20with%20Missing%20X's.pdf
Unconditional Mean Substitution
• Just take the sample mean of the observed data and use it for the
missing values
• Heavily biases the covariance matrix
• Bias can be corrected but the inferences (confidence intervals, tests,
etc.) are distorted and over-precise
http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRV
14
Q/Regression%20with%20Missing%20X's.pdf
Conditional Mean Imputation
http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRV
15
Q/Regression%20with%20Missing%20X's.pdf
Multiple Imputation
Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 19-20). Hoboken, New Jersey: John Wiley & Sons.
16
Steps Involved in Multiple Imputation
• Introduce random variation into the process of imputing missing
values
• Generate several data sets, each with different imputed values
• Perform an analysis on each data set
• Combine the results into a single set of parameter estimates,
standard errors, and test statistics
http://support.sas.com/resources/papers/proceedings12/312-2012.pdf
17
Introducing Randomness into a M.I. Model
http://support.sas.com/resources/papers/proceedings12/312-2012.pdf
18
Adding Variability to the Imputed Values
http://support.sas.com/resources/papers/proceedings12/312-2012.pdf
19
Why Do We Want to Add Variability?
• This is the whole point of multiple imputation
http://www.stat.columbia.edu/~gelman/arm/missing.pdf
20
Combining Inferences from Imputed Data
http://support.sas.com/resources/papers/proceedings12/312-2012.pdf
21
Simplified Form using a Regression Example
http://www.stat.columbia.edu/~gelman/arm/missing.pdf
22
Likelihood-Based Inference
https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf
23
ML with Ignorable Missing Data
https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf
24
ML with Ignorable Missing Data
https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf
25
Comparison of Methods
Listwise
• Easiest to implement
• Has minimal effect if data are MCAR, or
MAR for large sample sizes
• Has a tendency to bias results
Multiple Imputation
• Requires no special software once the
imputed datasets are generated
• Requires specification of a model
• Requires more assumptions
Pairwise
• Uses more information than listwise
• Increases statistical power
• Also easy to implement
Maximum Likelihood
• Requires specification of a model for each
variable
• Most asymptotically efficient
• Most complex
• You get model comparison statistics (AIC,
BIC, etc.)
26
Download