Roberta Harnett
MAR 550
October 30, 2007
When do we see missing data?
Types of missing data
Traditional approaches
Deletion
Substitution
Modern Approaches
Maximum likelihood and Bayes
Software
Medical studies, nonresponse in surveys or censuses, dropouts in clinical trials, censored data
Loss of information, power
Bias in results due to differences in missing and observed data
Complicated analysis with standard software
MCAR
MAR
MNAR
Missing Completely at Random
Probability that x i is missing doesn’t depend on its value or on value of other variables
Doesn’t matter if it is associated with other
“missingness”
Missing at Random
Missingness doesn’t depend on x i controlling for other variable after
This is not great, but we can deal with it
Missing Not at Random
Not MCAR or MAR (anything else)
BAD!!
Model missingness
Deletion
List-wise
Unbiased, but loses power
Alternatives are really replacements for list-wise
Pairwise (also called “unwise”) deletion
Leads to different sample sizes for different parts of analysis
Can be a disaster
Single Imputation
Hot deck
Census Bureau
vs. Cold deck
Mean substitution
Regression substitution
Stochastic regression substitution
Maximum Likelihood
EM algorithm
Estimate parameters
Listwise deletion, add some error
Predict missing data
(M): Maximize likelihood. Repeat.
NORM (http://www.stat.psu.edu/~jls/misoftwa.html)
Multiple Imputation
Simple and general – works for any type of analysis
Validity of method depends on how imputation is carried out
Should reasonably predict missing data, but should also reflect uncertainty in predictions
Using a “sensible” imputation model
Predict missing values, then add error component drawn randomly from residual distribution of the variable
Repeat several times to improve error estimates
Use Bayesian arguments to impute data:
Parametric model for data
Ignorable missing data
Non-ignorable missing data
Apply prior for unknown model parameters
Simulate m independent draws from distribution of Y mis given Y obs
Calculate values explicitly or through MCMC
Simulate a random draw of unknown parameters from observed-data posterior
Simulate a random draw of missing values from conditional predictive distribution
Repeat, obtaining new parameter estimates from
“complete” data set until stabilizes
Do 3-5 times total (Rubin)
MCMC: data augmentation algorithm of Tanner and
Wong (1987)
T = 1 m
Calculate parameter Q from m data sets
Estimate of Q is just average of m values of Q
Variance of Q is T = (1+m -1 ) B + U
Where U is the mean within-imputation variance and B is
B = (1/m) Σ (Q l
-Q ave
) 2
The between-imputation variability.
As m → ∞ , T = B + U and you don’t need to correct B for low numbers of imputations.
Imputation is computationally distinct from analysis
Problem if assumptions of imputation are not compatible with analysis assumptions
Loss of power if imputation makes fewer assumptions than analysis
“Superefficient” if imputation is based on more
(valid) assumptions than analysis
Inconsistent if imputation makes invalid assumptions that are not included in analysis
Ex: interaction terms
Imputation needs to preserve features of data that will be included in analysis
Approximate Bayesian Bootstrap (Rubin,
1987)
Fancier version of Hot deck imputation
Removing entries with missing data vs. MI
Imputing once vs. MI
Number of imputations
Efficiency is (1+
λ/m
) -1
MI vs. EM
Ignorable if data are MAR
MI can be used when there is nonignorable nonresponse
Missing-data mechanism
For S-PLUS: www.stat.psu.edu/~jls/misoftwa.html
For R:
Amelia (II) (surveys and time-series data)
Norm (for multivariate normal data)
SOLAS (tested by Allison, 2000)
For windows
Little, R.J.A. and Rubin, D.B. (1987) Statistical Analysis with Missing Data . J. Wiley &
Sons, New York.
Schafer, J.L. (1999) Multiple imputation: a primer. Statistical Methods in Medical
Research , 8, 3-15.
Barnard, J. and X. Meng. (1999) Applications of multiple imputation in medical studies: from AIDS to NHANES. Statistical Methods in Medical Research , 8, 17-36.
http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html http://www.stat.psu.edu/~jls/mifaq.html#em
Allison, P.D. (2000) Multiple Imputation for Missing Data: A Cautionary Tale.
Sociological Methods and Research , 28 (3), 301-309.
AIDS survival time with reporting-delay
(1) Survival-time model
(2) Reporting-lag model using available information
(3) Multiply impute delayed cases using model from step 2
(4) Compute estimates of survival-time model parameters
(5) Combine estimates using repeated-imputation rules
Effects of school choice on achievement tests (public vs. private schools)
School vouchers to attend “choice” schools, participating private schools
Only households with less than 1.75 times poverty line could participate
Randomized block design
Outcome variables were scores from ITBS
Maximum of 4 years observed (1990-1994)
Higher levels of missingness than in typical medical study
Pattern in missing data was not monotone