Statistical Methods for Missing Data

Statistical Methods for Missing Data

Roberta Harnett

MAR 550

October 30, 2007

Outline

 When do we see missing data?

 Types of missing data

 Traditional approaches

 Deletion

 Substitution

 Modern Approaches

 Maximum likelihood and Bayes

 Software

Missing Data

 Medical studies, nonresponse in surveys or censuses, dropouts in clinical trials, censored data

 Loss of information, power

 Bias in results due to differences in missing and observed data

 Complicated analysis with standard software

Types of missing data

 MCAR

 MAR

 MNAR

MCAR

 Missing Completely at Random

 Probability that x i is missing doesn’t depend on its value or on value of other variables



Doesn’t matter if it is associated with other

“missingness”

MAR

 Missing at Random



Missingness doesn’t depend on x i controlling for other variable after

 This is not great, but we can deal with it

MNAR

 Missing Not at Random

 Not MCAR or MAR (anything else)

 BAD!!

 Model missingness

Traditional Approaches

 Deletion

 List-wise



Unbiased, but loses power



Alternatives are really replacements for list-wise

 Pairwise (also called “unwise”) deletion



Leads to different sample sizes for different parts of analysis



Can be a disaster

Traditional cont…

 Single Imputation

 Hot deck



Census Bureau

 vs. Cold deck

 Mean substitution

 Regression substitution

 Stochastic regression substitution

Modern Methods

 Maximum Likelihood

 EM algorithm



Estimate parameters

 Listwise deletion, add some error



Predict missing data



(M): Maximize likelihood. Repeat.

 NORM (http://www.stat.psu.edu/~jls/misoftwa.html)

Modern Methods

 Multiple Imputation

 Simple and general – works for any type of analysis

 Validity of method depends on how imputation is carried out



Should reasonably predict missing data, but should also reflect uncertainty in predictions



Using a “sensible” imputation model

“Random Imputation”

 Predict missing values, then add error component drawn randomly from residual distribution of the variable

 Repeat several times to improve error estimates

Multiple Imputation

 Use Bayesian arguments to impute data:

 Parametric model for data



Ignorable missing data



Non-ignorable missing data

 Apply prior for unknown model parameters

 Simulate m independent draws from distribution of Y mis given Y obs

 Calculate values explicitly or through MCMC

MI procedure

 Simulate a random draw of unknown parameters from observed-data posterior

 Simulate a random draw of missing values from conditional predictive distribution

 Repeat, obtaining new parameter estimates from

“complete” data set until stabilizes

 Do 3-5 times total (Rubin)

 MCMC: data augmentation algorithm of Tanner and

Wong (1987)

T = 1 m

Parameter Estimates

 Calculate parameter Q from m data sets

 Estimate of Q is just average of m values of Q

 Variance of Q is T = (1+m -1 ) B + U





Where U is the mean within-imputation variance and B is

B = (1/m) Σ (Q l

-Q ave

) 2

The between-imputation variability.

As m → ∞ , T = B + U and you don’t need to correct B for low numbers of imputations.

MI

 Imputation is computationally distinct from analysis

 Problem if assumptions of imputation are not compatible with analysis assumptions

 Loss of power if imputation makes fewer assumptions than analysis



“Superefficient” if imputation is based on more

(valid) assumptions than analysis

MI

 Inconsistent if imputation makes invalid assumptions that are not included in analysis

 Ex: interaction terms

 Imputation needs to preserve features of data that will be included in analysis

ABB

 Approximate Bayesian Bootstrap (Rubin,

1987)

 Fancier version of Hot deck imputation

Comparison of Methods

 Removing entries with missing data vs. MI

 Imputing once vs. MI

 Number of imputations

 Efficiency is (1+

λ/m

) -1

 MI vs. EM

Nonignorable nonresponse

 Ignorable if data are MAR

 MI can be used when there is nonignorable nonresponse

 Missing-data mechanism

Programs

 For S-PLUS: www.stat.psu.edu/~jls/misoftwa.html

 For R:

 Amelia (II) (surveys and time-series data)

 Norm (for multivariate normal data)

 SOLAS (tested by Allison, 2000)

 For windows

References













Little, R.J.A. and Rubin, D.B. (1987) Statistical Analysis with Missing Data . J. Wiley &

Sons, New York.

Schafer, J.L. (1999) Multiple imputation: a primer. Statistical Methods in Medical

Research , 8, 3-15.

Barnard, J. and X. Meng. (1999) Applications of multiple imputation in medical studies: from AIDS to NHANES. Statistical Methods in Medical Research , 8, 17-36.

http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html http://www.stat.psu.edu/~jls/mifaq.html#em

Allison, P.D. (2000) Multiple Imputation for Missing Data: A Cautionary Tale.

Sociological Methods and Research , 28 (3), 301-309.

MI Example (Tu et al, 1993)













AIDS survival time with reporting-delay

(1) Survival-time model

(2) Reporting-lag model using available information

(3) Multiply impute delayed cases using model from step 2

(4) Compute estimates of survival-time model parameters

(5) Combine estimates using repeated-imputation rules

Milwaukee Parental Choice Program

(MPCP)

 Effects of school choice on achievement tests (public vs. private schools)

 School vouchers to attend “choice” schools, participating private schools

 Only households with less than 1.75 times poverty line could participate

Milwaukee Parental Choice Program

(MPCP)

 Randomized block design

 Outcome variables were scores from ITBS

 Maximum of 4 years observed (1990-1994)

 Higher levels of missingness than in typical medical study

 Pattern in missing data was not monotone

Statistical Methods for Missing Data

Statistical Methods for Missing Data

Outline

Missing Data

Types of missing data

MCAR

MAR

MNAR

Traditional Approaches

Traditional cont…

Modern Methods

Modern Methods

“Random Imputation”

Multiple Imputation

MI procedure

Parameter Estimates

MI

MI

ABB

Comparison of Methods

Nonignorable nonresponse

Programs

References

MI Example (Tu et al, 1993)

Milwaukee Parental Choice Program

(MPCP)

Milwaukee Parental Choice Program

(MPCP)

Related documents

Products

Support

Statistical Methods for Missing Data