Statistical Methods for Missing Data

advertisement

Statistical Methods for Missing Data

Roberta Harnett

MAR 550

October 30, 2007

Outline

 When do we see missing data?

 Types of missing data

 Traditional approaches

 Deletion

 Substitution

 Modern Approaches

 Maximum likelihood and Bayes

 Software

Missing Data

 Medical studies, nonresponse in surveys or censuses, dropouts in clinical trials, censored data

 Loss of information, power

 Bias in results due to differences in missing and observed data

 Complicated analysis with standard software

Types of missing data

 MCAR

 MAR

 MNAR

MCAR

 Missing Completely at Random

 Probability that x i is missing doesn’t depend on its value or on value of other variables

Doesn’t matter if it is associated with other

“missingness”

MAR

 Missing at Random

Missingness doesn’t depend on x i controlling for other variable after

 This is not great, but we can deal with it

MNAR

 Missing Not at Random

 Not MCAR or MAR (anything else)

 BAD!!

 Model missingness

Traditional Approaches

 Deletion

 List-wise

Unbiased, but loses power

Alternatives are really replacements for list-wise

 Pairwise (also called “unwise”) deletion

Leads to different sample sizes for different parts of analysis

Can be a disaster

Traditional cont…

 Single Imputation

 Hot deck

Census Bureau

 vs. Cold deck

 Mean substitution

 Regression substitution

 Stochastic regression substitution

Modern Methods

 Maximum Likelihood

 EM algorithm

Estimate parameters

 Listwise deletion, add some error

Predict missing data

(M): Maximize likelihood. Repeat.

 NORM (http://www.stat.psu.edu/~jls/misoftwa.html)

Modern Methods

 Multiple Imputation

 Simple and general – works for any type of analysis

 Validity of method depends on how imputation is carried out

Should reasonably predict missing data, but should also reflect uncertainty in predictions

Using a “sensible” imputation model

“Random Imputation”

 Predict missing values, then add error component drawn randomly from residual distribution of the variable

 Repeat several times to improve error estimates

Multiple Imputation

 Use Bayesian arguments to impute data:

 Parametric model for data

Ignorable missing data

Non-ignorable missing data

 Apply prior for unknown model parameters

 Simulate m independent draws from distribution of Y mis given Y obs

 Calculate values explicitly or through MCMC

MI procedure

 Simulate a random draw of unknown parameters from observed-data posterior

 Simulate a random draw of missing values from conditional predictive distribution

 Repeat, obtaining new parameter estimates from

“complete” data set until stabilizes

 Do 3-5 times total (Rubin)

 MCMC: data augmentation algorithm of Tanner and

Wong (1987)

T = 1 m

Parameter Estimates

 Calculate parameter Q from m data sets

 Estimate of Q is just average of m values of Q

 Variance of Q is T = (1+m -1 ) B + U

Where U is the mean within-imputation variance and B is

B = (1/m) Σ (Q l

-Q ave

) 2

The between-imputation variability.

As m → ∞ , T = B + U and you don’t need to correct B for low numbers of imputations.

MI

 Imputation is computationally distinct from analysis

 Problem if assumptions of imputation are not compatible with analysis assumptions

 Loss of power if imputation makes fewer assumptions than analysis

“Superefficient” if imputation is based on more

(valid) assumptions than analysis

MI

 Inconsistent if imputation makes invalid assumptions that are not included in analysis

 Ex: interaction terms

 Imputation needs to preserve features of data that will be included in analysis

ABB

 Approximate Bayesian Bootstrap (Rubin,

1987)

 Fancier version of Hot deck imputation

Comparison of Methods

 Removing entries with missing data vs. MI

 Imputing once vs. MI

 Number of imputations

 Efficiency is (1+

λ/m

) -1

 MI vs. EM

Nonignorable nonresponse

 Ignorable if data are MAR

 MI can be used when there is nonignorable nonresponse

 Missing-data mechanism

Programs

 For S-PLUS: www.stat.psu.edu/~jls/misoftwa.html

 For R:

 Amelia (II) (surveys and time-series data)

 Norm (for multivariate normal data)

 SOLAS (tested by Allison, 2000)

 For windows

References

Little, R.J.A. and Rubin, D.B. (1987) Statistical Analysis with Missing Data . J. Wiley &

Sons, New York.

Schafer, J.L. (1999) Multiple imputation: a primer. Statistical Methods in Medical

Research , 8, 3-15.

Barnard, J. and X. Meng. (1999) Applications of multiple imputation in medical studies: from AIDS to NHANES. Statistical Methods in Medical Research , 8, 17-36.

http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html http://www.stat.psu.edu/~jls/mifaq.html#em

Allison, P.D. (2000) Multiple Imputation for Missing Data: A Cautionary Tale.

Sociological Methods and Research , 28 (3), 301-309.

MI Example (Tu et al, 1993)

AIDS survival time with reporting-delay

(1) Survival-time model

(2) Reporting-lag model using available information

(3) Multiply impute delayed cases using model from step 2

(4) Compute estimates of survival-time model parameters

(5) Combine estimates using repeated-imputation rules

Milwaukee Parental Choice Program

(MPCP)

 Effects of school choice on achievement tests (public vs. private schools)

 School vouchers to attend “choice” schools, participating private schools

 Only households with less than 1.75 times poverty line could participate

Milwaukee Parental Choice Program

(MPCP)

 Randomized block design

 Outcome variables were scores from ITBS

 Maximum of 4 years observed (1990-1994)

 Higher levels of missingness than in typical medical study

 Pattern in missing data was not monotone

Download