power point - Institute for Policy Research

advertisement
Missing Data in
Randomized Control Trials
John W. Graham
The Prevention Research Center
and
Department of Biobehavioral Health
Penn State University
jgraham@psu.edu
IES/NCER Summer Research
Training Institute, August 2, 2010
Sessions in Three Parts


(1) Introduction: Missing Data Theory
(2) Attrition: Bias and Lost Power
After the break ...
 (3) Hands-on with Multiple Imputation


Multiple Imputation with NORM
SPSS Automation Utility (New!)


SPSS Regression
HLM Automation Utility (New!)

2-Level Regression with HLM 6
Recent Papers


Graham,
(2009).
Missing data analysis: making
it work in the real world. Annual Review of Psychology, 60,
549-576.
J. W.,
Graham, J. W., Cumsille, P. E., & Elek-Fisk,
(2003). Methods for handling missing data. In J. A.
E.
Schinka & W. F. Velicer (Eds.). Research Methods in Psychology
(pp. 87_114). Volume 2 of Handbook of Psychology (I. B.
Weiner, Editor-in-Chief). New York: John Wiley & Sons.

Graham,
J. W.
(2010, forthcoming).
Missing Data:
Analysis and Design. New York: Springer.



Chapter 4: Multiple Imputation with Norm 2.03
Chapter 6: Multiple Imputation and Analysis with SPSS 17/18
Chapter 7: Multiple Imputation and Analysis with Multilevel
(Cluster) Data
Recent Papers

Collins,
L. M.,
Schafer,
J. L.,
& Kam,
C. M.
(2001).
A
comparison of inclusive and restrictive strategies in modern
missing data procedures. Psychological Methods, 6, 330-351.


Schafer, J. L., & Graham, J. W. (2002).
Missing data:
our view of the state of the art. Psychological Methods, 7, 147177.
Graham, J. W., Taylor, B. J., Olchowski, A. E., &
Cumsille, P. E. (2006). Planned missing data designs in
psychological research. Psychological Methods, 11, 323-343.
Part 1:
A Brief Introduction to
Analysis with Missing Data
Problem with Missing Data

Analysis procedures were
designed for complete data
...
Solution 1



Design new model-based procedures
Missing Data
+ Parameter Estimation
in One Step
Full Information Maximum Likelihood (FIML)
SEM and Other Latent Variable Programs
(Amos, LISREL, Mplus, Mx, LTA)
Solution 2

Data based procedures


e.g., Multiple Imputation (MI)
Two Steps

Step 1: Deal with the missing data



(e.g., replace missing values with plausible
values
Produce a product
Step 2: Analyze the product as if there
were no missing data
FAQ

Aren't you somehow
with imputation?
...
helping yourself
NO.
Missing data imputation . . .


does NOT give you
something for nothing
DOES let you
make use of all data you have
...
FAQ

Is the imputed value what the person
would have given?
NO.
When we impute a value . .


We do not impute for the sake of the
value itself
We impute to preserve important
characteristics of the whole data set
...
We want . . .

unbiased parameter estimation


Good estimate of variability


e.g., b-weights
e.g., standard errors
best statistical power
Causes of Missingness

Ignorable
MCAR: Missing Completely At Random
 MAR: Missing At Random


Non-Ignorable

MNAR: Missing Not At Random
MCAR
(Missing Completely At Random)



MCAR 1: Cause of missingness
completely random process (like coin flip)
MCAR 2: (essentially MCAR)
uncorrelated with variables of interest

Cause

Example: parents move
No bias if cause omitted
MAR (Missing At Random)


Missingness may be related to
measured variables
But no residual relationship with
unmeasured variables


Example: reading speed
No bias if you control for measured
variables
MNAR (Missing Not At Random)



Even after controlling for measured
variables ...
Residual relationship with unmeasured
variables
Example: drug use reason for absence
MNAR Causes



The recommended methods assume
missingness is MAR
But what if the cause of missingness is
not MAR?
Should these methods be used
when MAR assumptions not met?
YES! These Methods
Work!


Suggested methods work
better than “old” methods
Multiple causes of missingness


Only small part of missingness may be
MNAR
Suggested methods usually work very
well
Methods:
"Old" vs MAR vs MNAR

MAR methods (MI and ML)



are ALWAYS at least as good as,
usually better than "old" methods
(e.g., listwise deletion)
Methods designed to handle MNAR
missingness are
NOT always better than MAR methods
Analysis: Old and New
Old Procedures:
Analyze Complete Cases
(listwise deletion)

may produce bias

you always lose some power

(because you are throwing away data)

reasonable if you lose only 5% of cases

often lose substantial power
Analyze Complete Cases
(listwise deletion)
1
0
1
1
1



1
1
0
1
1
1
1
1
0
1
1
1
1
1
0
very common situation
only 20% (4 of 20) data points missing
but discard 80% of the cases
Other "Old" Procedures

Pairwise deletion


Mean substitution


May be of occasional use for preliminary
analyses
Never use it
Regression-based single imputation

generally not recommended ... except ...
Recommended
Model-Based Procedures


Multiple Group SEM
(Structural Equation Modeling)
Latent Transition Analysis (Collins et al.)

A latent class procedure
Recommended
Model-Based Procedures

Raw Data Maximum Likelihood SEM
aka Full Information Maximum Likelihood (FIML)

Amos (James Arbuckle)

LISREL 8.5+ (Jöreskog & Sörbom)

Mplus (Bengt Muthén)

Mx (Michael Neale)
Amos, Mx,
Mplus, LISREL 8.8

Structural Equation Modeling (SEM) Programs

In Single Analysis ...

Good Estimation

Reasonable standard errors

Windows Graphical Interface
Limitation with
Model-Based Procedures

That particular model must be what you want
Recommended
Data-Based Procedures
EM Algorithm (ML parameter estimation)
 Norm-Cat-Mix, EMcov, SAS, SPSS
Multiple Imputation
 NORM, Cat, Mix, Pan (Joe Schafer)
 SAS Proc MI
 SPSS 17/18 (not quite yet)
 LISREL 8.5+
 Amos
EM Algorithm
Expectation - Maximization
Alternate between

E-step: predict missing data
M-step: estimate parameters

Excellent (ML) parameter estimates

But no standard errors
must use bootstrap
 or multiple imputation

Multiple Imputation

Problem with Single Imputation:
Too Little Variability


Because of Error Variance
Because covariance matrix is only one
estimate
Too Little Error Variance

Imputed value
lies on regression line
Imputed Values on Regression
Line
Restore Error . . .

Add random normal residual
Regression Line only One
Estimate
Covariance Matrix (Regression
Line) only One Estimate



Obtain multiple plausible estimates of
the covariance matrix
ideally draw multiple covariance
matrices from population
Approximate this with
Bootstrap
 Data Augmentation (Norm)
 MCMC (SAS)

Data Augmentation

stochastic version of EM

EM
E (expectation) step: predict missing data
 M (maximization) step: estimate
parameters


Data Augmentation
I (imputation) step: simulate missing data
 P (posterior) step: simulate parameters

Data Augmentation

Parameters from consecutive steps ...
too related
 i.e., not enough variability


after 50 or 100 steps of DA ...
covariance matrices are like
random draws from the population
Multiple Imputation Allows:

Unbiased Estimation

Good standard errors
provided number of imputations (m) is
large enough
 too few imputations  reduced power with
small effect sizes

Power Falloff
FMI = .50, ρrho = .10
Percent Power Falloff
14
12
10
8
6
4
2
0
100 85
From Graham,
J.W.,
Olchowski,
A.E.,
70 55 40 25
m Imputations
& Gilreath,
T.D.
(2007).
needed? Some practical clarifications of multiple imputation theory.
10
How many imputations are really
Prevention Science, 8, 206-213.
Download