Missing Data Estimation in Longitudinal Research: It’s not Cheating! It’s Essential!

advertisement
Missing Data Estimation in
Longitudinal Research:
It’s not Cheating!
It’s Essential!
Todd D. Little
University of Kansas
Director, Quantitative Training Program
Director, Center for Research Methods and Data Analysis
Director, Undergraduate Social and Behavioral Sciences Methodology Minor
Member, Developmental Psychology Training Program
crmda.KU.edu
Workshop presented 03-29-2011 @
Society for Research in Child Development
crmda.KU.edu
1
Road Map
• Learn about the different types of missing data
• Learn about ways in which the missing data process
can be recovered
• Understand why imputing missing data is not cheating
• Learn why NOT imputing missing data is more likely to lead
to errors in generalization!
• Learn about intentionally missing designs
• Introduce a simple method for significance testing
• Discuss imputation with large longitudinal datasets
crmda.KU.edu
2
Key Considerations
• Recoverability
•
•
Is it possible to recover what the sufficient statistics would
have been if there was no missing data?
• (sufficient statistics = means, variances, and covariances)
Is it possible to recover what the parameter estimates of a
model would have been if there was no missing data.
• Bias
•
Are the sufficient statistics/parameter estimates
systematically different than what they would have been
had there not been any missing data?
• Power
•
Do we have the same or similar rates of power (1 – Type II
error rate) as we would without missing data?
crmda.KU.edu
3
Types of Missing Data
•
Missing Completely at Random (MCAR)
•
•
No association with unobserved variables (selective
process) and no association with observed variables
Missing at Random (MAR)
• No association with unobserved variables, but maybe
related to observed variables
• Random in the statistical sense of predictable
•
Non-random (Selective) Missing (NMAR)
•
Some association with unobserved variables and maybe
with observed variables
crmda.KU.edu
4
Effects of imputing missing data
crmda.KU.edu
5
Effects of imputing missing data
No Association
with Observed
Variable(s)
An Association
with Observed
Variable(s)
No Association
with Unobserved
/Unmeasured
Variable(s)
MCAR
•Fully
recoverable
•Fully unbiased
MAR
• Partly to fully
recoverable
• Less biased to
unbiased
An Association
with Unobserved
/Unmeasured
Variable(s)
NMAR
• Unrecoverable
• Biased (same
bias as not
estimating)
MAR/NMAR
• Partly
recoverable
• Same to
unbiased
crmda.KU.edu
6
Effects of imputing missing data
No Association with
ANY Observed
Variable
An Association
with Analyzed
Variables
An Association
with Unanalyzed
Variables
No Association
with Unobserved
/Unmeasured
Variable(s)
MCAR
•Fully
recoverable
•Fully unbiased
MAR
• Partly to fully
recoverable
• Less biased to
unbiased
MAR
• Partly to fully
recoverable
• Less biased to
unbiased
An Association
with Unobserved
/Unmeasured
Variable(s)
NMAR
• Unrecoverable
• Biased (same
bias as not
estimating)
MAR/NMAR
• Partly to fully
recoverable
• Same to
unbiased
MAR/NMAR
• Partly to fully
recoverable
• Same to
unbiased
Statistical Power: Will always be greater when missing data is imputed!
crmda.KU.edu
7
Modern Missing Data Analysis
MI or FIML
•
In 1978, Rubin proposed Multiple Imputation (MI)
•
•
•
•
An approach especially well suited for use with large public-use
databases.
First suggested in 1978 and developed more fully in 1987.
MI primarily uses the Expectation Maximization (EM) algorithm
and/or the Markov Chain Monte Carlo (MCMC) algorithm.
Beginning in the 1980’s, likelihood approaches developed.
•
•
Multiple group SEM
Full Information Maximum Likelihood (FIML).
• An approach well suited to more circumscribed models
crmda.KU.edu
8
Full Information Maximum Likelihood
•
FIML maximizes the casewise -2loglikelihood of the available
data to compute an individual mean vector and covariance
matrix for every observation.
•
•
Each individual likelihood function is then summed to create a
combined likelihood function for the whole data frame.
•
•
Since each observation’s mean vector and covariance matrix is
based on its own unique response pattern, there is no need to fill in
the missing data.
Individual likelihood functions with greater amounts of missing
are given less weight in the final combined likelihood function than
those will a more complete response pattern, thus controlling for
the loss of information.
Formally, the function that FIML is maximizing is
2
 i 1  Ki  log i  (yi  i )i1 (yi  i ) ,
N
com
where
Ki  pi log(2 )
crmda.KU.edu
9
Multiple Imputation
•
Multiple imputation involves generating m imputed datasets
(usually between 20 and 100), running the analysis model on
each of these datasets, and combining the m sets of results to
make inferences.
•
•
Data sets can be generated in a number of ways, but the two
most common approaches are through an MCMC simulation
technique such as Tanner & Wong’s (1987) Data Augmentation
algorithm or through bootstrapping likelihood estimates, such
as the bootstrapped EM algorithm used by Amelia II.
•
•
By filling in m separate estimates for each missing value we can
account for the uncertainty in that datum’s true population value.
SAS uses data augmentation to pull random draws from a specified
posterior distribution (i.e., stationary distribution of EM
estimates).
After m data sets have been created and the analysis model has
been run on each separately, the resulting estimates are
commonly combined with Rubin’s Rules (Rubin, 1987).
crmda.KU.edu
10
Fraction Missing
•
•
Fraction Missing is a measure of efficiency lost due to
missing data. It is the extent to which parameter estimates
have greater standard errors than they would have had all
data been observed.
It is a ratio of variances:
j  1
estimated parameter variance in the complete data set
total parameter variance taking into account missingness
Estimated parameter variance in the complete data set
1
sˆ 2j 
M
M
2
ˆ
s
m
m 1
Between-imputation variance
M
1
2
ˆ  ˆ
Bˆ j 
(

)
 m MI ,M
M  1 m 1
crmda.KU.edu
11
Fraction Missing
•
Fraction of Missing Information (asymptotic formula)
ˆ j  1 
•
•
sˆ
2
j
sˆ2j  Bˆ j
Varies by parameter in the model
Is typically smaller for MCAR than MAR data
crmda.KU.edu
12
Estimate Missing Data With SAS
Obs BADL0
1
65
2
10
3
95
4
90
5
30
6
40
7
40
8
95
9
50
10 55
11 50
12 70
13 100
14 75
15
0
BADL1
BADL3
BADL6
MMSE0
95
10
100
100
80
50
70
100
80
100
100
95
100
90
5
95
40
100
100
90
.
100
100
75
100
100
100
100
100
10
100
25
100
100
100
.
95
100
85
100
100
100
100
100
.
23
25
27
30
23
28
29
28
26
30
30
28
30
30
3
crmda.KU.edu
MMSE1 MMSE3
25
27
29
30
29
27
29
30
29
30
27
28
30
30
3
25
28
29
27
29
3
30
29
27
30
30
28
30
29
3
MMSE6
27
27
28
29
30
3
30
30
25
30
24
29
30
30
.
13
PROC MI
PROC MI data=sample out=outmi
seed = 37851 nimpute=100
EM maxiter = 1000;
MCMC initial=em (maxiter=1000);
Var BADL0 BADL1 BADL3 BADL6
MMSE0 MMSE1 MMSE3 MMSE6;
run;
•
•
•
crmda.KU.edu
out=
•
Designates output file for
imputed data
nimpute =
•
•
# of imputed datasets
Default is 5
Var
•
Variables to use in imputation
14
PROC MI output: Imputed dataset
Obs _Imputation_ BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
65
10
95
90
30
40
40
95
50
55
50
70
100
75
0
95
10
100
100
80
50
70
100
80
100
100
95
100
90
5
95
40
100
100
90
21
100
100
75
100
100
100
100
100
10
100
25
100
100
100
12
95
100
85
100
100
100
100
100
8
crmda.KU.edu
23
25
27
30
23
28
29
28
26
30
30
28
30
30
3
25
27
29
30
29
27
29
30
29
30
27
28
30
30
3
25
28
29
27
29
3
30
29
27
30
30
28
30
29
3
27
27
28
29
30
3
30
30
25
30
24
29
30
30
2
15
What to Say to Reviewers:
•
I pity the fool who does not impute
– Mr. T
•
If you compute you must impute
– Johnny Cochran
•
Go forth and impute with impunity
– Todd Little
•
If math is God’s poetry, then statistics are
God’s elegantly reasoned prose
– Bill Bukowski
crmda.KU.edu
16
Missing Data and Estimation:
Missingness by Design
•
Assess all persons, but not all variables at each
time of measurement
•
•
•
McArdle, Graham
Have core battery for all participants, but divide sample
into groups and each group has additional measures
Control entry into study, to estimate and control
for retesting effects
• Randomly assign participants to their entry into a
•
longitudinal study and to the occasions of assessment
Likely to be key in providing unbiased estimates of
growth or change
crmda.KU.edu
17
3-Form Intentionally Missing Design
Common
Form Variables
Variable
Set A
Variable
Set B
Variable
Set C
1
¼ of
Variables
¼ of
Variables
¼ of
Variables
None
2
¼ of
Variables
¼ of
Variables
none
¼ of
Variables
3
¼ of
Variables
none
¼ of
Variables
¼ of
Variables
crmda.KU.edu
18
3-Form Protocol II
Common
Form Variables
Variable
Set A
Variable
Set B
Variable
Set C
1
Marker
Variables
1/3 of
Variables
1/3 of
Variables
None
2
Marker
Variables
Marker
Variables
1/3 of
Variables
none
1/3 of
Variables
none
1/3 of
Variables
1/3 of
Variables
3
crmda.KU.edu
19
Expansions of 3-Form Design
(Graham, Taylor, Olchowski, & Cumsille, 2006)
crmda.KU.edu
20
Expansions of 3-Form Design
(Graham, Taylor, Olchowski, & Cumsille, 2006)
crmda.KU.edu
21
2-Method Planned Missing Design
crmda.KU.edu
22
Controlled Enrollment
crmda.KU.edu
23
Growth-Curve Design
Group
Time 1
Time 2
Time 3
Time 4
Time 5
1
x
x
x
x
x
2
x
x
x
x
missing
3
x
x
x
missing
x
4
x
x
missing
x
x
5
x
missing
x
x
x
6
missing
x
x
x
x
crmda.KU.edu
24
Growth Curve Design II
Group
Time 1
Time 2
Time 3
Time 4
Time 5
1
x
x
x
x
x
2
x
x
x
missing
missing
3
x
x
missing
x
missing
4
x
missing
x
x
missing
5
missing
x
x
x
missing
6
x
x
missing
missing
x
7
x
missing
x
missing
x
8
missing
x
x
missing
x
9
x
missing
missing
x
x
10
missing
x
missing
x
x
11
missing
missing
x
x
x
crmda.KU.edu
25
Growth Curve Design II
Group
Time 1
Time 2
Time 3
Time 4
Time 5
1
x
x
x
x
x
2
x
x
x
missing
missing
3
x
x
missing
x
missing
4
x
missing
x
x
missing
5
missing
x
x
x
missing
6
x
x
missing
missing
x
7
x
missing
x
missing
x
8
missing
x
x
missing
x
9
x
missing
missing
x
x
10
missing
x
missing
x
x
11
missing
missing
x
x
x
crmda.KU.edu
26
Efficiency of Planned Missing Designs
crmda.KU.edu
27
Combined Elements
crmda.KU.edu
28
The Sequential Designs
crmda.KU.edu
29
Transforming to Accelerated Longitudinal
crmda.KU.edu
30
Transforming to Episodic Time
crmda.KU.edu
31
Simple Significance Testing with MI
•
Generate multiply imputed datasets (m).
•
Calculate a single covariance matrix on all N*m observations.
•
•
Run the Analysis model on this single covariance matrix and
use the resulting estimates as the basis for inference and
hypothesis testing.
•
•
By combining information from all m datasets, this matrix should
represent the best estimate of the population associations.
The fit function from this approach should be the best basis for making
inferences about model fit and significance.
Using a Monte Carlo Simulation, we test the hypothesis that
this approach is reasonable.
crmda.KU.edu
32
Population Model
.52
1*
1*
Factor B
Factor A
.75 .68 .76 .70 .72 .67 .69 .79 .72 .75
.81 .72 .74 .70 .71 .79 .69 .81 .73 .78
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
.35 .49 .45 .52 .50 .38 .53 .35 .47
.44 .53 .42 .51 .48 .55 .52 .38 .49
.39
RMSEA = .047, CFI = .967, TLI = .962, SRMR = .021
crmda.KU.edu
.43
Note: These are fully
standardized parameter
estimates
33
Change in Chi-squared Test
Correlation Matrix Technique
Change in Chi Squared Across Replications
Condition
PRB
10%
Missing
-2.95%
30%
Missing
4.39%
50%
Missing
6.08%
75
60
45
30
15
M
is
si
ng
50
%
M
is
si
ng
30
%
M
is
si
ng
10
%
n
0
Po
pu
la
tio
Change in Chi Squared
90
Condition
www.Quant.KU.edu
34
Imputing with Large Datasets
•
Create a BLOCK of variables that contains as much
information about the dataset as possible and has no missing
data
•
•
•
•
•
•
•
•
Reduce the data by creating scale averages
Reduce the data by estimating a set of principal components
Use both approaches
Impute missingness in the block.
Create product terms by key potential moderators and powered terms.
Reduce the data again
This block can be the auxiliary variables block in FIML estimation
In a sequential set of steps impute the item-level data in
groups of similar types of items
•
•
•
•
•
Use the BLOCK of variables in each set of multiple imputations.
Select the item-level data based on similarity of constructs.
Use as many items as possible.
Save, sort, and merge the imputed datasets.
Use the super matrix approach to analyze.
crmda.KU.edu
35
Missing Data Estimation in
Longitudinal Research:
It’s not Cheating!
It’s Essential!
Thanks for your attention!
Questions?
crmda.KU.edu
Workshop presented 03-29-2011
Society for Research in Child Development
crmda.KU.edu
36
Update
Dr. Todd Little is currently at
Texas Tech University
Director, Institute for Measurement, Methodology, Analysis and Policy (IMMAP)
Director, “Stats Camp”
Professor, Educational Psychology and Leadership
Email: yhat@ttu.edu
IMMAP (immap.educ.ttu.edu)
Stats Camp (Statscamp.org)
www.Quant.KU.edu
37
Download