On the Merits of Planning and Planning for Missing Data*

advertisement
On the Merits of Planning and
Planning for Missing Data*
*You’re a fool for not using planned
missing data design
Todd D. Little
University of Kansas
Director, Quantitative Training Program
Director, Center for Research Methods and Data Analysis
Director, Undergraduate Social and Behavioral Sciences Methodology Minor
Member, Developmental Psychology Training Program
crmda.KU.edu
Workshop presented 05-21-2012 @
Max Planck Institute for Human Development in Berlin, Germany
Very Special Thanks to: Mijke Rhemtulla & Wei Wu
crmda.KU.edu
1
University of Kansas
crmda.KU.edu
2
University of Kansas
crmda.KU.edu
3
University of Kansas
crmda.KU.edu
4
University of Kansas
crmda.KU.edu
5
University of Kansas
crmda.KU.edu
6
Road Map
• Learn about the different types of missing data
• Learn about ways in which the missing data process
can be recovered
• Understand why imputing missing data is not cheating
• Learn why NOT imputing missing data is more likely to lead
to errors in generalization!
• Learn about intentionally missing designs
• Discuss imputation with large longitudinal datasets
• Introduce a simple method for significance testing
crmda.KU.edu
7
Key Considerations
• Recoverability
•
•
Is it possible to recover what the sufficient statistics would
have been if there was no missing data?
• (sufficient statistics = means, variances, and covariances)
Is it possible to recover what the parameter estimates of a
model would have been if there was no missing data.
• Bias
•
Are the sufficient statistics/parameter estimates
systematically different than what they would have been
had there not been any missing data?
• Power
•
Do we have the same or similar rates of power (1 – Type II
error rate) as we would without missing data?
crmda.KU.edu
8
Types of Missing Data
•
Missing Completely at Random (MCAR)
•
•
No association with unobserved variables (selective
process) and no association with observed variables
Missing at Random (MAR)
• No association with unobserved variables, but maybe
related to observed variables
• Random in the statistical sense of predictable
•
Non-random (Selective) Missing (MNAR)
•
Some association with unobserved variables and maybe
with observed variables
crmda.KU.edu
9
Effects of imputing missing data
No Association
with Observed
Variable(s)
An Association
with Observed
Variable(s)
No Association
with Unobserved
/Unmeasured
Variable(s)
MCAR
•Fully
recoverable
•Fully unbiased
MAR
• Partly to fully
recoverable
• Less biased to
unbiased
An Association
with Unobserved
/Unmeasured
Variable(s)
NMAR
• Unrecoverable
• Biased (same
bias as not
estimating)
MAR/NMAR
• Partly
recoverable
• Same to
unbiased
crmda.KU.edu
10
Effects of imputing missing data
No Association with
ANY Observed
Variable
An Association
with Analyzed
Variables
An Association
with Unanalyzed
Variables
No Association
with Unobserved
/Unmeasured
Variable(s)
MCAR
•Fully
recoverable
•Fully unbiased
MAR
• Partly to fully
recoverable
• Less biased to
unbiased
MAR
• Partly to fully
recoverable
• Less biased to
unbiased
An Association
with Unobserved
/Unmeasured
Variable(s)
NMAR
• Unrecoverable
• Biased (same
bias as not
estimating)
MAR/NMAR
• Partly to fully
recoverable
• Same to
unbiased
MAR/NMAR
• Partly to fully
recoverable
• Same to
unbiased
Statistical Power: Will always be greater when missing data is imputed!
crmda.KU.edu
11
Bad Missing Data Corrections
•
•
List-wise Deletion
• If a single data point is missing, delete subject
• N is uniform but small
• Variances biased, means biased
• Acceptable only if power is not an issue and the incomplete
data is MCAR
Pair-wise Deletion
•
•
•
•
•
If a data point is missing, delete paired data points when
calculating the correlation
N varies per correlation
Variances biased, means biased
Matrix often non-positive definite
Acceptable only if power is not an issue and the incomplete
data is MCAR
www.crmda.ku.edu
12
Bad Imputation Techniques
• Sample-wise Mean Substitution
• Use the mean of the sample for any missing value
•
•
of a given individual
Variances reduced
Correlations biased
• Subject-wise Mean Substitution
• Use the mean score of other items for a given
missing value
• Depends on the homogeneity of the items used
• Is like regression imputation with regression weights
fixed at 1.0
www.crmda.ku.edu
13
Questionable Imputation Techniques
• Regression Imputation – Focal Item Pool
• Regress the variable with missing data on to
•
•
other items selected for a given analysis
Variances reduced
Assumes MCAR and MAR
• Regression Imputation – Full Item Pool
• Variances reduced
• Attempts to account for NMAR in as much as
items in the pool correlate with the unobserved
variables responsible for the missingness
www.crmda.ku.edu
14
Modern Missing Data Analysis
MI or FIML
•
In 1978, Rubin proposed Multiple Imputation (MI)
•
•
•
•
An approach especially well suited for use with large public-use
databases.
First suggested in 1978 and developed more fully in 1987.
MI primarily uses the Expectation Maximization (EM) algorithm
and/or the Markov Chain Monte Carlo (MCMC) algorithm.
Beginning in the 1980’s, likelihood approaches developed.
•
•
Multiple group SEM
Full Information Maximum Likelihood (FIML).
• An approach well suited to more circumscribed models
crmda.KU.edu
15
Full Information Maximum Likelihood
•
FIML maximizes the casewise -2loglikelihood of the available
data to compute an individual mean vector and covariance
matrix for every observation.
•
•
Each individual likelihood function is then summed to create a
combined likelihood function for the whole data frame.
•
•
Since each observation’s mean vector and covariance matrix is
based on its own unique response pattern, there is no need to fill in
the missing data.
Individual likelihood functions with greater amounts of missing
are given less weight in the final combined likelihood function than
those will a more complete response pattern, thus controlling for
the loss of information.
Formally, the function that FIML is maximizing is
2
 i 1  Ki  log i  (yi  i )i1 (yi  i ) ,
N
com
where
Ki  pi log(2 )
crmda.KU.edu
16
Multiple Imputation
•
Multiple imputation involves generating m imputed datasets
(usually between 20 and 100), running the analysis model on
each of these datasets, and combining the m sets of results to
make inferences.
•
•
Data sets can be generated in a number of ways, but the two
most common approaches are through an MCMC simulation
technique such as Tanner & Wong’s (1987) Data Augmentation
algorithm or through bootstrapping likelihood estimates, such
as the bootstrapped EM algorithm used by Amelia II.
•
•
By filling in m separate estimates for each missing value we can
account for the uncertainty in that datum’s true population value.
SAS uses data augmentation to pull random draws from a specified
posterior distribution (i.e., stationary distribution of EM
estimates).
After m data sets have been created and the analysis model has
been run on each separately, the resulting estimates are
commonly combined with Rubin’s Rules (Rubin, 1987).
crmda.KU.edu
17
Good Data Imputation Techniques
• (But only if variables related to missingness are included in analysis, or
missingness is MCAR)
•
EM Imputation
• Imputes the missing data values a number of times starting with the E
•
•
•
•
•
step
The E(stimate)-step is a stochastic regression-based imputation
The M(aximize)-step is to calculate a complete covariance matrix
based on the estimated values.
The E-step is repeated for each variable but the regression is now on
the covariance matrix estimated from the first E-step.
The M-step is repeated until the imputed estimates don’t differ from
one iteration to the other
MCMC imputation is a more flexible (but computerintensive) algorithm.
crmda.KU.edu
18
Good Data Imputation Techniques
•
(But only if variables related to missingness are included in
analysis, or missingness is MCAR)
• Multiple (EM or MCMC) Imputation
• Impute N (say 20) datasets
• Each data set is based on a resampling plan of the
original sample
• Mimics a random selection of another sample from the
population
• Run your analyses N times
• Calculate the mean and standard deviation of the N
analyses
crmda.KU.edu
19
Fraction Missing
•
•
Fraction Missing is a measure of efficiency lost due to
missing data. It is the extent to which parameter estimates
have greater standard errors than they would have had all
data been observed.
It is a ratio of variances:
j  1
estimated parameter variance in the complete data set
total parameter variance taking into account missingness
Estimated parameter variance in the complete data set
1
sˆ 2j 
M
M
2
ˆ
s
m
m 1
Between-imputation variance
M
1
2
ˆ  ˆ
Bˆ j 
(

)
 m MI ,M
M  1 m 1
crmda.KU.edu
20
Fraction Missing
•
Fraction of Missing Information (asymptotic formula)
ˆ j  1 
•
•
sˆ
2
j
sˆ2j  Bˆ j
Varies by parameter in the model
Is typically smaller for MCAR than MAR data
crmda.KU.edu
21
Estimate Missing Data With SAS
Obs BADL0
1
65
2
10
3
95
4
90
5
30
6
40
7
40
8
95
9
50
10 55
11 50
12 70
13 100
14 75
15
0
BADL1
BADL3
BADL6
MMSE0
95
10
100
100
80
50
70
100
80
100
100
95
100
90
5
95
40
100
100
90
.
100
100
75
100
100
100
100
100
10
100
25
100
100
100
.
95
100
85
100
100
100
100
100
.
23
25
27
30
23
28
29
28
26
30
30
28
30
30
3
crmda.KU.edu
MMSE1 MMSE3
25
27
29
30
29
27
29
30
29
30
27
28
30
30
3
25
28
29
27
29
3
30
29
27
30
30
28
30
29
3
MMSE6
27
27
28
29
30
3
30
30
25
30
24
29
30
30
.
22
PROC MI
PROC MI data=sample out=outmi
seed = 37851 nimpute=100
EM maxiter = 1000;
MCMC initial=em (maxiter=1000);
Var BADL0 BADL1 BADL3 BADL6
MMSE0 MMSE1 MMSE3 MMSE6;
run;
•
•
•
crmda.KU.edu
out=
•
Designates output file for
imputed data
nimpute =
•
•
# of imputed datasets
Default is 5
Var
•
Variables to use in imputation
23
PROC MI output: Imputed dataset
Obs _Imputation_ BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
65
10
95
90
30
40
40
95
50
55
50
70
100
75
0
95
10
100
100
80
50
70
100
80
100
100
95
100
90
5
95
40
100
100
90
21
100
100
75
100
100
100
100
100
10
100
25
100
100
100
12
95
100
85
100
100
100
100
100
8
crmda.KU.edu
23
25
27
30
23
28
29
28
26
30
30
28
30
30
3
25
27
29
30
29
27
29
30
29
30
27
28
30
30
3
25
28
29
27
29
3
30
29
27
30
30
28
30
29
3
27
27
28
29
30
3
30
30
25
30
24
29
30
30
2
24
What to Say to Reviewers:
•
I pity the fool who does not impute
– Mr. T
•
If you compute you must impute
– Johnny Cochran
•
Go forth and impute with impunity
– Todd Little
•
If math is God’s poetry, then statistics are
God’s elegantly reasoned prose
– Bill Bukowski
crmda.KU.edu
25
Planned missing data designs
• In planned missing data designs, participants are
randomly assigned to conditions in which they do not
respond to all items, all measures, and/or all
measurement occasions
• Why would you want to do this?
1.
2.
3.
4.
Long assessments can reduce data quality
Repeated assessments can induce practice effects
Collecting data can be time- and cost-intensive
Less taxing assessments may reduce unplanned
missingness
crmda.KU.edu
26
Planned missing data designs
• Cross-Sectional Designs
– Matrix sampling (brief)
– Three-Form Design (and Variations)
– Two-Method Measurement (very cool)
• Longitudinal Designs
–
–
–
–
Developmental Time-Lag
Wave- to Age-based designs
Monotonic Sample Reduction
Growth-Curve Planned Missing
crmda.KU.edu
27
Multiple matrix sampling
Test Items
1
1
0
1
0
2
0
1
0
0
3
1
1
1
0
4
1
1
1
0
5…
0
1
0
1
K
Participants
1
2
3
4
…
N
crmda.KU.edu
28
Multiple matrix sampling
Test Items
1
2
3
4
…
1
1
0
1
0
2
0
1
0
0
3
1
1
1
0
4
1
1
1
0
5…
0
1
0
1
K
Participants
Test a few participants on full item bank
N
crmda.KU.edu
29
Multiple matrix sampling
Or, randomly sample items and people…
Test Items
1
1
0
1
0
2
0
1
0
0
3
1
1
1
0
4
1
1
1
0
5…
0
1
0
1
K
Participants
1
2
3
4
…
N
crmda.KU.edu
30
Multiple matrix sampling
• Assumptions
– The K items are a random sample from a
population of items (just as N participants are a
random sample from a population)
• Limitations
– Properties of individual items or relations
between items are not of interest
• Not used much outside of ability testing
domain.
crmda.KU.edu
31
3-Form Intentionally Missing Design
Common
Form Variables
Variable
Set A
Variable
Set B
Variable
Set C
Planned
Missing
¼ of
Variables
1
¼ of
Variables
¼ of
Variables
¼ of
Variables
2
¼ of
Variables
¼ of
Variables
3
¼ of
Variables
Planned
Missing
Planned
Missing
¼ of
Variables
¼ of
Variables
• Graham Graham, Taylor, Olchowski, & Cumsille (2006)
• Raghunathan & Grizzle (1995) “split questionnaire design”
• Wacholder et al. (1994) “partial questionnaire design”
crmda.KU.edu
32
3-form design
•
What goes in the Common Set?
Form
Common Set
X
Variable Set A
Variable Set B
Variable Set C
1
¼ of items
¼ of items
¼ of items
missing
2
¼ of items
¼ of items
missing
¼ of items
3
¼ of items
missing
¼ of items
¼ of items
crmda.KU.edu
33
3-form design: Example
•
21 questions made up of 7 3-question subtests
Subtest
Item
Subtest
Item
Demographics
How old are you?
Are you male or female?
What is your occupation?
Extraversion
Musical Taste
What is your favorite genre
of music?
Do you like to listen to music
while you work?
Do you prefer music played
loud or softly?
I start conversations.
I am the life of the party.
I am comfortable around
people.
Neuroticism
I get stressed out easily.
I get irritated easily.
I have frequent mood
swings.
Conscientiousness
I am always prepared.
I like order.
I pay attention to details.
Agreeableness
I am interested in people.
I have a soft heart.
I take time out for others.
Openness
I have a rich vocabulary.
I have excellent ideas.
I have a vivid imagination.
crmda.KU.edu
34
3-form design: Example
• Common Set (X)
Subtest
Item
Subtest
Item
Demographics
How old are you?
Are you male or female?
What is your occupation?
Extraversion
Musical Taste
What is your favorite genre
of music?
Do you like to listen to music
while you work?
Do you prefer music played
loud or softly?
I start conversations.
I am the life of the party.
I am comfortable around
people.
Neuroticism
I get stressed out easily.
I get irritated easily.
I have frequent mood
swings.
Conscientiousness
I am always prepared.
I like order.
I pay attention to details.
Agreeableness
I am interested in people.
I have a soft heart.
I take time out for others.
Openness
I have a rich vocabulary.
I have excellent ideas.
I have a vivid imagination.
crmda.KU.edu
3-form design: Example
• Common Set (X)
Subtest
Item
Subtest
Item
Demographics
How old are you?
Are you male or female?
What is your occupation?
Extraversion
Musical Taste
What is your favorite genre
of music?
Do you like to listen to music
while you work?
Do you prefer music played
loud or softly?
I start conversations.
I am the life of the party.
I am comfortable around
people.
Neuroticism
I get stressed out easily.
I get irritated easily.
I have frequent mood
swings.
Conscientiousness
I am always prepared.
I like order.
I pay attention to details.
Agreeableness
I am interested in people.
I have a soft heart.
I take time out for others.
Openness
I have a rich vocabulary.
I have excellent ideas.
I have a vivid imagination.
crmda.KU.edu
36
3-form design: Example
• Set A
Subtest
Item
Subtest
Item
Demographics
How old are you?
Are you male or female?
What is your occupation?
Extraversion
Musical Taste
What is your favorite genre
of music?
Do you like to listen to music
while you work?
Do you prefer music played
loud or softly?
I start conversations.
I am the life of the party.
I am comfortable around
people.
Neuroticism
I get stressed out easily.
I get irritated easily.
I have frequent mood
swings.
Conscientiousness
I am always prepared.
I like order.
I pay attention to details.
Agreeableness
I am interested in people.
I have a soft heart.
I take time out for others.
Openness
I have a rich vocabulary.
I have excellent ideas.
I have a vivid imagination.
crmda.KU.edu
37
3-form design: Example
• Set B
Subtest
Item
Subtest
Item
Demographics
How old are you?
Are you male or female?
What is your occupation?
Extraversion
Musical Taste
What is your favorite genre
of music?
Do you like to listen to music
while you work?
Do you prefer music played
loud or softly?
I start conversations.
I am the life of the party.
I am comfortable around
people.
Neuroticism
I get stressed out easily.
I get irritated easily.
I have frequent mood
swings.
Conscientiousness
I am always prepared.
I like order.
I pay attention to details.
Agreeableness
I am interested in people.
I have a soft heart.
I take time out for others.
Openness
I have a rich vocabulary.
I have excellent ideas.
I have a vivid imagination.
crmda.KU.edu
38
3-form design: Example
• Set C
Subtest
Item
Subtest
Item
Demographics
How old are you?
Are you male or female?
What is your occupation?
Extraversion
Musical Taste
What is your favorite genre
of music?
Do you like to listen to music
while you work?
Do you prefer music played
loud or softly?
I start conversations.
I am the life of the party.
I am comfortable around
people.
Neuroticism
I get stressed out easily.
I get irritated easily.
I have frequent mood
swings.
swings.
Conscientiousness
I am always prepared.
I like order.
I pay attention to details.
Agreeableness
I am interested in people.
I have a soft heart.
I take time out for others.
Openness
I have a rich vocabulary.
I have excellent ideas.
I have a vivid imagination.
crmda.KU.edu
39
Form 1 (XAB)
Form 2 (XAC)
Form 3 (XBC)
How old are you?
Are you male or female?
What is your occupation?
How old are you?
Are you male or female?
What is your occupation?
How old are you?
Are you male or female?
What is your occupation?
What is your favorite genre of
What is your favorite genre of
What is your favorite genre of
music?
music?
music?
Do you like to listen to music
Do you like to listen to music
Do you like to listen to music
while you work?
while you work?
while you work?
Do you prefer music played loud or Do you prefer music played loud or Do you prefer music played loud or
softly?
softly?
softly?
I have a rich vocabulary.
I have excellent ideas.
I have a rich vocabulary.
I have a vivid imagination.
I have excellent ideas.
I have a vivid imagination.
I start conversations.
I am the life of the party.
I start conversations.
I am comfortable around people.
I am the life of the party.
I am comfortable around people.
I get stressed out easily.
I get irritated easily.
I get stressed out easily.
I have frequent mood swings.
I get irritated easily.
I have frequent mood swings.
I am always prepared.
I like order.
I am always prepared.
I pay attention to details.
I like order.
I pay attention to details.
I am interested in people.
I have a soft heart.
I am interested in people.
I take time out for others.
I have a soft heart.
I take time out for others.
40
Jazz
4
1
29 M
server
5
1
17 M
6
2
11
7
2
8
4
4
--
1
5
--
1
2
--
4
2
--
3
2
--
soft
1
3
--
2
2
--
5
3
--
4
1
--
2
1
--
N
soft
2
4
--
5
5
--
2
4
--
5
1
--
4
2
--
Metal
N
soft
1
3
--
5
2
--
2
1
--
1
1
--
4
2
--
chef
Rock
N
soft
1
4
--
5
1
--
2
2
--
5
3
--
2
2
--
F
painter
Pop
Y loud
4
--
4
2
--
1
1
--
5
1
--
5
5
--
3
19
F
librarian
Alt
N loud
1
--
4
4
--
3
4
--
3
4
--
2
4
--
3
2
22
F
server
Ska
N
soft
4
--
2
3
--
3
3
--
3
1
--
2
5
--
5
9
2
18 M
doctor
Punk
N loud
1
--
3
2
--
2
2
--
4
4
--
1
3
--
2
10
2
19
F
statistician
Pop
N loud
4
--
5
3
--
4
5
--
4
3
--
2
3
--
1
11
3
28
F
chef
Rock
Y loud --
3
3
--
5
5
--
5
4
--
3
3
--
2
5
12
3
25 M
nurse
Rock
N
soft
--
4
5
--
2
2
--
2
5
--
4
5
--
3
5
13
3
19 M
lawyer
Jazz
Y
soft
--
3
4
--
3
2
--
4
5
--
4
5
--
1
2
14
3
18
F
accountant
Metal
N
soft
--
3
1
--
1
2
--
3
3
--
4
4
--
5
4
15
3
21
F
secretary
Alt
N loud --
4
4
--
1
2
--
1
1
--
5
3
--
4
5
Genre
Agree3
student
Agree2
17 M
Agree1
1
Consc3
3
Consc2
N
Consc1
Funk
Neuro3
musician
Neuro2
F
Neuro1
12
Extra3
1
Extra2
2
Extra1
professor
Open3
Occupation
F
Open2
Sex
17
Open1
Age
1
Volume
Form
Work Music
Participant
1
Classical N loud
crmda.KU.edu
41
Jazz
4
1
29 M
server
5
1
17 M
6
2
11
7
2
8
4
4
--
1
5
--
1
2
--
4
2
--
3
2
--
soft
1
3
--
2
2
--
5
3
--
4
1
--
2
1
--
N
soft
2
4
--
5
5
--
2
4
--
5
1
--
4
2
--
Metal
N
soft
1
3
--
5
2
--
2
1
--
1
1
--
4
2
--
chef
Rock
N
soft
1
4
--
5
1
--
2
2
--
5
3
--
2
2
--
F
painter
Pop
Y loud
4
--
4
2
--
1
1
--
5
1
--
5
5
--
3
19
F
librarian
Alt
N loud
1
--
4
4
--
3
4
--
3
4
--
2
4
--
3
2
22
F
server
Ska
N
soft
4
--
2
3
--
3
3
--
3
1
--
2
5
--
5
9
2
18 M
doctor
Punk
N loud
1
--
3
2
--
2
2
--
4
4
--
1
3
--
2
10
2
19
F
statistician
Pop
N loud
4
--
5
3
--
4
5
--
4
3
--
2
3
--
1
11
3
28
F
chef
Rock
Y loud --
3
3
--
5
5
--
5
4
--
3
3
--
2
5
12
3
25 M
nurse
Rock
N
soft
--
4
5
--
2
2
--
2
5
--
4
5
--
3
5
13
3
19 M
lawyer
Jazz
Y
soft
--
3
4
--
3
2
--
4
5
--
4
5
--
1
2
14
3
18
F
accountant
Metal
N
soft
--
3
1
--
1
2
--
3
3
--
4
4
--
5
4
15
3
21
F
secretary
Alt
N loud --
4
4
--
1
2
--
1
1
--
5
3
--
4
5
Genre
Agree3
student
Agree2
17 M
Agree1
1
Consc3
3
Consc2
N
Consc1
Funk
Neuro3
musician
Neuro2
F
Neuro1
12
Extra3
1
Extra2
2
Extra1
professor
Open3
Occupation
F
Open2
Sex
17
Open1
Age
1
Volume
Form
Work Music
Participant
1
Classical N loud
crmda.KU.edu
42
Jazz
4
1
29 M
server
5
1
17 M
6
2
11
7
2
8
4
4
--
1
5
--
1
2
--
4
2
--
3
2
--
soft
1
3
--
2
2
--
5
3
--
4
1
--
2
1
--
N
soft
2
4
--
5
5
--
2
4
--
5
1
--
4
2
--
Metal
N
soft
1
3
--
5
2
--
2
1
--
1
1
--
4
2
--
chef
Rock
N
soft
1
4
--
5
1
--
2
2
--
5
3
--
2
2
--
F
painter
Pop
Y loud
4
--
4
2
--
1
1
--
5
1
--
5
5
--
3
19
F
librarian
Alt
N loud
1
--
4
4
--
3
4
--
3
4
--
2
4
--
3
2
22
F
server
Ska
N
soft
4
--
2
3
--
3
3
--
3
1
--
2
5
--
5
9
2
18 M
doctor
Punk
N loud
1
--
3
2
--
2
2
--
4
4
--
1
3
--
2
10
2
19
F
statistician
Pop
N loud
4
--
5
3
--
4
5
--
4
3
--
2
3
--
1
11
3
28
F
chef
Rock
Y loud --
3
3
--
5
5
--
5
4
--
3
3
--
2
5
12
3
25 M
nurse
Rock
N
soft
--
4
5
--
2
2
--
2
5
--
4
5
--
3
5
13
3
19 M
lawyer
Jazz
Y
soft
--
3
4
--
3
2
--
4
5
--
4
5
--
1
2
14
3
18
F
accountant
Metal
N
soft
--
3
1
--
1
2
--
3
3
--
4
4
--
5
4
15
3
21
F
secretary
Alt
N loud --
4
4
--
1
2
--
1
1
--
5
3
--
4
5
Genre
Agree3
student
Agree2
17 M
Agree1
1
Consc3
3
Consc2
N
Consc1
Funk
Neuro3
musician
Neuro2
F
Neuro1
12
Extra3
1
Extra2
2
Extra1
professor
Open3
Occupation
F
Open2
Sex
17
Open1
Age
1
Volume
Form
Work Music
Participant
1
Classical N loud
crmda.KU.edu
43
Jazz
4
1
29 M
server
5
1
27 M
6
2
21
7
2
8
4
4
--
1
5
--
1
2
--
4
2
--
3
2
--
soft
1
3
--
2
2
--
5
3
--
4
1
--
2
1
--
N
soft
2
4
--
5
5
--
2
4
--
5
1
--
4
2
--
Metal
N
soft
1
3
--
5
2
--
2
1
--
1
1
--
4
2
--
chef
Rock
N
soft
1
4
--
5
1
--
2
2
--
5
3
--
2
2
--
F
painter
Pop
Y loud
4
--
4
2
--
1
1
--
5
1
--
5
5
--
3
39
F
librarian
Alt
N loud
1
--
4
4
--
3
4
--
3
4
--
2
4
--
3
2
22
F
server
Ska
N
soft
4
--
2
3
--
3
3
--
3
1
--
2
5
--
5
9
2
38 M
doctor
Punk
N loud
1
--
3
2
--
2
2
--
4
4
--
1
3
--
2
10
2
29
F
statistician
Pop
N loud
4
--
5
3
--
4
5
--
4
3
--
2
3
--
1
11
3
28
F
chef
Rock
Y loud
--
3
3
--
5
5
--
5
4
--
3
3
--
2
5
12
3
25 M
nurse
Rock
N
soft
--
4
5
--
2
2
--
2
5
--
4
5
--
3
5
13
3
29 M
lawyer
Jazz
Y
soft
--
3
4
--
3
2
--
4
5
--
4
5
--
1
2
14
3
38
F
accountant
Metal
N
soft
--
3
1
--
1
2
--
3
3
--
4
4
--
5
4
15
3
21
F
secretary
Alt
N loud
--
4
4
--
1
2
--
1
1
--
5
3
--
4
5
Genre
Agree3
student
Agree2
27 M
Agree1
1
Consc3
3
Consc2
N
Consc1
Funk
Neuro3
musician
Neuro2
F
Neuro1
42
Extra3
1
Extra2
2
Extra1
professor
Open3
Occupation
F
Open2
Sex
47
Open1
Age
1
Volume
Form
Work Music
Participant
1
Classical N loud
crmda.KU.edu
44
Expansions of 3-Form Design
Forms
Item Set Order
1
2
3
4
5
6
7
8
9
XAB
XAC
XBC
AXB
AXC
BXC
ABX
ACX
BCX
(Graham, Taylor, Olchowski, & Cumsille, 2006)
crmda.KU.edu
45
Expansions of 3-Form Design
(Graham, Taylor, Olchowski, & Cumsille, 2006)
crmda.KU.edu
46
2-Method Planned Missing Design
crmda.KU.edu
47
2-Method Measurement
•
Expensive Measure 1
• Gold standard– highly valid (unbiased) measure of
•
•
the construct under investigation
Problem: Measure 1 is time-consuming and/or
costly to collect, so it is not feasible to collect from a
large sample
Inexpenseive Measure 2
• Practical– inexpensive and/or quick to collect on a
•
large sample
Problem: Measure 2 is systematically biased so not
ideal
crmda.KU.edu
48
2-Method Measurement
•
e.g., measuring stress
•
•
•
Expensive Measure 1 = collect spit samples, measure cortisol
Inexpensive Measure 2 = survey querying stressful thoughts
e.g., measuring intelligence
• Expensive Measure 1 = WAIS IQ scale
• Inexpensive Measure 2 = multiple choice IQ test
•
e.g., measuring smoking
• Expensive Measure 1 = carbon monoxide measure
• Inexpensive Measure 2 = self-report
•
e.g., Student Attention
•
•
Expensive Measure 1 = Classroom observations
Inexpensive Measure 2 = Teacher report
crmda.KU.edu
49
2-Method Measurement
• How it works
• ALL participants receive Measure 2 (the cheap
•
•
one)
A subset of participants also receive Measure 1
(the gold standard)
Using both measures (on a subset of
participants) enables us to estimate and remove
the bias from the inexpensive measure (for all
participants) using a latent variable model
crmda.KU.edu
50
2-Method Planned Missing Design
Self-Report
Bias
SelfSelfReport 1 Report 2
CO
Cotinine
Smoking
crmda.KU.edu
51
2-Method Measurement
• Example
• Does child’s level of classroom attention in Grade 1
•
•
predict math ability in Grade 3?
Attention Measures
• 1) Direct Classroom Assessment (2 items, N = 60)
• 2) Teacher Report (2 items, N = 200)
Math Ability Measure, 1 item (test score, N = 200)
crmda.KU.edu
52
1

Attention
(Grade 1)
TR1
TR 2
DA1
DA 2
Math Score
(Grade 3)
1
Teacher
Report 1
Teacher
Report 2
Direct
Assessment 1
Direct
Assessment 2
Math Score
(Grade 3)
(N = 200)
(N = 200)
(N = 60)
(N = 60)
(N = 200)
1
1
Teacher
Bias
2
 bias
crmda.KU.edu
53
1

Attention
(Grade 1)
TR1
TR 2
DA1
DA 2
Math Score
(Grade 3)
1
Teacher
Report 1
Teacher
Report 2
Direct
Assessment 1
Direct
Assessment 2
Math Score
(Grade 3)
(N = 200)
(N = 200)
(N = 60)
(N = 60)
(N = 200)
1
1
Teacher
Bias
2
 bias
crmda.KU.edu
54
1

Attention
(Grade 1)
TR1
TR 2
DA1
DA 2
Math Score
(Grade 3)
1
Teacher
Report 1
Teacher
Report 2
Direct
Assessment 1
Direct
Assessment 2
Math Score
(Grade 3)
(N = 200)
(N = 200)
(N = 60)
(N = 60)
(N = 200)
1
1
Teacher
Bias
2
 bias
crmda.KU.edu
55
1

Attention
(Grade 1)
TR1
TR 2
DA1
DA 2
Math Score
(Grade 3)
1
Teacher
Report 1
Teacher
Report 2
Direct
Assessment 1
Direct
Assessment 2
Math Score
(Grade 3)
(N = 200)
(N = 200)
(N = 60)
(N = 60)
(N = 200)
1
1
Teacher
Bias
2
 bias
crmda.KU.edu
56
1

Attention
(Grade 1)
TR1
TR 2
DA1
DA 2
Math Score
(Grade 3)
1
Teacher
Report 1
Teacher
Report 2
Direct
Assessment 1
Direct
Assessment 2
Math Score
(Grade 3)
(N = 200)
(N = 200)
(N = 60)
(N = 60)
(N = 200)
1
1
Teacher
Bias
2
 bias
crmda.KU.edu
57
1

Attention
(Grade 1)
TR1
TR 2
DA1
DA 2
Math Score
(Grade 3)
1
Teacher
Report 1
Teacher
Report 2
Direct
Assessment 1
Direct
Assessment 2
Math Score
(Grade 3)
(N = 200)
(N = 200)
(N = 60)
(N = 60)
(N = 200)
1
1
Teacher
Bias
2
 bias
crmda.KU.edu
58
2-Method Planned Missing Design
crmda.KU.edu
59
2-Method Planned Missing Design
crmda.KU.edu
60
2-Method Planned Missing Design
• Assumptions:
•
• expensive measure is unbiased (i.e., valid)
• inexpensive measure is systematically biased
• both measures access the same construct
Goals
• Optimize cost
• Optimize power
crmda.KU.edu
61
2-Method Planned Missing Design
• All participants get the inexpensive measure
• Only a subset get the expensive measure
• Cost:
Proportion of sample
MC test
WAIS
.36
yes
yes
.64
yes
no
$total  $inexpensive  N total  $expensive  N expensive
N expensive 
N total 
$total   $inexpensive  N total 
$expensive
$total   $expensive  N expensive 
$inexpensive
crmda.KU.edu
62
2-Method Planned Missing Design
•
•
Holding cost constant,
as Ntotal increases,
Nexpensive decreases
As Ntotal increases, SEs
begin to decrease (power
increases); as Ntotal
continues to increase,
SEs increase again,
driving power back
down
63
crmda.KU.edu
63
2-Method Planned Missing Design
•
Goal: find the sweet spot!
true-score true-score
reliability
reliability
(expensive) (cheap)
bias
.25
.25
cheap only
.49
.25
cheap only
.25
.49
cheap only
.49
.49
cheap only
.49
.25
neither
64
crmda.KU.edu
64
Longitudinal Missing Designs
• Rather than specific items missing,
•
longitudinal planned missing designs tend to
focus on whole waves missing for individual
participants
Researchers have long turned complete data
into planned missing data with more time
points
• e.g., data at 3 grades transformed into 8 ages
crmda.KU.edu
65
Developmental Time-Lag Model
•
Use 2-time point data with variable time-lags to
measure a growth trajectory + practice effects
(McArdle & Woodcock, 1997)
crmda.KU.edu
66
Time
Age
student
T1
T2
1
5;6
5;7
2
5;3
5;8
3
4;9
4;11
4
4;6
5;0
5
4;11
5;4
6
5;7
5;10
7
5;2
5;3
8
5;4
5;8
0
1
2
crmda.KU.edu
3
4
5
6
67
T0
T1
T2
T3
crmda.KU.edu
T4
T5
T6
68
Yt  1 I  Bt G  At P
1
Intercept
1
T0
1
1
T1
1
1
1
1
T2
T3
crmda.KU.edu
T4
T5
T6
69
Yt  1 I  Bt G  At P
Linear growth
1
1
Intercept
1
T0
1
Growth
1
T1
1
1
1
1
0 1
T2
2
3
4
T3
crmda.KU.edu
5
6
T4
T5
T6
70
Yt  1 I  Bt G  At P
Constant Practice Effect
1
1
Intercept
1
T0
1
1
Growth
1
T1
1
1
1
1
0 1
T2
2
3
4
T3
crmda.KU.edu
Practice
5
6
T4
0
11
1
1
T5
1
1
T6
71
Yt  1 I  Bt G  At P
Exponential Practice Decline
1
1
Intercept
1
T0
1
1
Growth
1
T1
1
1
1
1
0 1
T2
2
3
4
T3
crmda.KU.edu
Practice
5
6
T4
0
1 .87
.67
.55
T5
.45
.35
T6
72
The Equations for Each Time Point
Constant Practice Effect
Declining Practice Effect
YT0  I
YT1  I  1G  P
YT0  I
YT1  I  1G  1.0 P
YT2  I  2G  P
YT2  I  2G  .82 P
YT3  I  3G  P
YT3  I  3G  .67 P
YT 4  I  4G  P
YT 4  I  4G  .55P
YT 5  I  5G  P
YT 6  I  6G  P
YT 5  I  5G  .45P
YT 6  I  6G  .37 P
crmda.KU.edu
73
Developmental Time-Lag model
•
Summary
• 2 measured time points are formatted according to
•
time-lag
This formatting allows a growth-curve to be fit,
measuring growth and practice effects
crmda.KU.edu
74
Wave- to Age-based Data
•
•
The idea of reformatting data to answer a
different question is not limited to time-lag designs
Wave-based data collection (e.g., data collected at
Grade 1-3) can be transformed into age-based
data with missingness
crmda.KU.edu
75
age
grade
student
K
1
2
1
5;6
6;7
7;3
2
5;3
6;0
7;4
3
4;9
5;11 6;10
4
4;6
5;5
6;4
5
4;11
5;9
6;10
6
5;7
6;7
7;5
7
5;2
6;1
7;3
8
5;4
6;5
7;6
4;64;11
5;0- 5;65;5 5;11
crmda.KU.edu
6;0- 6;66;5 6;11
7;0- 7;67;5 7;11
76
 Out of 3 waves, we
create 7 waves of data
with high missingness
 Allows for more finetuned age-specific
growth modeling
 Even high amounts of
missing data are not
typically a problem
for estimation
age
4;64;11
5;0- 5;65;5 5;11
6;0- 6;66;5 6;11
5;6
6;7
5;3
4;9
4;6
6;0
5;11
5;5
4;11
7;0- 7;67;5 7;11
7;3
7;4
6;10
6;4
5;9
6;10
5;7
6;7
5;2
6;1
5;4
6;5
crmda.KU.edu
7;5
7;3
7;6
77
Monotonic Sample Reduction
•
Advantages:
• Cost reduction
• A lot of power to estimate effects at earlier waves
•
Disadvantages:
•
•
Very little power to estimate effects dependent on the
last wave of data, e.g., growth curve models (may be
missing 80% of data)
It is important to be able to estimate attrition rates
before beginning data collection
crmda.KU.edu
78
Monotonic Sample Reduction
•
•
•
Sometimes used in large datasets (e.g., Early Childhood
Longitudinal Study) to reduce costs
At each wave, a randomly-selected subgroup of the
original sample is observed again
The remainder of the original participants do not need to
be kept track of, dramatically reducing costs
Group
Time 1
Time 2
Time 3
Time 4
Time 5
1
x
x
x
x
x
2
x
x
x
x
--
3
x
x
x
--
--
4
x
x
--
--
--
5
x
--
--
--
--
crmda.KU.edu
79
Growth-Curve Planned Missing
•
With a particular analysis in mind, missingness
may be tailored to maximize power
•
•
•
In growth-curve designs, the most important
parameters are the growth parameters (e.g., estimate
the steepness and the shape of the curve)
Estimation precision depends heavily on the first and
last time points
A planned missing design can take advantage of this by
putting missingness in the middle
crmda.KU.edu
80
Growth-Curve Design
Group
Time 1
Time 2
Time 3
Time 4
Time 5
1
x
x
x
x
x
2
x
x
x
x
missing
3
x
x
x
missing
x
4
x
x
missing
x
x
5
x
missing
x
x
x
6
missing
x
x
x
x
crmda.KU.edu
81
Growth Curve Design II
Group
Time 1
Time 2
Time 3
Time 4
Time 5
1
x
x
x
x
x
2
x
x
x
missing
missing
3
x
x
missing
x
missing
4
x
missing
x
x
missing
5
missing
x
x
x
missing
6
x
x
missing
missing
x
7
x
missing
x
missing
x
8
missing
x
x
missing
x
9
x
missing
missing
x
x
10
missing
x
missing
x
x
11
missing
missing
x
x
x
crmda.KU.edu
82
Growth Curve Design II
Group
Time 1
Time 2
Time 3
Time 4
Time 5
1
x
x
x
x
x
2
x
x
x
missing
missing
3
x
x
missing
x
missing
4
x
missing
x
x
missing
5
missing
x
x
x
missing
6
x
x
missing
missing
x
7
x
missing
x
missing
x
8
missing
x
x
missing
x
9
x
missing
missing
x
x
10
missing
x
missing
x
x
11
missing
missing
x
x
x
crmda.KU.edu
83
Efficiency of Planned Missing Designs
crmda.KU.edu
84
Combined Elements
crmda.KU.edu
85
The Sequential Designs
crmda.KU.edu
86
Transforming to Accelerated Longitudinal
crmda.KU.edu
87
Transforming to Episodic Time
crmda.KU.edu
88
Planned Missing Designs: Summary
•
Purposeful missing data can address several issue
in study design
• Cost of data collection
• Participant burden/fatigue
• Practice effects
• Participant dropout
•
Rearranging data can turn one complete design
into a more nuanced missing data design
• Developmental time-lag designs
• Wave-missing into age-missing
crmda.KU.edu
89
The Impact of Auxiliary Variables
• Consider the following Monte Carlo
simulation:
• 60% MAR (i.e., Aux1) missing data
• 1,000 samples of N = 100
crmda.KU.edu
www.crmda.ku.edu
90
Excluding A Correlate of Missingness
crmda.KU.edu
www.crmda.ku.edu
91
Simulation Results Showing the Bias Associated with Omitting a
Correlate of Missingness.
crmda.KU.edu
92
MNAR improvements
crmda.KU.edu
www.crmda.ku.edu
93
Simulation Results Showing the Bias Reduction Associated with
Including Auxiliary Variables in a MNAR Situation.
crmda.KU.edu
94
Improvement in power relative to the power
of a model with no auxiliary variables.
Simulation results showing the relative power associated with
including auxiliary variables in a MCAR Situation.
crmda.KU.edu
95
PCA Auxiliary Variables
• Use PCA to reduce the dimensionality of
the auxiliary variables in a data set.
• A new smaller set of auxiliary variables are
created (e.g., principal components) that
contain all the useful information (both linear
and non-linear) in the original data set.
• These principal component scores are then
used to inform the missing data handling
procedure (i.e., FIML, MI).
crmda.KU.edu
www.crmda.ku.edu
96
The Use of PCA Auxiliary Variables
• Consider a series of simulations:
• MCAR, MAR, MNAR (10-60%) missing data
• 1,000 samples of N = 50-1000
crmda.KU.edu
www.crmda.ku.edu
97
60% MAR correlation estimates with no
auxiliary variables
Simulation results showing XY correlation estimates (with 95 and
99% confidence intervals) associated with a 60% MAR Situation.
crmda.KU.edu
98
Bias – Linear MAR process
ρAux,Y = .60; 60% MAR
crmda.KU.edu
99
Non-Linear Missingness
crmda.KU.edu
100
Bias – Non-Linear MAR process
ρAux,Y = .60; 60% non-linear MAR
crmda.KU.edu
101
Bias
ρAux,Y = .60; 60% MAR
crmda.KU.edu
102
Bias
ρAux,Y = .60; 60% MAR
crmda.KU.edu
103
Bias
ρAux,Y = .60; 60% MAR
crmda.KU.edu
104
60% MAR correlation estimates with no
auxiliary variables
Simulation results showing XY correlation estimates (with 95 and 99% confidence
intervals) associated with a 60% MAR Situation.
crmda.KU.edu
105
60% MAR correlation estimates with all possible
auxiliary variables (r = .60)
Simulation results showing XY correlation estimates (with 95 and 99%
confidence intervals) associated with a 60% MAR Situation and 8 auxiliary
variables.
crmda.KU.edu
106
60% MAR correlation estimates with 1 PCA
auxiliary variable (r = .60)
Simulation results showing XY correlation estimates (with 95 and 99%
confidence intervals) associated with a 60% MAR Situation and 1 PCA auxiliary
variable.
crmda.KU.edu
107
Auxiliary Variable Power Comparison
1 PCA Auxiliary
All 8 Auxiliary
Variables
1 Auxiliary
crmda.KU.edu
108
Faster and more reliable convergence
crmda.KU.edu
109
Summary
• Including principal component auxiliary
variables in the imputation model improves
parameter estimation compared to
• the absence of auxiliary variables and
• beyond the improvement of typical auxiliary
variables in most cases, particularly with the nonlinear MAR type of missingness.
• Improve missing data handling procedures
when the number of potential auxiliary
variables is beyond a practical limit.
crmda.KU.edu
www.crmda.ku.edu
110
www.quant.ku.edu
crmda.KU.edu
111
Simple Significance Testing with MI
•
Generate multiply imputed datasets (m).
•
Calculate a single covariance matrix on all N*m observations.
•
•
Run the Analysis model on this single covariance matrix and
use the resulting estimates as the basis for inference and
hypothesis testing.
•
•
By combining information from all m datasets, this matrix should
represent the best estimate of the population associations.
The fit function from this approach should be the best basis for making
inferences about model fit and significance.
Using a Monte Carlo Simulation, we test the hypothesis that
this approach is reasonable.
crmda.KU.edu
112
Population Model
.52
1*
1*
Factor B
Factor A
.75 .68 .76 .70 .72 .67 .69 .79 .72 .75
.81 .72 .74 .70 .71 .79 .69 .81 .73 .78
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
.35 .49 .45 .52 .50 .38 .53 .35 .47
.44 .53 .42 .51 .48 .55 .52 .38 .49
.39
RMSEA = .047, CFI = .967, TLI = .962, SRMR = .021
crmda.KU.edu
.43
Note: These are fully
standardized parameter
estimates
113
Change in Chi-squared Test
Correlation Matrix Technique
Change in Chi Squared Across Replications
Condition
PRB
10%
Missing
-2.95%
30%
Missing
4.39%
50%
Missing
6.08%
75
60
45
30
15
M
is
si
ng
50
%
M
is
si
ng
30
%
M
is
si
ng
10
%
n
0
Po
pu
la
tio
Change in Chi Squared
90
Condition
crmda.KU.edu
114
On the Merits of Planning and
Planning for Missing Data*
*You’re a fool for not using planned
missing data design
Thanks for your attention!
Questions?
crmda.KU.edu
Workshop presented 05-21-2012
Max Planck Institute for Human Development, Berlin, Germany
crmda.KU.edu
115
References
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.
Graham, J. W., Hofer, S. M., & Piccinin, A. M. (1994). Analysis with Missing Data in Drug
Prevention Research. In L. M. Collins & L. Seitz (Eds.), National Institute on Drug Abuse
Research Monograph Series (pp. 13-62). Washington, DC: National Institute on Drug Abuse.
Graham, J. W., Hofer, S. M., & MacKinnon, D. P. (1996). Maximizing the usefulness of data
obtained with planned missing value patterns: An application of maximum likelihood
procedures. Multivariate Behavioral Research, 31, 197-218.
Graham, J. W., Taylor, B. J., Olchowski, A. E., & Cumsille, P. E. (2006). Planned missing data
designs in psychological research. Psychological Methods, 11, 323−343.
Graham, J. W., Taylor, B. J.,& Cumsille, P. E. (2001). Planned missing data designs in the analysis
of change. In L. M. Collins &A.G. Sayer (Eds.), New methods for the analysis of change (pp.
335−353). Washington, D.C.: American Psychological Association.
McArdle, J. J. & Woodcock, R. W. (1997). Expanding test-retest designs to include developmental
time-lag components. Psychological Methods, 2, 403-435.
Raghunathan, T. E., & Grizzle, J. E. (1995). A split questionnaire survey design. Journal of the
American Statistical Association, 90, 54-63.
Shoemaker, D. M. (1971). Principles and procedures of multiple matrix sampling. Southwest
regional library technical report 34.
Wacholder, S., Carroll, R. J., Pee, D., & Gail, M. H. (1994). The partial questionnaire design for
case-control studies. Statistics in Medicine, 13, 623-634.
crmda.KU.edu
116
Update
Dr. Todd Little is currently at
Texas Tech University
Director, Institute for Measurement, Methodology, Analysis and Policy (IMMAP)
Director, “Stats Camp”
Professor, Educational Psychology and Leadership
Email: yhat@ttu.edu
IMMAP (immap.educ.ttu.edu)
Stats Camp (Statscamp.org)
www.Quant.KU.edu
11
Download