Multiple Imputation

advertisement
Introduction to
Multiple Imputation
CFDR Workshop Series
Spring 2008
Outline
•
•
•
•
•
•
•
Missing data mechanisms
What is Multiple Imputation?
SAS Proc MI, Proc MIANALYZE
Stata ICE, MICOMBINE
SAS IVEware
What’s the diff?
Problems with categorical imputation
2
Missing data mechanisms
• Missing Completely At Random (MCAR)
– The probability of missingness doesn't depend on
anything.
• Missing At Random (MAR)
– The probability of missingness does not depend on
the unobserved value of the missing variable, but it
can depend on any of the other variables in your
dataset
• Not Missing at Random (NMAR)
– The probability of missingness depends on the
unobserved value of the missing variable itself
3
4
What is Multiple Imputation?
1. Imputation
•
Make M=3 to 10 copies of incomplete data
set filling in with conditionally random values
2. Analyses
•
Of each data set separately
3. Pooling
•
•
Point estimates. Average across M analyses
Standard errors. Combine variances .
5
1. Imputation: Multiple Copies of Dataset
Y
44.61
54.3
49.87
X1
X2
11.37
178
8.65
156
9.22 .
.
11.95
176
39.44 13.08
174
50.54 .
.
44.75 11.12
176
51.86 10.33
166
40.84 10.95
168
46.77 10.25 .
X3
1
0
.
1
1
1
0
0
.
.
_I_
1
1
1
1
1
1
1
1
1
1
_I_
2
2
2
2
2
2
2
2
2
2
Y
44.61
54.3
49.87
39.97
39.44
50.54
44.75
51.86
40.84
46.77
X1
11.37
8.65
9.22
11.95
13.08
9.117
11.12
10.33
10.95
10.25
X2
178
156
181.2
176
174
168.2
176
166
168
185.9
X3
1
0
0.23
1
1
1
0
0
0.756
0.632
Y
X1
X2
X3
44.609 11.37
178
1
54.297
8.65
156
0
49.874
9.22 137.47 0.0666
39.849 11.95
176
1
39.442 13.08
174
1
50.541 9.9192 162.67
1
44.754 11.12
176
0
51.855 10.33
166
0
40.836 10.95
168 0.2288
46.774 10.25 184.83 0.0998
6
Three steps
1. Imputation
•
Make M=2 to 10 copies of incomplete data
set filling in with conditionally random values
2. Analyses
•
Of each data set separately
3. Pooling
•
•
Point estimates. Average across M analyses
Standard errors. Combine variances .
7
What is MI?
• STATA
– based on each conditional density
– chained equations
• SAS
– joint distribution of all the variables
– assumed multivariate normal distribution
• SAS IVEware
– same as Stata, more options.
8
Stata Example
• ICE to impute
– Regression commands may be logistic,
mlogit, ologit, or regress.
• MICOMBINE to analyze and combine the
results.
– Supported regression cmds are clogit, cnreg,
glm, logistic, logit, mlogit, ologit, oprobit,
poisson, probit, qreg, regress, rreg, stcox,
streg, or xtgee.
• Easy to use, nice documentation
9
SAS example
Oxygen
RunTime RunPulse
44.609
11.37
178
54.297
8.65
156
49.874
9.22 .
.
11.95
176
39.442
13.08
174
50.541 .
.
44.754
11.12
176
51.855
10.33
166
40.836
10.95
168
46.774
10.25 .
39.407
12.63
174
45.441
9.63
164
10
Step 1: Proc MI
• Typical syntax:
proc mi data=mi_example out=outmi
seed=1234;
var Oxygen RunTime RunPulse;
run;
11
Step 2: Run Models
proc reg data=outmi outest=outreg covout
noprint;
model Oxygen = RunTime RUnPulse;
by _Imputation_;
run;
Note that the regression output is stored as
dataset “outreg”
Proc’s= Reg, Logistic, Genmod, Mixed, GLM
12
Parameter Estimates & Covariance
Matrices
proc print data=outreg(obs=8);
var _Imputation_ _Type_ _Name_ Intercept
RunTime RunPulse;
run;
Obs
_Imputation_
_TYPE_
1
2
3
4
5
6
7
8
1
1
1
1
2
2
2
2
PARMS
COV
COV
COV
PARMS
COV
COV
COV
_NAME_
Intercept
RunTime
RunPulse
Intercept
RunTime
RunPulse
Intercept RunTime RunPulse
82.9694
65.1698
0.2646
-0.3952
85.1831
85.3406
-0.4467
-0.4679
-2.44422
0.26463
0.14005
-0.0101
-3.0485
-0.44671
0.13629
-0.00581
-0.06121
-0.39518
-0.0101
0.00293
-0.03452
-0.46786
-0.00581
0.00308
13
Step 3. Proc Mianalyze
proc mianalyze data=outreg;
modeleffects Intercept RunTime RunPulse;
run;
Parameter
Estimate
Multiple Imputation Parameter Estimates
Std Error 95% Confidence Limits
DF
Intercept
RunTime
92.696519
-2.915452
12.780914
0.48346
65.35758
-3.90873
120.0355
-1.9222
RunPulse
-0.086795
0.070425
-0.23209
0.0585
Minimum
14.412 82.969385
26.264 -3.146336
24.163
-0.13547
Maximum
Pr > |t|
101.288118 <.0001
-2.444217 <.0001
-0.034519
0.2296
14
Irritating Parameter Est. & Covariance
Matrices
• Syntax depends on what procedure you used in previous step:
• proc mianalyze data=parmcov;
(or)
• proc mianalyze parms=parmsdat
covb=covbdat;
(or)
• proc mianalyze parms=parmsdat
xpxi=xpxidat;
PROC’s: reg, genmod, logit, mixed, glm.
15
SAS IVEware: 4 Components
1. IMPUTE -- nice options.
2. DESCRIBE estimates the population means, proportions, subgroup
differences, contrasts and linear combinations of means and
proportions. A Taylor Series approach is used to obtain variance
estimates appropriate for a user specified complex sample design.
3. REGRESS fits linear, logistic, polytomous, Poisson, Tobit and
proportional hazard regression models for data resulting from a
complex sample design.
4. SASMOD allows users to take into account complex sample design
features when analyzing data with several SAS procedures. SAS
PROCS can be called:CALIS, CATMOD, GENMOD, LIFEREG,
MIXED, NLIN, PHREG, and PROBIT.
16
IVEware Impute
IMPUTE assumes the variables in the data set are one of
the following five types:
(1) continuous
(2) binary
(3) categorical (polytomous with more than two categories)
(4) counts
(5) mixed
The types of regression models used are linear, logistic,
Poisson, generalized logit or mixed logistic/linear,
depending on the type of variable being imputed.
17
SAS IVEware: 4 Components
1. IMPUTE -- nice options.
2. DESCRIBE estimates the population means, proportions, subgroup
differences, contrasts and linear combinations of means and
proportions. A Taylor Series approach is used to obtain variance
estimates appropriate for a user specified complex sample design.
3. REGRESS fits linear, logistic, polytomous, Poisson, Tobit and
proportional hazard regression models for data resulting from a
complex sample design.
4. SASMOD allows users to take into account complex sample design
features when analyzing data with several SAS procedures. SAS
PROCS can be called:CALIS, CATMOD, GENMOD, LIFEREG,
MIXED, NLIN, PHREG, and PROBIT.
18
A Few Issues
• Do I impute the dependent variable?
• Which model has more information? The
imputation model or the analyst model?
• How many imputations do I need to do?
• Can I impute in one language and analyze in
another?
• How do I get summary statistics such as R
squared?
• Can I do this in SPSS?
• Where do I go with questions?
19
Thanks
Next up:
“COLLATERAL CONSEQUENCES OF VIOLENCE IN
DISADVANTAGED NEIGHBORHOODS”
Dr. David Harding
Wednesday, February 13,
Noon - 1:00 pm
Accessing and Analyzing Add Health Data
Instructor: Dr. Meredith Porter
Monday, February 25, 12:00-1:00 pm
20
Download