A new architecture for handling multiply imputed data in Stata JC Galati

advertisement
A new architecture for handling
multiply imputed data in Stata
JC Galati1, JB Carlin1,2, P Royston3
1Murdoch
Childrens Research Institute (MCRI), Melbourne
2The University of Melbourne
3MRC Clinical Trials Unit, London
Missing data
• Why do we need additional tools for analysing
datasets with missing values?
• Traditional methods work with complete datasets
• Statistical packages discard incomplete observations
when analysing an incomplete dataset
• i.e. a complete-case analysis is performed
• This can lead to loss of power, and possibly to biased
estimates, depending on why the data went missing
2
Multiple imputation (MI)
• Introduced by Donald Rubin (1987 book, Wiley)
• Based on Bayesian principles
• Both the data-generating mechanism and the
missingness mechanism are modelled
• Fairly broad assumptions about data-generating model
• Fairly restrictive assumptions about missingness mechanism
• Modelling assumptions apply at the imputation level
• Statistical modelling is general (once data is
imputed)
• Post estimation – some; more work needs to be done
• Diagnostics – theory and practice not yet worked out
• Model-building – in its infancy – work has started
3
MI data analysis
• Start with a dataset with some values missing
• Missing values are imputed multiple times
• Using a Bayesianly “proper” imputation method
• This creates m sets of completed data
• Each completed dataset is analysed separately:
• Standard complete-data estimation methods are used
• E.g. linear regression, logistic regression
4
Inference (estimation) using MI
• Coefficient estimates and variances (SE’s) from
complete-data analyses are combined using Rubin’s
Rules:
• Parameter estimates:
• Average of the complete-data parameter estimates
• Variance is the sum of two components:
• Within-imputation variance (average of the complete-data
variances)
• Between-imputation variance (determined from complete-data
parameter estimates)
• Point estimators divided by SE have approximate t
distributions
• Estimate d.f. and use t-multipliers to get confidence intervals
5
Background (MI in Stata)
• What is available in Stata currently?
• “MI Tools,” Carlin et. al. Stata J. 2003:
• Imputed datasets stored in separate dta files
myfile1.dta, ... , myfile`m’.dta
• Estimation:
• mifit with:
regress, logit, probit, clogit, glm,
logistic, poisson, svyreg, svylogit,
svyprobit, svypoisson, xtgee, xtreg
• Post estimation:
milincom, mitestparm
• Data manipulation
miset, miappend, mimerge, mido, misave
6
Background (MI in Stata)
• Main drawbacks of “MI Tools”:
•
•
•
•
Loose association between original and imputed data
Loose association between individual imputed datasets
Limit to range of estimation commands supported (13)
Choice of coding of some aspects resulted in slow
execution time in some cases
• No capacity to perform imputation
7
Background (MI in Stata)
• What is available in Stata currently? (cont.)
• ice, micombine, Royston Stata J. 2004/05:
• ice stores imputed datasets in a single dta file
• uses impid and obsid vars
• Estimation:
• micombine with
clogit, cnreg, glm, logistic, logit,
poisson, probit, qreg, regress, rreg,
xtgee, streg, stcox, ologit, oprobit, mlogit
• Post estimation:
• results returned in e(b) , e(V) etc.
• onus on user to know when post-estimation command
applied directly to combined estimates is valid
8
Background (MI in Stata)
• ice, micombine, Royston Stata J. 2004/05 (cont.):
• Data manipulation
• left to user, but stacked format facilitates simple
transformation of variables etc.
• mijoin, misplit (for conversion between formats)
• Main drawbacks
• Limit to range of estimation commands supported (16)
• Manipulation that changes number of observations in
each dataset not easily supported (eg. reshape)
• Not clear when/if post-estimation is valid
9
mim: A new architecture
Main aims
• To unify two sets of tools into a single architecture
• To combine functionality of both sets of tools
• To simplify the command syntax
• To extend the range of estimation commands
supported
• Better post-estimation facilities
• testparm, lincom, predict
• Make it harder to do crazy things
• Add other post-estimation commands later
10
mim: A new architecture
Scope:
• Creation of imputations is NOT included
• But easy for users to put imputed datasets into mim
format
• Architecture covers analysis and manipulation of
existing imputed datasets
• Designed to handle:
•
•
•
•
•
Estimation
Data manipulation (reshape, append & merge)
Post-estimation (lincom, testparm & predict)
Replay (management of estimation results)
Utility functions
11
mim: A new architecture
Storage of imputed datasets:
• Based on Royston’s stacked format
• Fixed names for impid and obsid vars
• _mj (impid) and _mi (obsid)
• no need for:
• dataset characteristics to record the names
• additional command options to specify the names
• dedicated set command to manage the characteristics
• stacking requires only generate, append and replace
• Original data stored in the stack
• _mj == 0
12
mim: A new architecture
Storage of datasets: illustration
_mj
_mi
y
x
---------------------------------0
1
1.1
105
0
2
9.2
106
0
3
1.1
.
0
4
2.3
.
0
5
7.5
108
0
6
7.9
.
1
1
1.1
105
1
2
9.2
106
1
3
1.1
109.796
1
4
2.3
110.456
1
5
7.5
108
1
6
7.9
102.243
2
1
1.1
105
2
2
9.2
106
2
3
1.1
107.952
2
4
2.3
115.968
2
5
7.5
108
2
6
7.9
114.479
13
mim: A new architecture
Command structure:
• A single command prefix called mim
• mim processes the multiply-imputed dataset
currently in memory
• Typical syntax
. mim: command
• E.g.
.
.
.
.
use myImputedData, clear
mim: regress y x1 x2 x3
mim: predict yhat
mim: lincom x1+x2+x3, or
14
mim: A new architecture
Commands (cont.):
• General syntax
• Default behaviour of mim may be modified through
mim options:
mim [, mim_options]: command
• mim_options depend on whether one wishes to do
•
•
•
•
estimation
data manipulation
post-estimation
replay
15
Using mim
Estimation:
• mim recognises 28 estimation commands
regress mean proportion ratio logistic
logit ologit mlogit probit oprobit poisson
glm binreg blogit clogit cnreg mvreg rreg
qreg iqreg sqreg bsqreg stcox streg
xtgee xtreg xtlogit xtmixed
• and 11 svy commands
svy:regress
svy:mean
...
svy:poisson
• Plus, in principle any Stata estimation command may
be used:
mim, category(fit): estimation_command
16
Using mim
Data manipulation:
• Stacked format allows simple manipulation using
existing stata commands:
. generate, replace, label etc.
. by _mj: tabulate ...
• mim recognises 3 data manipulation commands
. mim: reshape cmdline
. mim: append using “another mim dataset”
. mim, sort(varlist) : merge using ...
• In principle, any Stata data manipulation command
may be used with mim:
mim, category(manip) sort(varlist): manip_command
17
Using mim
• mim recognises some post-estimation commands:
.
.
.
.
mim:
mim:
mim:
mim:
lincom cmdline
testparm cmdline
predict xbvar [, eq(name)]
predict sevar, stdp [ eq(name)]
• Replay combined estimates:
. mim
• Replay individual estimates (#th imputed dataset):
. mim, j(#)
18
Using mim
• Interactive example in Stata
19
Final comments
• Difficulties faced:
• Simplicity of programming versus ease of use and flexibility
• Inconsistencies between commands resulted in more
tailoring than we’d hoped
• Progress
• Coding of version 1 complete
• Current version is 1.0.3
• Help file written
• Has been in beta-testing for several months
• Submitted for publication in Stata Journal
20
Download