ice_help

advertisement
----------------------------------------------------------------------------------------------help for ice, uvis
Patrick Royston
----------------------------------------------------------------------------------------------Multiple imputation by the MICE system of chained equations
ice mainvarlist using filename[.dta] [if exp] [in range] [weight] , [ boot[(varlist)]
cc(varlist) cmd(cmdlist) cycles(#) dropmissing dryrun eq(eqlist)
genmiss(string) id(string) interval(intlist) m(#) match[(varlist)]
noconstant noshoweq on(varlist) orderasis passive(passivelist)
substitute(sublist) replace seed(#) trace(filename) ]
uvis regression_cmd {yvar|llvar ulvar} xvarlist [if exp] [in range] [weight] ,
gen(newvarname) [ boot match replace seed(#) ]
where
regression_cmd may be intreg, logistic, logit, mlogit, ologit, or regress. llvar
ulvar are required with intreg.
All weight types supported by regression_cmd are allowed; see weights.
Description
ice imputes missing values in mainvarlist by using switching regression, an iterative
multivariable regression technique. The abbreviation MICE means multiple imputation by
chained equations, and was apparently coined by Steff van Buuren. ice implements MICE for
Stata. Sets of imputed and non-imputed variables are stored to a new file called
filename. Any number of complete imputations may be created. The original data are stored
in filename as "imputation number 0" and the new variable _mj is set to 0 for these
observations.
uvis (univariate imputation sampling) imputes missing values in the single variable yvar
based on multiple regression on xvarlist. uvis is called repeatedly by ice in a
regression switching mode to perform multivariate imputation.
The missing observations are assumed to be "missing at random" (MAR) or "missing
completely at random" (MCAR), according to the jargon. See for example van Buuren et al
(1999) for an explanation of these concepts.
Please note that ice and uvis require Stata 8.0 or higher. There have been
incompatibility issues with Stata 7 or lower.
Options for ice
boot[(varlist)] instructs that each member of varlist, a subset of mainvarlist, be
imputed with the boot option of uvis activated. If (varlist) is omitted then all
members of mainvarlist with missing observations are imputed using the boot option of
uvis.
cc(varlist) prevents imputation of missing data in mainvarlist for cases in which any
member of varlist has a missing value. "cc" signifies "complete case". Note that
members of varlist are used for imputation if they appear in mainvarlist, but not
otherwise. Use of this option is equivalent to entering if ~missing(var1) &
~missing(var2) ..., where var1, var2, ... denote the members of varlist.
cmd(cmdlist) defines the regression commands to be used for each variable in mainvarlist,
when it becomes the dependent variable in the switching regression procedure used by
uvis (see Remarks). The first item in cmdlist may be a command such as regress or
may have the syntax varlist:cmd, specifying that command cmd applies to all the
variables in varlist. Subsequent items in cmdlist must follow the latter syntax, and
each item should be followed by a comma.
The default cmd for a variable is logit when there are two distinct values, mlogit
when there are 3-5 and regress otherwise.
Example: cmd(regress) specifies that all variables are to be imputed by regress,
over-riding the defaults
Example: cmd(x1 x2:logit, x3:regress) specifies that x1 and x2 are to be imputed by
logit, x3 by regress and all others by their default choices
cycles(#) determines the number of cycles of regression switching to be carried out.
Default # is 10.
dropmissing is a feature designed to save memory when using the file of imputed data
created by ice. It omits from filename all observations which are not in the
estimation sample, that is for which either (i) they are filtered out by if or in, or
a non-positive weight, or (ii) the values of all variables in mainvarlist are
missing. This option provides a "clean" analysis file of imputations, with no
missing values. Note that the observations not in the estimation sample are omitted
also from the original data, stored as imputation #0 in filename.
dryrun does a "dry run", that is, ice reports the prediction equations it has constructed
from the various inputs. No imputation is done and no files are created. It is not
mandatory to specify an output file with using for a dry run. Sometimes the
prediction equation set-up needs to be carefully checked before running what may be a
lengthy imputation process.
eq(eqlist) allows one to define customised prediction equations for any subset of
variables in mainvarlist. The option, particularly when used with passive(), allows
great flexibility in the possible imputation schemes. The syntax of eqlist is
varname1:varlist1 [,varname2:varlist2 ...] where each varname# (or varlist#) is a
member (or subset) of mainvarlist. It is your responsibility to ensure that each
equation is sensible. ice places no restrictions except to check that all variables
mentioned are indeed in mainvarlist, and that an equation is not defined for a
variable specified to be passively imputed (see the passive() option. Note that eq()
takes precedence over all default definitions and assumptions about the way a given
variable in mainvarlist will be imputed. The default, if the passive() and
substitute() options are not invoked, is that each variable in mainvarlist with any
missing data is imputed from all the other variables in mainvarlist.
genmiss(string) creates an indicator variable for the missingness of data in any variable
in mainvarlist for which at least one value has been imputed. The indicator variable
is set to missing for observations excluded by if, in, etc. The indicator variable
for xvar is named stringxvar. This option is left for backwards compatibility, but
now that the original data are stored in the output file, it is no longer really
needed. The information on missingness is implicit in the original data stored as
"imputation 0".
id(string) creates a variable called string containing the original sort order of the
data. Default string: _mi.
interval(intlist) imputes interval-censored variables. An interval-censored value is one
which is known to lie in an interval [a, b], where a may be finite or minus infinity,
b may be finite or plus infinity, and a <= b. When either a or b is infinite we have
left or right censoring, respectively. intlist has the syntax varname:llvar ulvar [,
varname:it:llvar ulvar ...], where each varname is an interval-censored variable,
each llvar contains the lower bound (a) for varname and each ulvar contains the upper
bound (b) for varname (or a missing value to represent plus or minus infinity). The
supplied values of varname are irrelevant since they will be replaced anyway; it is
only required that varname exist. Observations with llvar missing and ulvar present
are left-censored for varname. Observations with llvar present and ulvar missing are
right-censored for varname. Observations with llvar = ulvar are complete, and no
imputation is done for them. Observations with both llvar and ulvar missing are
imputed assuming an uncensored normal distribution. See Remarks for further
information.
m(#) defines # as the number of imputations required (minimum 1, no upper limit). The
default # is 1.
match[(varlist)] instructs that each member of varlist be imputed with the match option
of uvis. This provides prediction matching for each member of varlist. If (varlist)
is omitted then all relevant variables are imputed with the match option of uvis. The
default, if match() is not specified, is to draw from the posterior predictive
distribution of each variable requiring imputation.
noshoweq suppresses the presentation of the prediction equations.
noconstant suppresses the regression constant in all regressions.
on(varlist) changes the operation of ice in a major way. With this option, uvis imputes
each member of mainvarlist univariately on varlist. This provides a convenient way of
producing multiple imputations when imputation for each variable in mainvarlist is to
be done univariately on a set of complete predictors.
orderasis enters the variables in mainvarlist into the MICE algorithm in the order given.
The default is to order them according to the number of missing values: the variable
with least missingness gets imputed first, and so on.
passive(passivelist) allows the use of "passive" imputation of variables that depend on
other variables, some of which are imputed. The syntax of passivelist is varname:exp
[\varname:exp ...]. Notice the requirement to use "\" as a separator between items in
passivelist, rather than the usual comma; the reason is that a comma may be a valid
part of an expression. The option is most easily explained by example. Suppose x1 is
a categorical variable with 3 levels, and that two dummy variables x1a, x1b have been
created by the commands
. generate byte x1a=(x1==2)
. generate byte x1b=(x1==3)
Now suppose that x1 is to be imputed by the mlogit command, and is to be treated as
the two dummy variables x1a and x1b when predicting other variables. Use of mlogit
is achieved by the option cmd(x1:mlogit). When x1 is imputed, we want x1a and x1b to
be updated with new values which depend on the imputed values of x1. This may be
achieved by specifying passive(x1a:x1==2 \ x1b:x1==3). It is necessary also to remove
x1 from the list of predictors when variables other than x1 are being imputed, and
this is done by using the substitute() option; in the present example, you would
specify substitute(x1:x1a x1b).
Note that although in this example x1a will take the (possibly unintended) value of 0
when x1 is missing, ice is careful to ensure that x1a (and x1b) inherit the
missingness of x1, and are passively imputed following active imputation of missing
values of x1. If this were not done, incorrect results could occur. The
responsibility of the user is to create x1a and x1b before running ice such that
their missing values are identical to those of x1.
A second example is multiplicative interactions between variables, for example,
between x1 and x2 (e.g. x12=x1*x2); this could be entered as passive(x12:x1*x2). It
would cause the interaction term x12 to be omitted when either x1 or x2 was being
imputed, since it would make no sense to impute x1 from its interaction with x2.
substitute() is not needed here.
It should be stressed that variables to be imputed passively must already exist and
must be included in mainvarlist, otherwise they will not be recognised.
replace permits filename to be overwritten with new data. replace may not be
abbreviated.
seed(#) sets the random number seed to #. To reproduce a set of imputations, the same
random number seed should be used. Default #: 0, meaning no seed is set by the
program.
substitute(sublist) is typically used with the passive() option to represent multilevel
categorical variables as dummy variables in models for predicting other variables.
See passive() for more details. The syntax of sublist is varname:dummyvarlist
[,varname:dummyvarlist ...] where varname is the name of a variable to be substituted
and dummyvarlist is the list of dummy variables representing it.
Note, however, the following important convenience feature: substitute() may be used
without corresponding expressions in passive() to recreate dummy variables
automatically. If the values of variables in dummyvarlist are NOT defined through
expressions involving varname in the passive() option, then the variables in
dummyvarlist are calculated according to the actual range of values of varname. For
example, suppose the options passive(x1a:x1==2 \ x1b:x1==3) and
{cmd:substitute(x1:x1a x1b) were specified. Provided that all the non-missing values
of x1 were 2 when x1a==1 and all the non-missing values of x1 were 3 when x1b==1,
then passive(x1a:x1==2 \ x1b:x1==3) is implied by substitute(x1:x1a x1b) and can be
omitted. The rule applied by substitute(x:dummy1 [dummy2...]) for defining dummy
variables dummy1, dummy2, ... is as follows:
1. Determine the range of values [xmin, xmax] of x for which dummy1 > 0.
2a. If xmin < xmax, define dummy1 to be 1 if xmin <= x <= xmax and 0 otherwise.
2b. If xmin = xmax, define dummy1 to be 1 if x = xmin and 0 otherwise.
3. Repeat steps 1 and 2a,b for dummy2, dummy3, ... as necessary.
With many such categorical variables this feature can save a lot of typing.
trace(filename) monitors the convergence of the imputation algorithm. For each original
variable with missing values, the mean of the imputed values is stored as a variable
in filename, together with the cycle number at which that mean was calculated. The
results are stored only for the final imputation. For diagnostic purposes, it is
sensible to run trace() with m(1) and a large number of cycles, such as cycles(100).
When the run is complete, it is helpful to load filename into memory and plot the
mean for each imputed variable against the cycle number. If necessary, smoothing may
be applied to clarify any apparent pattern. Convergence is judged to have occurred
when the pattern of the imputed means is random. It is usually obvious from the
appearance of the plot how many cycles are needed for convergence.
Options for uvis
boot invokes a bootstrap method for creating imputed values (see Remarks).
gen(newvar) is not optional. newvar contains original (non-missing) and imputed
(originally missing) values of yvar.
match creates imputations by prediction matching. The default is to draw imputations at
random from the posterior distribution of the missing values of yvar, conditional on
the observed values and the members of xvarlist. See Remarks for further details.
noconstant suppresses the regression constant in all regressions.
replace permits newvar (see gen(newvar)) to be overwritten with new data. replace may not
be abbreviated.
seed(#) sets the random number seed to #. See Remarks for comments on how to ensure
reproducible imputations by using the seed() option. Default #: 0, meaning no seed
is set by the program.
Remarks
uvis imputes yvar from xvarlist according to the following algorithm (see van Buuren et
al (1999) section 3.2 for further technical details):
1. Estimate the vector of coefficients (beta) and the residual variance by regressing
the non-missing values of yvar on the current "completed" version of xvarlist.
Predict the fitted values etaobs at the non-missing observations of yvar.
2. Draw at random a value (sigma_star) from the posterior distribution of the
residual standard deviation.
3. Draw at random a value (beta_star) from the posterior distribution of beta,
allowing, through sigma_star, for uncertainty in beta.
4. Use beta_star to predict the fitted values etamis at the missing observations of
yvar.
5. The imputed values are predicted directly from beta_star, sigma_star and the
covariates. When imputation is by linear regression (regress command), this step
assumes that yvar is Normally distributed, given the covariates. For other types
of imputation, samples are drawn from the appropriate distribution.
With the match option, step 5 is replaced by the following. For each missing observation
of yvar with prediction etamis, find the non-missing observation of yvar whose prediction
(etaobs) on observed data is closest to etamis. This closest non-missing observation is
used to impute the missing value of yvar.
The default draw method is not robust to departures from Normality and may produce
implausible imputations. For example, if the original distribution is skew and
positive-valued, the imputed distribution will not necessarily have the appropriate
amount of skewness, nor will all the imputed values necessarily be positive. Log
transformation of positive variables may greatly improve the appropriateness of the
imputations.
The alternative match method is recommended only for continuous variables when the
Normality assumption is clearly untenable, even approximately. It is not necessary, nor
is it recommended, for binary, ordered categorical or nominal variables. match may work
well when the distribution of a continuous variable is very non-Normal, but it may
sometimes result in biased imputations.
With the boot option, steps 2-4 are replaced by a bootstrap estimation of beta_star;
beta_star is estimated by regressing yvar on xvarlist after taking a bootstrap sample of
the non-missing observations. This has the advantage of robustness since the distribution
of beta is no longer assumed to be multivariate normal.
Note that uvis will not impute observations for which a value of a variable in xvarlist
is missing. However, all original (missing or non-missing) observations of yvar will be
copied into newvarname in such cases. This is a change from the first release of uvis
(with mvis). Previously, newvarname would be set to missing whenever a value of a
variable in xvarlist was missing, irrespective of the value of yvar.
Missing data for ordered (or unordered) categorical covariates should be imputed by using
the ologit (or mlogit) command. In these cases, prediction matching is done on the scale
of the mean absolute difference in the predicted class probabilities, preceded by logit
transformation.
ice carries out multivariate imputation in mainvarlist using regression switching (van
Buuren et al 1999) as follows:
1. Ignore any observations for which mainvarlist has only missing values, or if the
ccvarlist(varlist) option has been specified, for which any member of varlist has
a missing value.
2. For each variable in mainvarlist with any missing data, randomly order that
variable and replicate the observed values across the missing cases. This step
initialises the iterative procedure by ensuing that no relevant values are
missing.
3. For each variable in mainvarlist in turn, impute missing values by applying uvis
with the remaining variables as covariates.
4. Repeat step 3 cycles() times, replacing the imputed values with updated values at
the end of each cycle.
A single imputation sample is created for each variable with any relevant missing values.
Van Buuren recommends cycles(20) but goes on to say that 10 or even 5 iterations are
probably sufficient. We have chosen a compromise default of 10.
"Multiple imputation" (MI) implies the creation and analysis of several imputed datasets.
To do this, one would run ice with m set to a suitable number, for example 5. To obtain
final estimates of the parameters of interest and their standard errors, one would fit a
model in each imputation and carry out the appropriate post-MI averaging procedure on the
results from the m separate imputations. A suitable estimation tool for this purpose is
micombine.
Handling categorical variables
Binary variables present no difficulty: by default, in the MICE procedure, when such a
variable is the response, it is predicted from other variables by using logistic
regression; when it is a covariate, it is modelled in the only way possible, effectively
as a single dummy variable. Categorical variables with 3 or more levels may in principle
be treated in different ways. By default, in ice variables with 3-5 levels are modelled
using multinomial logistic regression (mlogit command) when the response, and as a single
linear term when a covariate. The same behaviour occurs with the ordered logistic model
(ologit command), requested via the cmd() option. The use of dummy variables instead of a
single linear term may be imposed as described under the passive() option. The requisite
dummy variables must be created before ice is invoked. Variables with 6 or more levels
are treated as ordered and continuous, but again different choices may be imposed by use
of the cmd(), passive() and substitute() options.
You should be aware that unless the dataset is large, use of the mlogit command may
produce unstable estimates if the number of levels is too large, and may compromise the
accuracy of the imputations. It is hard to predict when this will occur.
Note that due to a peculiarity of the way the mlogit command works, variables with score
labels cause problems to ice and uvis when missing data are imputed using mlogit. Score
labels for such variables are removed in the file of imputed data. See also the related
comment on Post-estimation prediction in micombine.
Interval censoring
Values of a variable y that are interval censored are imputed under the assumption that y
is normally distributed with unknown mean and variance. The method, which is fast and
efficient, is essentially as described for right-censored variables in section 3.3 of
Royston (2001). A minor extension to allow left or interval censoring is employed. For
example, if A < y < B and A and B are both finite, the imputed value for y will follow a
truncated normal distribution with bounds A and B, variance parameter estimated from the
data and mean given by the linear predictor for the imputation model for y. Stata's
intreg command is used to estimate the mean and variance of y. When A and B are both
missing (infinite), imputation of y simply assumes the normal distribution just
mentioned, but without bounds.
If you wish to impose range limits on the imputed values, the lower and upper bound
variables may be set accordingly. For example, to impute right-censored (e.g. survival)
data, you would set llvar equal to all the observed times to event, whether censored or
not, and ulvar to all the uncensored event times and missing for the censored times.
This would cause the right-censored values to be imputed without restriction. If you
wanted to bound the imputed values above, say by 10, you would specify ulvar to be 10
(rather than missing) for all the censored observations.
Errors and diagnostics
ice may occasionally detect an anomaly when running uvis with a particular variable as
response and a particular regression command. ice will then stop and report the uvis
command it was running and the error number returned. Often the problem lies in a
regression of a binary or categorical variable where the estimation procedure fails to
converge; this is usually caused by sparse cell occupancy of the response variable. If
you obtain this error you should either omit the offending variable from the imputation,
or seek to combine a sparse category with another category.
Another possibility is that, again due to a defect in a particular regression command in
the chained equations structure, the number of values imputed for a particular variable
is less than expected. This is a serious error and again may arise from estimation
problems involving a binary or categorical variable. In this situation, ice saves in the
working directory a snapshot of the data it was using in the attempted estimation to a
file called _ice_dump.dta, while also reporting the uvis command it was executing. You
can then investigate what may have gone wrong with the command by loading the data in
_ice_dump.dta and re-running the offending regression command.
Further notes
ice determines the order of imputing variables in the round of chained equations
according to the amount of missing data. Variables with the least missingness are
imputed first.
An important application of MI is to investigate possible models, for example prognostic
models, in which selection of influential variables is required (Clark & Altman 2003).
For example, the stability of the final model across the imputation samples is of
interest. This area of enquiry is in its infancy.
In survival analysis, it is recommended to include the censoring indicator and the log of
the survival time in the variables to be used for imputation. Van Buuren et al (1999)
give a detailed discussion of the different types of covariate that can be included in
the imputation model and discuss the important issue of how to deal with variables which
are missing completely at random (MCAR), missing at random (MAR) and not missing at
random (NMAR).
See also Van Buuren's website http://www.multiple-imputation.com for further information
and software sources.
Examples
. uvis regress y x1 x2 x3, gen(ym)
. uvis intreg ll ul x1 x2 x3, gen(y)
. ice x1 x2 x3 using imputed, m(5)
. ice x1 x2 x3 using imputed, m(5) cycles(20) cc(x4 x5)
. ice x1-x5 using imputed, m(10) boot match(x1 x2 x3) cmd(x1 x2:mlogit, x3:ologit)
id(pid) seed(101) genmiss(m_)
. ice x1 x1a x1b x2 x3 x23 using imputed, m(5) cmd(x1:ologit) passive(x1a:x1==2
\x1b:x1==3 \x23=x2*x3) substitute(x1:x1a x1b)
. ice y1 y2 y3 x1 x2 x3 x4 using imputed, m(5) eq(y1:x1 x2 y2, y2:y1 x3 x4, y3:y1 y2)
match(y3)
. ice x1 x2 x3 using imputed, m(5) cmd(x1:ologit) match(x2) dropmissing
. ice x1 ll2 ul2 x2 ll3 ul3 x3 using imputed, m(5) interval(x2:ll2 ul2, x3:ll3 ul3)
Author
Patrick Royston, MRC Clinical Trials Unit, London.
pr@ctu.mrc.ac.uk
Further reading
van Buuren S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of missing
blood pressure covariates in survival analysis. Statistics in Medicine 18:681-694.
Also see http://www.multiple-imputation.com.
Carlin J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple
imputed datasets. Stata Journal 3(3):226-244.
Clark T. G. and D. G. Altman. 2003. Developing a prognostic model in the presence of
missing data: an ovarian cancer case-study. Journal of Clinical Epidemiology
5628-37.
Royston P. 2001. The lognormal distribution as a model for survival time in cancer, with
an emphasis on prognostic factors. Statistica Neelandica 55:89-104.
Royston P. 2004. Multiple imputation of missing values. Stata Journal 4(3):227-241.
Royston P. 2005a. Multiple imputation of missing values: update. Stata Journal 5:
188-201.
Royston P. 2005b. Multiple imputation of missing values: update of ice. Stata Journal 5:
527-536.
Acknowledgement
I am grateful to Gillian Raab for pointing out certain issues with the prediction
matching approach, particularly that it is only useful with continuous variables. As a
result, the default imputation method has been changed from matching to drawing from the
predictive distribution. Gillian also suggested imputing the variables in reverse order
of the amount of missingness, and selecting the imputed value at random from the set
determined by the available matching predictions. Both suggestions have been implemented.
Also see
On-line: help for mijoin, micombine, mitools and related programs (if installed).
Download