----------------------------------------------------------------------------------------------help for ice, uvis Patrick Royston ----------------------------------------------------------------------------------------------Multiple imputation by the MICE system of chained equations ice mainvarlist using filename[.dta] [if exp] [in range] [weight] , [ boot[(varlist)] cc(varlist) cmd(cmdlist) cycles(#) dropmissing dryrun eq(eqlist) genmiss(string) id(string) interval(intlist) m(#) match[(varlist)] noconstant noshoweq on(varlist) orderasis passive(passivelist) substitute(sublist) replace seed(#) trace(filename) ] uvis regression_cmd {yvar|llvar ulvar} xvarlist [if exp] [in range] [weight] , gen(newvarname) [ boot match replace seed(#) ] where regression_cmd may be intreg, logistic, logit, mlogit, ologit, or regress. llvar ulvar are required with intreg. All weight types supported by regression_cmd are allowed; see weights. Description ice imputes missing values in mainvarlist by using switching regression, an iterative multivariable regression technique. The abbreviation MICE means multiple imputation by chained equations, and was apparently coined by Steff van Buuren. ice implements MICE for Stata. Sets of imputed and non-imputed variables are stored to a new file called filename. Any number of complete imputations may be created. The original data are stored in filename as "imputation number 0" and the new variable _mj is set to 0 for these observations. uvis (univariate imputation sampling) imputes missing values in the single variable yvar based on multiple regression on xvarlist. uvis is called repeatedly by ice in a regression switching mode to perform multivariate imputation. The missing observations are assumed to be "missing at random" (MAR) or "missing completely at random" (MCAR), according to the jargon. See for example van Buuren et al (1999) for an explanation of these concepts. Please note that ice and uvis require Stata 8.0 or higher. There have been incompatibility issues with Stata 7 or lower. Options for ice boot[(varlist)] instructs that each member of varlist, a subset of mainvarlist, be imputed with the boot option of uvis activated. If (varlist) is omitted then all members of mainvarlist with missing observations are imputed using the boot option of uvis. cc(varlist) prevents imputation of missing data in mainvarlist for cases in which any member of varlist has a missing value. "cc" signifies "complete case". Note that members of varlist are used for imputation if they appear in mainvarlist, but not otherwise. Use of this option is equivalent to entering if ~missing(var1) & ~missing(var2) ..., where var1, var2, ... denote the members of varlist. cmd(cmdlist) defines the regression commands to be used for each variable in mainvarlist, when it becomes the dependent variable in the switching regression procedure used by uvis (see Remarks). The first item in cmdlist may be a command such as regress or may have the syntax varlist:cmd, specifying that command cmd applies to all the variables in varlist. Subsequent items in cmdlist must follow the latter syntax, and each item should be followed by a comma. The default cmd for a variable is logit when there are two distinct values, mlogit when there are 3-5 and regress otherwise. Example: cmd(regress) specifies that all variables are to be imputed by regress, over-riding the defaults Example: cmd(x1 x2:logit, x3:regress) specifies that x1 and x2 are to be imputed by logit, x3 by regress and all others by their default choices cycles(#) determines the number of cycles of regression switching to be carried out. Default # is 10. dropmissing is a feature designed to save memory when using the file of imputed data created by ice. It omits from filename all observations which are not in the estimation sample, that is for which either (i) they are filtered out by if or in, or a non-positive weight, or (ii) the values of all variables in mainvarlist are missing. This option provides a "clean" analysis file of imputations, with no missing values. Note that the observations not in the estimation sample are omitted also from the original data, stored as imputation #0 in filename. dryrun does a "dry run", that is, ice reports the prediction equations it has constructed from the various inputs. No imputation is done and no files are created. It is not mandatory to specify an output file with using for a dry run. Sometimes the prediction equation set-up needs to be carefully checked before running what may be a lengthy imputation process. eq(eqlist) allows one to define customised prediction equations for any subset of variables in mainvarlist. The option, particularly when used with passive(), allows great flexibility in the possible imputation schemes. The syntax of eqlist is varname1:varlist1 [,varname2:varlist2 ...] where each varname# (or varlist#) is a member (or subset) of mainvarlist. It is your responsibility to ensure that each equation is sensible. ice places no restrictions except to check that all variables mentioned are indeed in mainvarlist, and that an equation is not defined for a variable specified to be passively imputed (see the passive() option. Note that eq() takes precedence over all default definitions and assumptions about the way a given variable in mainvarlist will be imputed. The default, if the passive() and substitute() options are not invoked, is that each variable in mainvarlist with any missing data is imputed from all the other variables in mainvarlist. genmiss(string) creates an indicator variable for the missingness of data in any variable in mainvarlist for which at least one value has been imputed. The indicator variable is set to missing for observations excluded by if, in, etc. The indicator variable for xvar is named stringxvar. This option is left for backwards compatibility, but now that the original data are stored in the output file, it is no longer really needed. The information on missingness is implicit in the original data stored as "imputation 0". id(string) creates a variable called string containing the original sort order of the data. Default string: _mi. interval(intlist) imputes interval-censored variables. An interval-censored value is one which is known to lie in an interval [a, b], where a may be finite or minus infinity, b may be finite or plus infinity, and a <= b. When either a or b is infinite we have left or right censoring, respectively. intlist has the syntax varname:llvar ulvar [, varname:it:llvar ulvar ...], where each varname is an interval-censored variable, each llvar contains the lower bound (a) for varname and each ulvar contains the upper bound (b) for varname (or a missing value to represent plus or minus infinity). The supplied values of varname are irrelevant since they will be replaced anyway; it is only required that varname exist. Observations with llvar missing and ulvar present are left-censored for varname. Observations with llvar present and ulvar missing are right-censored for varname. Observations with llvar = ulvar are complete, and no imputation is done for them. Observations with both llvar and ulvar missing are imputed assuming an uncensored normal distribution. See Remarks for further information. m(#) defines # as the number of imputations required (minimum 1, no upper limit). The default # is 1. match[(varlist)] instructs that each member of varlist be imputed with the match option of uvis. This provides prediction matching for each member of varlist. If (varlist) is omitted then all relevant variables are imputed with the match option of uvis. The default, if match() is not specified, is to draw from the posterior predictive distribution of each variable requiring imputation. noshoweq suppresses the presentation of the prediction equations. noconstant suppresses the regression constant in all regressions. on(varlist) changes the operation of ice in a major way. With this option, uvis imputes each member of mainvarlist univariately on varlist. This provides a convenient way of producing multiple imputations when imputation for each variable in mainvarlist is to be done univariately on a set of complete predictors. orderasis enters the variables in mainvarlist into the MICE algorithm in the order given. The default is to order them according to the number of missing values: the variable with least missingness gets imputed first, and so on. passive(passivelist) allows the use of "passive" imputation of variables that depend on other variables, some of which are imputed. The syntax of passivelist is varname:exp [\varname:exp ...]. Notice the requirement to use "\" as a separator between items in passivelist, rather than the usual comma; the reason is that a comma may be a valid part of an expression. The option is most easily explained by example. Suppose x1 is a categorical variable with 3 levels, and that two dummy variables x1a, x1b have been created by the commands . generate byte x1a=(x1==2) . generate byte x1b=(x1==3) Now suppose that x1 is to be imputed by the mlogit command, and is to be treated as the two dummy variables x1a and x1b when predicting other variables. Use of mlogit is achieved by the option cmd(x1:mlogit). When x1 is imputed, we want x1a and x1b to be updated with new values which depend on the imputed values of x1. This may be achieved by specifying passive(x1a:x1==2 \ x1b:x1==3). It is necessary also to remove x1 from the list of predictors when variables other than x1 are being imputed, and this is done by using the substitute() option; in the present example, you would specify substitute(x1:x1a x1b). Note that although in this example x1a will take the (possibly unintended) value of 0 when x1 is missing, ice is careful to ensure that x1a (and x1b) inherit the missingness of x1, and are passively imputed following active imputation of missing values of x1. If this were not done, incorrect results could occur. The responsibility of the user is to create x1a and x1b before running ice such that their missing values are identical to those of x1. A second example is multiplicative interactions between variables, for example, between x1 and x2 (e.g. x12=x1*x2); this could be entered as passive(x12:x1*x2). It would cause the interaction term x12 to be omitted when either x1 or x2 was being imputed, since it would make no sense to impute x1 from its interaction with x2. substitute() is not needed here. It should be stressed that variables to be imputed passively must already exist and must be included in mainvarlist, otherwise they will not be recognised. replace permits filename to be overwritten with new data. replace may not be abbreviated. seed(#) sets the random number seed to #. To reproduce a set of imputations, the same random number seed should be used. Default #: 0, meaning no seed is set by the program. substitute(sublist) is typically used with the passive() option to represent multilevel categorical variables as dummy variables in models for predicting other variables. See passive() for more details. The syntax of sublist is varname:dummyvarlist [,varname:dummyvarlist ...] where varname is the name of a variable to be substituted and dummyvarlist is the list of dummy variables representing it. Note, however, the following important convenience feature: substitute() may be used without corresponding expressions in passive() to recreate dummy variables automatically. If the values of variables in dummyvarlist are NOT defined through expressions involving varname in the passive() option, then the variables in dummyvarlist are calculated according to the actual range of values of varname. For example, suppose the options passive(x1a:x1==2 \ x1b:x1==3) and {cmd:substitute(x1:x1a x1b) were specified. Provided that all the non-missing values of x1 were 2 when x1a==1 and all the non-missing values of x1 were 3 when x1b==1, then passive(x1a:x1==2 \ x1b:x1==3) is implied by substitute(x1:x1a x1b) and can be omitted. The rule applied by substitute(x:dummy1 [dummy2...]) for defining dummy variables dummy1, dummy2, ... is as follows: 1. Determine the range of values [xmin, xmax] of x for which dummy1 > 0. 2a. If xmin < xmax, define dummy1 to be 1 if xmin <= x <= xmax and 0 otherwise. 2b. If xmin = xmax, define dummy1 to be 1 if x = xmin and 0 otherwise. 3. Repeat steps 1 and 2a,b for dummy2, dummy3, ... as necessary. With many such categorical variables this feature can save a lot of typing. trace(filename) monitors the convergence of the imputation algorithm. For each original variable with missing values, the mean of the imputed values is stored as a variable in filename, together with the cycle number at which that mean was calculated. The results are stored only for the final imputation. For diagnostic purposes, it is sensible to run trace() with m(1) and a large number of cycles, such as cycles(100). When the run is complete, it is helpful to load filename into memory and plot the mean for each imputed variable against the cycle number. If necessary, smoothing may be applied to clarify any apparent pattern. Convergence is judged to have occurred when the pattern of the imputed means is random. It is usually obvious from the appearance of the plot how many cycles are needed for convergence. Options for uvis boot invokes a bootstrap method for creating imputed values (see Remarks). gen(newvar) is not optional. newvar contains original (non-missing) and imputed (originally missing) values of yvar. match creates imputations by prediction matching. The default is to draw imputations at random from the posterior distribution of the missing values of yvar, conditional on the observed values and the members of xvarlist. See Remarks for further details. noconstant suppresses the regression constant in all regressions. replace permits newvar (see gen(newvar)) to be overwritten with new data. replace may not be abbreviated. seed(#) sets the random number seed to #. See Remarks for comments on how to ensure reproducible imputations by using the seed() option. Default #: 0, meaning no seed is set by the program. Remarks uvis imputes yvar from xvarlist according to the following algorithm (see van Buuren et al (1999) section 3.2 for further technical details): 1. Estimate the vector of coefficients (beta) and the residual variance by regressing the non-missing values of yvar on the current "completed" version of xvarlist. Predict the fitted values etaobs at the non-missing observations of yvar. 2. Draw at random a value (sigma_star) from the posterior distribution of the residual standard deviation. 3. Draw at random a value (beta_star) from the posterior distribution of beta, allowing, through sigma_star, for uncertainty in beta. 4. Use beta_star to predict the fitted values etamis at the missing observations of yvar. 5. The imputed values are predicted directly from beta_star, sigma_star and the covariates. When imputation is by linear regression (regress command), this step assumes that yvar is Normally distributed, given the covariates. For other types of imputation, samples are drawn from the appropriate distribution. With the match option, step 5 is replaced by the following. For each missing observation of yvar with prediction etamis, find the non-missing observation of yvar whose prediction (etaobs) on observed data is closest to etamis. This closest non-missing observation is used to impute the missing value of yvar. The default draw method is not robust to departures from Normality and may produce implausible imputations. For example, if the original distribution is skew and positive-valued, the imputed distribution will not necessarily have the appropriate amount of skewness, nor will all the imputed values necessarily be positive. Log transformation of positive variables may greatly improve the appropriateness of the imputations. The alternative match method is recommended only for continuous variables when the Normality assumption is clearly untenable, even approximately. It is not necessary, nor is it recommended, for binary, ordered categorical or nominal variables. match may work well when the distribution of a continuous variable is very non-Normal, but it may sometimes result in biased imputations. With the boot option, steps 2-4 are replaced by a bootstrap estimation of beta_star; beta_star is estimated by regressing yvar on xvarlist after taking a bootstrap sample of the non-missing observations. This has the advantage of robustness since the distribution of beta is no longer assumed to be multivariate normal. Note that uvis will not impute observations for which a value of a variable in xvarlist is missing. However, all original (missing or non-missing) observations of yvar will be copied into newvarname in such cases. This is a change from the first release of uvis (with mvis). Previously, newvarname would be set to missing whenever a value of a variable in xvarlist was missing, irrespective of the value of yvar. Missing data for ordered (or unordered) categorical covariates should be imputed by using the ologit (or mlogit) command. In these cases, prediction matching is done on the scale of the mean absolute difference in the predicted class probabilities, preceded by logit transformation. ice carries out multivariate imputation in mainvarlist using regression switching (van Buuren et al 1999) as follows: 1. Ignore any observations for which mainvarlist has only missing values, or if the ccvarlist(varlist) option has been specified, for which any member of varlist has a missing value. 2. For each variable in mainvarlist with any missing data, randomly order that variable and replicate the observed values across the missing cases. This step initialises the iterative procedure by ensuing that no relevant values are missing. 3. For each variable in mainvarlist in turn, impute missing values by applying uvis with the remaining variables as covariates. 4. Repeat step 3 cycles() times, replacing the imputed values with updated values at the end of each cycle. A single imputation sample is created for each variable with any relevant missing values. Van Buuren recommends cycles(20) but goes on to say that 10 or even 5 iterations are probably sufficient. We have chosen a compromise default of 10. "Multiple imputation" (MI) implies the creation and analysis of several imputed datasets. To do this, one would run ice with m set to a suitable number, for example 5. To obtain final estimates of the parameters of interest and their standard errors, one would fit a model in each imputation and carry out the appropriate post-MI averaging procedure on the results from the m separate imputations. A suitable estimation tool for this purpose is micombine. Handling categorical variables Binary variables present no difficulty: by default, in the MICE procedure, when such a variable is the response, it is predicted from other variables by using logistic regression; when it is a covariate, it is modelled in the only way possible, effectively as a single dummy variable. Categorical variables with 3 or more levels may in principle be treated in different ways. By default, in ice variables with 3-5 levels are modelled using multinomial logistic regression (mlogit command) when the response, and as a single linear term when a covariate. The same behaviour occurs with the ordered logistic model (ologit command), requested via the cmd() option. The use of dummy variables instead of a single linear term may be imposed as described under the passive() option. The requisite dummy variables must be created before ice is invoked. Variables with 6 or more levels are treated as ordered and continuous, but again different choices may be imposed by use of the cmd(), passive() and substitute() options. You should be aware that unless the dataset is large, use of the mlogit command may produce unstable estimates if the number of levels is too large, and may compromise the accuracy of the imputations. It is hard to predict when this will occur. Note that due to a peculiarity of the way the mlogit command works, variables with score labels cause problems to ice and uvis when missing data are imputed using mlogit. Score labels for such variables are removed in the file of imputed data. See also the related comment on Post-estimation prediction in micombine. Interval censoring Values of a variable y that are interval censored are imputed under the assumption that y is normally distributed with unknown mean and variance. The method, which is fast and efficient, is essentially as described for right-censored variables in section 3.3 of Royston (2001). A minor extension to allow left or interval censoring is employed. For example, if A < y < B and A and B are both finite, the imputed value for y will follow a truncated normal distribution with bounds A and B, variance parameter estimated from the data and mean given by the linear predictor for the imputation model for y. Stata's intreg command is used to estimate the mean and variance of y. When A and B are both missing (infinite), imputation of y simply assumes the normal distribution just mentioned, but without bounds. If you wish to impose range limits on the imputed values, the lower and upper bound variables may be set accordingly. For example, to impute right-censored (e.g. survival) data, you would set llvar equal to all the observed times to event, whether censored or not, and ulvar to all the uncensored event times and missing for the censored times. This would cause the right-censored values to be imputed without restriction. If you wanted to bound the imputed values above, say by 10, you would specify ulvar to be 10 (rather than missing) for all the censored observations. Errors and diagnostics ice may occasionally detect an anomaly when running uvis with a particular variable as response and a particular regression command. ice will then stop and report the uvis command it was running and the error number returned. Often the problem lies in a regression of a binary or categorical variable where the estimation procedure fails to converge; this is usually caused by sparse cell occupancy of the response variable. If you obtain this error you should either omit the offending variable from the imputation, or seek to combine a sparse category with another category. Another possibility is that, again due to a defect in a particular regression command in the chained equations structure, the number of values imputed for a particular variable is less than expected. This is a serious error and again may arise from estimation problems involving a binary or categorical variable. In this situation, ice saves in the working directory a snapshot of the data it was using in the attempted estimation to a file called _ice_dump.dta, while also reporting the uvis command it was executing. You can then investigate what may have gone wrong with the command by loading the data in _ice_dump.dta and re-running the offending regression command. Further notes ice determines the order of imputing variables in the round of chained equations according to the amount of missing data. Variables with the least missingness are imputed first. An important application of MI is to investigate possible models, for example prognostic models, in which selection of influential variables is required (Clark & Altman 2003). For example, the stability of the final model across the imputation samples is of interest. This area of enquiry is in its infancy. In survival analysis, it is recommended to include the censoring indicator and the log of the survival time in the variables to be used for imputation. Van Buuren et al (1999) give a detailed discussion of the different types of covariate that can be included in the imputation model and discuss the important issue of how to deal with variables which are missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR). See also Van Buuren's website http://www.multiple-imputation.com for further information and software sources. Examples . uvis regress y x1 x2 x3, gen(ym) . uvis intreg ll ul x1 x2 x3, gen(y) . ice x1 x2 x3 using imputed, m(5) . ice x1 x2 x3 using imputed, m(5) cycles(20) cc(x4 x5) . ice x1-x5 using imputed, m(10) boot match(x1 x2 x3) cmd(x1 x2:mlogit, x3:ologit) id(pid) seed(101) genmiss(m_) . ice x1 x1a x1b x2 x3 x23 using imputed, m(5) cmd(x1:ologit) passive(x1a:x1==2 \x1b:x1==3 \x23=x2*x3) substitute(x1:x1a x1b) . ice y1 y2 y3 x1 x2 x3 x4 using imputed, m(5) eq(y1:x1 x2 y2, y2:y1 x3 x4, y3:y1 y2) match(y3) . ice x1 x2 x3 using imputed, m(5) cmd(x1:ologit) match(x2) dropmissing . ice x1 ll2 ul2 x2 ll3 ul3 x3 using imputed, m(5) interval(x2:ll2 ul2, x3:ll3 ul3) Author Patrick Royston, MRC Clinical Trials Unit, London. pr@ctu.mrc.ac.uk Further reading van Buuren S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18:681-694. Also see http://www.multiple-imputation.com. Carlin J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets. Stata Journal 3(3):226-244. Clark T. G. and D. G. Altman. 2003. Developing a prognostic model in the presence of missing data: an ovarian cancer case-study. Journal of Clinical Epidemiology 5628-37. Royston P. 2001. The lognormal distribution as a model for survival time in cancer, with an emphasis on prognostic factors. Statistica Neelandica 55:89-104. Royston P. 2004. Multiple imputation of missing values. Stata Journal 4(3):227-241. Royston P. 2005a. Multiple imputation of missing values: update. Stata Journal 5: 188-201. Royston P. 2005b. Multiple imputation of missing values: update of ice. Stata Journal 5: 527-536. Acknowledgement I am grateful to Gillian Raab for pointing out certain issues with the prediction matching approach, particularly that it is only useful with continuous variables. As a result, the default imputation method has been changed from matching to drawing from the predictive distribution. Gillian also suggested imputing the variables in reverse order of the amount of missingness, and selecting the imputed value at random from the set determined by the available matching predictions. Both suggestions have been implemented. Also see On-line: help for mijoin, micombine, mitools and related programs (if installed).