Chapter 5-9. Missing Data Imputation In this chapter we will discuss what to do about missing data, and particularly the imputation schemes available in Stata. To offer a “precise” definition, Missing data imputation is substituting values for missing values—but it is not fabricating data because we do it statistically. Although you rarely see authors describe how they handled missing data in their articles, it is becoming common to find entire chapters on this subject in statistical textbooks. For example, such chapters can be found in Harrell (2001, pp.41-52), Twisk (2003, pp.202-224), and Fleiss et al. (2003, pp. 491-560). Listwise Deletion Stata, like other statistical software, uses listwise deletion of missing values, which can diminish the sample size very quickly in regression models. With listwise deletion, a subject is dropped from the analysis if it is missing at least one of the variables used in the model (y, x1, or x2) in the following example. Consider the following dataset. . list +---------------------------------+ | id y x1 x2 x3 s x4 | |---------------------------------| 1. | 1 11 1 2 3 a . | 2. | 2 10 . 5 . b 5 | 3. | 3 5 3 2 4 3 | 4. | 4 9 . . 5 c 1 | 5. | 5 12 5 7 1 d 2 | |---------------------------------| 6. | 6 7 6 3 2 e 5 | +---------------------------------+ Ignoring obvious overfitting for sake of illustration, the following regression command will lose two subjects from the sample size due to listwise deletion of missing data. _________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine. Chapter 5-9. (Accessed February 15, 2012, at http://www.ccts.utah.edu/biostats/ ?pageId=5385). Chapter 5-9 (revision 15 Feb 2012) p. 1 . regress y x1 x2 Source | SS df MS -------------+-----------------------------Model | 22.8676471 2 11.4338235 Residual | 9.88235294 1 9.88235294 -------------+-----------------------------Total | 32.75 3 10.9166667 Number of obs F( 2, 1) Prob > F R-squared Adj R-squared Root MSE = = = = = = 4 1.16 0.5493 0.6982 0.0947 3.1436 -----------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------x1 | -1 .9701425 -1.03 0.490 -13.32683 11.32683 x2 | 1.352941 .9036642 1.50 0.375 -10.1292 12.83508 _cons | 7.764706 3.654641 2.12 0.280 -38.67191 54.20132 ------------------------------------------------------------------------------ _________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385 We just lost 1/3 of our sample (n=4 instead of n=6). That may seriously bias the results, not to mention the loss of statistical power. Although it is what most researchers do, just dropping subjects from the analysis that have at least one missing value for the variable in the model, or listwise deletion of missing data, results in regression coefficients that can be terribly biased, imprecise, or both (Harrell, 2001, pp.43-44). Types of Missing Data Harrell (2001, pp.41-52) discusses three types of missing data. Missing completely at random (MCAR) Data are missing for reasons unrelated to any characteristics or responses of the subject, including the value of the missing value, were it to be known. An example is the accidental dropping of test tube resulting in missing laboratory measurements. (Here, the best guess of the missing variable is simply the sample median). Missing at random (MAR) Data elements are not missing at random, but the probability that a value is missing depends on values of variables that were actually measured. For example, suppose males are less likely to respond to their income question in general, but the likelihood of responding is independent of their actual income. In this case, unbiased sex-specific income estimates can be made if we have data on the sex variable (by replacing the missing value with the sex-specific median income, for example). Informative missing (IM) Data elements are more likely to be missing if their true values of the variable in question are systematically higher or lower. For example, this occurs if lower income subjects, or high income subjects, or both, are less likely to answer the income question in a survey. This is the most difficult type of missing data to handle, and in many cases there is no good value to substitute for the missing value. Furthermore, if you analyze your data by just dropping these subjects, your results will be biased, so that does not work either. Chapter 5-9 (revision 15 Feb 2012) p. 2 Missing Comorbidities In Patient Medical Record A special case of missing is a comorbidity that is not listed in the patient’s medical record. For example, if no mention of diabetes was ever made and a diagnostic code for diabetes was never entered for any clinic visit, the fact that it is missing suggests that the patient does not have diabetes. Defining a coding rule to replace this missing value with 0, or absent, will most likely produce the least amount of misclassification error. Steyerberg (2009, p.130-131) mentions this approach, “An alternative in such a situation might be to change the definition of the predictor, i.e., by assuming that if no value is available from a patient chart, the characteristic is absent rather than missing.” Replacing Missing Values with Mean, Median, or Mode Before the more sophisticated imputation schemes were developed, it was common practice to replace the missing value with a likely value, being the mean, median, or mode. One criticism of this approach is that you artificially shrink the variance, since so many observations will now have the average value. Royston (2004) makes this criticism, “Old-fashioned imputation typically replaced missing values with the mean or mode of the nonmissing values for that variable. That approach is now regarded as inadequate. For subsequent statistical inference to be valid, it is essential to inject the correct degree of randomness into the imputations and to incorporate that uncertainty when computing standard errors and confidence intervals for parameters of interest.” One possible approach is imputing the missing value with a likely value, such as the median. Then, add a random residual back to the median imputed value to maintain the correct standard error (Harrell, 2001, pp.45-46). Such a direct approach is not usually done, however, since the more widely accepted approaches, accomplish the same thing. Furthermore, if there are a lot of missing data, imputing with a likely value might adversely affect the regression coefficient. The imputation methods of multiple imputation and maximum likelihood not only provide the best standard errors, but also the best regression coefficients. Chapter 5-9 (revision 15 Feb 2012) p. 3 What About Imputing the Outcome Variable? It is common to discard subjects with a missing outcome variable, but imputing missing values of the outcome variable frequently leads to more efficient estimates of the regression coefficients when the imputation is based on the nonmissing predictor variables (Harrell, 2001, p.43). Missing Value Indicator Approach A historically popular approach in epidemiologic research was to use a missing value indicator, which has a value of 1 if the variable is missing and 0 otherwise. For example, given the following variable for gender, 1. male n = 50 2. female n = 40 Missing n = 10 we would recode this to two indicators, male and malemissing: Original gender variable 1. male (n=50) 2. female (n=40) . = missing (n=10) Male indicator 1 (n=50) 0 (n=40) 0 (n=10) Malemissing indicator 0 (n=50) 0 (n=40) 1 (n=10) and then include both indicator variables into the regression model. With this approach, the missing value indicator is not interpreted, or reported in an article, but simply acts as a place holder so the subjects with missing values are not dropped out of the analysis. Greenland and Finkle (1995) suggest not using the missing value indictor approach, “Epidemiologic studies often encounter missing covariate values. While simple methods such as stratification on missing-data status, conditional-mean imputation, and completesubject analysis are commonly employed for handling this problem, several studies have shown that these methods can be biased under reasonable circumstances. The authors review these results in the context of logistic regression and present simulation experiments showing the limitations of the methods. The method based on missing-data indicators can exhibit severe bias even when the data are missing completely at random, and regression (conditional-mean) imputation can be inordinately sensitive to model misspecification. Even complete-subject analysis can outperform these methods. More sophisticated methods, such as maximum likelihood, multiple imputation, and weighted estimating equations, have been given extensive attention in the statistics literature. While these methods are superior to simple methods, they are not commonly used in epidemiology, no doubt due to their complexity and the lack of packaged software to apply these methods. The authors contrast the results of multiple imputation to simple Chapter 5-9 (revision 15 Feb 2012) p. 4 methods in the analysis of a case-control study of endometrial cancer, and they find a meaningful difference in results for age at menarche. In general, the authors recommend that epidemiologists avoid using the missing-indicator method and use more sophisticated methods whenever a large proportion of data are missing.” Huberman and Langholz (1999) later proposed the missing value indicator approach for matched case-control studies, which was not specifically discussed in the Greenland and Finke paper. Li et al (2004) criticized the Huberman and Langholz proposal, stating that the approach does not perform as well as Huberman and Langholz suggested. Steyerberg (2009, pp.130-131) likewise advises not using the missing value indicator approach, “…such a procedure ignores correlation of the values of predictors among each other. Simulations have shown that the procedure may lead to severe bias in estimated regression coefficients.155,295. The missing indicator should hence generally not be used.” ----155 Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analysis. Am J Epidemiol 1995;142(12):1255-64. 295 Moons KG, Grobbee DE. Diagnostic studies as multivariable, prediction research. J Epidemiol Community Health 2002;56(5):337-8. Hotdeck Imputation (command “hotdeck”, contributed by Mander and Clayton) With hot deck imputation, each missing value is replaced with the value from the most similar case for which the variable is not missing. Hotdeck imputation is available in Stata, which requires you to update your Stata to get the command hotdeck, but instead of replacing only the missing values, it replaces the entire observation with an observation that has no missing data. Thus, both missing and nonmissing variables for a subject are replaced, which does not seem very appealing. In addition, the replacement is not with the most similar observation, which is the most appealing feature of hotdeck imputation, but instead the replacement is with a random observation. Royston (2004) criticizes the method, “Hotdeck imputation was implemented in Stata in 1999 by Mander and Clayton. However, this technique may perform poorly when many rows of data have at least one missing value.” Chapter 5-9 (revision 15 Feb 2012) p. 5 Hotdeck Imputation (command “hotdeckvar”, contributed by Schonlau) A better version of hotdeck is available on the Internet. Schonlau developed a Stata procedure to perform a simple hotdeck imputation. Missing values are replaced by random values from the same variable. Although this is not the “most similar case”, which true hotdeck imputation is supposed to achieve, it is intuitively more appealing than the Mander and Clayton procedure. Schonlau’s version can be found on his website (http://www.schonlau.net/statasoftware.html). To install it, connect to the Internet and enter the following commands in the Stata Command Window: net from http://www.schonlau.net/stata net install hotdeckvar and then click on the “hotdeckvar” link. To get help for this in Stata enter help hotdeckvar Article Statistical Methods Suggestion For Schonlau’s Hotdeck Imputation If you wanted to use hotdeck imputation, you could use the following in your Statistical Methods section: Imputation for missing values was performed using the hotdeck procedure, where missing values were replaced by random values from the same variable, using the Schonlau implementation for the Stata software (Schonlau, 2006). Hotdeck imputation has the advantage of being simple to use, it preserves the distributional characteristics of the variable, and performs nearly as well as the more sophisticated imputation approaches (Roth, 1994). Multiple Imputation An excellent website discussing multiple imputation is: http://www.multiple-imputation.com/, and then click on the What is MI? link. This website was created by S. van Buuren, one of the authors of the MICE method (Multiple Imputation by Chained Equations). The multiple imputation routine available in Stata uses the MICE method. Multiple imputation was implemented in Stata by Royston (2004). It was updated by Royston (2005a), then updated again by Royston (2005b), and again by Royston (2007), and again by Carlin, Galati, and Royston (2008), and so on…. Chapter 5-9 (revision 15 Feb 2012) p. 6 Royston (2004) describes the method: “This article describes an implementation for Stata of the MICE method of multiple multivariate imputation described by van Buuren, Boshuizen, and Knook (1999). MICE stands for multivariate imputation by chained equations. The basic idea of data analysis with multiple imputation is to create a small number (e.g., 5–10) of copies of the data, each of which has the missing values suitably imputed, and analyze each complete dataset independently. Estimates of parameters of interest are averaged across the copies to give a single estimate. Standard errors are computed according to the “Rubin rules”, devised to allow for the between- and within-imputation components of variation in the parameter estimates.” This is the best imputation approach available in Stata, and so is the recommended approach if the proportion of missing data is greater than 5% (see Harrell guideline below). If you use it and want to say you did in your article, you would report it as: Article Statistical Methods Section Suggestion If you wanted to use multiple imputation, you could use the following in your Statistical Methods section: Missing data were imputed using the method of multiple multivariate imputation described by van Buuren, et al (1999) as implemented in the STATA software (Royston, 2004). We will practice with this method below. The Stata commands for multiple imputation assume data to be missing at random (MAR) or missing completely at random (MCAR). Exercise: Look at the Vandenbrouchke et al (2007) STROBE statement suggestion for reporting your missing value imputation approach. On page W-176 under the heading “12(c) Explain how missing data were addressed”, they give an example description for your Statistical Methods section, an explanation, and in Box 6 on the next page, W-177, they give details about imputation in general, which are consistent with the presentation in this chapter. They give this example (a clinical paper, their reference 106) for explaining in your manuscript how imputation was done, Chapter 5-9 (revision 15 Feb 2012) p. 7 “Our missing data analysis procedures used missing at random (MAR) assumptions. We used the MICE (multivariate imputation by chained equations) method of multiple multivariate imputation in STATA. We independently analyzed 10 copies of the data, each with missing values suitably imputed , in the multivariate logistic regression analyses. We average estimates of the variables to give a single mean estimate and adjusted standard errors according to Rubin’s rules” (106). ------106. Chandola T, Brunner E, Marmot M. Chronic stress at work and the metabolic syndrome: prospective study. BMJ 2006;332:521-5. PMID: 16428252 This paragraph is very similar to the Royston (2004) description on the previous page. Some Crude Guidelines Harrell provides these crude guidelines (2001, p.49): “Proportion of missings 0.05: It doesn’t matter very much how you impute missings or whether you adjust variance of regression coefficient estimates for having imputed data in this case. For continuous variables imputing missings with the median nonmissing value is adequate; for categorical predictors the most frequent category can be used. Complete case analysis is an option here.” “Proportion of missings 0.05 to 0.15: If a predictor is unrelated to all of the other predictors, imputations can be done the same as the above (i.e., impute a reasonable constant value). If the predictor is correlated with other predictors, develop a customized model (or have the transcan fuction [available for S-Plus from Harrell’s website] do it for you) to predict the predictor from all of the other predictors. Then impute missings with predicted values. For categorical variables, classification trees are good methods for developing customized imputation models. For continuous variables, ordinary regression can be used if the variable in question does not require a nonmonotonic transformation to be predicted from the other variables. For either the related or unrelated predictor case, variances may need to adjusted for imputation. Single imputation is probably OK here, but multiple imputation doesn’t hurt.” “Proportion of missings > 0.15: This situation requires the same considerations as in the previous case, and adjusting variances for imputation is even more important. To estimate the strength of the effect of a predictor that is frequently missing, it may be necessary to refit the model on the subject of observations for which that predictor is not missing, if Y is not used for imputation. Multiple imputation is preferred for most models.” Chapter 5-9 (revision 15 Feb 2012) p. 8 Counting Number of Missing Values A quick way to see how many missing values you have in your variables is to use the nmissing or npresent commands. First, you have to update your Stata to add them, which you can do with the following command while connected to the Internet. findit nmissing SJ-5-4 dm67_3 . . . . . . . . . . Software update for nmissing and npresent (help nmissing if installed) . . . . . . . . . . . . . . . N. J. Cox Q4/05 SJ 5(4):607 now produces saved results Click on the dm67_3 link to install, which gives you: INSTALLATION FILES dm67_3/nmissing.ado dm67_3/nmissing.hlp dm67_3/npresent.ado dm67_3/npresent.hlp (click here to install) and then click on the “(click here to install)” link. Having done that, you can use the following to see the number of missing values for each variable: nmissing Alternatively, you can using the following to see the number of nonmissing vaues for each variable: npresent Chapter 5-9 (revision 15 Feb 2012) p. 9 Stata Practice Imputing with Likely Value Bringing in some data to practice with File Open Find the directory where you copied the CD Change to the subdirectory datasets & do-files Single click on births_with_missing.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ births_with_missing.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use births_with_missing, clear Looking at the data sum Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------id | 500 250.5 144.4818 1 500 bweight | 478 3137.253 637.777 628 4553 lowbw | 478 .1213389 .3268628 0 1 gestwks | 447 38.79617 2.14174 26.95 43.16 preterm | 447 .1230425 .328854 0 1 -------------+-------------------------------------------------------matage | 485 34.05979 3.905724 23 43 hyp | 478 .1401674 .3475243 0 1 sex | 459 1.479303 .5001165 1 2 sexalph | 0 We have (500-478)/500, or 4.4% missing for that dichotomous variable hyp. Using Harrell’s guideline for proportion missing 5%, we can simply impute those missings with the modal value of the variable. Chapter 5-9 (revision 15 Feb 2012) p. 10 Here we might use a naming convention of “nm” for “no missing” prefixed onto the variable name, so that we can easily recognize the original variable and the imputed variable. capture drop nmhyp tab hyp // look at output to see that modal value is 0 gen nmhyp=hyp replace nmhyp=0 if hyp==. tab nmhyp hyp , missing . tab hyp hypertens | Freq. Percent Cum. ------------+----------------------------------0 | 411 85.98 85.98 1 | 67 14.02 100.00 ------------+----------------------------------Total | 478 100.00 . tab nmhyp hyp , missing | hypertens nmhyp | 0 1 . | Total -----------+---------------------------------+---------0 | 411 0 22 | 433 1 | 0 67 0 | 67 -----------+---------------------------------+---------Total | 411 67 22 | 500 Alternatively, to make your do-file more automated, so that you don’t have to change these lines everytime you add new subjects to your dataset, you could use: capture drop nmhyp gen nmhyp=hyp count if hyp==0 // returns the #0's in r(N) scalar count0=r(N) // store #0’s in count0 count if hyp==1 // returns the #1's in r(N) scalar count1=r(N) // store #1’s in count1 replace nmhyp =0 if (hyp ==.) & (count0>=count1) replace nmhyp =1 if (hyp ==.) & (count0<count1) tab nmhyp hyp , missing Here, we used the count command to count the number of occurrences of 0 and 1 and stored these in scalars (variables with one element). Then we imputed with the most frequent category. . tab nmhyp hyp , missing | hypertens nmhyp | 0 1 . | Total -----------+---------------------------------+---------0 | 411 0 22 | 433 1 | 0 67 0 | 67 -----------+---------------------------------+---------Total | 411 67 22 | 500 For the continuous variable matage, we have (500-485)/500, or 3% missing. Using Harrell’s guideline for proportion missing 5%, we can simply impute those missings with the median of the variable. We can use either of the following two commands to discover the median. Chapter 5-9 (revision 15 Feb 2012) p. 11 centile matage, centile(50) sum matage, detail . centile matage, centile(50) -- Binom. Interp. -Variable | Obs Percentile Centile [95% Conf. Interval] -------------+------------------------------------------------------------matage | 485 50 34 34 35 . sum matage, detail maternal age ------------------------------------------------------------Percentiles Smallest 1% 25 23 5% 27 24 10% 29 25 Obs 485 25% 31 25 Sum of Wgt. 485 50% 75% 90% 95% 99% 34 37 39 40 42 Largest 42 43 43 43 Mean Std. Dev. 34.05979 3.905724 Variance Skewness Kurtosis 15.25468 -.241381 2.483957 To impute with the median of 34, we could use capture drop nmmatage gen nmmatage=matage replace nmmatage=34 if matage==. sum nmmatage Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------nmmatage | 500 34.058 3.846587 23 43 Alternatively, to make your do-file more automated, so that you don’t have to change these lines everytime you add new subjects to your dataset, you could use: capture drop nmmatage gen nmmatage=matage sum matage, detail *return list // to discover median is r(p50) replace nmmatage=r(p50) if matage==. sum nmmatage After running the sum command, followed by “return list” without the “*”, where it was discovered that the median is stored in the macro name r(p50), we can simply delete the return list line, or comment it out using the asterick. Chapter 5-9 (revision 15 Feb 2012) p. 12 Suppose the missing data are the Missing at random (MAR) case (see page 2). A better “likely value” can be obtained with regression. For the gestwks variable, there are (500-447)/500, or 10.6% missing. Although multiple imputation would be better, since it adjusts the variability, let’s try the regression approach just to see how it works. Later on, we will compare the result to multiple imputation. To find out what other predictors are correlated with gestwks, we can use corr bweight-sex | bweight lowbw gestwks preterm matage hyp sex -------------+--------------------------------------------------------------bweight | 1.0000 lowbw | -0.7083 1.0000 gestwks | 0.6969 -0.6045 1.0000 preterm | -0.5580 0.5653 -0.7374 1.0000 matage | 0.0260 -0.0268 0.0335 -0.0039 1.0000 hyp | -0.1923 0.1297 -0.1796 0.1297 -0.0527 1.0000 sex | -0.1654 0.0661 -0.0397 0.0500 -0.0351 -0.0771 1.0000 We see that lowbw, preterm, hyp, and sex are correlated with gestwks. Using Stata’s impute command, capture drop nmgestwks impute gestwks lowbw preterm hyp sex , gen(nmgestwks) sum gestwks nmgestwks We see that the imputed variable has a mean very similar to the original variable, but that the standard deviation is smaller. Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------gestwks | 447 38.79617 2.14174 26.95 43.16 nmgestwks | 500 38.78908 2.082616 26.95 43.16 This always happens, and is why we find the phrase “variances may need to adjusted for imputation” in Harrell’s “Proportion of missings 0.05 to 0.15” rule above. The diminished variance may lead to erroneous statistical significance. The impute command does not add back in random variability to the imputed values. Chapter 5-9 (revision 15 Feb 2012) p. 13 Stata Practice Imputing with Hotdeck Imputation First, we must update our Stata to include hotdeck imputation, as was done on Page 5, if we did not do this yet. net from http://www.schonlau.net/stata net install hotdeckvar We will use a small dataset so we can keep track of what is happening. use births_miss_small, clear list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. +-----------------------------------------+ | id bweight lowbw gestwks matage | |-----------------------------------------| | 1 2620 0 38.15 35 | | 2 3751 0 39.8 31 | | 3 3200 0 38.89 33 | | 4 3673 0 . . | | 5 . . 38.97 35 | |-----------------------------------------| | 6 3001 0 41.02 38 | | 7 1203 1 . . | | 8 3652 0 . . | | 9 3279 0 39.35 30 | | 10 3007 0 . . | |-----------------------------------------| | 11 2887 0 38.9 28 | | 12 . . 40.03 27 | | 13 3375 0 . 36 | | 14 2002 1 36.48 37 | | 15 2213 1 37.68 39 | +-----------------------------------------+ To see the help for hotdeckvar, use help hotdeckvar Chapter 5-9 (revision 15 Feb 2012) p. 14 To impute all the missing values in this dataset, we use the following. We first set the seed to the random number generator so that we get the same imputed values each time we run the hotdeckvar command. set seed 999 // otherwise imputed variables change hotdeckvar bweight lowbw gestwks matage, suffix("_hot") list bweight_hot-matage_hot, abbrev(15) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. +----------------------------------------------------+ | bweight_hot lowbw_hot gestwks_hot matage_hot | |----------------------------------------------------| | 2620 0 38.15 35 | | 3751 0 39.8 31 | | 3200 0 38.89 33 | | 3673 0 37.68 39 | | 2620 0 38.97 35 | |----------------------------------------------------| | 3001 0 41.02 38 | | 1203 1 39.8 31 | | 3652 0 38.89 33 | | 3279 0 39.35 30 | | 3007 0 41.02 38 | |----------------------------------------------------| | 2887 0 38.9 28 | | 3200 0 40.03 27 | | 3375 0 39.8 36 | | 2002 1 36.48 37 | | 2213 1 37.68 39 | +----------------------------------------------------+ For analysis, we simply use these new imputed variables. Chapter 5-9 (revision 15 Feb 2012) p. 15 Stata Practice Imputing with Multiple Imputation A significant improvement was made to multiple imputation in Stata version 11. There is now a Stata manual, Multiple Imputation, available. To see it, click on Help on the Stata menu bar, then click on PDF Documentation. It is all done with the mi command. The Version 11 commands did not fill in the missing values as well, but the Version 10 approach obtained an estimate for each missing value. Version 10: Multiple Imputation If you have Stata Version 11, do not do this section. Just look at the process it went through, and then go on to the Version 11 section. First, we must update our Stata to include the multiple imputation commands: ice and micombine. Use findit ice which takes you to a help screen where several versions of these commands are: First, click on the st0067_3 link: SJ-7-4 st0067_3 . . . . Multiple imputation of missing values: Update of ice (help ice, ice_reformat, micombine, uvis if installed) . . P. Royston Q4/07 SJ 7(4):445--464 update of ice allowing imputation of left-, right-, or interval-censored observations which brings up the following: INSTALLATION FILES st0067_3/ice.ado st0067_3/ice.hlp st0067_3/ice_reformat.ado st0067_3/ice_reformat.hlp st0067_3/micombine.ado st0067_3/micombine.hlp st0067_3/uvis.ado st0067_3/uvis.hlp st0067_3/cmdchk.ado st0067_3/nscore.ado (click here to install) Second, click on the click here to install link: Chapter 5-9 (revision 15 Feb 2012) p. 16 Third, click on the st0067_4 link: SJ-9-3 st0067_4 . Mult. imp.: update of ice, with emphasis on cat. variables (help ice, uvis if installed) . . . . . . . . . . . . . . . P. Royston Q3/09 SJ 9(3):466--477 update of ice package with emphasis on categorical variables; clarifies relationship between ice and mi which brings up the following: INSTALLATION FILES (click here to install) st0067_4/ice.ado st0067_4/ice.hlp st0067_4/ice_.ado st0067_4/uvis.ado st0067_4/uvis.hlp followed by clicking on the “(click here to to install)” link. Fourth, click on the click here to install link: In this chapter, the ice and micombine commands are used. The multiple imputation procedure for Stata version 10 was later updated to include the mim and mimstack commands. The old commands seemed easier to use, and should give the same result, so this chapter still teaches the ice and micombine commands. Again, we will use a small dataset so we can keep track of what is happening. use births_miss_small, clear list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. +-----------------------------------------+ | id bweight lowbw gestwks matage | |-----------------------------------------| | 1 2620 0 38.15 35 | | 2 3751 0 39.8 31 | | 3 3200 0 38.89 33 | | 4 3673 0 . . | | 5 . . 38.97 35 | |-----------------------------------------| | 6 3001 0 41.02 38 | | 7 1203 1 . . | | 8 3652 0 . . | | 9 3279 0 39.35 30 | | 10 3007 0 . . | |-----------------------------------------| | 11 2887 0 38.9 28 | | 12 . . 40.03 27 | | 13 3375 0 . 36 | | 14 2002 1 36.48 37 | | 15 2213 1 37.68 39 | +-----------------------------------------+ The stata commands we will use are explain fully in the Royston (2005a, 2005b, 2007) articles, which are updated versions of the routines explained fully in the Royston (2004) article. These articles, as with any Stata Journal article that is at least two years old, can be downloaded at no cost from the Stata Corporation website, if you want them. More recent articles must be paid for. http://www.stata-journal.com/archives.html Chapter 5-9 (revision 15 Feb 2012) p. 17 With listwise deletion of missing values, we get the following linear regression. regress bweight gestwks matage Source | SS df MS -------------+-----------------------------Model | 1880727.23 2 940363.614 Residual | 436631.648 5 87326.3296 -------------+-----------------------------Total | 2317358.88 7 331051.268 Number of obs F( 2, 5) Prob > F R-squared Adj R-squared Root MSE = = = = = = 8 10.77 0.0154 0.8116 0.7362 295.51 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 274.9777 83.99418 3.27 0.022 59.06382 490.8916 matage | -66.55269 28.83879 -2.31 0.069 -140.6852 7.57979 _cons | -5541.07 3641.212 -1.52 0.189 -14901.1 3818.962 ------------------------------------------------------------------------------ van Buuren (1999, p.686) describes the multiple imputation method: “Multiple imputation will be applied to account for the non-response. The main tasks to be accomplished in multiple imputation are: 1. Specify the posterior predictive density p(Ymis|X,R), where X is a set of predictor variables, given the non-response mechanism p(R|Y,Z) and the complete data model p(Y,Z). 2. Draw imputations from this density to produce m complete data sets. 3. Perform m complete-data analyses (Cox regression in our case) on each completed data matrix. 4. Pool the m analyses results into final point and variance estimates. Simulation studies have shown that the required number of repeated imputations m can be as low as three for data with 20 per cent of missing entries.10 In the following we use m=5, which is a conservative choice.” Next we will obtain the multiple imputation solution. First we use the mvis command, which imputes missing values in the mainvarlist m times by using “switching regression”, an iterative multivariate regression technique. The imputed and non-imputed variables are stored in a new file called filename.dta. The syntax is: ice mainvarlist [if] [in] [weight] [, boot [(varlist)] cc(ccvarlist) cmd(cmdlist) cycles(#) dropmissing dryrun eq(eqlist) genmiss(string) id(string) m(#) interval(intlist) match[(varlist)] noconstant nopp noshoweq nowarning on(varlist) orderasis passive(passivelist) replace saving(filename [, replace]) seed(#) substitute(sublist) trace(filename) Chapter 5-9 (revision 15 Feb 2012) p. 18 We use ice bweight gestwks matage , m(5) genmiss(nm) /// saving(bwtimp, replace) seed(888) To see what this did, we can use use bwtimp, clear list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. +----------------------------------------------------------------------------------------+ | id bweight lowbw gestwks matage _mi _mj nmbwei~t nmgest~s nmmatage | |----------------------------------------------------------------------------------------| | 1 2620 0 38.15 35 1 0 . . . | | 2 3751 0 39.8 31 2 0 . . . | | 3 3200 0 38.89 33 3 0 . . . | | 4 3673 0 . . 4 0 . . . | | 5 . . 38.97 35 5 0 . . . | |----------------------------------------------------------------------------------------| | 6 3001 0 41.02 38 6 0 . . . | | 7 1203 1 . . 7 0 . . . | | 8 3652 0 . . 8 0 . . . | | 9 3279 0 39.35 30 9 0 . . . | | 10 3007 0 . . 10 0 . . . | |----------------------------------------------------------------------------------------| | 11 2887 0 38.9 28 11 0 . . . | | 12 . . 40.03 27 12 0 . . . | | 13 3375 0 . 36 13 0 . . . | | 14 2002 1 36.48 37 14 0 . . . | | 15 2213 1 37.68 39 15 0 . . . | |----------------------------------------------------------------------------------------| | 1 2620 0 38.15 35 1 1 0 0 0 | | 2 3751 0 39.8 31 2 1 0 0 0 | | 3 3200 0 38.89 33 3 1 0 0 0 | | 4 3673 0 39.035 30.4696 4 1 0 1 1 | | 5 2573.08 . 38.97 35 5 1 1 0 0 | |----------------------------------------------------------------------------------------| | 6 3001 0 41.02 38 6 1 0 0 0 | | 7 1203 1 37.32458 38.3812 7 1 0 1 1 | | 8 3652 0 40.36074 30.7336 8 1 0 1 1 | | 9 3279 0 39.35 30 9 1 0 0 0 | | 10 3007 0 38.58094 35.3263 10 1 0 1 1 | |----------------------------------------------------------------------------------------| | 11 2887 0 38.9 28 11 1 0 0 0 | | 12 3678.1 . 40.03 27 12 1 1 0 0 | | 13 3375 0 40.13427 36 13 1 0 1 0 | | 14 2002 1 36.48 37 14 1 0 0 0 | | 15 2213 1 37.68 39 15 1 0 0 0 | |----------------------------------------------------------------------------------------| | 1 2620 0 38.15 35 1 2 0 0 0 | | 2 3751 0 39.8 31 2 2 0 0 0 | | 3 3200 0 38.89 33 3 2 0 0 0 | | 4 3673 0 39.02037 28.6181 4 2 0 1 1 | | 5 2663.96 . 38.97 35 5 2 1 0 0 | |----------------------------------------------------------------------------------------| | 6 3001 0 41.02 38 6 2 0 0 0 | | 7 1203 1 36.64371 39.5586 7 2 0 1 1 | | 8 3652 0 39.96529 32.5712 8 2 0 1 1 | | 9 3279 0 39.35 30 9 2 0 0 0 | | 10 3007 0 39.37796 31.3312 10 2 0 1 1 | |----------------------------------------------------------------------------------------| | 11 2887 0 38.9 28 11 2 0 0 0 | | 12 3949.46 . 40.03 27 12 2 1 0 0 | | 13 3375 0 40.42688 36 13 2 0 1 0 | | 14 2002 1 36.48 37 14 2 0 0 0 | | 15 2213 1 37.68 39 15 2 0 0 0 | |----------------------------------------------------------------------------------------| | 1 2620 0 38.15 35 1 3 0 0 0 | | 2 3751 0 39.8 31 2 3 0 0 0 | | 3 3200 0 38.89 33 3 3 0 0 0 | | 4 3673 0 40.09019 22.9542 4 3 0 1 1 | Chapter 5-9 (revision 15 Feb 2012) p. 19 50. | 5 2521.11 . 38.97 35 5 3 1 0 0 | |----------------------------------------------------------------------------------------| 51. | 6 3001 0 41.02 38 6 3 0 0 0 | 52. | 7 1203 1 35.87934 42.478 7 3 0 1 1 | 53. | 8 3652 0 41.6578 34.7143 8 3 0 1 1 | 54. | 9 3279 0 39.35 30 9 3 0 0 0 | 55. | 10 3007 0 41.21783 42.4263 10 3 0 1 1 | |----------------------------------------------------------------------------------------| 56. | 11 2887 0 38.9 28 11 3 0 0 0 | 57. | 12 3544.44 . 40.03 27 12 3 1 0 0 | 58. | 13 3375 0 41.06176 36 13 3 0 1 0 | 59. | 14 2002 1 36.48 37 14 3 0 0 0 | 60. | 15 2213 1 37.68 39 15 3 0 0 0 | |----------------------------------------------------------------------------------------| 61. | 1 2620 0 38.15 35 1 4 0 0 0 | 62. | 2 3751 0 39.8 31 2 4 0 0 0 | 63. | 3 3200 0 38.89 33 3 4 0 0 0 | 64. | 4 3673 0 39.85606 25.5008 4 4 0 1 1 | 65. | 5 3028.08 . 38.97 35 5 4 1 0 0 | |----------------------------------------------------------------------------------------| 66. | 6 3001 0 41.02 38 6 4 0 0 0 | 67. | 7 1203 1 33.76732 29.3472 7 4 0 1 1 | 68. | 8 3652 0 41.02277 31.6178 8 4 0 1 1 | 69. | 9 3279 0 39.35 30 9 4 0 0 0 | 70. | 10 3007 0 37.80869 17.7524 10 4 0 1 1 | |----------------------------------------------------------------------------------------| 71. | 11 2887 0 38.9 28 11 4 0 0 0 | 72. | 12 3761.25 . 40.03 27 12 4 1 0 0 | 73. | 13 3375 0 38.98916 36 13 4 0 1 0 | 74. | 14 2002 1 36.48 37 14 4 0 0 0 | 75. | 15 2213 1 37.68 39 15 4 0 0 0 | |----------------------------------------------------------------------------------------| 76. | 1 2620 0 38.15 35 1 5 0 0 0 | 77. | 2 3751 0 39.8 31 2 5 0 0 0 | 78. | 3 3200 0 38.89 33 3 5 0 0 0 | 79. | 4 3673 0 39.61145 24.598 4 5 0 1 1 | 80. | 5 2581.58 . 38.97 35 5 5 1 0 0 | |----------------------------------------------------------------------------------------| 81. | 6 3001 0 41.02 38 6 5 0 0 0 | 82. | 7 1203 1 35.71356 51.1944 7 5 0 1 1 | 83. | 8 3652 0 42.04614 37.6994 8 5 0 1 1 | 84. | 9 3279 0 39.35 30 9 5 0 0 0 | 85. | 10 3007 0 39.53621 32.1937 10 5 0 1 1 | |----------------------------------------------------------------------------------------| 86. | 11 2887 0 38.9 28 11 5 0 0 0 | 87. | 12 3228.02 . 40.03 27 12 5 1 0 0 | 88. | 13 3375 0 41.59284 36 13 5 0 1 0 | 89. | 14 2002 1 36.48 37 14 5 0 0 0 | 90. | 15 2213 1 37.68 39 15 5 0 0 0 | +----------------------------------------------------------------------------------------+ In this file, the original data are shown on the first 15 lines, followed by 75 lines (5 x 15) with imputed values. The ice command stored 5 sets of imputed data, as specified with the m(5) option. This generated missing value indicators, which begin with “nm” (abbrevation for “nonmissing”, but you can use anything you like) as we specified in the genmiss(nm) option. Notice that no imputation occurred for the variable lowbw, which was left out of the ice command. The imputed values are inserted in the original variable names in this file, but not in the data we have in Stata memory. You can spot the imputed values, because they have more decimal places. Chapter 5-9 (revision 15 Feb 2012) p. 20 Whereas van Buuren describes this “regression switching” as, “1. Specify the posterior predictive density p(Ymis|X,R), where X is a set of predictor variables, given the non-response mechanism p(R|Y,Z) and the complete data model p(Y,Z).” Royston (2004, p. 232) explains this regression switching as, “The algorithm is a type of Gibbs sampler in which the distribution of missing values of a covariate is sampled conditional on the distribution of the remaining covariates. Each variable in mainvarlist becomes in turn the response variable.” This format is suitable for the micombine command, which will be used next. The syntax for the micombine command is: micombine regression_cmd [yvar][covarlist] [if][in][weight] [, br detail eform(string) genxb(newvarname) impid(varname) lrr noconstant obsid(varname) svy[(svy_options)] regression_cmd_options] where regression cmd includes clogit, cnreg, glm, logistic, logit, mlogit, nbreg, ologit, oprobit, poisson, probit, qreg, regress, rreg, stcox, streg, or xtgee. Other regression cmds will work but not all have been tested by the author. All weight types supported by regression cmd are allowed. First, the data file created with the ice command must be loaded into memory, if they are not already there. The regression_cmd portion of the micombine is the same regression command we used before we imputed the data. use bwtimp, clear micombine regress bweight gestwks matage Multiple imputation parameter estimates (5 imputations) -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 324.7973 72.26509 4.49 0.001 167.3452 482.2494 matage | -64.41396 30.7788 -2.09 0.058 -131.4752 2.647278 _cons | -7567.114 2975.775 -2.54 0.026 -14050.77 -1083.457 -----------------------------------------------------------------------------15 observations. This linear regression model represents the combined (a type of pooling) estimates from fitting the model to the five imputed datasets. Chapter 5-9 (revision 15 Feb 2012) p. 21 This model can be compared to the listwise deletion model we fitted above. Listwise Deletion Source | SS df MS -------------+-----------------------------Model | 1880727.23 2 940363.614 Residual | 436631.648 5 87326.3296 -------------+-----------------------------Total | 2317358.88 7 331051.268 Number of obs F( 2, 5) Prob > F R-squared Adj R-squared Root MSE = = = = = = 8 10.77 0.0154 0.8116 0.7362 295.51 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 274.9777 83.99418 3.27 0.022 59.06382 490.8916 matage | -66.55269 28.83879 -2.31 0.069 -140.6852 7.57979 _cons | -5541.07 3641.212 -1.52 0.189 -14901.1 3818.962 ------------------------------------------------------------------------------ Multiple Imputation Multiple imputation parameter estimates (5 imputations) -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 324.7973 72.26509 4.49 0.001 167.3452 482.2494 matage | -64.41396 30.7788 -2.09 0.058 -131.4752 2.647278 _cons | -7567.114 2975.775 -2.54 0.026 -14050.77 -1083.457 -----------------------------------------------------------------------------15 observations. The large discrepancies are reflecting the differences that can exist in small sample sizes, where a single value can change model dramatically. Chapter 5-9 (revision 15 Feb 2012) p. 22 For a more fair comparison, we will use the original large dataset and try it again. use births_with_missing, clear regress bweight gestwks matage ice bweight gestwks matage , m(5) genmiss(nm) /// saving(bwtimp, replace) seed(888) use bwtimp, clear micombine regress bweight gestwks matage Listwise Deletion Source | SS df MS -------------+-----------------------------Model | 79309893.1 2 39654946.6 Residual | 81166170.7 422 192336.898 -------------+-----------------------------Total | 160476064 424 378481.283 Number of obs F( 2, 422) Prob > F R-squared Adj R-squared Root MSE = = = = = = 425 206.17 0.0000 0.4942 0.4918 438.56 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 201.6962 9.945484 20.28 0.000 182.1473 221.245 matage | .5083184 5.513141 0.09 0.927 -10.32832 11.34496 _cons | -4692.07 422.0691 -11.12 0.000 -5521.689 -3862.45 ------------------------------------------------------------------------------ Multiple Imputation Multiple imputation parameter estimates (5 imputations) -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 208.7919 9.449999 22.09 0.000 190.2251 227.3588 matage | .1029839 5.825347 0.02 0.986 -11.34236 11.54833 _cons | -4964.06 402.0698 -12.35 0.000 -5754.026 -4174.094 -----------------------------------------------------------------------------500 observations (imputation 1). We see the models are much closer. In this case, we had 75 observations (15%) missing in the first model. Chapter 5-9 (revision 15 Feb 2012) p. 23 Just for fun, let’s compare these models to median and mean imputed models. * -- median imputed model use births_with_missing, clear * capture drop nmbweight gen nmbweight=bweight centile bweight, centile(50) replace nmbweight=r(c_1) if nmbweight==. // r(c_1) found using "return list" * capture drop nmgestwks gen nmgestwks=gestwks centile gestwks, centile(50) replace nmgestwks=r(c_1) if nmgestwks==. * capture drop nmmatage gen nmmatage=matage centile matage, centile(50) replace nmmatage=r(c_1) if nmmatage==. * regress nmbweight nmgestwks nmmatage * -- mean imputed model use births_with_missing, clear * capture drop nmbweight gen nmbweight=bweight sum bweight replace nmbweight=r(mean) if nmbweight==. * capture drop nmgestwks gen nmgestwks=gestwks sum gestwks replace nmgestwks=r(mean) if nmgestwks==. * capture drop nmmatage gen nmmatage=matage sum matage replace nmmatage=r(mean) if nmmatage==. * regress nmbweight nmgestwks nmmatage Chapter 5-9 (revision 15 Feb 2012) p. 24 1) Listwise deletion model Source | SS df MS -------------+-----------------------------Model | 79309893.1 2 39654946.6 Residual | 81166170.7 422 192336.898 -------------+-----------------------------Total | 160476064 424 378481.283 Number of obs F( 2, 422) Prob > F R-squared Adj R-squared Root MSE = = = = = = 425 206.17 0.0000 0.4942 0.4918 438.56 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 201.6962 9.945484 20.28 0.000 182.1473 221.245 matage | .5083184 5.513141 0.09 0.927 -10.32832 11.34496 _cons | -4692.07 422.0691 -11.12 0.000 -5521.689 -3862.45 ------------------------------------------------------------------------------ 2) Multiple imputation model Multiple imputation parameter estimates (5 imputations) -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 208.7919 9.449999 22.09 0.000 190.2251 227.3588 matage | .1029839 5.825347 0.02 0.986 -11.34236 11.54833 _cons | -4964.06 402.0698 -12.35 0.000 -5754.026 -4174.094 -----------------------------------------------------------------------------500 observations (imputation 1). 3) Median imputed model Source | SS df MS -------------+-----------------------------Model | 74160695.6 2 37080347.8 Residual | 119927799 497 241303.418 -------------+-----------------------------Total | 194088495 499 388954.899 Number of obs F( 2, 497) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 153.67 0.0000 0.3821 0.3796 491.23 -----------------------------------------------------------------------------nmbweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------nmgestwks | 190.0368 10.8495 17.52 0.000 168.7203 211.3534 nmmatage | .2067422 5.721325 0.04 0.971 -11.03422 11.44771 _cons | -4247.992 457.7167 -9.28 0.000 -5147.29 -3348.694 ------------------------------------------------------------------------------ 4) Mean imputed model Source | SS df MS -------------+-----------------------------Model | 75577258.4 2 37788629.2 Residual | 118447042 497 238324.028 -------------+-----------------------------Total | 194024300 499 388826.253 Number of obs F( 2, 497) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 158.56 0.0000 0.3895 0.3871 488.18 -----------------------------------------------------------------------------nmbweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------nmgestwks | 192.1832 10.80209 17.79 0.000 170.9599 213.4066 nmmatage | .2557455 5.68614 0.04 0.964 -10.91609 11.42758 _cons | -4327.432 454.9994 -9.51 0.000 -5221.391 -3433.473 ------------------------------------------------------------------------------ They are all pretty close to each other. Chapter 5-9 (revision 15 Feb 2012) p. 25 Version 11: Multiple Imputation Again, we will use a small dataset so we can keep track of what is happening. use births_miss_small, clear list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. +-----------------------------------------+ | id bweight lowbw gestwks matage | |-----------------------------------------| | 1 2620 0 38.15 35 | | 2 3751 0 39.8 31 | | 3 3200 0 38.89 33 | | 4 3673 0 . . | | 5 . . 38.97 35 | |-----------------------------------------| | 6 3001 0 41.02 38 | | 7 1203 1 . . | | 8 3652 0 . . | | 9 3279 0 39.35 30 | | 10 3007 0 . . | |-----------------------------------------| | 11 2887 0 38.9 28 | | 12 . . 40.03 27 | | 13 3375 0 . 36 | | 14 2002 1 36.48 37 | | 15 2213 1 37.68 39 | +-----------------------------------------+ With listwise deletion of missing values, we get the following linear regression. regress bweight gestwks matage Source | SS df MS -------------+-----------------------------Model | 1880727.23 2 940363.614 Residual | 436631.648 5 87326.3296 -------------+-----------------------------Total | 2317358.88 7 331051.268 Number of obs F( 2, 5) Prob > F R-squared Adj R-squared Root MSE = = = = = = 8 10.77 0.0154 0.8116 0.7362 295.51 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 274.9777 83.99418 3.27 0.022 59.06382 490.8916 matage | -66.55269 28.83879 -2.31 0.069 -140.6852 7.57979 _cons | -5541.07 3641.212 -1.52 0.189 -14901.1 3818.962 ------------------------------------------------------------------------------ Chapter 5-9 (revision 15 Feb 2012) p. 26 First, we declare the data to be mi (multiple imputation) data, requesting the marginal long style (mlong) because it is the most memory-efficient of the available styles. mi set mlong Second, we register our variables, using “imputed” if to be imputed. mi register imputed bweight gestwks matage Third, we impute the missing values for bweight, using the linear regression approach (like Stata’s impute command), where the missing value is predicted by gestwks and matage. We set the random number seed so we can duplicate our work and asked for 5 imputations. mi impute regress bweight gestwks matage, rseed(888) add(5) note: variables gestwks matage registered as imputed and used to model variable bweight; this may cause some observations to be omitted from the estimation and may lead to missing imputed values Univariate imputation Linear regression Imputed: m=1 through m=5 Imputations = added = updated = 5 5 0 | Observations per m |---------------------------------------------Variable | complete incomplete imputed | total ---------------+-----------------------------------+---------bweight | 13 2 2 | 15 -------------------------------------------------------------(complete + incomplete = total; imputed is the minimum across m of the number of filled in observations.) Note: right-hand-side variables (or weights) have missing values; model parameters estimated using listwise deletion We were able to fill in the two missing values for bweight. Fourth, we impute the missing values for gestwks, using the linear regression approach predicting from matage. mi impute regress gestwks matage, rseed(999) add(5) gestwks: missing imputed values produced This may occur when imputation variables are used as independent variables or when independent variables contain missing values. You can specify option force if you wish to proceed anyway. r(498); The “force” option would keep from crashing, but nothing useful would come out of it. It appears we cannot impute gestwks due to missing values (no information) in matage. Fifth, we impute the missing values for matage, using the linear regression approach predicting from gestwks Chapter 5-9 (revision 15 Feb 2012) p. 27 mi impute regress matage gestwks, rseed(999) add(5) matage: missing imputed This may occur when or when independent option force if you r(498); values produced imputation variables are used as independent variables variables contain missing values. You can specify wish to proceed anyway. The “force” option would keep from crashing, but nothing useful would come out of it. It appears we cannot impute matage due to missing values (no information) in gestwks. Sixth, we check that the imputation did not create something crazy, but looking at the descriptive stats before imputation, after the first imputation, and at the fifth imputation. mi xeq 0 1 5: sum bweight gestwks matage m=0 data: -> sum bweight gestwks matage Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bweight | 13 2912.538 741.7804 1203 3751 gestwks | 10 38.927 1.277533 36.48 41.02 matage | 11 33.54545 4.058661 27 39 m=1 data: -> sum bweight gestwks matage Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bweight | 15 2944.241 712.5035 1203 3751 gestwks | 10 38.927 1.277533 36.48 41.02 matage | 11 33.54545 4.058661 27 39 m=5 data: -> sum bweight gestwks matage Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bweight | 15 2897.641 700.9038 1203 3751 gestwks | 10 38.927 1.277533 36.48 41.02 matage | 11 33.54545 4.058661 27 39 We see the nonmissing sample size for matage increased, but nothing changed for the other two variables. It was a big change in the mean for bweight, which might make us wonder if the imputation worked well. We’ll see below it did not seem to cause a problem. Chapter 5-9 (revision 15 Feb 2012) p. 28 Finally, we run the linear regression model on the imputed variables. The dots options charts progress on the screen if it takes a long time. mi estimate, dots: regress bweight gestwks matage Imputations (5): ..... done Multiple-imputation estimates Linear regression DF adjustment: Model F test: Within VCE type: Small sample Equal FMI OLS Imputations Number of obs Average RVI Complete DF DF: min avg max F( 2, 4.7) Prob > F = = = = = = = = = 5 10 0.2082 7 5.01 5.33 5.56 11.71 0.0146 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 264.7004 82.06318 3.23 0.021 58.51564 470.8851 matage | -59.11732 25.88137 -2.28 0.071 -125.622 7.387346 _cons | -5427.05 3541.906 -1.53 0.180 -14262.06 3407.96 ------------------------------------------------------------------------------ Comparing this to the original model that used listwise deletion of missing values, Source | SS df MS -------------+-----------------------------Model | 1880727.23 2 940363.614 Residual | 436631.648 5 87326.3296 -------------+-----------------------------Total | 2317358.88 7 331051.268 Number of obs F( 2, 5) Prob > F R-squared Adj R-squared Root MSE = = = = = = 8 10.77 0.0154 0.8116 0.7362 295.51 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 274.9777 83.99418 3.27 0.022 59.06382 490.8916 matage | -66.55269 28.83879 -2.31 0.069 -140.6852 7.57979 _cons | -5541.07 3641.212 -1.52 0.189 -14901.1 3818.962 ------------------------------------------------------------------------------ We see that the imputed modeled was based on n=2 more observations, which increased the statistical power slightly (smaller p value) for gestwks, while keeping the coefficient essentially the same. Chapter 5-9 (revision 15 Feb 2012) p. 29 Version 11: Multiple Imputation Using Imputaton Via Changed Equations (mi ice) This approach will fill in more missing data than the preceding approach. On the website, http://www.stata.com/support/faqs/stat/mi_ice.html, the way to get the original ice, which fills in more missing data, is described, “In Stata 11, you can use the user-written command mi ice to perform imputation via chained equations. mi ice is available from Patrick Royston’s web page (net from http://www.homepages.ucl.ac.uk/~ucakjpr/stata/) under the heading mi_ice. mi ice is a wrapper for ice that understands the official mi data format.” To add this to Stata-11, use net from http://www.homepages.ucl.ac.uk/~ucakjpr/stata/ and then click on, mi_ice Stata 11 version of -ice-; knows about the new MI format ------------------------------------------------------------------------------package mi_ice from http://www.homepages.ucl.ac.uk/~ucakjpr/stata ------------------------------------------------------------------------------TITLE mi ice. Stata 11-aware wrapper for the -ice- package DESCRIPTION/AUTHOR(S) Program by Yulia Marchenko and Patrick Royston Distribution-Date: 20101130 version: 1.0.2 Note: for this Stata 11 program to work, you must first install -ice-. Please direct queries to Patrick Royston (pr@ctu.mrc.ac.uk) INSTALLATION FILES (click here to install) mi_ice\mi_cmd_ice.ado mi_ice\mi_ice.sthlp ------------------------------------------------------------------------------(click here to return to the previous screen) Clicking on the “(click here to install)” link loads the software into Stata. Chapter 5-9 (revision 15 Feb 2012) p. 30 Again, we will use a small dataset so we can keep track of what is happening. use births_miss_small, clear list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. +-----------------------------------------+ | id bweight lowbw gestwks matage | |-----------------------------------------| | 1 2620 0 38.15 35 | | 2 3751 0 39.8 31 | | 3 3200 0 38.89 33 | | 4 3673 0 . . | | 5 . . 38.97 35 | |-----------------------------------------| | 6 3001 0 41.02 38 | | 7 1203 1 . . | | 8 3652 0 . . | | 9 3279 0 39.35 30 | | 10 3007 0 . . | |-----------------------------------------| | 11 2887 0 38.9 28 | | 12 . . 40.03 27 | | 13 3375 0 . 36 | | 14 2002 1 36.48 37 | | 15 2213 1 37.68 39 | +-----------------------------------------+ With listwise deletion of missing values, we get the following linear regression. regress bweight gestwks matage Source | SS df MS -------------+-----------------------------Model | 1880727.23 2 940363.614 Residual | 436631.648 5 87326.3296 -------------+-----------------------------Total | 2317358.88 7 331051.268 Number of obs F( 2, 5) Prob > F R-squared Adj R-squared Root MSE = = = = = = 8 10.77 0.0154 0.8116 0.7362 295.51 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 274.9777 83.99418 3.27 0.022 59.06382 490.8916 matage | -66.55269 28.83879 -2.31 0.069 -140.6852 7.57979 _cons | -5541.07 3641.212 -1.52 0.189 -14901.1 3818.962 ------------------------------------------------------------------------------ Chapter 5-9 (revision 15 Feb 2012) p. 31 First, we declare the data to be mi (multiple imputation) data, requesting the marginal long style (mlong) because it is the most memory-efficient of the available styles. mi set mlong Second, we register our variables, using “imputed” if to be imputed. mi register imputed bweight gestwks matage Third, we impute the missing values for bweight, gestwks, and matage using the method of chain equations. We set the random number seed so we can duplicate our work and asked for 5 imputations. mi ice bweight gestwks matage , add(5) seed(888) #missing | values | Freq. Percent Cum. ------------+----------------------------------0 | 8 53.33 53.33 1 | 3 20.00 73.33 2 | 4 26.67 100.00 ------------+----------------------------------Total | 15 100.00 Variable | Command | Prediction equation ------------+---------+-----------------------------------------------bweight | regress | gestwks matage gestwks | regress | bweight matage matage | regress | bweight gestwks ----------------------------------------------------------------------Imputing 1..2..3..4..5..file C:\DOCUME~1\U00327~1.SRV\LOCALS~1\Temp\ST_0000006s.tmp saved (5 imputations added; M=5) Chapter 5-9 (revision 15 Feb 2012) p. 32 Fourth, we check that the imputation did not create something crazy, but looking at the descriptive stats before imputation, after the first imputation, and at the fifth imputation. mi xeq 0 1 5: sum bweight gestwks matage m=0 data: -> sum bweight gestwks matage Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bweight | 13 2912.538 741.7804 1203 3751 gestwks | 10 38.927 1.277533 36.48 41.02 matage | 11 33.54545 4.058661 27 39 m=1 data: -> sum bweight gestwks matage Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bweight | 15 2940.945 721.708 1203 3751 gestwks | 15 38.98037 1.220608 36.48 41.02 matage | 15 33.59405 3.860192 27 39 m=5 data: -> sum bweight gestwks matage Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bweight | 15 2911.507 697.542 1203 3751 gestwks | 15 39.18468 1.725737 35.71356 42.04614 matage | 15 34.31236 6.361323 24.59798 51.19441 We don’t see anything strange. We also notice that the three variables are now all nonmissing. Chapter 5-9 (revision 15 Feb 2012) p. 33 Finally, we run the linear regression model on the imputed variables. The dots options charts progress on the screen if it takes a long time. mi estimate, dots: regress bweight gestwks matage Imputations (5): ..... done Multiple-imputation estimates Linear regression DF adjustment: Model F test: Within VCE type: Small sample Equal FMI OLS Imputations Number of obs Average RVI Complete DF DF: min avg max F( 2, 3.9) Prob > F = = = = = = = = = 5 15 1.2052 12 2.62 4.98 6.74 13.13 0.0186 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 324.7973 72.26509 4.49 0.005 144.5517 505.0429 matage | -64.41396 30.7788 -2.09 0.140 -170.8037 41.97578 _cons | -7567.114 2975.775 -2.54 0.040 -14658.97 -475.2537 ------------------------------------------------------------------------------ Comparing this to the original model that used listwise deletion of missing values, Source | SS df MS -------------+-----------------------------Model | 1880727.23 2 940363.614 Residual | 436631.648 5 87326.3296 -------------+-----------------------------Total | 2317358.88 7 331051.268 Number of obs F( 2, 5) Prob > F R-squared Adj R-squared Root MSE = = = = = = 8 10.77 0.0154 0.8116 0.7362 295.51 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 274.9777 83.99418 3.27 0.022 59.06382 490.8916 matage | -66.55269 28.83879 -2.31 0.069 -140.6852 7.57979 _cons | -5541.07 3641.212 -1.52 0.189 -14901.1 3818.962 ------------------------------------------------------------------------------ We see that the imputed modeled was based on n=15 observations, which is the actual sample size of our dataset. Here are all of the commands that we just used, mi mi mi mi mi set mlong register imputed bweight gestwks matage ice bweight gestwks matage , add(5) seed(888) xeq 0 1 5: sum bweight gestwks matage estimate, dots: regress bweight gestwks matage Chapter 5-9 (revision 15 Feb 2012) p. 34 Version 12: Multiple Imputation Using Imputaton Via Changed Equations (mi impute chained) The method of imputation by chained equations was officially added to Stata in version 12. Again, we will use a small dataset so we can keep track of what is happening. use births_miss_small, clear list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. +-----------------------------------------+ | id bweight lowbw gestwks matage | |-----------------------------------------| | 1 2620 0 38.15 35 | | 2 3751 0 39.8 31 | | 3 3200 0 38.89 33 | | 4 3673 0 . . | | 5 . . 38.97 35 | |-----------------------------------------| | 6 3001 0 41.02 38 | | 7 1203 1 . . | | 8 3652 0 . . | | 9 3279 0 39.35 30 | | 10 3007 0 . . | |-----------------------------------------| | 11 2887 0 38.9 28 | | 12 . . 40.03 27 | | 13 3375 0 . 36 | | 14 2002 1 36.48 37 | | 15 2213 1 37.68 39 | +-----------------------------------------+ With listwise deletion of missing values, we get the following linear regression. regress bweight gestwks matage Source | SS df MS -------------+-----------------------------Model | 1880727.23 2 940363.614 Residual | 436631.648 5 87326.3296 -------------+-----------------------------Total | 2317358.88 7 331051.268 Number of obs F( 2, 5) Prob > F R-squared Adj R-squared Root MSE = = = = = = 8 10.77 0.0154 0.8116 0.7362 295.51 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 274.9777 83.99418 3.27 0.022 59.06382 490.8916 matage | -66.55269 28.83879 -2.31 0.069 -140.6852 7.57979 _cons | -5541.07 3641.212 -1.52 0.189 -14901.1 3818.962 ------------------------------------------------------------------------------ Chapter 5-9 (revision 15 Feb 2012) p. 35 To see what variables having missing values, we can use the nmissing command we installed above, along with the “describe short” to get a list of all the variable names, ds nmissing . ds id bweight lowbw gestwks matage . nmissing bweight lowbw gestwks matage 2 2 5 4 We see that we have some missing data on all variables, except our subject ID variable that will not be included in the model, anyway. We do not intend to use low birth weight, lowbw, in our model to predicto birthweight, so we can ignore that variable. First, we declare the data to be mi (multiple imputation) data, requesting the marginal long style (mlong) because it is the most memory-efficient of the available styles. mi set mlong Second, we register our variables, using “imputed” if to be imputed. mi register imputed bweight gestwks matage Third, we impute the missing values for bweight, gestwks, and matage using the method of chain equations, requesting 10 imputed datasets with the add(10) option. The (regress) part of the command instructs Stata to use linear regression to predict the missing values from the other variables. The rseed, where you select any number you want as the random number generator seed, is a way to be able to reproduce the imputation, mi impute chained (regress) bweight gestwks matage , add(10) rseed(888) Conditional models: bweight: regress bweight matage gestwks matage: regress matage bweight gestwks gestwks: regress gestwks bweight matage Performing chained iterations ... Multivariate imputation Chained equations Imputed: m=6 through m=10 Initialization: monotone Imputations = added = updated = 10 5 0 Iterations = burn-in = 50 10 bweight: linear regression gestwks: linear regression Chapter 5-9 (revision 15 Feb 2012) p. 36 matage: linear regression -----------------------------------------------------------------| Observations per m |---------------------------------------------Variable | Complete Incomplete Imputed | Total -------------------+-----------------------------------+---------bweight | 13 2 2 | 15 gestwks | 10 5 5 | 15 matage | 11 4 4 | 15 -----------------------------------------------------------------(complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.) Fourth, we check that the imputation did not create something crazy, but looking at the descriptive stats before imputation, after the first imputation, and at the fifth imputation. mi xeq 0 1 5: sum bweight gestwks matage m=0 data: -> sum bweight gestwks matage Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bweight | 13 2912.538 741.7804 1203 3751 gestwks | 10 38.927 1.277533 36.48 41.02 matage | 11 33.54545 4.058661 27 39 m=1 data: -> sum bweight gestwks matage Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bweight | 15 2906.259 711.6166 1203 3751 gestwks | 15 38.85626 1.335499 35.94739 41.02 matage | 15 33.56222 5.995813 26.79644 49.17725 m=5 data: -> sum bweight gestwks matage Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bweight | 15 2954.346 714.6822 1203 3751 gestwks | 15 38.80349 2.10584 32.45712 41.02 matage | 15 33.27142 3.748245 27 39 We don’t see anything strange. The imputed datasets have descriptive statistics very similar to the original non-imputed dataset. We also notice that the three variables are now all nonmissing as they should be. Chapter 5-9 (revision 15 Feb 2012) p. 37 Finally, we run the linear regression model on the imputed variables. The dots options charts progress on the screen, which is helpful if it takes a long time. mi estimate, dots: regress bweight gestwks matage Imputations (10): .........10 done Multiple-imputation estimates Linear regression DF adjustment: Model F test: Within VCE type: Small sample Equal FMI OLS Imputations Number of obs Average RVI Largest FMI Complete DF DF: min avg max F( 2, 6.4) Prob > F = = = = = = = = = = 10 15 0.7878 0.4927 12 5.37 5.71 6.16 13.60 0.0049 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 310.4137 90.59601 3.43 0.013 90.13877 530.6885 matage | -67.25186 28.77103 -2.34 0.063 -139.7065 5.202746 _cons | -6879.842 4104.348 -1.68 0.148 -17097.55 3337.863 ------------------------------------------------------------------------------ Notice the model is now based on a sample size of n=15. Comparing this to the original model that used listwise deletion of missing values, with a sample size of n=8 due to listwise deletion of missing data, Source | SS df MS -------------+-----------------------------Model | 1880727.23 2 940363.614 Residual | 436631.648 5 87326.3296 -------------+-----------------------------Total | 2317358.88 7 331051.268 Number of obs F( 2, 5) Prob > F R-squared Adj R-squared Root MSE = = = = = = 8 10.77 0.0154 0.8116 0.7362 295.51 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gestwks | 274.9777 83.99418 3.27 0.022 59.06382 490.8916 matage | -66.55269 28.83879 -2.31 0.069 -140.6852 7.57979 _cons | -5541.07 3641.212 -1.52 0.189 -14901.1 3818.962 ------------------------------------------------------------------------------ Here are all of the commands that we just used, so you have them in one place, mi mi mi mi mi set mlong register imputed bweight gestwks matage impute chained (regress) bweight gestwks matage , add(10) rseed(888) xeq 0 1 5: sum bweight gestwks matage estimate, dots: regress bweight gestwks matage Chapter 5-9 (revision 15 Feb 2012) p. 38 References Chandola T, Brunner E, Marmot M. (2006). Chronic stress at work and the metabolic syndrome: prospective study. BMJ 332:521-5. PMID:16428252 Greenland S, Finkle WD. (1995). A critical look at methods for handling missing covariates in epidemiologic regression analysis. Am J Epidemiol 142(12):1255-64. Fleiss JL, Levin B, Paik MC. (2003). Statistical Methods for Rates and Proportions, 3rd ed. Hoboken NJ, John Wiley & Sons. Harrell Jr FE. (2001). Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, Springer-Verlag. Huberman M, Langholz B. (1999). Application of the missing-indicator method in matched casecontrol studies with incomplete data. Am J Epidemiol 150(12):1340-5. Li X, Song X, Gray RH. (2004). Comparison of the missing-indicator method and conditional logistic regression in 1:m matched case-control studies with missing exposure values. Am J Epidemiol 159(6):603-610. Moons KG, Grobbee DE. (2002). Diagnostic studies as multivariable, prediction research. J Epidemiol Community Health 56(5):337-8. Roth, P. (1994). Missing data: A conceptual review for applied psychologists. Personnel Psychology 47:537-560. Royston P. (2004). Multiple imputation of missing values. The Stata Journal 4(3):227-241. Royston P. (2005a). Multiple imputation of missing values: update. The Stata Journal 5(2):188201. Royston P. (2005b). Multiple imputation of missing values: update of ice. The Stata Journal 5(4):527-536. Royston P. (2007). Multiple imputation of missing vlaues: further update of ice, with an emphasis on interval censoring. The Stata Journal 7(4):445-464. Schnonlau M. (2006). Stata software package, hotdeckvar.pkg, for hotdeck imputation. http://www.schonlau.net/stata/. Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York, Springer. Twisk JWR. (2003). Applied Longitudinal Data Analysis for Epidemiology: A Practical Guide. Cambridge, Cambridge University Press. Chapter 5-9 (revision 15 Feb 2012) p. 39 van Buuren S, Boshuizen HC, Knook DL. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18: 681–694. Vandenbrouchke JP, von Elm E, Altman DG, et al. (2007). Strengthening and reporting of observational studies in epidemiology (STROBE): explanation and elaboration. Ann Intern Med 147(8):W-163 to W-194. Chapter 5-9 (revision 15 Feb 2012) p. 40