Chapter 5-9: Missing Data Imputation

advertisement
Chapter 5-9. Missing Data Imputation
In this chapter we will discuss what to do about missing data, and particularly the imputation
schemes available in Stata.
To offer a “precise” definition,
Missing data imputation is substituting values for missing values—but it is not
fabricating data because we do it statistically.
Although you rarely see authors describe how they handled missing data in their articles, it is
becoming common to find entire chapters on this subject in statistical textbooks. For example,
such chapters can be found in Harrell (2001, pp.41-52), Twisk (2003, pp.202-224), and Fleiss et
al. (2003, pp. 491-560).
Listwise Deletion
Stata, like other statistical software, uses listwise deletion of missing values, which can diminish
the sample size very quickly in regression models. With listwise deletion, a subject is dropped
from the analysis if it is missing at least one of the variables used in the model (y, x1, or x2) in
the following example.
Consider the following dataset.
. list
+---------------------------------+
| id
y
x1
x2
x3
s
x4 |
|---------------------------------|
1. | 1
11
1
2
3
a
. |
2. | 2
10
.
5
.
b
5 |
3. | 3
5
3
2
4
3 |
4. | 4
9
.
.
5
c
1 |
5. | 5
12
5
7
1
d
2 |
|---------------------------------|
6. | 6
7
6
3
2
e
5 |
+---------------------------------+
Ignoring obvious overfitting for sake of illustration, the following regression command will lose
two subjects from the sample size due to listwise deletion of missing data.
_________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah
School of Medicine. Chapter 5-9. (Accessed February 15, 2012, at http://www.ccts.utah.edu/biostats/
?pageId=5385).
Chapter 5-9 (revision 15 Feb 2012)
p. 1
. regress y x1 x2
Source |
SS
df
MS
-------------+-----------------------------Model | 22.8676471
2 11.4338235
Residual | 9.88235294
1 9.88235294
-------------+-----------------------------Total |
32.75
3 10.9166667
Number of obs
F( 2,
1)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
4
1.16
0.5493
0.6982
0.0947
3.1436
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x1 |
-1
.9701425
-1.03
0.490
-13.32683
11.32683
x2 |
1.352941
.9036642
1.50
0.375
-10.1292
12.83508
_cons |
7.764706
3.654641
2.12
0.280
-38.67191
54.20132
------------------------------------------------------------------------------
_________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385
We just lost 1/3 of our sample (n=4 instead of n=6). That may seriously bias the results, not to
mention the loss of statistical power.
Although it is what most researchers do, just dropping subjects from the analysis that have at
least one missing value for the variable in the model, or listwise deletion of missing data, results
in regression coefficients that can be terribly biased, imprecise, or both (Harrell, 2001, pp.43-44).
Types of Missing Data
Harrell (2001, pp.41-52) discusses three types of missing data.
Missing completely at random (MCAR) Data are missing for reasons unrelated to any
characteristics or responses of the subject, including the value of the missing value, were it to be
known. An example is the accidental dropping of test tube resulting in missing laboratory
measurements. (Here, the best guess of the missing variable is simply the sample median).
Missing at random (MAR) Data elements are not missing at random, but the probability that a
value is missing depends on values of variables that were actually measured. For example,
suppose males are less likely to respond to their income question in general, but the likelihood of
responding is independent of their actual income. In this case, unbiased sex-specific income
estimates can be made if we have data on the sex variable (by replacing the missing value with
the sex-specific median income, for example).
Informative missing (IM) Data elements are more likely to be missing if their true values of the
variable in question are systematically higher or lower. For example, this occurs if lower income
subjects, or high income subjects, or both, are less likely to answer the income question in a
survey. This is the most difficult type of missing data to handle, and in many cases there is no
good value to substitute for the missing value. Furthermore, if you analyze your data by just
dropping these subjects, your results will be biased, so that does not work either.
Chapter 5-9 (revision 15 Feb 2012)
p. 2
Missing Comorbidities In Patient Medical Record
A special case of missing is a comorbidity that is not listed in the patient’s medical record. For
example, if no mention of diabetes was ever made and a diagnostic code for diabetes was never
entered for any clinic visit, the fact that it is missing suggests that the patient does not have
diabetes. Defining a coding rule to replace this missing value with 0, or absent, will most likely
produce the least amount of misclassification error.
Steyerberg (2009, p.130-131) mentions this approach, “An alternative in such a situation might
be to change the definition of the predictor, i.e., by assuming that if no value is available from a
patient chart, the characteristic is absent rather than missing.”
Replacing Missing Values with Mean, Median, or Mode
Before the more sophisticated imputation schemes were developed, it was common practice to
replace the missing value with a likely value, being the mean, median, or mode.
One criticism of this approach is that you artificially shrink the variance, since so many
observations will now have the average value.
Royston (2004) makes this criticism,
“Old-fashioned imputation typically replaced missing values with the mean or mode of
the nonmissing values for that variable. That approach is now regarded as inadequate.
For subsequent statistical inference to be valid, it is essential to inject the correct degree
of randomness into the imputations and to incorporate that uncertainty when computing
standard errors and confidence intervals for parameters of interest.”
One possible approach is imputing the missing value with a likely value, such as the median.
Then, add a random residual back to the median imputed value to maintain the correct standard
error (Harrell, 2001, pp.45-46).
Such a direct approach is not usually done, however, since the more widely accepted approaches,
accomplish the same thing. Furthermore, if there are a lot of missing data, imputing with a likely
value might adversely affect the regression coefficient. The imputation methods of multiple
imputation and maximum likelihood not only provide the best standard errors, but also the best
regression coefficients.
Chapter 5-9 (revision 15 Feb 2012)
p. 3
What About Imputing the Outcome Variable?
It is common to discard subjects with a missing outcome variable, but imputing missing values of
the outcome variable frequently leads to more efficient estimates of the regression coefficients
when the imputation is based on the nonmissing predictor variables (Harrell, 2001, p.43).
Missing Value Indicator Approach
A historically popular approach in epidemiologic research was to use a missing value indicator,
which has a value of 1 if the variable is missing and 0 otherwise.
For example, given the following variable for gender,
1. male n = 50
2. female n = 40
Missing n = 10
we would recode this to two indicators, male and malemissing:
Original gender
variable
1. male
(n=50)
2. female (n=40)
. = missing (n=10)
Male
indicator
1 (n=50)
0 (n=40)
0 (n=10)
Malemissing
indicator
0 (n=50)
0 (n=40)
1 (n=10)
and then include both indicator variables into the regression model. With this approach, the
missing value indicator is not interpreted, or reported in an article, but simply acts as a place
holder so the subjects with missing values are not dropped out of the analysis.
Greenland and Finkle (1995) suggest not using the missing value indictor approach,
“Epidemiologic studies often encounter missing covariate values. While simple methods
such as stratification on missing-data status, conditional-mean imputation, and completesubject analysis are commonly employed for handling this problem, several studies have
shown that these methods can be biased under reasonable circumstances. The authors
review these results in the context of logistic regression and present simulation
experiments showing the limitations of the methods. The method based on missing-data
indicators can exhibit severe bias even when the data are missing completely at random,
and regression (conditional-mean) imputation can be inordinately sensitive to model
misspecification. Even complete-subject analysis can outperform these methods. More
sophisticated methods, such as maximum likelihood, multiple imputation, and weighted
estimating equations, have been given extensive attention in the statistics literature. While
these methods are superior to simple methods, they are not commonly used in
epidemiology, no doubt due to their complexity and the lack of packaged software to
apply these methods. The authors contrast the results of multiple imputation to simple
Chapter 5-9 (revision 15 Feb 2012)
p. 4
methods in the analysis of a case-control study of endometrial cancer, and they find a
meaningful difference in results for age at menarche. In general, the authors recommend
that epidemiologists avoid using the missing-indicator method and use more sophisticated
methods whenever a large proportion of data are missing.”
Huberman and Langholz (1999) later proposed the missing value indicator approach for matched
case-control studies, which was not specifically discussed in the Greenland and Finke paper. Li
et al (2004) criticized the Huberman and Langholz proposal, stating that the approach does not
perform as well as Huberman and Langholz suggested.
Steyerberg (2009, pp.130-131) likewise advises not using the missing value indicator approach,
“…such a procedure ignores correlation of the values of predictors among each other.
Simulations have shown that the procedure may lead to severe bias in estimated
regression coefficients.155,295. The missing indicator should hence generally not be used.”
----155
Greenland S, Finkle WD. A critical look at methods for handling missing covariates in
epidemiologic regression analysis. Am J Epidemiol 1995;142(12):1255-64.
295
Moons KG, Grobbee DE. Diagnostic studies as multivariable, prediction research. J
Epidemiol Community Health 2002;56(5):337-8.
Hotdeck Imputation (command “hotdeck”, contributed by Mander and Clayton)
With hot deck imputation, each missing value is replaced with the value from the most similar
case for which the variable is not missing.
Hotdeck imputation is available in Stata, which requires you to update your Stata to get the
command hotdeck, but instead of replacing only the missing values, it replaces the entire
observation with an observation that has no missing data. Thus, both missing and nonmissing
variables for a subject are replaced, which does not seem very appealing. In addition, the
replacement is not with the most similar observation, which is the most appealing feature of
hotdeck imputation, but instead the replacement is with a random observation.
Royston (2004) criticizes the method,
“Hotdeck imputation was implemented in Stata in 1999 by Mander and Clayton.
However, this technique may perform poorly when many rows of data have at least one
missing value.”
Chapter 5-9 (revision 15 Feb 2012)
p. 5
Hotdeck Imputation (command “hotdeckvar”, contributed by Schonlau)
A better version of hotdeck is available on the Internet. Schonlau developed a Stata procedure to
perform a simple hotdeck imputation. Missing values are replaced by random values from the
same variable. Although this is not the “most similar case”, which true hotdeck imputation is
supposed to achieve, it is intuitively more appealing than the Mander and Clayton procedure.
Schonlau’s version can be found on his website (http://www.schonlau.net/statasoftware.html).
To install it, connect to the Internet and enter the following commands in the Stata Command
Window:
net from http://www.schonlau.net/stata
net install hotdeckvar
and then click on the “hotdeckvar” link.
To get help for this in Stata enter
help hotdeckvar
Article Statistical Methods Suggestion For Schonlau’s Hotdeck Imputation
If you wanted to use hotdeck imputation, you could use the following in your Statistical Methods
section:
Imputation for missing values was performed using the hotdeck procedure, where missing
values were replaced by random values from the same variable, using the Schonlau
implementation for the Stata software (Schonlau, 2006). Hotdeck imputation has the
advantage of being simple to use, it preserves the distributional characteristics of the
variable, and performs nearly as well as the more sophisticated imputation approaches
(Roth, 1994).
Multiple Imputation
An excellent website discussing multiple imputation is: http://www.multiple-imputation.com/,
and then click on the What is MI? link. This website was created by S. van Buuren, one of the
authors of the MICE method (Multiple Imputation by Chained Equations). The multiple
imputation routine available in Stata uses the MICE method.
Multiple imputation was implemented in Stata by Royston (2004). It was updated by Royston
(2005a), then updated again by Royston (2005b), and again by Royston (2007), and again by
Carlin, Galati, and Royston (2008), and so on….
Chapter 5-9 (revision 15 Feb 2012)
p. 6
Royston (2004) describes the method:
“This article describes an implementation for Stata of the MICE method of multiple
multivariate imputation described by van Buuren, Boshuizen, and Knook (1999).
MICE stands for multivariate imputation by chained equations. The basic idea of
data analysis with multiple imputation is to create a small number (e.g., 5–10)
of copies of the data, each of which has the missing values suitably imputed, and
analyze each complete dataset independently. Estimates of parameters of interest
are averaged across the copies to give a single estimate. Standard errors are
computed according to the “Rubin rules”, devised to allow for the between- and
within-imputation components of variation in the parameter estimates.”
This is the best imputation approach available in Stata, and so is the recommended approach if
the proportion of missing data is greater than 5% (see Harrell guideline below). If you use it and
want to say you did in your article, you would report it as:
Article Statistical Methods Section Suggestion
If you wanted to use multiple imputation, you could use the following in your Statistical Methods
section:
Missing data were imputed using the method of multiple multivariate imputation
described by van Buuren, et al (1999) as implemented in the STATA software (Royston,
2004).
We will practice with this method below.
The Stata commands for multiple imputation assume data to be missing at random (MAR) or
missing completely at random (MCAR).
Exercise: Look at the Vandenbrouchke et al (2007) STROBE statement suggestion for reporting
your missing value imputation approach.
On page W-176 under the heading “12(c) Explain how missing data were addressed”, they give
an example description for your Statistical Methods section, an explanation, and in Box 6 on the
next page, W-177, they give details about imputation in general, which are consistent with the
presentation in this chapter.
They give this example (a clinical paper, their reference 106) for explaining in your manuscript
how imputation was done,
Chapter 5-9 (revision 15 Feb 2012)
p. 7
“Our missing data analysis procedures used missing at random (MAR) assumptions. We
used the MICE (multivariate imputation by chained equations) method of multiple
multivariate imputation in STATA. We independently analyzed 10 copies of the data,
each with missing values suitably imputed , in the multivariate logistic regression
analyses. We average estimates of the variables to give a single mean estimate and
adjusted standard errors according to Rubin’s rules” (106).
------106. Chandola T, Brunner E, Marmot M. Chronic stress at work and the
metabolic syndrome: prospective study. BMJ 2006;332:521-5. PMID:
16428252
This paragraph is very similar to the Royston (2004) description on the previous page.
Some Crude Guidelines
Harrell provides these crude guidelines (2001, p.49):
“Proportion of missings  0.05: It doesn’t matter very much how you impute missings
or whether you adjust variance of regression coefficient estimates for having imputed data
in this case. For continuous variables imputing missings with the median nonmissing
value is adequate; for categorical predictors the most frequent category can be used.
Complete case analysis is an option here.”
“Proportion of missings 0.05 to 0.15: If a predictor is unrelated to all of the other
predictors, imputations can be done the same as the above (i.e., impute a reasonable
constant value). If the predictor is correlated with other predictors, develop a customized
model (or have the transcan fuction [available for S-Plus from Harrell’s website] do it for
you) to predict the predictor from all of the other predictors. Then impute missings with
predicted values. For categorical variables, classification trees are good methods for
developing customized imputation models. For continuous variables, ordinary regression
can be used if the variable in question does not require a nonmonotonic transformation to
be predicted from the other variables. For either the related or unrelated predictor case,
variances may need to adjusted for imputation. Single imputation is probably OK here,
but multiple imputation doesn’t hurt.”
“Proportion of missings > 0.15: This situation requires the same considerations as in
the previous case, and adjusting variances for imputation is even more important. To
estimate the strength of the effect of a predictor that is frequently missing, it may be
necessary to refit the model on the subject of observations for which that predictor is not
missing, if Y is not used for imputation. Multiple imputation is preferred for most
models.”
Chapter 5-9 (revision 15 Feb 2012)
p. 8
Counting Number of Missing Values
A quick way to see how many missing values you have in your variables is to use the nmissing or
npresent commands.
First, you have to update your Stata to add them, which you can do with the following command
while connected to the Internet.
findit nmissing
SJ-5-4
dm67_3 . . . . . . . . . . Software update for nmissing and npresent
(help nmissing if installed) . . . . . . . . . . . . . . . N. J. Cox
Q4/05
SJ 5(4):607
now produces saved results
Click on the dm67_3 link to install, which gives you:
INSTALLATION FILES
dm67_3/nmissing.ado
dm67_3/nmissing.hlp
dm67_3/npresent.ado
dm67_3/npresent.hlp
(click here to install)
and then click on the “(click here to install)” link.
Having done that, you can use the following to see the number of missing values for each
variable:
nmissing
Alternatively, you can using the following to see the number of nonmissing vaues for each
variable:
npresent
Chapter 5-9 (revision 15 Feb 2012)
p. 9
Stata Practice Imputing with Likely Value
Bringing in some data to practice with
File
Open
Find the directory where you copied the CD
Change to the subdirectory datasets & do-files
Single click on births_with_missing.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\
births_with_missing.dta", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use births_with_missing, clear
Looking at the data
sum
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------id |
500
250.5
144.4818
1
500
bweight |
478
3137.253
637.777
628
4553
lowbw |
478
.1213389
.3268628
0
1
gestwks |
447
38.79617
2.14174
26.95
43.16
preterm |
447
.1230425
.328854
0
1
-------------+-------------------------------------------------------matage |
485
34.05979
3.905724
23
43
hyp |
478
.1401674
.3475243
0
1
sex |
459
1.479303
.5001165
1
2
sexalph |
0
We have (500-478)/500, or 4.4% missing for that dichotomous variable hyp. Using Harrell’s
guideline for proportion missing 5%, we can simply impute those missings with the modal
value of the variable.
Chapter 5-9 (revision 15 Feb 2012)
p. 10
Here we might use a naming convention of “nm” for “no missing” prefixed onto the variable
name, so that we can easily recognize the original variable and the imputed variable.
capture drop nmhyp
tab hyp
// look at output to see that modal value is 0
gen nmhyp=hyp
replace nmhyp=0 if hyp==.
tab nmhyp hyp , missing
. tab hyp
hypertens |
Freq.
Percent
Cum.
------------+----------------------------------0 |
411
85.98
85.98
1 |
67
14.02
100.00
------------+----------------------------------Total |
478
100.00
. tab nmhyp hyp , missing
|
hypertens
nmhyp |
0
1
. |
Total
-----------+---------------------------------+---------0 |
411
0
22 |
433
1 |
0
67
0 |
67
-----------+---------------------------------+---------Total |
411
67
22 |
500
Alternatively, to make your do-file more automated, so that you don’t have to change these lines
everytime you add new subjects to your dataset, you could use:
capture drop nmhyp
gen nmhyp=hyp
count if hyp==0 // returns the #0's in r(N)
scalar count0=r(N) // store #0’s in count0
count if hyp==1 // returns the #1's in r(N)
scalar count1=r(N) // store #1’s in count1
replace nmhyp =0 if (hyp ==.) & (count0>=count1)
replace nmhyp =1 if (hyp ==.) & (count0<count1)
tab nmhyp hyp , missing
Here, we used the count command to count the number of occurrences of 0 and 1 and stored
these in scalars (variables with one element). Then we imputed with the most frequent category.
. tab nmhyp hyp , missing
|
hypertens
nmhyp |
0
1
. |
Total
-----------+---------------------------------+---------0 |
411
0
22 |
433
1 |
0
67
0 |
67
-----------+---------------------------------+---------Total |
411
67
22 |
500
For the continuous variable matage, we have (500-485)/500, or 3% missing. Using Harrell’s
guideline for proportion missing 5%, we can simply impute those missings with the median of
the variable. We can use either of the following two commands to discover the median.
Chapter 5-9 (revision 15 Feb 2012)
p. 11
centile matage, centile(50)
sum matage, detail
. centile matage, centile(50)
-- Binom. Interp. -Variable |
Obs Percentile
Centile
[95% Conf. Interval]
-------------+------------------------------------------------------------matage |
485
50
34
34
35
. sum matage, detail
maternal age
------------------------------------------------------------Percentiles
Smallest
1%
25
23
5%
27
24
10%
29
25
Obs
485
25%
31
25
Sum of Wgt.
485
50%
75%
90%
95%
99%
34
37
39
40
42
Largest
42
43
43
43
Mean
Std. Dev.
34.05979
3.905724
Variance
Skewness
Kurtosis
15.25468
-.241381
2.483957
To impute with the median of 34, we could use
capture drop nmmatage
gen nmmatage=matage
replace nmmatage=34 if matage==.
sum nmmatage
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------nmmatage |
500
34.058
3.846587
23
43
Alternatively, to make your do-file more automated, so that you don’t have to change these lines
everytime you add new subjects to your dataset, you could use:
capture drop nmmatage
gen nmmatage=matage
sum matage, detail
*return list // to discover median is r(p50)
replace nmmatage=r(p50) if matage==.
sum nmmatage
After running the sum command, followed by “return list” without the “*”, where it was
discovered that the median is stored in the macro name r(p50), we can simply delete the return
list line, or comment it out using the asterick.
Chapter 5-9 (revision 15 Feb 2012)
p. 12
Suppose the missing data are the Missing at random (MAR) case (see page 2). A better “likely
value” can be obtained with regression.
For the gestwks variable, there are (500-447)/500, or 10.6% missing. Although multiple
imputation would be better, since it adjusts the variability, let’s try the regression approach just to
see how it works. Later on, we will compare the result to multiple imputation.
To find out what other predictors are correlated with gestwks, we can use
corr bweight-sex
| bweight
lowbw gestwks preterm
matage
hyp
sex
-------------+--------------------------------------------------------------bweight |
1.0000
lowbw | -0.7083
1.0000
gestwks |
0.6969 -0.6045
1.0000
preterm | -0.5580
0.5653 -0.7374
1.0000
matage |
0.0260 -0.0268
0.0335 -0.0039
1.0000
hyp | -0.1923
0.1297 -0.1796
0.1297 -0.0527
1.0000
sex | -0.1654
0.0661 -0.0397
0.0500 -0.0351 -0.0771
1.0000
We see that lowbw, preterm, hyp, and sex are correlated with gestwks. Using Stata’s impute
command,
capture drop nmgestwks
impute gestwks lowbw preterm hyp sex , gen(nmgestwks)
sum gestwks nmgestwks
We see that the imputed variable has a mean very similar to the original variable, but that the
standard deviation is smaller.
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------gestwks |
447
38.79617
2.14174
26.95
43.16
nmgestwks |
500
38.78908
2.082616
26.95
43.16
This always happens, and is why we find the phrase “variances may need to adjusted for
imputation” in Harrell’s “Proportion of missings 0.05 to 0.15” rule above. The diminished
variance may lead to erroneous statistical significance. The impute command does not add back
in random variability to the imputed values.
Chapter 5-9 (revision 15 Feb 2012)
p. 13
Stata Practice Imputing with Hotdeck Imputation
First, we must update our Stata to include hotdeck imputation, as was done on Page 5, if we did
not do this yet.
net from http://www.schonlau.net/stata
net install hotdeckvar
We will use a small dataset so we can keep track of what is happening.
use births_miss_small, clear
list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
+-----------------------------------------+
| id
bweight
lowbw
gestwks
matage |
|-----------------------------------------|
| 1
2620
0
38.15
35 |
| 2
3751
0
39.8
31 |
| 3
3200
0
38.89
33 |
| 4
3673
0
.
. |
| 5
.
.
38.97
35 |
|-----------------------------------------|
| 6
3001
0
41.02
38 |
| 7
1203
1
.
. |
| 8
3652
0
.
. |
| 9
3279
0
39.35
30 |
| 10
3007
0
.
. |
|-----------------------------------------|
| 11
2887
0
38.9
28 |
| 12
.
.
40.03
27 |
| 13
3375
0
.
36 |
| 14
2002
1
36.48
37 |
| 15
2213
1
37.68
39 |
+-----------------------------------------+
To see the help for hotdeckvar, use
help hotdeckvar
Chapter 5-9 (revision 15 Feb 2012)
p. 14
To impute all the missing values in this dataset, we use the following. We first set the seed to the
random number generator so that we get the same imputed values each time we run the
hotdeckvar command.
set seed 999 // otherwise imputed variables change
hotdeckvar bweight lowbw gestwks matage, suffix("_hot")
list bweight_hot-matage_hot, abbrev(15)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
+----------------------------------------------------+
| bweight_hot
lowbw_hot
gestwks_hot
matage_hot |
|----------------------------------------------------|
|
2620
0
38.15
35 |
|
3751
0
39.8
31 |
|
3200
0
38.89
33 |
|
3673
0
37.68
39 |
|
2620
0
38.97
35 |
|----------------------------------------------------|
|
3001
0
41.02
38 |
|
1203
1
39.8
31 |
|
3652
0
38.89
33 |
|
3279
0
39.35
30 |
|
3007
0
41.02
38 |
|----------------------------------------------------|
|
2887
0
38.9
28 |
|
3200
0
40.03
27 |
|
3375
0
39.8
36 |
|
2002
1
36.48
37 |
|
2213
1
37.68
39 |
+----------------------------------------------------+
For analysis, we simply use these new imputed variables.
Chapter 5-9 (revision 15 Feb 2012)
p. 15
Stata Practice Imputing with Multiple Imputation
A significant improvement was made to multiple imputation in Stata version 11. There is now a
Stata manual, Multiple Imputation, available. To see it, click on Help on the Stata menu bar,
then click on PDF Documentation. It is all done with the mi command.
The Version 11 commands did not fill in the missing values as well, but the Version 10 approach
obtained an estimate for each missing value.
Version 10: Multiple Imputation
If you have Stata Version 11, do not do this section. Just look at the process it went through, and
then go on to the Version 11 section.
First, we must update our Stata to include the multiple imputation commands: ice and
micombine. Use
findit ice
which takes you to a help screen where several versions of these commands are:
First, click on the st0067_3 link:
SJ-7-4
st0067_3 . . . . Multiple imputation of missing values: Update of ice
(help ice, ice_reformat, micombine, uvis if installed) . . P. Royston
Q4/07
SJ 7(4):445--464
update of ice allowing imputation of left-, right-, or
interval-censored observations
which brings up the following:
INSTALLATION FILES
st0067_3/ice.ado
st0067_3/ice.hlp
st0067_3/ice_reformat.ado
st0067_3/ice_reformat.hlp
st0067_3/micombine.ado
st0067_3/micombine.hlp
st0067_3/uvis.ado
st0067_3/uvis.hlp
st0067_3/cmdchk.ado
st0067_3/nscore.ado
(click here to install)
Second, click on the click here to install link:
Chapter 5-9 (revision 15 Feb 2012)
p. 16
Third, click on the st0067_4 link:
SJ-9-3
st0067_4 . Mult. imp.: update of ice, with emphasis on cat. variables
(help ice, uvis if installed) . . . . . . . . . . . . . . . P. Royston
Q3/09
SJ 9(3):466--477
update of ice package with emphasis on categorical
variables; clarifies relationship between ice and mi
which brings up the following:
INSTALLATION FILES
(click here to install)
st0067_4/ice.ado
st0067_4/ice.hlp
st0067_4/ice_.ado
st0067_4/uvis.ado
st0067_4/uvis.hlp
followed by clicking on the “(click here to to install)” link.
Fourth, click on the click here to install link:
In this chapter, the ice and micombine commands are used. The multiple imputation procedure
for Stata version 10 was later updated to include the mim and mimstack commands. The old
commands seemed easier to use, and should give the same result, so this chapter still teaches the
ice and micombine commands.
Again, we will use a small dataset so we can keep track of what is happening.
use births_miss_small, clear
list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
+-----------------------------------------+
| id
bweight
lowbw
gestwks
matage |
|-----------------------------------------|
| 1
2620
0
38.15
35 |
| 2
3751
0
39.8
31 |
| 3
3200
0
38.89
33 |
| 4
3673
0
.
. |
| 5
.
.
38.97
35 |
|-----------------------------------------|
| 6
3001
0
41.02
38 |
| 7
1203
1
.
. |
| 8
3652
0
.
. |
| 9
3279
0
39.35
30 |
| 10
3007
0
.
. |
|-----------------------------------------|
| 11
2887
0
38.9
28 |
| 12
.
.
40.03
27 |
| 13
3375
0
.
36 |
| 14
2002
1
36.48
37 |
| 15
2213
1
37.68
39 |
+-----------------------------------------+
The stata commands we will use are explain fully in the Royston (2005a, 2005b, 2007) articles,
which are updated versions of the routines explained fully in the Royston (2004) article. These
articles, as with any Stata Journal article that is at least two years old, can be downloaded at no
cost from the Stata Corporation website, if you want them. More recent articles must be paid for.
http://www.stata-journal.com/archives.html
Chapter 5-9 (revision 15 Feb 2012)
p. 17
With listwise deletion of missing values, we get the following linear regression.
regress bweight gestwks matage
Source |
SS
df
MS
-------------+-----------------------------Model | 1880727.23
2 940363.614
Residual | 436631.648
5 87326.3296
-------------+-----------------------------Total | 2317358.88
7 331051.268
Number of obs
F( 2,
5)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
8
10.77
0.0154
0.8116
0.7362
295.51
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
274.9777
83.99418
3.27
0.022
59.06382
490.8916
matage | -66.55269
28.83879
-2.31
0.069
-140.6852
7.57979
_cons |
-5541.07
3641.212
-1.52
0.189
-14901.1
3818.962
------------------------------------------------------------------------------
van Buuren (1999, p.686) describes the multiple imputation method:
“Multiple imputation will be applied to account for the non-response. The main tasks to
be accomplished in multiple imputation are:
1. Specify the posterior predictive density p(Ymis|X,R), where X is a set of predictor
variables, given the non-response mechanism p(R|Y,Z) and the complete data model
p(Y,Z).
2. Draw imputations from this density to produce m complete data sets.
3. Perform m complete-data analyses (Cox regression in our case) on each completed data
matrix.
4. Pool the m analyses results into final point and variance estimates.
Simulation studies have shown that the required number of repeated imputations m can be
as low as three for data with 20 per cent of missing entries.10 In the following we use
m=5, which is a conservative choice.”
Next we will obtain the multiple imputation solution. First we use the mvis command, which
imputes missing values in the mainvarlist m times by using “switching regression”, an iterative
multivariate regression technique. The imputed and non-imputed variables are stored in a new
file called filename.dta.
The syntax is:
ice mainvarlist [if] [in] [weight] [, boot [(varlist)]
cc(ccvarlist) cmd(cmdlist) cycles(#) dropmissing dryrun eq(eqlist)
genmiss(string) id(string) m(#) interval(intlist) match[(varlist)]
noconstant nopp noshoweq nowarning on(varlist) orderasis
passive(passivelist) replace saving(filename [, replace]) seed(#)
substitute(sublist) trace(filename)
Chapter 5-9 (revision 15 Feb 2012)
p. 18
We use
ice bweight gestwks matage , m(5) genmiss(nm) ///
saving(bwtimp, replace) seed(888)
To see what this did, we can use
use bwtimp, clear
list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
+----------------------------------------------------------------------------------------+
| id
bweight
lowbw
gestwks
matage
_mi
_mj
nmbwei~t
nmgest~s
nmmatage |
|----------------------------------------------------------------------------------------|
| 1
2620
0
38.15
35
1
0
.
.
. |
| 2
3751
0
39.8
31
2
0
.
.
. |
| 3
3200
0
38.89
33
3
0
.
.
. |
| 4
3673
0
.
.
4
0
.
.
. |
| 5
.
.
38.97
35
5
0
.
.
. |
|----------------------------------------------------------------------------------------|
| 6
3001
0
41.02
38
6
0
.
.
. |
| 7
1203
1
.
.
7
0
.
.
. |
| 8
3652
0
.
.
8
0
.
.
. |
| 9
3279
0
39.35
30
9
0
.
.
. |
| 10
3007
0
.
.
10
0
.
.
. |
|----------------------------------------------------------------------------------------|
| 11
2887
0
38.9
28
11
0
.
.
. |
| 12
.
.
40.03
27
12
0
.
.
. |
| 13
3375
0
.
36
13
0
.
.
. |
| 14
2002
1
36.48
37
14
0
.
.
. |
| 15
2213
1
37.68
39
15
0
.
.
. |
|----------------------------------------------------------------------------------------|
| 1
2620
0
38.15
35
1
1
0
0
0 |
| 2
3751
0
39.8
31
2
1
0
0
0 |
| 3
3200
0
38.89
33
3
1
0
0
0 |
| 4
3673
0
39.035
30.4696
4
1
0
1
1 |
| 5
2573.08
.
38.97
35
5
1
1
0
0 |
|----------------------------------------------------------------------------------------|
| 6
3001
0
41.02
38
6
1
0
0
0 |
| 7
1203
1
37.32458
38.3812
7
1
0
1
1 |
| 8
3652
0
40.36074
30.7336
8
1
0
1
1 |
| 9
3279
0
39.35
30
9
1
0
0
0 |
| 10
3007
0
38.58094
35.3263
10
1
0
1
1 |
|----------------------------------------------------------------------------------------|
| 11
2887
0
38.9
28
11
1
0
0
0 |
| 12
3678.1
.
40.03
27
12
1
1
0
0 |
| 13
3375
0
40.13427
36
13
1
0
1
0 |
| 14
2002
1
36.48
37
14
1
0
0
0 |
| 15
2213
1
37.68
39
15
1
0
0
0 |
|----------------------------------------------------------------------------------------|
| 1
2620
0
38.15
35
1
2
0
0
0 |
| 2
3751
0
39.8
31
2
2
0
0
0 |
| 3
3200
0
38.89
33
3
2
0
0
0 |
| 4
3673
0
39.02037
28.6181
4
2
0
1
1 |
| 5
2663.96
.
38.97
35
5
2
1
0
0 |
|----------------------------------------------------------------------------------------|
| 6
3001
0
41.02
38
6
2
0
0
0 |
| 7
1203
1
36.64371
39.5586
7
2
0
1
1 |
| 8
3652
0
39.96529
32.5712
8
2
0
1
1 |
| 9
3279
0
39.35
30
9
2
0
0
0 |
| 10
3007
0
39.37796
31.3312
10
2
0
1
1 |
|----------------------------------------------------------------------------------------|
| 11
2887
0
38.9
28
11
2
0
0
0 |
| 12
3949.46
.
40.03
27
12
2
1
0
0 |
| 13
3375
0
40.42688
36
13
2
0
1
0 |
| 14
2002
1
36.48
37
14
2
0
0
0 |
| 15
2213
1
37.68
39
15
2
0
0
0 |
|----------------------------------------------------------------------------------------|
| 1
2620
0
38.15
35
1
3
0
0
0 |
| 2
3751
0
39.8
31
2
3
0
0
0 |
| 3
3200
0
38.89
33
3
3
0
0
0 |
| 4
3673
0
40.09019
22.9542
4
3
0
1
1 |
Chapter 5-9 (revision 15 Feb 2012)
p. 19
50. | 5
2521.11
.
38.97
35
5
3
1
0
0 |
|----------------------------------------------------------------------------------------|
51. | 6
3001
0
41.02
38
6
3
0
0
0 |
52. | 7
1203
1
35.87934
42.478
7
3
0
1
1 |
53. | 8
3652
0
41.6578
34.7143
8
3
0
1
1 |
54. | 9
3279
0
39.35
30
9
3
0
0
0 |
55. | 10
3007
0
41.21783
42.4263
10
3
0
1
1 |
|----------------------------------------------------------------------------------------|
56. | 11
2887
0
38.9
28
11
3
0
0
0 |
57. | 12
3544.44
.
40.03
27
12
3
1
0
0 |
58. | 13
3375
0
41.06176
36
13
3
0
1
0 |
59. | 14
2002
1
36.48
37
14
3
0
0
0 |
60. | 15
2213
1
37.68
39
15
3
0
0
0 |
|----------------------------------------------------------------------------------------|
61. | 1
2620
0
38.15
35
1
4
0
0
0 |
62. | 2
3751
0
39.8
31
2
4
0
0
0 |
63. | 3
3200
0
38.89
33
3
4
0
0
0 |
64. | 4
3673
0
39.85606
25.5008
4
4
0
1
1 |
65. | 5
3028.08
.
38.97
35
5
4
1
0
0 |
|----------------------------------------------------------------------------------------|
66. | 6
3001
0
41.02
38
6
4
0
0
0 |
67. | 7
1203
1
33.76732
29.3472
7
4
0
1
1 |
68. | 8
3652
0
41.02277
31.6178
8
4
0
1
1 |
69. | 9
3279
0
39.35
30
9
4
0
0
0 |
70. | 10
3007
0
37.80869
17.7524
10
4
0
1
1 |
|----------------------------------------------------------------------------------------|
71. | 11
2887
0
38.9
28
11
4
0
0
0 |
72. | 12
3761.25
.
40.03
27
12
4
1
0
0 |
73. | 13
3375
0
38.98916
36
13
4
0
1
0 |
74. | 14
2002
1
36.48
37
14
4
0
0
0 |
75. | 15
2213
1
37.68
39
15
4
0
0
0 |
|----------------------------------------------------------------------------------------|
76. | 1
2620
0
38.15
35
1
5
0
0
0 |
77. | 2
3751
0
39.8
31
2
5
0
0
0 |
78. | 3
3200
0
38.89
33
3
5
0
0
0 |
79. | 4
3673
0
39.61145
24.598
4
5
0
1
1 |
80. | 5
2581.58
.
38.97
35
5
5
1
0
0 |
|----------------------------------------------------------------------------------------|
81. | 6
3001
0
41.02
38
6
5
0
0
0 |
82. | 7
1203
1
35.71356
51.1944
7
5
0
1
1 |
83. | 8
3652
0
42.04614
37.6994
8
5
0
1
1 |
84. | 9
3279
0
39.35
30
9
5
0
0
0 |
85. | 10
3007
0
39.53621
32.1937
10
5
0
1
1 |
|----------------------------------------------------------------------------------------|
86. | 11
2887
0
38.9
28
11
5
0
0
0 |
87. | 12
3228.02
.
40.03
27
12
5
1
0
0 |
88. | 13
3375
0
41.59284
36
13
5
0
1
0 |
89. | 14
2002
1
36.48
37
14
5
0
0
0 |
90. | 15
2213
1
37.68
39
15
5
0
0
0 |
+----------------------------------------------------------------------------------------+
In this file, the original data are shown on the first 15 lines, followed by 75 lines (5 x 15) with
imputed values.
The ice command stored 5 sets of imputed data, as specified with the m(5) option. This
generated missing value indicators, which begin with “nm” (abbrevation for “nonmissing”, but
you can use anything you like) as we specified in the genmiss(nm) option. Notice that no
imputation occurred for the variable lowbw, which was left out of the ice command. The
imputed values are inserted in the original variable names in this file, but not in the data we have
in Stata memory.
You can spot the imputed values, because they have more decimal places.
Chapter 5-9 (revision 15 Feb 2012)
p. 20
Whereas van Buuren describes this “regression switching” as,
“1. Specify the posterior predictive density p(Ymis|X,R), where X is a set of predictor
variables, given the non-response mechanism p(R|Y,Z) and the complete data model
p(Y,Z).”
Royston (2004, p. 232) explains this regression switching as,
“The algorithm is a type of Gibbs sampler in which the distribution of missing values of a
covariate is sampled conditional on the distribution of the remaining covariates. Each
variable in mainvarlist becomes in turn the response variable.”
This format is suitable for the micombine command, which will be used next.
The syntax for the micombine command is:
micombine regression_cmd [yvar][covarlist] [if][in][weight]
[, br detail eform(string) genxb(newvarname) impid(varname) lrr
noconstant obsid(varname) svy[(svy_options)] regression_cmd_options]
where regression cmd includes clogit, cnreg, glm, logistic, logit, mlogit, nbreg, ologit, oprobit,
poisson, probit, qreg, regress, rreg, stcox, streg, or xtgee. Other regression cmds will work but
not all have been tested by the author. All weight types supported by regression cmd are allowed.
First, the data file created with the ice command must be loaded into memory, if they are not
already there. The regression_cmd portion of the micombine is the same regression command
we used before we imputed the data.
use bwtimp, clear
micombine regress bweight gestwks matage
Multiple imputation parameter estimates (5 imputations)
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
324.7973
72.26509
4.49
0.001
167.3452
482.2494
matage | -64.41396
30.7788
-2.09
0.058
-131.4752
2.647278
_cons | -7567.114
2975.775
-2.54
0.026
-14050.77
-1083.457
-----------------------------------------------------------------------------15 observations.
This linear regression model represents the combined (a type of pooling) estimates from fitting
the model to the five imputed datasets.
Chapter 5-9 (revision 15 Feb 2012)
p. 21
This model can be compared to the listwise deletion model we fitted above.
Listwise Deletion
Source |
SS
df
MS
-------------+-----------------------------Model | 1880727.23
2 940363.614
Residual | 436631.648
5 87326.3296
-------------+-----------------------------Total | 2317358.88
7 331051.268
Number of obs
F( 2,
5)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
8
10.77
0.0154
0.8116
0.7362
295.51
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
274.9777
83.99418
3.27
0.022
59.06382
490.8916
matage | -66.55269
28.83879
-2.31
0.069
-140.6852
7.57979
_cons |
-5541.07
3641.212
-1.52
0.189
-14901.1
3818.962
------------------------------------------------------------------------------
Multiple Imputation
Multiple imputation parameter estimates (5 imputations)
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
324.7973
72.26509
4.49
0.001
167.3452
482.2494
matage | -64.41396
30.7788
-2.09
0.058
-131.4752
2.647278
_cons | -7567.114
2975.775
-2.54
0.026
-14050.77
-1083.457
-----------------------------------------------------------------------------15 observations.
The large discrepancies are reflecting the differences that can exist in small sample sizes, where a
single value can change model dramatically.
Chapter 5-9 (revision 15 Feb 2012)
p. 22
For a more fair comparison, we will use the original large dataset and try it again.
use births_with_missing, clear
regress bweight gestwks matage
ice bweight gestwks matage , m(5) genmiss(nm) ///
saving(bwtimp, replace) seed(888)
use bwtimp, clear
micombine regress bweight gestwks matage
Listwise Deletion
Source |
SS
df
MS
-------------+-----------------------------Model | 79309893.1
2 39654946.6
Residual | 81166170.7
422 192336.898
-------------+-----------------------------Total |
160476064
424 378481.283
Number of obs
F( 2,
422)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
425
206.17
0.0000
0.4942
0.4918
438.56
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
201.6962
9.945484
20.28
0.000
182.1473
221.245
matage |
.5083184
5.513141
0.09
0.927
-10.32832
11.34496
_cons |
-4692.07
422.0691
-11.12
0.000
-5521.689
-3862.45
------------------------------------------------------------------------------
Multiple Imputation
Multiple imputation parameter estimates (5 imputations)
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
208.7919
9.449999
22.09
0.000
190.2251
227.3588
matage |
.1029839
5.825347
0.02
0.986
-11.34236
11.54833
_cons |
-4964.06
402.0698
-12.35
0.000
-5754.026
-4174.094
-----------------------------------------------------------------------------500 observations (imputation 1).
We see the models are much closer. In this case, we had 75 observations (15%) missing in the
first model.
Chapter 5-9 (revision 15 Feb 2012)
p. 23
Just for fun, let’s compare these models to median and mean imputed models.
* -- median imputed model
use births_with_missing, clear
*
capture drop nmbweight
gen nmbweight=bweight
centile bweight, centile(50)
replace nmbweight=r(c_1) if nmbweight==. // r(c_1) found using
"return list"
*
capture drop nmgestwks
gen nmgestwks=gestwks
centile gestwks, centile(50)
replace nmgestwks=r(c_1) if nmgestwks==.
*
capture drop nmmatage
gen nmmatage=matage
centile matage, centile(50)
replace nmmatage=r(c_1) if nmmatage==.
*
regress nmbweight nmgestwks nmmatage
* -- mean imputed model
use births_with_missing, clear
*
capture drop nmbweight
gen nmbweight=bweight
sum bweight
replace nmbweight=r(mean) if nmbweight==.
*
capture drop nmgestwks
gen nmgestwks=gestwks
sum gestwks
replace nmgestwks=r(mean) if nmgestwks==.
*
capture drop nmmatage
gen nmmatage=matage
sum matage
replace nmmatage=r(mean) if nmmatage==.
*
regress nmbweight nmgestwks nmmatage
Chapter 5-9 (revision 15 Feb 2012)
p. 24
1) Listwise deletion model
Source |
SS
df
MS
-------------+-----------------------------Model | 79309893.1
2 39654946.6
Residual | 81166170.7
422 192336.898
-------------+-----------------------------Total |
160476064
424 378481.283
Number of obs
F( 2,
422)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
425
206.17
0.0000
0.4942
0.4918
438.56
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
201.6962
9.945484
20.28
0.000
182.1473
221.245
matage |
.5083184
5.513141
0.09
0.927
-10.32832
11.34496
_cons |
-4692.07
422.0691
-11.12
0.000
-5521.689
-3862.45
------------------------------------------------------------------------------
2) Multiple imputation model
Multiple imputation parameter estimates (5 imputations)
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
208.7919
9.449999
22.09
0.000
190.2251
227.3588
matage |
.1029839
5.825347
0.02
0.986
-11.34236
11.54833
_cons |
-4964.06
402.0698
-12.35
0.000
-5754.026
-4174.094
-----------------------------------------------------------------------------500 observations (imputation 1).
3) Median imputed model
Source |
SS
df
MS
-------------+-----------------------------Model | 74160695.6
2 37080347.8
Residual |
119927799
497 241303.418
-------------+-----------------------------Total |
194088495
499 388954.899
Number of obs
F( 2,
497)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
153.67
0.0000
0.3821
0.3796
491.23
-----------------------------------------------------------------------------nmbweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------nmgestwks |
190.0368
10.8495
17.52
0.000
168.7203
211.3534
nmmatage |
.2067422
5.721325
0.04
0.971
-11.03422
11.44771
_cons | -4247.992
457.7167
-9.28
0.000
-5147.29
-3348.694
------------------------------------------------------------------------------
4) Mean imputed model
Source |
SS
df
MS
-------------+-----------------------------Model | 75577258.4
2 37788629.2
Residual |
118447042
497 238324.028
-------------+-----------------------------Total |
194024300
499 388826.253
Number of obs
F( 2,
497)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
158.56
0.0000
0.3895
0.3871
488.18
-----------------------------------------------------------------------------nmbweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------nmgestwks |
192.1832
10.80209
17.79
0.000
170.9599
213.4066
nmmatage |
.2557455
5.68614
0.04
0.964
-10.91609
11.42758
_cons | -4327.432
454.9994
-9.51
0.000
-5221.391
-3433.473
------------------------------------------------------------------------------
They are all pretty close to each other.
Chapter 5-9 (revision 15 Feb 2012)
p. 25
Version 11: Multiple Imputation
Again, we will use a small dataset so we can keep track of what is happening.
use births_miss_small, clear
list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
+-----------------------------------------+
| id
bweight
lowbw
gestwks
matage |
|-----------------------------------------|
| 1
2620
0
38.15
35 |
| 2
3751
0
39.8
31 |
| 3
3200
0
38.89
33 |
| 4
3673
0
.
. |
| 5
.
.
38.97
35 |
|-----------------------------------------|
| 6
3001
0
41.02
38 |
| 7
1203
1
.
. |
| 8
3652
0
.
. |
| 9
3279
0
39.35
30 |
| 10
3007
0
.
. |
|-----------------------------------------|
| 11
2887
0
38.9
28 |
| 12
.
.
40.03
27 |
| 13
3375
0
.
36 |
| 14
2002
1
36.48
37 |
| 15
2213
1
37.68
39 |
+-----------------------------------------+
With listwise deletion of missing values, we get the following linear regression.
regress bweight gestwks matage
Source |
SS
df
MS
-------------+-----------------------------Model | 1880727.23
2 940363.614
Residual | 436631.648
5 87326.3296
-------------+-----------------------------Total | 2317358.88
7 331051.268
Number of obs
F( 2,
5)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
8
10.77
0.0154
0.8116
0.7362
295.51
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
274.9777
83.99418
3.27
0.022
59.06382
490.8916
matage | -66.55269
28.83879
-2.31
0.069
-140.6852
7.57979
_cons |
-5541.07
3641.212
-1.52
0.189
-14901.1
3818.962
------------------------------------------------------------------------------
Chapter 5-9 (revision 15 Feb 2012)
p. 26
First, we declare the data to be mi (multiple imputation) data, requesting the marginal long style
(mlong) because it is the most memory-efficient of the available styles.
mi set mlong
Second, we register our variables, using “imputed” if to be imputed.
mi register imputed bweight gestwks matage
Third, we impute the missing values for bweight, using the linear regression approach (like
Stata’s impute command), where the missing value is predicted by gestwks and matage. We set
the random number seed so we can duplicate our work and asked for 5 imputations.
mi impute regress bweight gestwks matage, rseed(888) add(5)
note: variables gestwks matage registered as imputed and used to model variable
bweight; this may cause some observations to be omitted from the estimation and
may lead to missing imputed values
Univariate imputation
Linear regression
Imputed: m=1 through m=5
Imputations =
added =
updated =
5
5
0
|
Observations per m
|---------------------------------------------Variable |
complete
incomplete
imputed |
total
---------------+-----------------------------------+---------bweight |
13
2
2 |
15
-------------------------------------------------------------(complete + incomplete = total; imputed is the minimum across m
of the number of filled in observations.)
Note: right-hand-side variables (or weights) have missing values;
model parameters estimated using listwise deletion
We were able to fill in the two missing values for bweight.
Fourth, we impute the missing values for gestwks, using the linear regression approach predicting
from matage.
mi impute regress gestwks matage, rseed(999) add(5)
gestwks: missing imputed values produced
This may occur when imputation variables are used as independent variables
or when independent variables contain missing values. You can specify
option force if you wish to proceed anyway.
r(498);
The “force” option would keep from crashing, but nothing useful would come out of it. It
appears we cannot impute gestwks due to missing values (no information) in matage.
Fifth, we impute the missing values for matage, using the linear regression approach predicting
from gestwks
Chapter 5-9 (revision 15 Feb 2012)
p. 27
mi impute regress matage gestwks, rseed(999) add(5)
matage: missing imputed
This may occur when
or when independent
option force if you
r(498);
values produced
imputation variables are used as independent variables
variables contain missing values. You can specify
wish to proceed anyway.
The “force” option would keep from crashing, but nothing useful would come out of it. It
appears we cannot impute matage due to missing values (no information) in gestwks.
Sixth, we check that the imputation did not create something crazy, but looking at the descriptive
stats before imputation, after the first imputation, and at the fifth imputation.
mi xeq 0 1 5: sum bweight gestwks matage
m=0 data:
-> sum bweight gestwks matage
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bweight |
13
2912.538
741.7804
1203
3751
gestwks |
10
38.927
1.277533
36.48
41.02
matage |
11
33.54545
4.058661
27
39
m=1 data:
-> sum bweight gestwks matage
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bweight |
15
2944.241
712.5035
1203
3751
gestwks |
10
38.927
1.277533
36.48
41.02
matage |
11
33.54545
4.058661
27
39
m=5 data:
-> sum bweight gestwks matage
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bweight |
15
2897.641
700.9038
1203
3751
gestwks |
10
38.927
1.277533
36.48
41.02
matage |
11
33.54545
4.058661
27
39
We see the nonmissing sample size for matage increased, but nothing changed for the other two
variables. It was a big change in the mean for bweight, which might make us wonder if the
imputation worked well. We’ll see below it did not seem to cause a problem.
Chapter 5-9 (revision 15 Feb 2012)
p. 28
Finally, we run the linear regression model on the imputed variables. The dots options charts
progress on the screen if it takes a long time.
mi estimate, dots: regress bweight gestwks matage
Imputations (5):
..... done
Multiple-imputation estimates
Linear regression
DF adjustment:
Model F test:
Within VCE type:
Small sample
Equal FMI
OLS
Imputations
Number of obs
Average RVI
Complete DF
DF:
min
avg
max
F(
2,
4.7)
Prob > F
=
=
=
=
=
=
=
=
=
5
10
0.2082
7
5.01
5.33
5.56
11.71
0.0146
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
264.7004
82.06318
3.23
0.021
58.51564
470.8851
matage | -59.11732
25.88137
-2.28
0.071
-125.622
7.387346
_cons |
-5427.05
3541.906
-1.53
0.180
-14262.06
3407.96
------------------------------------------------------------------------------
Comparing this to the original model that used listwise deletion of missing values,
Source |
SS
df
MS
-------------+-----------------------------Model | 1880727.23
2 940363.614
Residual | 436631.648
5 87326.3296
-------------+-----------------------------Total | 2317358.88
7 331051.268
Number of obs
F( 2,
5)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
8
10.77
0.0154
0.8116
0.7362
295.51
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
274.9777
83.99418
3.27
0.022
59.06382
490.8916
matage | -66.55269
28.83879
-2.31
0.069
-140.6852
7.57979
_cons |
-5541.07
3641.212
-1.52
0.189
-14901.1
3818.962
------------------------------------------------------------------------------
We see that the imputed modeled was based on n=2 more observations, which increased the
statistical power slightly (smaller p value) for gestwks, while keeping the coefficient essentially
the same.
Chapter 5-9 (revision 15 Feb 2012)
p. 29
Version 11: Multiple Imputation Using Imputaton Via Changed Equations (mi ice)
This approach will fill in more missing data than the preceding approach.
On the website, http://www.stata.com/support/faqs/stat/mi_ice.html, the way to get the original
ice, which fills in more missing data, is described,
“In Stata 11, you can use the user-written command mi ice to perform imputation via
chained equations. mi ice is available from Patrick Royston’s web page (net from
http://www.homepages.ucl.ac.uk/~ucakjpr/stata/) under the heading mi_ice. mi ice is
a wrapper for ice that understands the official mi data format.”
To add this to Stata-11, use
net from http://www.homepages.ucl.ac.uk/~ucakjpr/stata/
and then click on,
mi_ice
Stata 11 version of -ice-; knows about the new MI format
------------------------------------------------------------------------------package mi_ice from http://www.homepages.ucl.ac.uk/~ucakjpr/stata
------------------------------------------------------------------------------TITLE
mi ice. Stata 11-aware wrapper for the -ice- package
DESCRIPTION/AUTHOR(S)
Program by Yulia Marchenko and Patrick Royston
Distribution-Date: 20101130
version: 1.0.2
Note: for this Stata 11 program to work, you must first install -ice-.
Please direct queries to Patrick Royston (pr@ctu.mrc.ac.uk)
INSTALLATION FILES
(click here to install)
mi_ice\mi_cmd_ice.ado
mi_ice\mi_ice.sthlp
------------------------------------------------------------------------------(click here to return to the previous screen)
Clicking on the “(click here to install)” link loads the software into Stata.
Chapter 5-9 (revision 15 Feb 2012)
p. 30
Again, we will use a small dataset so we can keep track of what is happening.
use births_miss_small, clear
list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
+-----------------------------------------+
| id
bweight
lowbw
gestwks
matage |
|-----------------------------------------|
| 1
2620
0
38.15
35 |
| 2
3751
0
39.8
31 |
| 3
3200
0
38.89
33 |
| 4
3673
0
.
. |
| 5
.
.
38.97
35 |
|-----------------------------------------|
| 6
3001
0
41.02
38 |
| 7
1203
1
.
. |
| 8
3652
0
.
. |
| 9
3279
0
39.35
30 |
| 10
3007
0
.
. |
|-----------------------------------------|
| 11
2887
0
38.9
28 |
| 12
.
.
40.03
27 |
| 13
3375
0
.
36 |
| 14
2002
1
36.48
37 |
| 15
2213
1
37.68
39 |
+-----------------------------------------+
With listwise deletion of missing values, we get the following linear regression.
regress bweight gestwks matage
Source |
SS
df
MS
-------------+-----------------------------Model | 1880727.23
2 940363.614
Residual | 436631.648
5 87326.3296
-------------+-----------------------------Total | 2317358.88
7 331051.268
Number of obs
F( 2,
5)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
8
10.77
0.0154
0.8116
0.7362
295.51
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
274.9777
83.99418
3.27
0.022
59.06382
490.8916
matage | -66.55269
28.83879
-2.31
0.069
-140.6852
7.57979
_cons |
-5541.07
3641.212
-1.52
0.189
-14901.1
3818.962
------------------------------------------------------------------------------
Chapter 5-9 (revision 15 Feb 2012)
p. 31
First, we declare the data to be mi (multiple imputation) data, requesting the marginal long style
(mlong) because it is the most memory-efficient of the available styles.
mi set mlong
Second, we register our variables, using “imputed” if to be imputed.
mi register imputed bweight gestwks matage
Third, we impute the missing values for bweight, gestwks, and matage using the method of chain
equations. We set the random number seed so we can duplicate our work and asked for 5
imputations.
mi ice bweight gestwks matage , add(5) seed(888)
#missing |
values |
Freq.
Percent
Cum.
------------+----------------------------------0 |
8
53.33
53.33
1 |
3
20.00
73.33
2 |
4
26.67
100.00
------------+----------------------------------Total |
15
100.00
Variable | Command | Prediction equation
------------+---------+-----------------------------------------------bweight | regress | gestwks matage
gestwks | regress | bweight matage
matage | regress | bweight gestwks
----------------------------------------------------------------------Imputing 1..2..3..4..5..file
C:\DOCUME~1\U00327~1.SRV\LOCALS~1\Temp\ST_0000006s.tmp saved
(5 imputations added; M=5)
Chapter 5-9 (revision 15 Feb 2012)
p. 32
Fourth, we check that the imputation did not create something crazy, but looking at the
descriptive stats before imputation, after the first imputation, and at the fifth imputation.
mi xeq 0 1 5: sum bweight gestwks matage
m=0 data:
-> sum bweight gestwks matage
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bweight |
13
2912.538
741.7804
1203
3751
gestwks |
10
38.927
1.277533
36.48
41.02
matage |
11
33.54545
4.058661
27
39
m=1 data:
-> sum bweight gestwks matage
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bweight |
15
2940.945
721.708
1203
3751
gestwks |
15
38.98037
1.220608
36.48
41.02
matage |
15
33.59405
3.860192
27
39
m=5 data:
-> sum bweight gestwks matage
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bweight |
15
2911.507
697.542
1203
3751
gestwks |
15
39.18468
1.725737
35.71356
42.04614
matage |
15
34.31236
6.361323
24.59798
51.19441
We don’t see anything strange. We also notice that the three variables are now all nonmissing.
Chapter 5-9 (revision 15 Feb 2012)
p. 33
Finally, we run the linear regression model on the imputed variables. The dots options charts
progress on the screen if it takes a long time.
mi estimate, dots: regress bweight gestwks matage
Imputations (5):
..... done
Multiple-imputation estimates
Linear regression
DF adjustment:
Model F test:
Within VCE type:
Small sample
Equal FMI
OLS
Imputations
Number of obs
Average RVI
Complete DF
DF:
min
avg
max
F(
2,
3.9)
Prob > F
=
=
=
=
=
=
=
=
=
5
15
1.2052
12
2.62
4.98
6.74
13.13
0.0186
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
324.7973
72.26509
4.49
0.005
144.5517
505.0429
matage | -64.41396
30.7788
-2.09
0.140
-170.8037
41.97578
_cons | -7567.114
2975.775
-2.54
0.040
-14658.97
-475.2537
------------------------------------------------------------------------------
Comparing this to the original model that used listwise deletion of missing values,
Source |
SS
df
MS
-------------+-----------------------------Model | 1880727.23
2 940363.614
Residual | 436631.648
5 87326.3296
-------------+-----------------------------Total | 2317358.88
7 331051.268
Number of obs
F( 2,
5)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
8
10.77
0.0154
0.8116
0.7362
295.51
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
274.9777
83.99418
3.27
0.022
59.06382
490.8916
matage | -66.55269
28.83879
-2.31
0.069
-140.6852
7.57979
_cons |
-5541.07
3641.212
-1.52
0.189
-14901.1
3818.962
------------------------------------------------------------------------------
We see that the imputed modeled was based on n=15 observations, which is the actual sample
size of our dataset.
Here are all of the commands that we just used,
mi
mi
mi
mi
mi
set mlong
register imputed bweight gestwks matage
ice bweight gestwks matage , add(5) seed(888)
xeq 0 1 5: sum bweight gestwks matage
estimate, dots: regress bweight gestwks matage
Chapter 5-9 (revision 15 Feb 2012)
p. 34
Version 12: Multiple Imputation Using Imputaton Via Changed Equations
(mi impute chained)
The method of imputation by chained equations was officially added to Stata in version 12.
Again, we will use a small dataset so we can keep track of what is happening.
use births_miss_small, clear
list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
+-----------------------------------------+
| id
bweight
lowbw
gestwks
matage |
|-----------------------------------------|
| 1
2620
0
38.15
35 |
| 2
3751
0
39.8
31 |
| 3
3200
0
38.89
33 |
| 4
3673
0
.
. |
| 5
.
.
38.97
35 |
|-----------------------------------------|
| 6
3001
0
41.02
38 |
| 7
1203
1
.
. |
| 8
3652
0
.
. |
| 9
3279
0
39.35
30 |
| 10
3007
0
.
. |
|-----------------------------------------|
| 11
2887
0
38.9
28 |
| 12
.
.
40.03
27 |
| 13
3375
0
.
36 |
| 14
2002
1
36.48
37 |
| 15
2213
1
37.68
39 |
+-----------------------------------------+
With listwise deletion of missing values, we get the following linear regression.
regress bweight gestwks matage
Source |
SS
df
MS
-------------+-----------------------------Model | 1880727.23
2 940363.614
Residual | 436631.648
5 87326.3296
-------------+-----------------------------Total | 2317358.88
7 331051.268
Number of obs
F( 2,
5)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
8
10.77
0.0154
0.8116
0.7362
295.51
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
274.9777
83.99418
3.27
0.022
59.06382
490.8916
matage | -66.55269
28.83879
-2.31
0.069
-140.6852
7.57979
_cons |
-5541.07
3641.212
-1.52
0.189
-14901.1
3818.962
------------------------------------------------------------------------------
Chapter 5-9 (revision 15 Feb 2012)
p. 35
To see what variables having missing values, we can use the nmissing command we installed
above, along with the “describe short” to get a list of all the variable names,
ds
nmissing
. ds
id
bweight
lowbw
gestwks
matage
. nmissing
bweight
lowbw
gestwks
matage
2
2
5
4
We see that we have some missing data on all variables, except our subject ID variable that will
not be included in the model, anyway. We do not intend to use low birth weight, lowbw, in our
model to predicto birthweight, so we can ignore that variable.
First, we declare the data to be mi (multiple imputation) data, requesting the marginal long style
(mlong) because it is the most memory-efficient of the available styles.
mi set mlong
Second, we register our variables, using “imputed” if to be imputed.
mi register imputed bweight gestwks matage
Third, we impute the missing values for bweight, gestwks, and matage using the method of chain
equations, requesting 10 imputed datasets with the add(10) option. The (regress) part of the
command instructs Stata to use linear regression to predict the missing values from the other
variables. The rseed, where you select any number you want as the random number generator
seed, is a way to be able to reproduce the imputation,
mi impute chained (regress) bweight gestwks matage , add(10) rseed(888)
Conditional models:
bweight: regress bweight matage gestwks
matage: regress matage bweight gestwks
gestwks: regress gestwks bweight matage
Performing chained iterations ...
Multivariate imputation
Chained equations
Imputed: m=6 through m=10
Initialization: monotone
Imputations =
added =
updated =
10
5
0
Iterations =
burn-in =
50
10
bweight: linear regression
gestwks: linear regression
Chapter 5-9 (revision 15 Feb 2012)
p. 36
matage: linear regression
-----------------------------------------------------------------|
Observations per m
|---------------------------------------------Variable |
Complete
Incomplete
Imputed |
Total
-------------------+-----------------------------------+---------bweight |
13
2
2 |
15
gestwks |
10
5
5 |
15
matage |
11
4
4 |
15
-----------------------------------------------------------------(complete + incomplete = total; imputed is the minimum across m
of the number of filled-in observations.)
Fourth, we check that the imputation did not create something crazy, but looking at the
descriptive stats before imputation, after the first imputation, and at the fifth imputation.
mi xeq 0 1 5: sum bweight gestwks matage
m=0 data:
-> sum bweight gestwks matage
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bweight |
13
2912.538
741.7804
1203
3751
gestwks |
10
38.927
1.277533
36.48
41.02
matage |
11
33.54545
4.058661
27
39
m=1 data:
-> sum bweight gestwks matage
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bweight |
15
2906.259
711.6166
1203
3751
gestwks |
15
38.85626
1.335499
35.94739
41.02
matage |
15
33.56222
5.995813
26.79644
49.17725
m=5 data:
-> sum bweight gestwks matage
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------bweight |
15
2954.346
714.6822
1203
3751
gestwks |
15
38.80349
2.10584
32.45712
41.02
matage |
15
33.27142
3.748245
27
39
We don’t see anything strange. The imputed datasets have descriptive statistics very similar to
the original non-imputed dataset. We also notice that the three variables are now all nonmissing
as they should be.
Chapter 5-9 (revision 15 Feb 2012)
p. 37
Finally, we run the linear regression model on the imputed variables. The dots options charts
progress on the screen, which is helpful if it takes a long time.
mi estimate, dots: regress bweight gestwks matage
Imputations (10):
.........10 done
Multiple-imputation estimates
Linear regression
DF adjustment:
Model F test:
Within VCE type:
Small sample
Equal FMI
OLS
Imputations
Number of obs
Average RVI
Largest FMI
Complete DF
DF:
min
avg
max
F(
2,
6.4)
Prob > F
=
=
=
=
=
=
=
=
=
=
10
15
0.7878
0.4927
12
5.37
5.71
6.16
13.60
0.0049
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
310.4137
90.59601
3.43
0.013
90.13877
530.6885
matage | -67.25186
28.77103
-2.34
0.063
-139.7065
5.202746
_cons | -6879.842
4104.348
-1.68
0.148
-17097.55
3337.863
------------------------------------------------------------------------------
Notice the model is now based on a sample size of n=15. Comparing this to the original model
that used listwise deletion of missing values, with a sample size of n=8 due to listwise deletion of
missing data,
Source |
SS
df
MS
-------------+-----------------------------Model | 1880727.23
2 940363.614
Residual | 436631.648
5 87326.3296
-------------+-----------------------------Total | 2317358.88
7 331051.268
Number of obs
F( 2,
5)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
8
10.77
0.0154
0.8116
0.7362
295.51
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gestwks |
274.9777
83.99418
3.27
0.022
59.06382
490.8916
matage | -66.55269
28.83879
-2.31
0.069
-140.6852
7.57979
_cons |
-5541.07
3641.212
-1.52
0.189
-14901.1
3818.962
------------------------------------------------------------------------------
Here are all of the commands that we just used, so you have them in one place,
mi
mi
mi
mi
mi
set mlong
register imputed bweight gestwks matage
impute chained (regress) bweight gestwks matage , add(10) rseed(888)
xeq 0 1 5: sum bweight gestwks matage
estimate, dots: regress bweight gestwks matage
Chapter 5-9 (revision 15 Feb 2012)
p. 38
References
Chandola T, Brunner E, Marmot M. (2006). Chronic stress at work and the metabolic syndrome:
prospective study. BMJ 332:521-5. PMID:16428252
Greenland S, Finkle WD. (1995). A critical look at methods for handling missing covariates in
epidemiologic regression analysis. Am J Epidemiol 142(12):1255-64.
Fleiss JL, Levin B, Paik MC. (2003). Statistical Methods for Rates and Proportions, 3rd ed.
Hoboken NJ, John Wiley & Sons.
Harrell Jr FE. (2001). Regression Modeling Strategies With Applications to Linear Models,
Logistic Regression, and Survival Analysis. New York, Springer-Verlag.
Huberman M, Langholz B. (1999). Application of the missing-indicator method in matched casecontrol studies with incomplete data. Am J Epidemiol 150(12):1340-5.
Li X, Song X, Gray RH. (2004). Comparison of the missing-indicator method and conditional
logistic regression in 1:m matched case-control studies with missing exposure values. Am
J Epidemiol 159(6):603-610.
Moons KG, Grobbee DE. (2002). Diagnostic studies as multivariable, prediction research. J
Epidemiol Community Health 56(5):337-8.
Roth, P. (1994). Missing data: A conceptual review for applied psychologists. Personnel
Psychology 47:537-560.
Royston P. (2004). Multiple imputation of missing values. The Stata Journal 4(3):227-241.
Royston P. (2005a). Multiple imputation of missing values: update. The Stata Journal 5(2):188201.
Royston P. (2005b). Multiple imputation of missing values: update of ice. The Stata Journal
5(4):527-536.
Royston P. (2007). Multiple imputation of missing vlaues: further update of ice, with an
emphasis on interval censoring. The Stata Journal 7(4):445-464.
Schnonlau M. (2006). Stata software package, hotdeckvar.pkg, for hotdeck imputation.
http://www.schonlau.net/stata/.
Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development,
Validation, and Updating. New York, Springer.
Twisk JWR. (2003). Applied Longitudinal Data Analysis for Epidemiology: A Practical
Guide. Cambridge, Cambridge University Press.
Chapter 5-9 (revision 15 Feb 2012)
p. 39
van Buuren S, Boshuizen HC, Knook DL. (1999). Multiple imputation of missing
blood pressure covariates in survival analysis. Statistics in Medicine 18: 681–694.
Vandenbrouchke JP, von Elm E, Altman DG, et al. (2007). Strengthening and reporting of
observational studies in epidemiology (STROBE): explanation and elaboration. Ann
Intern Med 147(8):W-163 to W-194.
Chapter 5-9 (revision 15 Feb 2012)
p. 40
Download