Structural Equation Modeling - Center for AIDS Prevention Studies

advertisement
Handling Missing Data
Estie Hudes
Tor Neilands
UCSF Center for AIDS Prevention Studies
March 16, 2007
1
Presentation Overview
Overview of concepts and approaches to handling missing data
Missing data mechanisms - how data came to be missing
Problems with popular ad-hoc missing data handling methods
A more modern, better approach: Maximum likelihood (FIML/
Direct ML)
More on modern approaches: the EM algorithm
Another modern approach: Multiple Imputation (MI)
Extensions and Conclusions
2
Types of Missing Data
Item-missing: respondent is retained in
the study, but does not answer all
questions
Wave-missing: respondent is observed at
intermittent waves
Drop-out: respondent ceases
participation and is never observed again
Combinations of the above
3
Methods of Handling Missing Data
First method: Prevention of missing cases
(e.g., loss to follow-up) and individual
item non-response
Second method: Ad-hoc approaches (e.g.,
listwise/casewise deletion)
Third method: Maximum likelihood-based
approaches (e.g., direct ML) and related
approaches (e.g., restricted ML)
4
Prevention of Missing Data
Minimize individual item non-response
CASI and A-CASI may prove helpful
Interviewer-administered surveys
Avoid self-administered surveys where possible
Minimize loss to follow-up in longitudinal
studies by incorporating good participant
tracking protocols, appropriate use of
incentives, and reducing respondent
burden
5
Ad-hoc Approaches to
Handling Missing Data
Listwise deletion (a.k.a. complete-case analysis)
Pairwise deletion (a.k.a. available-case analysis)
Dummy variable adjustment (Cohen & Cohen)
Single imputation
Replacement with variable or participant means
Regression
Hot deck
6
Modern Approaches of
Handling Missing Data
Maximum likelihood (FIML/direct ML)
EM algorithm
Multiple imputation (MI)
Selection models and pattern-mixture
models for non-ignorable data
Weighting
We will confine our discussion to Direct
ML, EM algorithm and Multiple Imputation
7
A Tour of Missing
Data Mechanisms
How did the data become incomplete or
missing?
Missing Completely at Random (MCAR)
Missing at Random (MAR)
Not Ignorable Non-Response (NMAR; nonignorable missingness; informative
missingness)
Influential article: Rubin (1976) in
Biometrika
8
Missing Data Mechanisms:
Missing Completely at Random
 Pr(Y is missing|X,Y) = Pr(Y is missing)
 If incomplete data are MCAR, the cases with complete
data are then a random subset of the original sample.
 A good situation to be in if you have missing data
because listwise deletion of the cases with incomplete
data is generally justified.
 A down side is loss of statistical power, especially if
there are many cases, and the number of cases with
complete data is a small fraction of the original number
of cases.
9
Missing Data Mechanisms:
Missing at Random
Pr(Y is missing|X,Y) = Pr(Y missing|X)
Within each level of X, the probability that
Y is missing does not depend on the
numerical value of Y.
Data are MCAR within each level of X.
MAR is a much less restrictive assumption
than MCAR.
10
Missing Data Mechanisms:
Not Missing at Random
 If incomplete data are neither MCAR nor MAR, the data
are considered NMAR or non-ignorable.
 Missing data mechanism must be modeled to obtain
good parameter estimates.
 Heckman’s selection model is one example of NMAR
modeling. Pattern mixture models are another NMAR
approach.
 Disadvantages of NMAR modeling: Requires high level of
knowledge about missingness mechanism; results often
highly sensitive to the choice of NMAR model selected.
11
Missing Data Mechanisms:
Examples (1)
 Scenario: Measuring systolic blood pressure (SBP) in
January and February (Schafer and Graham, 2002,
Psychological Methods, 7(2), 147-177)
MCAR: Data missing in February at random, unrelated to SBP
level in January or February or any other variable in the study missing cases are a random subset of the original sample’s
cases.
MAR: Data missing in February because the January
measurement did not exceed 140 - cases are randomly missing
data within the two groups: SBP > 140 and SBP <= 140.
NMAR: Data missing in February because the February SBP
measurement did not exceed 140. (SBP taken, but not recorded
if it is <= 140.) Cases’ data are not missing at random.
12
Missing Data Mechanisms:
Examples (2)
 Scenario: Measuring Body Mass Index (BMI) of ambulance drivers in a
longitudinal context (Heitjan, 1997, AJPH, 87(4), 548-550).
MCAR: Data missing at follow-up because participants were out on
call at time of scheduled measurement, i.e., reason for data
missingness is unrelated to outcome or other measured variables missing cases are a random subset of the population of all cases.
MAR: Data missing at follow-up because of high BMI and
embarrassment at initial visit, regardless of whether participant
gained or lost weight since baseline, i.e., reason for data
missingness is related to BMI, a measured variable in the study.
NMAR: Data missing at follow-up because of weight gain since last
visit (assuming weight gain is unrelated to other measured
variables in the study).
13
More on Missing
Data Mechanisms
 Ignorable data missingness - occurs when data are incomplete due
to MCAR or MAR process
 If incomplete data arise from an MCAR or MAR data missingness
mechanism, there is no need for the analyst to explicitly model the
missing data mechanism (in the likelihood function), as long as the
analyst uses software programs that take the missingness
mechanism into account internally (several of these will be
mentioned later)
 Even if data missingness is not fully MAR, methods that assume
MAR usually (though not always) offer lower expected parameter
estimate bias than methods that assume MCAR (Muthén, Kaplan, &
Hollis, Psychometrika, 1987).
14
Ad-hoc Methods Unraveled (1)
 Listwise deletion: delete all cases with missing value on
any of the variables in the analysis. Only use complete
cases.
 OK if missing data are MCAR
Parameter estimates unbiased
Standard errors appropriate
 But, can result in substantial loss of statistical power
 Biased parameter estimates if data are MAR
 Robust to NMAR for predictor variables
 Robust to NMAR for predictor variables OR outcome
variable in logistic regression models (slopes only)
15
Ad-hoc Methods Unraveled (2)
Pairwise deletion: use all available cases for
computation of any sample moment
For computation of means, use all available data for each
variable;
For computation of covariances, use all available data on pairs
of variables.
 Can lead to non-positive definite var-cov matrices
because it uses different pairs of cases for each entry.
 Can lead to biased standard errors under MAR.
16
Ad-hoc Methods Unraveled (3)
Dummy variable adjustment
Advocated by Cohen & Cohen (1985)
1. When X has missing values, create a dummy variable D to
indicate complete case versus case with missing data.
2. When X is missing, fill in a constant c
3. Regress Y on X and D (and other non-missing predictors).
Produces biased coefficient estimates (see Jones’
1996 JASA article)
17
Ad-hoc Methods Unraveled (4)
 Single imputation (of missing values)
Mean substitution - by variable or by observation
Regression imputation (i.e., replacement with conditional
means)
Hot deck: Pick “donor” cases within homogeneous strata of
observed data to provide data for cases with unobserved
values.
 These methods lead to biased parameter estimates (e.g., means,
regression coefficients); variance and standard error estimates that
are biased downwards. One exception: Rubin (1987) provides a
hot-deck based method of multiple imputation that may return
unbiased parameter estimates under MAR.
 Otherwise, these methods are not recommended.
18
Modern Methods:
Maximum Likelihood (1)
When there are no missing data:
 Uses the likelihood function to express the probability of
the observed data, given the parameters, as a function
of the unknown parameter values.
n
 Example: L( )  i 1 p( xi , yi |  ) where p(x,y|θ) is the
(joint) probability of observing (x,y) given a parameter
θ, for a sample of n independent observations. The
likelihood function is the product of the separate
contributions to the likelihood from each observation.
 MLEs are the values of the parameters which maximize
the probability of the observed data (the likelihood).
19
Modern Methods:
Maximum Likelihood (2)
 Under ordinary conditions, ML estimates are:
consistent (approximately unbiased in large samples)
asymptotically efficient (have the smallest possible variance)
asymptotically normal (one can use normal theory to construct
confidence intervals and p-values).
 The ML approach can be easily extended to MAR
m
n
situations: L( )  i 1 p( xi , yi |  ) j m1 g ( y j |  )
The contribution to the likelihood from an observation
with X missing is the marginal: g(yj|θ) = xp(x,yj|θ)
 This likelihood may be maximized like any other
likelihood function. Often labeled FIML or direct ML.
20
Modern Methods:
Maximum Likelihood (3)
Available software to perform FIML estimation:
AMOS - Analysis of Moment Structures
Commercial program licensed as part of SPSS (CAPS has a
10-user license for this product)
Fits a wide variety of univariate and multivariate linear
regression, ANOVA, ANCOVA, and structural equation (SEM)
models.
http://www.smallwaters.com
Mx - Similar to AMOS in capabilities, less user-friendly
Freeware: http://views.vcu.edu/mx
LISREL - Similar to AMOS, more features, less user-friendly
Commercial program: http://www.ssicentral.com
21
Modern Methods:
Maximum Likelihood (4)
Available software:
l
 EM
Loglinear & Event history analysis w/ Missing data
(Jeroen Vermunt)
Freeware DOS program downloadable from the Internet
• http://www.uvt.nl/faculteiten/fsw/organisatie/departementen/mto/software2.html
Fits log-linear, logit, latent class, and event history models
with categorical predictors.
Mplus
Similar capabilities to AMOS (commercial)
Less easy to use than AMOS, but more general modeling
features.
http://www.statmodel.com
22
Modern Methods:
Maximum Likelihood (5)
Longitudinal data analysis software options (not
discussed):
Normally distributed outcomes
SAS PROC MIXED
S-PLUS LME
Stata XTREG and XTREGAR and XTMIXED
Poisson
Stata XTPOIS
Negative Binomial
Stata XTNBREG
Logistic
Stata XTLOGIT
23
Modern Methods:
Maximum Likelihood (6)
 Software for longitudinal analyses (continued)
General modeling of clustered and longitudinal data
Stata GLLAMM add-on command
SAS PROC NLMIXED
S-PLUS NLME
What about Generalized Estimating Equations (GEE) for analysis
of longitudinal or clustered data with missing observations?
Assumes incomplete data are MCAR. See Hedeker & Gibbons,
1997, Psychological Methods, p. 65. & Heitjan, AJPH, 1997, 87(4),
548-550.
Can be extended to accommodate the MAR assumption via a
weighting approach developed by Robbins, Rodnitzky, & Zhao
(JASA, 1995), but it has limited applicability.
24
Maximum Likelihood Example (1)
2 x 2 Table with missing data
Vote (Y=V)
Sex (X=S)
Male
Female
Total
Yes No
.
Y
28 45 10 (73) p11
22 52 15 (74) p21
50 97 25 (147)
N
p12
p22
1
Likelihood function: L(p11, p12, p21, p22) =
(p11)28(p12)45 (p21)22 (p22)52 (p11+p12)10 (p21+p22)15
25
Maximum Likelihood Example (2)
2 x 2 Table with missing data
28 73  10
11  (
)(
)  0.1851
73
172
p
45 73  10
12  (
)(
)  0.2975
73
172
p
p
22 74  15
21  (
)(
)  0.1538
74
172
p
52 74  15
22  (
)(
)  0.3636
74
172
26
Maximum Likelihood Example (3)
l
Using EM for 2 x 2 Table
Input (partial)
* R = response (NM) indicator
* S = sex; V = vote;
man 2
res 1
dim 2 2 2
lab R S V
sub SV S
* 2 manifest variables
* 1 response indicator
* with two levels
* and label R
* defines these two
* subgroups
mod SV
* model for complete
dat [28 45 22 52 * subgroup SV
10 15]
* subgroup S
Output (partial)
*** (CONDITIONAL) PROBABILITIES ***
* P(SV) *
1
1
2
2
1
2
1
2
complete data only
0.1851 (0.0311)
0.2975 (0.0361)
0.1538 (0.0297)
0.3636 (0.0384)
0.1905 (0.0324)
0.3061 (0.0380)
0.1497 (0.0294)
0.3537 (0.0394)
* P(R) *
1
2
0.8547
0.1453
27
Maximum Likelihood Example (1)
Continuous outcome & multiple predictors
Data on American colleges and
universities through US News and World
Report
N = 1302 colleges
Available from
http://lib.stat.cmu.edu/datasets/colleges
Described on p. 21 of Allison (2001)
28
Maximum Likelihood Example (2)
Continuous outcome & multiple predictors
 Outcome: gradrat - graduation rate (1,204 non-missing cases)
 Predictors
 csat - combined average scores on verbal and math SAT (779
non-missing cases)
 lenroll - natural log of the number of enrolling freshmen (1,297
non-missing cases)
 private - 1 = private; 0 = public (1,302 non-missing cases)
 stufac - ratio of students to faculty (x 100; 1,300 non-missing
cases)
 rmbrd - total annual cost of room and board (thousands of
dollars; 1,300 non-missing cases)
 act - Mean ACT scores (714 non-missing cases)
29
Maximum Likelihood Example (3)
Continuous outcome & multiple
predictors
Predict graduation rate from
Combined SAT
Number of enrolling freshmen on log scale
Student-faculty ratio
Private or public institution classification
Room and board costs
Use a linear regression model
ACT score included as an auxiliary variable
Use AMOS and Mplus to illustrate direct ML
30
Maximum Likelihood Example (4)
Continuous outcome & multiple predictors
AMOS: Two methods for model specification
Graphical user interface
AMOS BASIC programming language
Results (assuming joint MVN)
Regression Weights
GradRat
GradRat
GradRat
GradRat
GradRat
<-<-<-<-<--
CSAT
LEnroll
StuFac
Private
RMBRD
Estimate S.E.
0.0669 0.0048
2.0832 0.5953
-0.1814 0.0922
12.9144 1.2769
2.4040 0.5481
C.R.
13.9488
3.4995
-1.9678
10.1142
4.3856
P
0.0000
0.0005
0.0491
0.0000
0.0000
31
Maximum Likelihood Example (5)
Continuous outcome & multiple predictors
Mplus example (assuming joint MVN)
INPUT INSTRUCTIONS
TITLE: P. Allison 6/2002 Oakland, CA Missing Data Workshop non-normal example
DATA:
FILE IS D:\My Documents\Papers\Allison-Paul\usnews.txt;
VARIABLE: NAMES ARE csat act stufac gradrat rmbrd private lenroll;
USEVARIABLES ARE csat act stufac gradrat rmbrd private lenroll;
MISSING ARE ALL . ;
ANALYSIS: TYPE = general missing h1 ;
ESTIMATOR = ML ;
MODEL:
gradrat ON csat lenroll stufac private rmbrd ;
gradrat WITH act ;
csat WITH lenroll stufac private rmbrd act ;
lenroll WITH stufac private rmbrd act ;
stufac WITH private rmbrd act ;
private WITH rmbrd act ;
rmbrd WITH act ;
OUTPUT: patterns ;
32
Maximum Likelihood Example (6)
Continuous outcome & multiple predictors
Mplus results (assuming joint MVN)
MODEL RESULTS
GRADRAT ON
CSAT
LENROLL
STUFAC
PRIVATE
RMBRD
Estimates
S.E.
Est./S.E.
0.067
2.083
-0.181
12.914
2.404
0.005
0.595
0.092
1.276
0.548
13.954
3.501
-1.969
10.118
4.387
33
Maximum Likelihood Example (7)
Continuous outcome & multiple
predictors
Mplus example for continuous, non-normal data
Uses sandwich estimator robust to non-normality
Specify MLR instead of ML as the estimator
Mplus MLR estimator assumes MCAR missingness
and finite fourth-order moments (i.e., kurtosis is nonzero); initial simulation studies show low bias with
MAR data
GRADRAT ON
CSAT
LENROLL
STUFAC
PRIVATE
RMBRD
Estimates
S.E.
Est./S.E.
0.067
2.083
-0.181
12.914
2.404
0.005
0.676
0.093
1.327
0.570
13.312
3.083
-1.950
9.735
4.215
34
Maximum Likelihood Summary
 ML advantages:
Provides a single, deterministic set of results appropriate under
MAR data missingness.
Well-accepted method for handling missing values (e.g., for
grant writing).
Generally fast and convenient.
 ML disadvantages:
Parametric: may not always be robust to violations of
distributional assumptions (e.g., multivariate normality).
Only available for some models via canned software (would
need to program other models).
Most readily available for continuous outcomes and ordered
categorical outcomes.
Available for Poisson or Cox regression with continuous predictors
in Mplus, but requires numerical integration, which is timeconsuming and can be challenging to use, especially with large
numbers of variables.
35
Modern Methods:
EM Algorithm (1)
 EM algorithm proceeds in two steps to generate ML estimates for
incomplete data: Expectation and Maximization. The steps alternate
iteratively until convergence is attained.
 Seminal article by Dempster, Laird, & Rubin (1977), Journal of the
Royal Statistical Society, Series B, 39, 1-38. Early treatment by H.O.
Hartley (1958), Biometrics, 14(2), 174-194.
 Goal is to estimate sufficient statistics that can then be used for
substantive analyses. In normal theory applications these would be
the means, variances and covariances of the variables (first and
second moments of the normal distributions of the variables).
 Example from Allison, pp. 19-20: For a normal theory regression
scenario, consider four variables X1 - X4 that have some missing
data on X3 and X4.
36
Modern Methods:
EM Algorithm (2)
Starting Step (0):
Generate starting values for the means and
covariance matrix. Can use the usual formulas with
listwise or pairwise deletion.
Use these values to calculate the linear regression of
X3 on X1 and X2. Similarly for X4.
Expectation Step (1):
Use the linear regression coefficients and the
observed data for X1 and X2 to generate imputed
values of X3 and X4.
37
Modern Methods:
EM Algorithm (3)
Maximization Step (2):
Use the newly imputed data along with the original
data to compute new estimates of the sufficient
statistics (e.g., means, variances, and covariances)
Use the usual formula to compute the mean
Use modified formulas to compute variances and
covariances that correct for the usual underestimation of
variances that occurs in single imputation approaches.
 Cycle through the expectation and maximization steps
until convergence is attained (sufficient statistic values
change slightly from one iteration to the next).
38
Modern Methods:
EM Algorithm (4)
EM Advantages:
Only needs to assume incomplete data arise from
MAR process, not MCAR
Fast (relative to MCMC-based multiple imputation
approaches)
Applicable to a wide range of data analysis scenarios
Uses all available data to estimate sufficient statistics
Fairly robust to non-joint MVN data
Provides a single, deterministic set of results
May be all that is needed for non-inferential analyses
(e.g., Cronbach’s alpha or exploratory factor analysis)
Lots of software (commercial and freeware)
39
Modern Methods:
EM Algorithm (5)
EM Disadvantage:
Produces correct parameter estimates, but standard
errors for inferential analyses will be biased
downward because analyses of EM-generated data
assume all data arise from a complete data set
without missing information. The analyses of the EMbased data do not properly account for the
uncertainty inherent in imputing missing data.
Recent work by Meng provides a method by which
appropriate standard errors may be generated for EM-based
parameter estimates
Bootstrapping may also be used to overcome this limitation
40
Modern Methods:
Multiple Imputation (1)
 What is unique about MI: We impute multiple data sets to
analyze, not a single data set as in single imputation
approaches
Use the EM algorithm to obtain starting values for MI
The differences between the imputed data sets capture the
uncertainty due to imputing values
The actual values in the imputed data sets are less important than
analysis results combined across all data sets
 Several MI advantages:
MI yields consistent, asymptotically efficient, and asymptotically
normal estimators under MAR (same as direct ML)
MI-generated data sets may be used with any kind of software or
model
41
Modern Methods:
Multiple Imputation (2)
 The MI point estimate is the mean:
1
Q
m
m
Q
i
i 1
 The MI variance estimate is the sum of Within and Between
imputation variation:
V  W  (1  m1 ) B
where
1 m
V   Vi
m i 1
B  (1 
m
1
m
) (Qi  Q ) 2
i 1
 (Qi and Vi are the parameter estimate and its variance in the ith
imputed dataset)
42
Modern Methods:
Multiple Imputation (3)
 Imputation model vs. analysis model
Imputation model should include any auxiliary variables (i.e.,
variables that are correlated with other variables that have
incomplete data; variables that predict data missingness)
Analysis model should contain a subset of the variables from the
imputation model and address issues of categorical data, nonnormal data
 Texts that discuss MI in detail:
Little & Rubin (2002, John Wiley and Sons): A seminal classic
Rubin (1987, John Wiley and Sons): Non-response in surveys
J. L. Schafer (1997, Chapman & Hall): Modern and updated
P. Allison (2001, Sage Publications series # 136): A readable and
practical overview of and introduction to MI and missing data
handling approaches
43
Modern Methods:
Multiple Imputation (4)
 Multivariate normal imputation approach
MI approaches exist for multivariate normal data, categorical
data, mixed categorical and normal variables, and
longitudinal/clustered/panel data.
The MV normal approach is most popular because it performs
well in most applications, even with somewhat non-normal input
variables (Schafer, 1997)
Variable transformations can further improve imputations
For each variable with missing data, estimate the linear
regression of that variable on all other variables in the data set.
Using a Bayesian prior distribution for the parameters, typically
noninformative, regression parameters are drawn from the
posterior Bayesian distribution. Estimated regression equations
are used to generate predicted values for missing data points.
44
Modern Methods:
Multiple Imputation (5)
 Multivariate normal imputation approach (continued)
Add to each predicted value a random draw from the residual
normal distribution to reflect uncertainty due to incomplete data.
Obtaining Bayesian posterior random draws is the most complex
part of the procedure. Two approaches:
Data augmentation - implemented in NORM and PROC MI
• Uses a Markov-Chain Monte Carlo (MCMC) approach to generate the
imputed values
A variant of Data augmentation - implemented in ice (and MICE)
• Uses a Gibbs sampler and switching regressions approach (Fully Conditional
Specification - FCS) to generate the imputed values (van Buuren)
Sampling Importance/Resampling (SIR) - implemented in Amelia and a
user-written macro in SAS (sirnorm.sas); claimed to be faster than data
augmentation-based approaches.
“The relative superiority of these methods is far from settled”
(Allison, 2001, p. 34)
45
Modern Methods:
Multiple Imputation (6)
 Steps in using MI
 Select variables for the imputation model - use all variables in the
analysis model, including any dependent variable(s), and any variables
that are associated with variables that have missing data or the
probability of those variables having missing data (auxiliary variables),
in part or in whole.
 Transform non-normal continuous variables to attain normality (e.g.,
skewed variables)
 Select a random number seed for imputations (if possible)
 Choose number of imputations to generate
Typically 5 to 10: > 90% coverage & efficiency with 90% or less missing
information in large sample scenarios with M = 5 imputations (Rubin, 1987)
Sometimes, however, you may need more imputations (e.g., 20 or more for
some longitudinal scenarios).
You can compute the relative efficiency of parameter estimates as: relative
efficiency = (1 / (1 + rate of missing information / number of imputations))
X 100. Several MI software programs output the missing information rates
for parameters, allowing the analyst to easily compute relative efficiencies
46
Modern Methods:
Multiple Imputation (7)
 Steps in using MI (continued):
 Produce the multiply imputed data sets
Estimated parameters must be independent of initial values
Assess independence via autocorrelation and time series plots (when using
MCMC-based MI programs)
 Back-transform any previously transformed variables and round
imputations for discrete variables.
 Analyze each imputed data set using standard statistical approaches. If
you generated M imputations (e.g., 5), you would perform M separate,
but identical analyses (e.g., 5).
 Combine results from the M multiply imputed analyses (using NORM,
SAS PROC MIANALYZE, or Stata miest or micombine) using Rubin’s
(1987) formulas to obtain a single set of parameter estimates and
standard errors. Both p-values and confidence intervals may be
generated.
47
Modern Methods:
Multiple Imputation (8)
 Steps in using MI (continued)
 Rules for combining parameter estimates and standard errors
A parameter estimate is the mean of the parameter estimates from
the multiple analyses you performed.
The standard error is computed as follows:
• Square the standard errors from the individual analyses.
• Calculate the variance from the squared SEs across the M imputations.
• Add the results of the previous two steps together, applying a small
correction factor to the variance in the second step, and take the square
root.
There is a separate F-statistic available for multiparameter inference
(i.e., multi-DF tests of several parameters at once).
It is also possible to combine chi-square tests from the analysis of
multiply imputed data sets.
48
Modern Methods:
Multiple Imputation (9)
Is it wrong to impute the DV?
Yes, if performing single, deterministic imputation
(methods historically used by econometricians)
No, if using the random draw approach of Rubin. In
fact, leaving out the DV will cause bias (it will bias
the coefficients towards zero).
Given that the goal of MI is to reproduce all the
relationships in the data as closely as possible, this
can only be accomplished if all the dependent
variable(s) are included in the imputation process.
49
Modern Methods:
Multiple Imputation (10)
 Available imputation software for data augmentation:
SAS: PROC MI and PROC MIANALYZE (demonstrated)
MI produces imputations
MIANALYZE combines results from analyses of imputed data into a
single set of hypothesis tests
NORM - for MV normal data (J. L. Schafer)
Windows freeware
S-Plus MISSING library
R (add-in file)
CAT, MIX, and PAN - for categorical data, mixed
categorical/normal data, and longitudinal or clustered panel
data respectively (J. L. Schafer)
S-Plus MISSING library
R (add-in file)
LISREL - http://www.ssicentral.com (Windows, commercial)
50
Modern Methods:
Multiple Imputation (11)
 Newly Available MI Software from Stata:
(Uses Gibbs sampler and switching regressions;
related to data augmentation)
 Can handle continuous, dichotomous, categorical and
ordinal data
 Can handle interactions
Stata: -ice- with –micombinehttp://www.stata.com/search.cgi?query=ice
http://www.ats.ucla.edu/stat/stata/library/ice.htm
From inside Stata: . findit multiple imputation
51
Modern Methods:
Multiple Imputation (12)
 Available Imputation Software for Sampling Importance/Resampling
(SIR):
AMELIA
Windows freeware version (NOT demonstrated)
Produces the multiply imputed MI data sets.
http://pantheon.yale.edu/~ks298/index_files/software.htm
http://gking.harvard.edu/amelia/
More complete Gauss version available
http://www.aptech.com/
STATA can be used on datasets from AMELIA (NOT demonstrated)
• MIEST - a user-written command to run and combine separate analyses into a
single model. http://gking.harvard.edu/amelia/amelia1/docs/mi.zip
• MIEST2 - modifies MIEST to output non-integer DF for hypothesis tests
 SIRNORM.SAS - SAS user-written macro
http://yates.coph.usf.edu/research/psmg/Sirnorm/sirnorm.html
52
Multiple Imputation Example (1)
[Same as ML Example]
Data on American colleges and
universities from US News and World
Report
N = 1302 colleges
Available from
http://lib.stat.cmu.edu/datasets/colleges
Described on p. 21 of Allison (2001)
53
Multiple Imputation Example (2)
 Outcome: gradrat - graduation rate (1,204 non-missing cases)
 Predictors
 csat - combined average scores on verbal and math SAT (779
non-missing cases)
 lenroll - natural log of the number of enrolling freshmen (1,297
non-missing cases)
 private - 1 = private; 0 = public (1,302 non-missing cases)
 stufac - ratio of students to faculty (x 100; 1,300 non-missing
cases)
 rmbrd - total annual cost of room and board (thousands of
dollars; 1,300 non-missing cases)
 Auxiliary Variable
 act - Mean ACT scores (714 non-missing cases)
54
MI SAS Example (1)
Using SAS to perform multiple imputation
Suggest running PROC UNIVARIATE or PROC FREQ prior to
running PROC MI in order to examine distributions of variables,
identify ranges, and integer precision of each variable.
Some variables will have predefined ranges that can be specified
in PROC MI. E.g., CSAT ranges 400 to 1600.
Ranges for other variables can be set to their empirical values.
SAS creates a single SAS data set containing the individual
imputed data sets stacked. Each inputed data set is denoted by
the value of the SAS variable _IMPUTATION_. You can run
substantive analyses on the imputed data sets by using a SAS
BY statement (e.g, BY _IMPUTATION_ ; ).
55
MI SAS Example (2)
PROC MI syntax for college graduation data set
example
PROC MI
DATA = paul.usnews
OUT = miout
NIMPUTE = 10
SEED = 12345678
MINIMUM = 400 11
.
0
. 0 0
MAXIMUM = 1600 31 100 100
. 1 .
ROUND =
1 1
.
1 .001 1 . ;
MCMC
CHAIN = MULTIPLE
NBITER = 500 NITER = 250
TIMEPLOT (MEAN(csat rmbrd) COV (gradrat*rmbrd) WLF)
ACFPLOT (MEAN(csat rmbrd) COV(gradrat*rmbrd) WLF) ;
TITLE "Multiple Imputation procedure run on US News college data set" ;
VAR csat act stufac gradrat rmbrd private lenroll ;
RUN ;
56
MI SAS Example (3)
 PROC MI Statement
PROC MI
DATA = paul.usnews
OUT = miout
NIMPUTE = 10
SEED = 12345678
MINIMUM = 400 11
.
0
MAXIMUM = 1600 31 100 100
ROUND =
1 1
.
1
.
.
.001
0 0
1 .
1 . ;
 NIMPUTE: the number of imputations (default = 5)
 SEED: use the same random number seed to replicate imputations over
multiple program runs
 MINIMUM, MAXIMUM, and ROUND
 Order of values corresponds to variables listed in the VAR statement
(i.e., csat act stufac gradrat rmbrd private lenroll)
 csat, stufac, and gradrat ranges set on basis of meaningful expectations; others
are set via empirical frequency data.
 specify minimum values, maximum values, and values to which imputations are
rounded. Useful for handling categorical and integer variables. Dots/Periods
represent no values specified. First variable cannot have a period placeholder. 57
MI SAS Example (4)
 MCMC Statement
MCMC
CHAIN = MULTIPLE
NBITER = 500 NITER = 250
TIMEPLOT (MEAN(csat rmbrd) COV (gradrat*rmbrd) WLF)
ACFPLOT (MEAN(csat rmbrd) COV(gradrat*rmbrd) WLF) ;
 CHAIN - selects single or multiple chain Markov-Chain Monte Carlo
data augmentation procedure. Multiple chain may be slightly
preferred (Allison, 2001, p. 38).
 NBITER - number of “burn in” iterations performed prior to imputed
data sets being created. Often set to twice the number of iterations
EM requires to converge (Schafer).
 NITER - number of iterations between creation of each imputed data
set.
 More iterations ensure independence between imputed data sets.
 You can diagnose non-independence with time series and
autocorrelation plots.
58
MI SAS Example (5)
MCMC Statement (continued)
TIMEPLOT - produces time series plot for the worst
linear function of variables containing the most
missing data (csat and rmbrd)
ACFPLOT - produces autocorrelation plot for the
worst linear functions of variables containing the
most missing data
TRANSFORM statement also available for variable
transformations
Example: TRANSFORM LOG(rmbrd/c=5)
C option adds a constant prior to transformation
Available transformations: Box-Cox, Exp, Logit, Log, Power
59
MI SAS Example (6)
Time Series Plot
970
965
960
955
950
945
- 500
- 400
- 300
- 200
- 100
0
I t er at i on
60
MI SAS Example (7)
Autocorrelation Plot
1. 0
0. 5
0. 0
- 0. 5
- 1. 0
0
2
4
6
8
10
12
14
16
18
20
Lag
61
MI SAS Example (8)
ML linear regression analysis of the data output
by PROC MI using PROC GENMOD
PROC GENMOD DATA = miout ;
TITLE "Illustration of GENMOD analysis of the college data set" ;
MODEL gradrat = csat lenroll stufac private rmbrd / COVB ;
BY _IMPUTATION_ ;
ODS OUTPUT PARAMETERESTIMATES=gmparms COVB=gmcovb ;
RUN ;
 BY statement repeats analysis for each imputed data set
 COVB option on MODEL statement displays the variance-covariance matrix
of the parameter estimates
 ODS OUTPUT statement outputs the parameter estimates and their
variance-covariance matrix to separate SAS data sets, gmparms and
gmcovb, respectively. These data sets are then combined by PROC
MIANALYZE to return a single set of results to the analyst.
62
MI SAS Example (9)
Combining GENMOD results with PROC
MIANALYZE: Single Parameter Inference
PROC MIANALYZE PARMS = gmparms COVB = gmcovb ;
TITLE "Single DF inferences of GENMOD analysis of US News
college data set" ;
VAR intercept csat lenroll stufac private rmbrd ;
RUN ;
 PARMS statement reads the parameter estimates; COVB
reads the variance-covariance matrix of parameter
estimates
 Note presence of INTERCEPT term on VAR statement you will need to include it to obtain INTERCEPT results
63
MI SAS Example (10)
Combining GENMOD results with PROC
MIANALYZE: Multiparameter Inference
PROC MIANALYZE MULT PARMS = gmparms COVB = gmcovb ;
TITLE "Multivariate inference of MIXED analysis of US
News college data set" ;
VAR csat lenroll stufac private rmbrd ;
RUN ;
 MULT statement performs multivariate hypothesis
testing
 Note absence of intercept in the VAR statement - we do
not want it included as part of the list of variables tested
64
MI SAS Example (11)
Inference using other SAS procedures
REG, LOGISTIC, PROBIT, LIFEREG, and PHREG: use
OUTEST = and COVOUT statements
MIXED, GLM, and CALIS: use ODS
MIXED
• request SOLUTION and COVB as MODEL statement options
• ODS OUTPUT SOLUTIONF = gmparms COVB = gmcovb ;
GENMOD for GEE: use ODS as shown in this example
substitute GEEempest and GEERCov ODS tables for the
parameter estimate and covariance matrix tables shown in
the above example.
65
MI Stata Example (1)
Using Stata to check the original data
* Read in original data , and save as *.dta
. insheet using usnewsN.txt, names delimit (" ") clear
. save usnews.dta, replace
* Obtain (available cases, single) estimates of means and variance
. summarize gradrat csat lenroll stufac private rmbrd act
* Obtain (available cases, pairwise) estimates of correlations
. pwcorr gradrat csat lenroll stufac private rmbrd act, obs
* Obtain (complete cases) estimates of correlations, means and variance
. corr gradrat csat lenroll stufac private rmbrd, obs
* Obtain (complete cases) estimates of regression coefficients
. regress gradrat csat lenroll stufac private rmbrd
* patterns of missingness
. mvpatterns gradrat csat lenroll stufac private rmbrd
66
MI Stata Example (2)
 Using Stata to create the multiply imputed datasets
(stacked together in a single dataset)
. use usnews, clear
. mvis csat act stufac gradrat rmbrd private lenroll using usnews_mvis10, m(10)
genmiss(m_) seed(12345678)
OR (better):
. ice csat act stufac gradrat rmbrd private lenroll using usnews_ice10, m(10)
seed(12345678)
Using Stata to analyze the multiply imputed datasets and
combine the results
*–micombine- to obtain MI estimates of regression coefficients
. use usnews_ice10, clear
. micombine regress gradrat csat lenroll stufac private rmbrd
. testparm
csat lenroll stufac private rmbrd
67
Multiple Imputation Summary
 Multiple imputation is flexible: imputed datasets can be analyzed using parametric and
non-parametric techniques
 MI is available in SAS, and in S-PLUS MISSING library. Also free via NORM and AMELIA,
and in R.
 Some SAS procedures are easier to use with MI than others; SAS and NORM permit
user-specified random number seeds
 SAS and NORM permit testing multiparameter hypotheses
 Multiple imputation using Stata:
 You can use the Stata command ice to generate multiply imputed data sets and the
command micombine to combine the results from analyses of imputed data sets in
Stata. ice allows imputation of unordered or ordered categorical and continuous,
normally distributed variables. It also handles interactions properly.
 Alternatively, you can use AMELIA to generate multiply imputed data sets and feed
them into Stata for analyses. miest / miest2 can then combine the analysis results.
 All Stata estimation commands are equally easy to use with micombine, miest(2).
 ice permits user-specified random number seeds.
 micombine permits testing multiparameter hypotheses
 Multiple imputation is non-deterministic: you get a different result each time you generate
imputed data sets (unless the same random number seed is used each time)
 It is easy to include auxiliary variables into the imputation model to improve the quality of
imputations
 Compared with direct ML, large numbers of variables may be handled more easily.
68
Comparison of Regression
Example Results
Listwise
w/SAS
PROC
GENMOD
Mplus
Direct
ML
Mplus
Robust
ML
SAS MI
With
Stata
ice
with
PROC
GENMOD micombine
CSAT
.067
(.006)
.067
(.005)
.067
(.005)
.067
(.005)
.066
(.005)
LEnroll
2.417
(.953)
2.083
(.595)
2.083
(.676)
2.185
(.575)
2.129
(.598)
StuFac
-.123
(.131)
p = .348
13.588
(1.933)
-.181
(.092)
p = .049
12.914
(1.276)
-.181
(.097)
p = .051
12.914
(1.327)
-.184
(.097)
p=.061
13.034
(1.270)
-.189
(.101)
p=.066
12.900
(1.374)
2.162
(.709)
2.404
(.548)
2.404
(.570)
2.468
(.491)
2.527
(.518)
Private
RmBrd
Listwise N = 455; N = 1302 for all other analyses.
69
Extensions
 Multiple imputation under non-linearity and interaction - possible
but more complex than linear main effects only
 Multiple imputation for panel (longitudinal or clustered) data - only
available off the shelf in S-PLUS (you can sometimes transform
“long” clustered data structure to a “wide” format in which multiple
time points are expressed as multiple variables, perform MI, and
retransform the imputed data sets into “long” form).
 Weighting-based approaches to handle missing data - a promising
approach
 Non-ignorable situations - rely on a priori knowledge of missingness
mechanism
Pattern-mixture models
Selection models (e.g., Heckman’s model)
70
Conclusions
 Planning ahead can minimize missing cross-sectional
responses and longitudinal loss to follow-up
 Use of ad hoc methods can lead to biased results
 Modern methods are readily available for MAR data
FIML/Direct ML most convenient for models that are
supported by available software and when parametric
assumptions are met
Multiple Imputation available and effective for
remaining situations
 Imputation strategies for clustered data and non-linear
analyses available, but more complicated to implement
 Non-ignorable models are available, but still more
complicated and rest on tenuous assumptions
71
Download