Handling Missing Data Estie Hudes Tor Neilands UCSF Center for AIDS Prevention Studies March 16, 2007 1 Presentation Overview Overview of concepts and approaches to handling missing data Missing data mechanisms - how data came to be missing Problems with popular ad-hoc missing data handling methods A more modern, better approach: Maximum likelihood (FIML/ Direct ML) More on modern approaches: the EM algorithm Another modern approach: Multiple Imputation (MI) Extensions and Conclusions 2 Types of Missing Data Item-missing: respondent is retained in the study, but does not answer all questions Wave-missing: respondent is observed at intermittent waves Drop-out: respondent ceases participation and is never observed again Combinations of the above 3 Methods of Handling Missing Data First method: Prevention of missing cases (e.g., loss to follow-up) and individual item non-response Second method: Ad-hoc approaches (e.g., listwise/casewise deletion) Third method: Maximum likelihood-based approaches (e.g., direct ML) and related approaches (e.g., restricted ML) 4 Prevention of Missing Data Minimize individual item non-response CASI and A-CASI may prove helpful Interviewer-administered surveys Avoid self-administered surveys where possible Minimize loss to follow-up in longitudinal studies by incorporating good participant tracking protocols, appropriate use of incentives, and reducing respondent burden 5 Ad-hoc Approaches to Handling Missing Data Listwise deletion (a.k.a. complete-case analysis) Pairwise deletion (a.k.a. available-case analysis) Dummy variable adjustment (Cohen & Cohen) Single imputation Replacement with variable or participant means Regression Hot deck 6 Modern Approaches of Handling Missing Data Maximum likelihood (FIML/direct ML) EM algorithm Multiple imputation (MI) Selection models and pattern-mixture models for non-ignorable data Weighting We will confine our discussion to Direct ML, EM algorithm and Multiple Imputation 7 A Tour of Missing Data Mechanisms How did the data become incomplete or missing? Missing Completely at Random (MCAR) Missing at Random (MAR) Not Ignorable Non-Response (NMAR; nonignorable missingness; informative missingness) Influential article: Rubin (1976) in Biometrika 8 Missing Data Mechanisms: Missing Completely at Random Pr(Y is missing|X,Y) = Pr(Y is missing) If incomplete data are MCAR, the cases with complete data are then a random subset of the original sample. A good situation to be in if you have missing data because listwise deletion of the cases with incomplete data is generally justified. A down side is loss of statistical power, especially if there are many cases, and the number of cases with complete data is a small fraction of the original number of cases. 9 Missing Data Mechanisms: Missing at Random Pr(Y is missing|X,Y) = Pr(Y missing|X) Within each level of X, the probability that Y is missing does not depend on the numerical value of Y. Data are MCAR within each level of X. MAR is a much less restrictive assumption than MCAR. 10 Missing Data Mechanisms: Not Missing at Random If incomplete data are neither MCAR nor MAR, the data are considered NMAR or non-ignorable. Missing data mechanism must be modeled to obtain good parameter estimates. Heckman’s selection model is one example of NMAR modeling. Pattern mixture models are another NMAR approach. Disadvantages of NMAR modeling: Requires high level of knowledge about missingness mechanism; results often highly sensitive to the choice of NMAR model selected. 11 Missing Data Mechanisms: Examples (1) Scenario: Measuring systolic blood pressure (SBP) in January and February (Schafer and Graham, 2002, Psychological Methods, 7(2), 147-177) MCAR: Data missing in February at random, unrelated to SBP level in January or February or any other variable in the study missing cases are a random subset of the original sample’s cases. MAR: Data missing in February because the January measurement did not exceed 140 - cases are randomly missing data within the two groups: SBP > 140 and SBP <= 140. NMAR: Data missing in February because the February SBP measurement did not exceed 140. (SBP taken, but not recorded if it is <= 140.) Cases’ data are not missing at random. 12 Missing Data Mechanisms: Examples (2) Scenario: Measuring Body Mass Index (BMI) of ambulance drivers in a longitudinal context (Heitjan, 1997, AJPH, 87(4), 548-550). MCAR: Data missing at follow-up because participants were out on call at time of scheduled measurement, i.e., reason for data missingness is unrelated to outcome or other measured variables missing cases are a random subset of the population of all cases. MAR: Data missing at follow-up because of high BMI and embarrassment at initial visit, regardless of whether participant gained or lost weight since baseline, i.e., reason for data missingness is related to BMI, a measured variable in the study. NMAR: Data missing at follow-up because of weight gain since last visit (assuming weight gain is unrelated to other measured variables in the study). 13 More on Missing Data Mechanisms Ignorable data missingness - occurs when data are incomplete due to MCAR or MAR process If incomplete data arise from an MCAR or MAR data missingness mechanism, there is no need for the analyst to explicitly model the missing data mechanism (in the likelihood function), as long as the analyst uses software programs that take the missingness mechanism into account internally (several of these will be mentioned later) Even if data missingness is not fully MAR, methods that assume MAR usually (though not always) offer lower expected parameter estimate bias than methods that assume MCAR (Muthén, Kaplan, & Hollis, Psychometrika, 1987). 14 Ad-hoc Methods Unraveled (1) Listwise deletion: delete all cases with missing value on any of the variables in the analysis. Only use complete cases. OK if missing data are MCAR Parameter estimates unbiased Standard errors appropriate But, can result in substantial loss of statistical power Biased parameter estimates if data are MAR Robust to NMAR for predictor variables Robust to NMAR for predictor variables OR outcome variable in logistic regression models (slopes only) 15 Ad-hoc Methods Unraveled (2) Pairwise deletion: use all available cases for computation of any sample moment For computation of means, use all available data for each variable; For computation of covariances, use all available data on pairs of variables. Can lead to non-positive definite var-cov matrices because it uses different pairs of cases for each entry. Can lead to biased standard errors under MAR. 16 Ad-hoc Methods Unraveled (3) Dummy variable adjustment Advocated by Cohen & Cohen (1985) 1. When X has missing values, create a dummy variable D to indicate complete case versus case with missing data. 2. When X is missing, fill in a constant c 3. Regress Y on X and D (and other non-missing predictors). Produces biased coefficient estimates (see Jones’ 1996 JASA article) 17 Ad-hoc Methods Unraveled (4) Single imputation (of missing values) Mean substitution - by variable or by observation Regression imputation (i.e., replacement with conditional means) Hot deck: Pick “donor” cases within homogeneous strata of observed data to provide data for cases with unobserved values. These methods lead to biased parameter estimates (e.g., means, regression coefficients); variance and standard error estimates that are biased downwards. One exception: Rubin (1987) provides a hot-deck based method of multiple imputation that may return unbiased parameter estimates under MAR. Otherwise, these methods are not recommended. 18 Modern Methods: Maximum Likelihood (1) When there are no missing data: Uses the likelihood function to express the probability of the observed data, given the parameters, as a function of the unknown parameter values. n Example: L( ) i 1 p( xi , yi | ) where p(x,y|θ) is the (joint) probability of observing (x,y) given a parameter θ, for a sample of n independent observations. The likelihood function is the product of the separate contributions to the likelihood from each observation. MLEs are the values of the parameters which maximize the probability of the observed data (the likelihood). 19 Modern Methods: Maximum Likelihood (2) Under ordinary conditions, ML estimates are: consistent (approximately unbiased in large samples) asymptotically efficient (have the smallest possible variance) asymptotically normal (one can use normal theory to construct confidence intervals and p-values). The ML approach can be easily extended to MAR m n situations: L( ) i 1 p( xi , yi | ) j m1 g ( y j | ) The contribution to the likelihood from an observation with X missing is the marginal: g(yj|θ) = xp(x,yj|θ) This likelihood may be maximized like any other likelihood function. Often labeled FIML or direct ML. 20 Modern Methods: Maximum Likelihood (3) Available software to perform FIML estimation: AMOS - Analysis of Moment Structures Commercial program licensed as part of SPSS (CAPS has a 10-user license for this product) Fits a wide variety of univariate and multivariate linear regression, ANOVA, ANCOVA, and structural equation (SEM) models. http://www.smallwaters.com Mx - Similar to AMOS in capabilities, less user-friendly Freeware: http://views.vcu.edu/mx LISREL - Similar to AMOS, more features, less user-friendly Commercial program: http://www.ssicentral.com 21 Modern Methods: Maximum Likelihood (4) Available software: l EM Loglinear & Event history analysis w/ Missing data (Jeroen Vermunt) Freeware DOS program downloadable from the Internet • http://www.uvt.nl/faculteiten/fsw/organisatie/departementen/mto/software2.html Fits log-linear, logit, latent class, and event history models with categorical predictors. Mplus Similar capabilities to AMOS (commercial) Less easy to use than AMOS, but more general modeling features. http://www.statmodel.com 22 Modern Methods: Maximum Likelihood (5) Longitudinal data analysis software options (not discussed): Normally distributed outcomes SAS PROC MIXED S-PLUS LME Stata XTREG and XTREGAR and XTMIXED Poisson Stata XTPOIS Negative Binomial Stata XTNBREG Logistic Stata XTLOGIT 23 Modern Methods: Maximum Likelihood (6) Software for longitudinal analyses (continued) General modeling of clustered and longitudinal data Stata GLLAMM add-on command SAS PROC NLMIXED S-PLUS NLME What about Generalized Estimating Equations (GEE) for analysis of longitudinal or clustered data with missing observations? Assumes incomplete data are MCAR. See Hedeker & Gibbons, 1997, Psychological Methods, p. 65. & Heitjan, AJPH, 1997, 87(4), 548-550. Can be extended to accommodate the MAR assumption via a weighting approach developed by Robbins, Rodnitzky, & Zhao (JASA, 1995), but it has limited applicability. 24 Maximum Likelihood Example (1) 2 x 2 Table with missing data Vote (Y=V) Sex (X=S) Male Female Total Yes No . Y 28 45 10 (73) p11 22 52 15 (74) p21 50 97 25 (147) N p12 p22 1 Likelihood function: L(p11, p12, p21, p22) = (p11)28(p12)45 (p21)22 (p22)52 (p11+p12)10 (p21+p22)15 25 Maximum Likelihood Example (2) 2 x 2 Table with missing data 28 73 10 11 ( )( ) 0.1851 73 172 p 45 73 10 12 ( )( ) 0.2975 73 172 p p 22 74 15 21 ( )( ) 0.1538 74 172 p 52 74 15 22 ( )( ) 0.3636 74 172 26 Maximum Likelihood Example (3) l Using EM for 2 x 2 Table Input (partial) * R = response (NM) indicator * S = sex; V = vote; man 2 res 1 dim 2 2 2 lab R S V sub SV S * 2 manifest variables * 1 response indicator * with two levels * and label R * defines these two * subgroups mod SV * model for complete dat [28 45 22 52 * subgroup SV 10 15] * subgroup S Output (partial) *** (CONDITIONAL) PROBABILITIES *** * P(SV) * 1 1 2 2 1 2 1 2 complete data only 0.1851 (0.0311) 0.2975 (0.0361) 0.1538 (0.0297) 0.3636 (0.0384) 0.1905 (0.0324) 0.3061 (0.0380) 0.1497 (0.0294) 0.3537 (0.0394) * P(R) * 1 2 0.8547 0.1453 27 Maximum Likelihood Example (1) Continuous outcome & multiple predictors Data on American colleges and universities through US News and World Report N = 1302 colleges Available from http://lib.stat.cmu.edu/datasets/colleges Described on p. 21 of Allison (2001) 28 Maximum Likelihood Example (2) Continuous outcome & multiple predictors Outcome: gradrat - graduation rate (1,204 non-missing cases) Predictors csat - combined average scores on verbal and math SAT (779 non-missing cases) lenroll - natural log of the number of enrolling freshmen (1,297 non-missing cases) private - 1 = private; 0 = public (1,302 non-missing cases) stufac - ratio of students to faculty (x 100; 1,300 non-missing cases) rmbrd - total annual cost of room and board (thousands of dollars; 1,300 non-missing cases) act - Mean ACT scores (714 non-missing cases) 29 Maximum Likelihood Example (3) Continuous outcome & multiple predictors Predict graduation rate from Combined SAT Number of enrolling freshmen on log scale Student-faculty ratio Private or public institution classification Room and board costs Use a linear regression model ACT score included as an auxiliary variable Use AMOS and Mplus to illustrate direct ML 30 Maximum Likelihood Example (4) Continuous outcome & multiple predictors AMOS: Two methods for model specification Graphical user interface AMOS BASIC programming language Results (assuming joint MVN) Regression Weights GradRat GradRat GradRat GradRat GradRat <-<-<-<-<-- CSAT LEnroll StuFac Private RMBRD Estimate S.E. 0.0669 0.0048 2.0832 0.5953 -0.1814 0.0922 12.9144 1.2769 2.4040 0.5481 C.R. 13.9488 3.4995 -1.9678 10.1142 4.3856 P 0.0000 0.0005 0.0491 0.0000 0.0000 31 Maximum Likelihood Example (5) Continuous outcome & multiple predictors Mplus example (assuming joint MVN) INPUT INSTRUCTIONS TITLE: P. Allison 6/2002 Oakland, CA Missing Data Workshop non-normal example DATA: FILE IS D:\My Documents\Papers\Allison-Paul\usnews.txt; VARIABLE: NAMES ARE csat act stufac gradrat rmbrd private lenroll; USEVARIABLES ARE csat act stufac gradrat rmbrd private lenroll; MISSING ARE ALL . ; ANALYSIS: TYPE = general missing h1 ; ESTIMATOR = ML ; MODEL: gradrat ON csat lenroll stufac private rmbrd ; gradrat WITH act ; csat WITH lenroll stufac private rmbrd act ; lenroll WITH stufac private rmbrd act ; stufac WITH private rmbrd act ; private WITH rmbrd act ; rmbrd WITH act ; OUTPUT: patterns ; 32 Maximum Likelihood Example (6) Continuous outcome & multiple predictors Mplus results (assuming joint MVN) MODEL RESULTS GRADRAT ON CSAT LENROLL STUFAC PRIVATE RMBRD Estimates S.E. Est./S.E. 0.067 2.083 -0.181 12.914 2.404 0.005 0.595 0.092 1.276 0.548 13.954 3.501 -1.969 10.118 4.387 33 Maximum Likelihood Example (7) Continuous outcome & multiple predictors Mplus example for continuous, non-normal data Uses sandwich estimator robust to non-normality Specify MLR instead of ML as the estimator Mplus MLR estimator assumes MCAR missingness and finite fourth-order moments (i.e., kurtosis is nonzero); initial simulation studies show low bias with MAR data GRADRAT ON CSAT LENROLL STUFAC PRIVATE RMBRD Estimates S.E. Est./S.E. 0.067 2.083 -0.181 12.914 2.404 0.005 0.676 0.093 1.327 0.570 13.312 3.083 -1.950 9.735 4.215 34 Maximum Likelihood Summary ML advantages: Provides a single, deterministic set of results appropriate under MAR data missingness. Well-accepted method for handling missing values (e.g., for grant writing). Generally fast and convenient. ML disadvantages: Parametric: may not always be robust to violations of distributional assumptions (e.g., multivariate normality). Only available for some models via canned software (would need to program other models). Most readily available for continuous outcomes and ordered categorical outcomes. Available for Poisson or Cox regression with continuous predictors in Mplus, but requires numerical integration, which is timeconsuming and can be challenging to use, especially with large numbers of variables. 35 Modern Methods: EM Algorithm (1) EM algorithm proceeds in two steps to generate ML estimates for incomplete data: Expectation and Maximization. The steps alternate iteratively until convergence is attained. Seminal article by Dempster, Laird, & Rubin (1977), Journal of the Royal Statistical Society, Series B, 39, 1-38. Early treatment by H.O. Hartley (1958), Biometrics, 14(2), 174-194. Goal is to estimate sufficient statistics that can then be used for substantive analyses. In normal theory applications these would be the means, variances and covariances of the variables (first and second moments of the normal distributions of the variables). Example from Allison, pp. 19-20: For a normal theory regression scenario, consider four variables X1 - X4 that have some missing data on X3 and X4. 36 Modern Methods: EM Algorithm (2) Starting Step (0): Generate starting values for the means and covariance matrix. Can use the usual formulas with listwise or pairwise deletion. Use these values to calculate the linear regression of X3 on X1 and X2. Similarly for X4. Expectation Step (1): Use the linear regression coefficients and the observed data for X1 and X2 to generate imputed values of X3 and X4. 37 Modern Methods: EM Algorithm (3) Maximization Step (2): Use the newly imputed data along with the original data to compute new estimates of the sufficient statistics (e.g., means, variances, and covariances) Use the usual formula to compute the mean Use modified formulas to compute variances and covariances that correct for the usual underestimation of variances that occurs in single imputation approaches. Cycle through the expectation and maximization steps until convergence is attained (sufficient statistic values change slightly from one iteration to the next). 38 Modern Methods: EM Algorithm (4) EM Advantages: Only needs to assume incomplete data arise from MAR process, not MCAR Fast (relative to MCMC-based multiple imputation approaches) Applicable to a wide range of data analysis scenarios Uses all available data to estimate sufficient statistics Fairly robust to non-joint MVN data Provides a single, deterministic set of results May be all that is needed for non-inferential analyses (e.g., Cronbach’s alpha or exploratory factor analysis) Lots of software (commercial and freeware) 39 Modern Methods: EM Algorithm (5) EM Disadvantage: Produces correct parameter estimates, but standard errors for inferential analyses will be biased downward because analyses of EM-generated data assume all data arise from a complete data set without missing information. The analyses of the EMbased data do not properly account for the uncertainty inherent in imputing missing data. Recent work by Meng provides a method by which appropriate standard errors may be generated for EM-based parameter estimates Bootstrapping may also be used to overcome this limitation 40 Modern Methods: Multiple Imputation (1) What is unique about MI: We impute multiple data sets to analyze, not a single data set as in single imputation approaches Use the EM algorithm to obtain starting values for MI The differences between the imputed data sets capture the uncertainty due to imputing values The actual values in the imputed data sets are less important than analysis results combined across all data sets Several MI advantages: MI yields consistent, asymptotically efficient, and asymptotically normal estimators under MAR (same as direct ML) MI-generated data sets may be used with any kind of software or model 41 Modern Methods: Multiple Imputation (2) The MI point estimate is the mean: 1 Q m m Q i i 1 The MI variance estimate is the sum of Within and Between imputation variation: V W (1 m1 ) B where 1 m V Vi m i 1 B (1 m 1 m ) (Qi Q ) 2 i 1 (Qi and Vi are the parameter estimate and its variance in the ith imputed dataset) 42 Modern Methods: Multiple Imputation (3) Imputation model vs. analysis model Imputation model should include any auxiliary variables (i.e., variables that are correlated with other variables that have incomplete data; variables that predict data missingness) Analysis model should contain a subset of the variables from the imputation model and address issues of categorical data, nonnormal data Texts that discuss MI in detail: Little & Rubin (2002, John Wiley and Sons): A seminal classic Rubin (1987, John Wiley and Sons): Non-response in surveys J. L. Schafer (1997, Chapman & Hall): Modern and updated P. Allison (2001, Sage Publications series # 136): A readable and practical overview of and introduction to MI and missing data handling approaches 43 Modern Methods: Multiple Imputation (4) Multivariate normal imputation approach MI approaches exist for multivariate normal data, categorical data, mixed categorical and normal variables, and longitudinal/clustered/panel data. The MV normal approach is most popular because it performs well in most applications, even with somewhat non-normal input variables (Schafer, 1997) Variable transformations can further improve imputations For each variable with missing data, estimate the linear regression of that variable on all other variables in the data set. Using a Bayesian prior distribution for the parameters, typically noninformative, regression parameters are drawn from the posterior Bayesian distribution. Estimated regression equations are used to generate predicted values for missing data points. 44 Modern Methods: Multiple Imputation (5) Multivariate normal imputation approach (continued) Add to each predicted value a random draw from the residual normal distribution to reflect uncertainty due to incomplete data. Obtaining Bayesian posterior random draws is the most complex part of the procedure. Two approaches: Data augmentation - implemented in NORM and PROC MI • Uses a Markov-Chain Monte Carlo (MCMC) approach to generate the imputed values A variant of Data augmentation - implemented in ice (and MICE) • Uses a Gibbs sampler and switching regressions approach (Fully Conditional Specification - FCS) to generate the imputed values (van Buuren) Sampling Importance/Resampling (SIR) - implemented in Amelia and a user-written macro in SAS (sirnorm.sas); claimed to be faster than data augmentation-based approaches. “The relative superiority of these methods is far from settled” (Allison, 2001, p. 34) 45 Modern Methods: Multiple Imputation (6) Steps in using MI Select variables for the imputation model - use all variables in the analysis model, including any dependent variable(s), and any variables that are associated with variables that have missing data or the probability of those variables having missing data (auxiliary variables), in part or in whole. Transform non-normal continuous variables to attain normality (e.g., skewed variables) Select a random number seed for imputations (if possible) Choose number of imputations to generate Typically 5 to 10: > 90% coverage & efficiency with 90% or less missing information in large sample scenarios with M = 5 imputations (Rubin, 1987) Sometimes, however, you may need more imputations (e.g., 20 or more for some longitudinal scenarios). You can compute the relative efficiency of parameter estimates as: relative efficiency = (1 / (1 + rate of missing information / number of imputations)) X 100. Several MI software programs output the missing information rates for parameters, allowing the analyst to easily compute relative efficiencies 46 Modern Methods: Multiple Imputation (7) Steps in using MI (continued): Produce the multiply imputed data sets Estimated parameters must be independent of initial values Assess independence via autocorrelation and time series plots (when using MCMC-based MI programs) Back-transform any previously transformed variables and round imputations for discrete variables. Analyze each imputed data set using standard statistical approaches. If you generated M imputations (e.g., 5), you would perform M separate, but identical analyses (e.g., 5). Combine results from the M multiply imputed analyses (using NORM, SAS PROC MIANALYZE, or Stata miest or micombine) using Rubin’s (1987) formulas to obtain a single set of parameter estimates and standard errors. Both p-values and confidence intervals may be generated. 47 Modern Methods: Multiple Imputation (8) Steps in using MI (continued) Rules for combining parameter estimates and standard errors A parameter estimate is the mean of the parameter estimates from the multiple analyses you performed. The standard error is computed as follows: • Square the standard errors from the individual analyses. • Calculate the variance from the squared SEs across the M imputations. • Add the results of the previous two steps together, applying a small correction factor to the variance in the second step, and take the square root. There is a separate F-statistic available for multiparameter inference (i.e., multi-DF tests of several parameters at once). It is also possible to combine chi-square tests from the analysis of multiply imputed data sets. 48 Modern Methods: Multiple Imputation (9) Is it wrong to impute the DV? Yes, if performing single, deterministic imputation (methods historically used by econometricians) No, if using the random draw approach of Rubin. In fact, leaving out the DV will cause bias (it will bias the coefficients towards zero). Given that the goal of MI is to reproduce all the relationships in the data as closely as possible, this can only be accomplished if all the dependent variable(s) are included in the imputation process. 49 Modern Methods: Multiple Imputation (10) Available imputation software for data augmentation: SAS: PROC MI and PROC MIANALYZE (demonstrated) MI produces imputations MIANALYZE combines results from analyses of imputed data into a single set of hypothesis tests NORM - for MV normal data (J. L. Schafer) Windows freeware S-Plus MISSING library R (add-in file) CAT, MIX, and PAN - for categorical data, mixed categorical/normal data, and longitudinal or clustered panel data respectively (J. L. Schafer) S-Plus MISSING library R (add-in file) LISREL - http://www.ssicentral.com (Windows, commercial) 50 Modern Methods: Multiple Imputation (11) Newly Available MI Software from Stata: (Uses Gibbs sampler and switching regressions; related to data augmentation) Can handle continuous, dichotomous, categorical and ordinal data Can handle interactions Stata: -ice- with –micombinehttp://www.stata.com/search.cgi?query=ice http://www.ats.ucla.edu/stat/stata/library/ice.htm From inside Stata: . findit multiple imputation 51 Modern Methods: Multiple Imputation (12) Available Imputation Software for Sampling Importance/Resampling (SIR): AMELIA Windows freeware version (NOT demonstrated) Produces the multiply imputed MI data sets. http://pantheon.yale.edu/~ks298/index_files/software.htm http://gking.harvard.edu/amelia/ More complete Gauss version available http://www.aptech.com/ STATA can be used on datasets from AMELIA (NOT demonstrated) • MIEST - a user-written command to run and combine separate analyses into a single model. http://gking.harvard.edu/amelia/amelia1/docs/mi.zip • MIEST2 - modifies MIEST to output non-integer DF for hypothesis tests SIRNORM.SAS - SAS user-written macro http://yates.coph.usf.edu/research/psmg/Sirnorm/sirnorm.html 52 Multiple Imputation Example (1) [Same as ML Example] Data on American colleges and universities from US News and World Report N = 1302 colleges Available from http://lib.stat.cmu.edu/datasets/colleges Described on p. 21 of Allison (2001) 53 Multiple Imputation Example (2) Outcome: gradrat - graduation rate (1,204 non-missing cases) Predictors csat - combined average scores on verbal and math SAT (779 non-missing cases) lenroll - natural log of the number of enrolling freshmen (1,297 non-missing cases) private - 1 = private; 0 = public (1,302 non-missing cases) stufac - ratio of students to faculty (x 100; 1,300 non-missing cases) rmbrd - total annual cost of room and board (thousands of dollars; 1,300 non-missing cases) Auxiliary Variable act - Mean ACT scores (714 non-missing cases) 54 MI SAS Example (1) Using SAS to perform multiple imputation Suggest running PROC UNIVARIATE or PROC FREQ prior to running PROC MI in order to examine distributions of variables, identify ranges, and integer precision of each variable. Some variables will have predefined ranges that can be specified in PROC MI. E.g., CSAT ranges 400 to 1600. Ranges for other variables can be set to their empirical values. SAS creates a single SAS data set containing the individual imputed data sets stacked. Each inputed data set is denoted by the value of the SAS variable _IMPUTATION_. You can run substantive analyses on the imputed data sets by using a SAS BY statement (e.g, BY _IMPUTATION_ ; ). 55 MI SAS Example (2) PROC MI syntax for college graduation data set example PROC MI DATA = paul.usnews OUT = miout NIMPUTE = 10 SEED = 12345678 MINIMUM = 400 11 . 0 . 0 0 MAXIMUM = 1600 31 100 100 . 1 . ROUND = 1 1 . 1 .001 1 . ; MCMC CHAIN = MULTIPLE NBITER = 500 NITER = 250 TIMEPLOT (MEAN(csat rmbrd) COV (gradrat*rmbrd) WLF) ACFPLOT (MEAN(csat rmbrd) COV(gradrat*rmbrd) WLF) ; TITLE "Multiple Imputation procedure run on US News college data set" ; VAR csat act stufac gradrat rmbrd private lenroll ; RUN ; 56 MI SAS Example (3) PROC MI Statement PROC MI DATA = paul.usnews OUT = miout NIMPUTE = 10 SEED = 12345678 MINIMUM = 400 11 . 0 MAXIMUM = 1600 31 100 100 ROUND = 1 1 . 1 . . .001 0 0 1 . 1 . ; NIMPUTE: the number of imputations (default = 5) SEED: use the same random number seed to replicate imputations over multiple program runs MINIMUM, MAXIMUM, and ROUND Order of values corresponds to variables listed in the VAR statement (i.e., csat act stufac gradrat rmbrd private lenroll) csat, stufac, and gradrat ranges set on basis of meaningful expectations; others are set via empirical frequency data. specify minimum values, maximum values, and values to which imputations are rounded. Useful for handling categorical and integer variables. Dots/Periods represent no values specified. First variable cannot have a period placeholder. 57 MI SAS Example (4) MCMC Statement MCMC CHAIN = MULTIPLE NBITER = 500 NITER = 250 TIMEPLOT (MEAN(csat rmbrd) COV (gradrat*rmbrd) WLF) ACFPLOT (MEAN(csat rmbrd) COV(gradrat*rmbrd) WLF) ; CHAIN - selects single or multiple chain Markov-Chain Monte Carlo data augmentation procedure. Multiple chain may be slightly preferred (Allison, 2001, p. 38). NBITER - number of “burn in” iterations performed prior to imputed data sets being created. Often set to twice the number of iterations EM requires to converge (Schafer). NITER - number of iterations between creation of each imputed data set. More iterations ensure independence between imputed data sets. You can diagnose non-independence with time series and autocorrelation plots. 58 MI SAS Example (5) MCMC Statement (continued) TIMEPLOT - produces time series plot for the worst linear function of variables containing the most missing data (csat and rmbrd) ACFPLOT - produces autocorrelation plot for the worst linear functions of variables containing the most missing data TRANSFORM statement also available for variable transformations Example: TRANSFORM LOG(rmbrd/c=5) C option adds a constant prior to transformation Available transformations: Box-Cox, Exp, Logit, Log, Power 59 MI SAS Example (6) Time Series Plot 970 965 960 955 950 945 - 500 - 400 - 300 - 200 - 100 0 I t er at i on 60 MI SAS Example (7) Autocorrelation Plot 1. 0 0. 5 0. 0 - 0. 5 - 1. 0 0 2 4 6 8 10 12 14 16 18 20 Lag 61 MI SAS Example (8) ML linear regression analysis of the data output by PROC MI using PROC GENMOD PROC GENMOD DATA = miout ; TITLE "Illustration of GENMOD analysis of the college data set" ; MODEL gradrat = csat lenroll stufac private rmbrd / COVB ; BY _IMPUTATION_ ; ODS OUTPUT PARAMETERESTIMATES=gmparms COVB=gmcovb ; RUN ; BY statement repeats analysis for each imputed data set COVB option on MODEL statement displays the variance-covariance matrix of the parameter estimates ODS OUTPUT statement outputs the parameter estimates and their variance-covariance matrix to separate SAS data sets, gmparms and gmcovb, respectively. These data sets are then combined by PROC MIANALYZE to return a single set of results to the analyst. 62 MI SAS Example (9) Combining GENMOD results with PROC MIANALYZE: Single Parameter Inference PROC MIANALYZE PARMS = gmparms COVB = gmcovb ; TITLE "Single DF inferences of GENMOD analysis of US News college data set" ; VAR intercept csat lenroll stufac private rmbrd ; RUN ; PARMS statement reads the parameter estimates; COVB reads the variance-covariance matrix of parameter estimates Note presence of INTERCEPT term on VAR statement you will need to include it to obtain INTERCEPT results 63 MI SAS Example (10) Combining GENMOD results with PROC MIANALYZE: Multiparameter Inference PROC MIANALYZE MULT PARMS = gmparms COVB = gmcovb ; TITLE "Multivariate inference of MIXED analysis of US News college data set" ; VAR csat lenroll stufac private rmbrd ; RUN ; MULT statement performs multivariate hypothesis testing Note absence of intercept in the VAR statement - we do not want it included as part of the list of variables tested 64 MI SAS Example (11) Inference using other SAS procedures REG, LOGISTIC, PROBIT, LIFEREG, and PHREG: use OUTEST = and COVOUT statements MIXED, GLM, and CALIS: use ODS MIXED • request SOLUTION and COVB as MODEL statement options • ODS OUTPUT SOLUTIONF = gmparms COVB = gmcovb ; GENMOD for GEE: use ODS as shown in this example substitute GEEempest and GEERCov ODS tables for the parameter estimate and covariance matrix tables shown in the above example. 65 MI Stata Example (1) Using Stata to check the original data * Read in original data , and save as *.dta . insheet using usnewsN.txt, names delimit (" ") clear . save usnews.dta, replace * Obtain (available cases, single) estimates of means and variance . summarize gradrat csat lenroll stufac private rmbrd act * Obtain (available cases, pairwise) estimates of correlations . pwcorr gradrat csat lenroll stufac private rmbrd act, obs * Obtain (complete cases) estimates of correlations, means and variance . corr gradrat csat lenroll stufac private rmbrd, obs * Obtain (complete cases) estimates of regression coefficients . regress gradrat csat lenroll stufac private rmbrd * patterns of missingness . mvpatterns gradrat csat lenroll stufac private rmbrd 66 MI Stata Example (2) Using Stata to create the multiply imputed datasets (stacked together in a single dataset) . use usnews, clear . mvis csat act stufac gradrat rmbrd private lenroll using usnews_mvis10, m(10) genmiss(m_) seed(12345678) OR (better): . ice csat act stufac gradrat rmbrd private lenroll using usnews_ice10, m(10) seed(12345678) Using Stata to analyze the multiply imputed datasets and combine the results *–micombine- to obtain MI estimates of regression coefficients . use usnews_ice10, clear . micombine regress gradrat csat lenroll stufac private rmbrd . testparm csat lenroll stufac private rmbrd 67 Multiple Imputation Summary Multiple imputation is flexible: imputed datasets can be analyzed using parametric and non-parametric techniques MI is available in SAS, and in S-PLUS MISSING library. Also free via NORM and AMELIA, and in R. Some SAS procedures are easier to use with MI than others; SAS and NORM permit user-specified random number seeds SAS and NORM permit testing multiparameter hypotheses Multiple imputation using Stata: You can use the Stata command ice to generate multiply imputed data sets and the command micombine to combine the results from analyses of imputed data sets in Stata. ice allows imputation of unordered or ordered categorical and continuous, normally distributed variables. It also handles interactions properly. Alternatively, you can use AMELIA to generate multiply imputed data sets and feed them into Stata for analyses. miest / miest2 can then combine the analysis results. All Stata estimation commands are equally easy to use with micombine, miest(2). ice permits user-specified random number seeds. micombine permits testing multiparameter hypotheses Multiple imputation is non-deterministic: you get a different result each time you generate imputed data sets (unless the same random number seed is used each time) It is easy to include auxiliary variables into the imputation model to improve the quality of imputations Compared with direct ML, large numbers of variables may be handled more easily. 68 Comparison of Regression Example Results Listwise w/SAS PROC GENMOD Mplus Direct ML Mplus Robust ML SAS MI With Stata ice with PROC GENMOD micombine CSAT .067 (.006) .067 (.005) .067 (.005) .067 (.005) .066 (.005) LEnroll 2.417 (.953) 2.083 (.595) 2.083 (.676) 2.185 (.575) 2.129 (.598) StuFac -.123 (.131) p = .348 13.588 (1.933) -.181 (.092) p = .049 12.914 (1.276) -.181 (.097) p = .051 12.914 (1.327) -.184 (.097) p=.061 13.034 (1.270) -.189 (.101) p=.066 12.900 (1.374) 2.162 (.709) 2.404 (.548) 2.404 (.570) 2.468 (.491) 2.527 (.518) Private RmBrd Listwise N = 455; N = 1302 for all other analyses. 69 Extensions Multiple imputation under non-linearity and interaction - possible but more complex than linear main effects only Multiple imputation for panel (longitudinal or clustered) data - only available off the shelf in S-PLUS (you can sometimes transform “long” clustered data structure to a “wide” format in which multiple time points are expressed as multiple variables, perform MI, and retransform the imputed data sets into “long” form). Weighting-based approaches to handle missing data - a promising approach Non-ignorable situations - rely on a priori knowledge of missingness mechanism Pattern-mixture models Selection models (e.g., Heckman’s model) 70 Conclusions Planning ahead can minimize missing cross-sectional responses and longitudinal loss to follow-up Use of ad hoc methods can lead to biased results Modern methods are readily available for MAR data FIML/Direct ML most convenient for models that are supported by available software and when parametric assumptions are met Multiple Imputation available and effective for remaining situations Imputation strategies for clustered data and non-linear analyses available, but more complicated to implement Non-ignorable models are available, but still more complicated and rest on tenuous assumptions 71