Imputation Methods Document by Don Miller and Vivien Chen Population Research Institute Pennsylvania State University Version 1.0 (Dec. 15, 2006) TABLE OF CONTENTS Imputation Overview ............................................................................................ 2 Imputation/Regression using SAS (PROC MI / MIANALYZE) Imputation ................................................................................................................ Syntax example A (linear regression) ...................................................................... Syntax example B (logistic regression) ................................................................... Imputation options ................................................................................................... ODS statements for other SAS regression procedures ............................................ 3 4 5 6 7 Imputation/Regression using Stata (ICE) Syntax example A (linear regression) ...................................................................... 8 Imputation/Regression using R (MICE) Imputation ................................................................................................................ 9 Syntax example A (linear regression) ...................................................................... 10 Syntax example B (logistic regression) ................................................................... 11 Imputation/Regression using R / S-Plus (NORM) Syntax example A (linear regression) ...................................................................... 12 (Future development items:) 1. Imputation/Regression using R (Amelia) As of this writing, we had trouble getting Amelia to converge, even with rather simple datasets and models. Amelia would error (saying there was not enough observations), or run forever without converging (we let one run go for 5 days straight without any output). 2. Imputation/Regression using SPSS (MVA) SPSS's Missing Value Analysis has been reported as being biased and limited in the types of imputations that are available. We have not yet investigated whether the current version (MVA 15.0) has improved this situation. 3. Imputation/Regression using SAS (IVEware) 4. Imputation/Regression using Matlab (Imputation Module) 5. Comparisons of output between software packages 6. Syntax examples C, D 7. Common errors and imputation problems and how to resolve them 1 Imputation Overview This document reviews the imputation methods used by several software packages and gives examples of their syntax. For all software packages, we will use a three step procedure: 1. Multiple Imputation 2. Regression 3. Combination of results For most packages, how the results are combined is dependent on what type of regression was used (linear, logistic, etc.). Unfortunately, the documentation for many of these packages have steps 2 and 3 in different places (and for different examples), so it's difficult to see how to do this process all the way from beginning to end. Therefore, we will show steps 2 and 3 together for each example. We also use the same examples across packages. That is, "Example A" for SAS uses the same model and data as "Example A" for Stata. This should help you understand how to use a different package if you're already familiar with one of them. 2 Imputation/Regression using SAS (PROC MI / MIANALYZE) ---------- Step 1. Imputation ---------This step is the same for all the examples. For this example, I took the CPS 2004 data and randomly disbursed missing values among it's variables. Since some of the variables are categorical (e.g. A_HGA=education), I created binary dummies for them. New versions of SAS have a CLASS statement to handle this. However, it is currently listed as "experimental" and isn't available for SAS 8, so we'll use the older method. /* change the following to be the path of your data: */ libname imp "~miller/imputation"; data imp.cpsdat2; set imp.cpsdat; /* create dummies for A_HGA=education */ A_HGA0=0; A_HGA1=0; A_HGA2=0; A_HGA3=0; A_HGA4=0; A_HGA5=0; if A_HGA=0 then A_HGA0=1; /* children */ if A_HGA in (31:38) then A_HGA1=1; /* < HS diploma */ if A_HGA=39 then A_HGA2=1; /* = HS diploma */ if A_HGA=40 then A_HGA3=1; /* = some college */ if A_HGA in (41:43) then A_HGA4=1; /* associate or bachelors */ if A_HGA=44 then A_HGA5=1; /* post-bachelors */ if A_HGA=. then do; A_HGA0=.; A_HGA1=.; A_HGA2=.; A_HGA3=.; A_HGA4=.; A_HGA5=.; end; /* I also created dummies for A_VET=veteran status and PRDTRACE=race, not given here to save space */ run; Next, we impute the data. The "seed" here is just some number you pick. If you keep it the same each time, you can reproduce your imputation on future runs. SAS will by default create 5 imputations. They will all be stacked in the same output dataset (specified by the "out=" option). It will have a new variable (called "_imputation_") with values of 1 to 5. Note we dropped the reference categories (e.g. A_HGA0). proc mi data=imp.cpsdat2 seed=3315 out=imp.cpsimp; var A_AGE A_HRSPAY A_USLHRS FTOTVAL A_HGA1 A_HGA2 A_HGA3 A_HGA4 A_HGA5 A_VET1 PRDTRACE1 PRDTRACE2 PRDTRACE3 PRDTRACE4; run; Afterwards, you may wish to compare means of the imputed data to the original data: proc means data=imp.cpsdat2; var _NUMERIC_; run; proc means data=imp.cpsimp; var _NUMERIC_; run; 3 Imputation/Regression using SAS (PROC MI / MIANALYZE) ---------- Example A (linear regression) ---------I wanted to model A_HRSPAY (a continuous variable) by A_AGE (age, continuous) and A_HGA (education, categorical). Step 1 (Imputation) can be seen on page 2. Before doing the regression, you'll have to decide if you wish to round the imputed values off. You'll want to round off if (for example) you're doing a logistic regression or you need to use the imputed values to group the data somehow. We are not doing either in this example, so we'll leave them unrounded. Also, I didn't like how large the A_AGE values were compared to the other variables, so I decided to divide them by 100 first: data new; set imp.cpsimp; a_age100=a_age/100; run; ---------- Step 2. Regression ---------The parameters dataset (outest=parmscov) and the covariates (covout) are needed for step 3. If you want standard beta coefficients, put a "/stb" at the end of the "model" line (before the semicolon). proc reg data=new outest=parmcov covout; model a_hrspay = a_age100 a_hga1 a_hga2 a_hga3 a_hga4 a_hga5; by _imputation_; run; You'll get a large amount of output; it's a regression for each value of _imputation_. ---------- Step 3. Combination of Results ---------Now we combine the results to get the correct errors and confidence intervals. Note you need to include "intercept" in the "modeleffects" line. Previous versions of SAS used "var" instead of "modeleffects". proc mianalyze data=parmcov; modeleffects intercept a_age100 a_hga1 a_hga2 a_hga3 a_hga4 a_hga5; run; If your data is weighted, you can specify this by adding WEIGHT weightvariablename; to the PROC REG procedure. Our data is not weighted so we did not include this. 4 Imputation/Regression using SAS (PROC MI / MIANALYZE) ---------- Example B (logistic regression) ---------This is a logistic model with dependent variable A_VET1 (veteran status, a binary variable) and independent variables A_HGA (education, categorical) and PRDTRACE (race, categorical). It's not the best model, but it will serve to show the necessary syntax. Step 1 (Imputation) can be seen on page 2. Since this is a logistic regression, I decided to round off my results. I could have done this in the PROC MI step (using the ROUND= option), but I prefer to round myself afterwards. data new; set imp.cps2004small_sas_imputed; if a_vet1<0.5 then a_vet1r=0; else a_vet1r=1; if a_hga1<0.5 then a_hga1r=0; else a_hga1r=1; if a_hga2<0.5 then a_hga2r=0; else a_hga2r=1; /* ... and so on ... */ run; ---------- Step 2. Regression ---------To pass the logistic results to PROC MIANALYZE for analysis, we need to use an ODS statement. Using an OUT= or OUTEST= option like in Example A will not give the desired results. proc logistic data=new descending; model a_vet1r = a_hga1r a_hga2r a_hga3r a_hga4r a_hga5r prdtrace1r prdtrace2r prdtrace3r prdtrace4r /covb; by _imputation_; ods output ParameterEstimates=parmsdat CovB=covbdat; run; ---------- Step 3. Combination of Results ---------Note how the syntax differs here (when using ODS) from Example A. proc mianalyze parms=parmsdat covb=covbdat; modeleffects intercept a_hga1r a_hga2r a_hga3r a_hga4r a_hga5r prdtrace1r prdtrace2r prdtrace3r prdtrace4r; run; 5 Imputation/Regression using SAS (PROC MI / MIANALYZE) ---------- Imputation options ---------Here are some of the options available to the PROC MI procedure. The following options are put on the PROC MI line, so you would do something like this: proc mi data=imp.cpsdat2 seed=3315 out=imp.cpsimp nimpute=5; ... Only the commonly used options are given below: nimpute=5 Sets the number of imputations, the default is 5. Setting it to 0 gives just the missing patterns (no imputation is run). minimum=0 0 . 0 maximum=1 1 . . Set minimum & maximum values for each variable. The positions correspond to the order they appear in the var statement. In the above example, the first two variables will have a min of 0 and a max of 1. The third variable has no min or max defined (SAS will not apply a min or max limit at all to this variable). The fourth will have a min of 0 but no maximum. round=1 1 1 0.01 Round off to the given precision after imputation. I should note that combining the minimum, maximum, and round options sometimes has unexpected results, for example, having a min of 0, a max of 1, and a round of 1 should only output values of 0 or 1, but I've seen values of 0.000000001 and 0.999999999. Presumably this is roundoff error, but it is troublesome nonetheless. As I mentioned before, I frequenty ignore these options and round manually. alpha=0.05 Confidence limit. mu0=0.5 0.5 0.5 25 T-test the null hypothesis μ=μ0. These two options are used on their own line in PROC MI procedure (after the semicolon at the end of the PROC MI statement, on their own line): em maxiter=200 out=emdata; Set maximum number of iterations. The default is 200, which is sometimes low. class sex race; Specify categorical variables (don’t need to use dummies). (This is new / experimental.) 6 Imputation/Regression using SAS (PROC MI / MIANALYZE) ---------- ODS statements for other SAS regression procedures ---------When using SAS regression procedures on imputed data, the ODS statements that are used to output the parameter estimates and covariance matrix (needed to combine the results in PROC MIANALYZE) vary between procedures, unfortunately. Here are some examples of other SAS regression procedures and how to get the output into PROC MIANALYZE. PROC REG and PROC LOGISTIC have already been shown in Examples A and B. proc mixed data=impdat; model drivesfast=sex black other age /solution covb; by _imputation_; ods output covparms=parmcov; run; proc genmod data=impdat; model drivesfast=sex black other age /covb; by _imputation_; ods output ParameterEstimates=parmsdat CovB=covbdat; run; proc glm data=impdat; model drivesfast=sex black other age /inverse; by _imputation_; ods output ParameterEstimates=parmsdat InvXPX=xpxidat; run; How you read in the data in PROC MIANALYZE depends on the procedure used: For PROC REG, MIXED: proc mianalyze data=parmcov; For PROC LOGISTIC, GENMOD: proc mianalyze parms=parmsdat covb=covbdat; For PROC GLM: proc mianalyze parms=parmsdat xpxi=xpxidat; Note: The ODS syntax has fluctuated a bit over the last few releases of SAS, and the ODS OUTPUT statements may have to be altered somewhat depending on your version of SAS. You can use an ods trace on /listing; statement before your regression procedure to figure out which items you need to include in the ODS OUTPUT statement, but the exact details of how to do this is not presented in this version of the paper. 7 Imputation/Regression using Stata (ICE) ---------- Example A (linear regression) ---------Again, we took the same CPS 2004 data as we used in SAS (randomly disbursed missing values among it's variables). Again we model A_HRSPAY (a continuous variable) by A_AGE (age, continuous) and A_HGA (education, categorical). Here is the corresponding syntax in Stata: clear set more off set mem 40m log using cps04-impute1-stata.log,replace use cps2004small.dta,clear This creates the altered age variable (with values divided by 100): gen a_age100=a_age/100 Here is the imputation procedure: ice a_age100 a_uslhrs a_hrspay a_hga1 a_hga2 a_hga3 a_hga4 a_hga5 a_vet1 prdtrace1 prdtrace2 prdtrace3 prdtrace4 using cps04imp_stata,replace m(5) And finally the regression is run. The regression and the combination of results is done in one line (unlike SAS, where two different procedures are used): use cps04imp_stata micombine reg a_hrspay log close 8 a_age100 a_hga1 a_hga2 a_hga3 a_hga4 a_hga5 Imputation/Regression using R (MICE) ---------- Step 1. Imputation ---------This step is the same for all the examples. For this example, I took the CPS 2004 data and randomly disbursed missing values among it's variables. By default, MICE will use all variables in the dataset in the imputation. We had an ID variable (PERIDNUM) which we made the row names (row.names="PERIDNUM" below). # Load the mice library for later use library(mice) # Read comma-delimited file into R, csv file is in parent directory cps2004 <- read.csv("../cps2004.csv",header=TRUE,row.names="PERIDNUM") # Make sure variable/column names were included names(cps2004) Since some of the variables are categorical (e.g. A_HGA=education), I created binary dummies for them. # Create dummies cps2004$A_HGA0 <- cps2004$A_HGA==0 cps2004$A_HGA1 <- cps2004$A_HGA>=31 & cps2004$A_HGA<=38 cps2004$A_HGA2 <- cps2004$A_HGA==39 cps2004$A_HGA3 <- cps2004$A_HGA==40 cps2004$A_HGA4 <- cps2004$A_HGA>=41 & cps2004$A_HGA<=43 cps2004$A_HGA5 <- cps2004$A_HGA==44 # I also created dummies for A_VET=veteran status and # PRDTRACE=race, not given here to save space */ Also, I didn't like how large the A_AGE values were compared to the other variables, so I decided to divide them by 100 first: # Divide age by 100 for regression cps2004$A_AGE <- cps2004$A_AGE / 100 Again, since MICE includes all variables by default, we need to remove the reference categories. This is done by the "[-c(5,11,13)]" construct in the mice statement. Columns 5, 11, and 13 were removed. # Create data object with imputed values, remove reference variables cps2004.imp <- mice(cps2004[-c(5,11,13)]) # Look at imputed values of A_AGE cps2004.imp$imp$A_AGE 9 Imputation/Regression using R (MICE) ---------- Example A (linear regression) ---------I wanted to model A_HRSPAY (a continuous variable) by A_AGE (age, continuous) and A_HGA (education, categorical). Step 1 (Imputation) can be seen on page 9. Before doing the regression, you'll have to decide if you wish to round the imputed values off. You'll want to round off if (for example) you're doing a logistic regression or you need to use the imputed values to group the data somehow. We are not doing either in this example, so we'll leave them unrounded. ---------- Step 2. Regression and Step 3. Combination of Results ---------In R (MICE), the regression and the pooling of results occur in the same command line. The "lm.mids" is the regression, and the "pool" is the combination of imputation results. # Run the model and pool the results from each imputation cps2004l.pool <- pool(lm.mids(A_HRSPAY ~ A_AGE+A_HGA1+A+HGA2+A_HGA3+A_HGA4+A_HGA5,cps2004l.imp)) # Use summary function to see additional information cps2004.lm.summary <- summary(cps2004lm.pool) You can alternatively just type summary(cps2004lm.pool) to look at the results without moving them to a dataset. # Look at results cps2004.lm.summary If your data is weighted, you can specify this at the end of the lm statement by putting a ,WEIGHT=weightvariablename) at the end of the call. Our data is not weighted so we did not include this. 10 Imputation/Regression using R (MICE) ---------- Example B (logistic regression) ---------This is a logistic model with dependent variable A_VET1 (veteran status, a binary variable) and independent variables A_HGA (education, categorical) and PRDTRACE (race, categorical). It's not the best model, but it will serve to show the necessary syntax. Step 1 (Imputation) can be seen on page 9. MICE automatically rounds results, so we need not round here. ---------- Step 2. Regression and Step 3. Combination of Results ---------# Check values of A_VET1 prior to running logistic regression table(cps2004$A_VET1) We ran into a problem with GLM in MICE that caused it to not pass the parameters correctly, so we repaired the glm.mids function. It is possible that we are not calling glm.mids in the proper way, and the authors intended a different usage, but at this point we haven't seen anything to prove this. # Modify "glm.mids" so it will run correctly fix(glm.mids) The following two lines need to be adjusted (corrected items in italics): function (formula = formula(data), family = binomial("logit"), data = sys.parent(), [further down in the code:] analyses[[i]] <- glm(formula, data = data.i, family=binomial("logit"), ...) In R (MICE), the regression and the pooling of results occur in the same command line. The "glm.mids" is the regression, and the "pool" is the combination of imputation results. # Run logistic regression model cps2004glm.pool <- pool(glm.mids(A_VET1 ~ A_HGA1 + A_HGA2 + A_HGA3 + A_HGA4 + A_HGA5 + PRDTRACE1 + PRDTRACE2 + PRDTRACE3 +PRDTRACE4, data=cps2004.imp)) # Summary to see results cps2004.glm.summary <- summary(cps2004glm.pool) cps2004.glm.summary # Write results to comma delimited file write.csv(cps2004.glm.summary,file="log-reg-results.csv") write.csv(cps2004.lm.summary,file="reg-results.csv") 11 Imputation/Regression using R (NORM) ---------- Example A (linear regression) ---------We begin by defining two functions. This first function does the imputation and the regression, and does some housekeeping (like converting the data to a matrix that NORM expects). This code will not actually run in R until you call it (which we'll do later). mult.imp.norm.lm <- function(formula,...,matrix,imputations=5) { # Load library/package if not already done require(norm) estimates = NULL standard.errors = NULL for (i in 1:imputations) { # Set random seed rngseed(abs(round(rnorm(1)*100000000))) # run prelim.norm function, followed by em.norm on results first.step <- prelim.norm(matrix) second.step <- em.norm(first.step) # Impute the data imputed.data <- imp.norm(first.step,second.step) # Assign column(variable) names to variable for later use c.names <- colnames(matrix) # Convert matrix with imputed values to data.frame temp.frame <- data.frame(imputed.data) # Assign variable names to columns colnames(temp.frame) <- c.names # Round variables temp.frame$A_HGA1 <- as.integer(temp.frame$A_HGA1 >= 0.5) temp.frame$A_HGA2 <- as.integer(temp.frame$A_HGA2 >= 0.5) # (we rounded the other variables as well, omitted for space) # Divide age by 100 for analysis temp.frame$A_AGE = temp.frame$A_AGE / 100 # "mi.inference" function wants estimates and standard errors coef.set.i<-summary(lm(formula,data=temp.frame))$coefficients[,1:2] # Display estimates and standard errors on screen print(coef.set.i) # append coefficients to objects to be returned from function estimates <- cbind(estimates,coef.set.i[,1]) standard.errors <- cbind(standard.errors,coef.set.i[,2]) mi.results <- list(estimates,standard.errors) } return(mi.results) } 12 Imputation/Regression using R (NORM) ---------- Example A (linear regression) (continued) ---------Here is the second function we defined. This combines the coefficients. Again, this code will not actually run in R until you call it (we'll do that later). combine <- function(x) { # Find number of rows(variables) to calculate rows <- nrow(x[[1]]) # Initialize output data mi.results <- NULL for (i in 1:rows) { # Put row name into variable rowname <- dimnames(x[[1]])[[1]][i] # Extract estimate row from list this.est <- x[[1]][i,] # Extrace standard error row from list this.se <- x[[2]][i,] # Call mi.inference.mod to do the calculations results <- mi.inference.mod(est=this.est,std.err=this.se) # "Attach" variable name to results results <- cbind(rowname,results) # Append combined results (row) to output data mi.results <- rbind(mi.results,results) } # Return results, exit function return(mi.results) } Before we call these functions we need to do one more thing. We kept getting a syntax error in one of NORM's functions, which we repaired. Like MICE, it is possible that we are not calling the function in the proper way, and the authors intended a different usage. fix(mi.inference.mod) This line needs to be commented out: # dimnames(u)[[1]] <- dimnames(qstar)[[1]] These lines need to be added near the bottom of the function (only add the italics part, the last two lines are part of the original function.) # Convert list to data.frame, and add row names (variable names) result <- data.frame(result) row.names(result) <- row.names(est)[[1]] result } 13 Imputation/Regression using R (NORM) ---------- Example A (linear regression) (continued) ---------Now we are ready to call the functions in R. # Read comma delimited file into R cps2004 <- read.csv("../cps2004.csv",header=TRUE,row.names="PERIDNUM") # Check variable names names(cps2004) # Convert file to matrix (needed for norm) cps2004.M <- as.matrix(cps2004[-c(5,11,13)]) # Read in user-written and modified functions source("mult.imp.norm.lm") source("mi.inference.mod") source("combine") # Run multiple imputation, store estimates and standard errors imp.results <- mult.imp.norm.lm(formula= A_HRSPAY ~ A_AGE + A_HGA1 + A_HGA2 + A_HGA3 + A_HGA4 + A_HGA5,matrix=cps2004.M) # combine estimates and errors for each imputation final.results <- combine(imp.results) # check values final.results # write to external file write.csv(final.results,file="norm-lm-results.csv") 14