Multiple Imputation : Handling Interactions Michael Spratt 1 Introduction • Missing data is considerable problem • Complete case analysis will generally lead to systematic bias • Have to make some assumption – Most commonly used is Missing at Random (MAR) – MNAR uses different assumptions • In this talk we are discussing analysis when MAR assumption is made 2 Introduction : MAR • In MAR, the probability of being missing does not depend on the missing data itself, given the observed data and the model parameters – Unlike MNAR analysis, we do not have to explicitly the model missingness mechanism 3 Introduction : MAR and multiple imputation • Most common approach : perform imputation of missing data and save multiple imputed datasets – Each imputed dataset differs (slightly) due to stochastic nature of imputation • Then carry out substantive analysis on each of the imputed datasets • Then combine the individual results (using Rubin’s Rules) to obtain combined imputation estimates and standard errors 4 MICE/ICE for imputation assuming MAR • MICE : Multiple Imputation using Chained Equations has been widely used for imputation – Sometimes called FCS (fully conditional specification) – For general missingness patterns (does not not have to assume monotone missingness) • Implemented in – – – – • MICE package in R (van Buuren et. al.) ICE command in Stata (Royston) IVEWARE (Raghunathan et. al.) Potentially task-specific versions be written in other programs e.g. WinBUGS Ref : Multiple imputation of missing blood pressure covariates in survival analysis. Van Buuren, Boshuizen, Knook. Statistics in Medicine 1999; 18(6): 681–94. 5 MICE/ICE for imputation assuming MAR • X1, X2, X3 …Xn partially observed • Zobs represents set of fully observed variables Chained equations are : • X1 ~ f(X2, X3 , X4 … Xn, Zobs) • X2 ~ f(X1, X3 , X4 … Xn, Zobs) • X3 ~ f(X1, X2 , X4 … Xn, Zobs) etc. • Comparable to Gibbs Sampler • Much shorter chains which on termination produce an imputed dataset 6 Interactions in the Analysis Model • A useful practical guide to using imputation to perform analysis in the presence of missing data are • Multiple imputation: current perspectives (Kenward and Carpenter Statistical Methods in Medical Rearch16: 199– 218) • Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls (Sterne, White, Carpenter et. al. BMJ 2009;338:b2393) also contains useful guidance • The imputation model should be at least as rich as the substantive model – The imputation model should preserve the structure of the data 7 MICE/ICE for imputation • For most datasets where distributional assumptions are met, MICE/ICE has been shown in practice to work well for MAR data • More care is needed when models contain structures such as interactions, multi-level, nonlinearity etc. In particular the structure of the substantive model should be reflected in the imputation model • This talk focuses on interactions 8 Why omitting interactions in the imputation may cause problems • Take as an example 3 binary variables X, Y, Z • We are interested in a substantive analysis in the presence of missing data of the logistic regression of Y on X and Z with an interaction • logit(P(Y=1| X, Z)) = b0+ bxx + bzz + bxzx.z • We initially have a full [X,Y,Z] dataset, but it then becomes subject to missingness (MAR mechanisms) • We would like the parameter estimates after MAR followed by imputation occurs to be the same as the full data estimates 9 Why omitting interactions in the imputation may cause problems • The coefficients are the same as the coefficients of the corresponding log-linear model – logistic : • logit(Y) = b0+ bxx + bzz + bxzx.z – log-linear • log(mxyz) = m0+ mxx + myy + mzz + mxyx.y + mxzx.z + myzy.z + mxyzx.y.z – Examining the bias of bx is equivalent to examining the bias of mxy; same for bz and myz; and for bxz and mxyz 10 Why omitting interactions in the imputation may cause problems • Omitting interactions terms in the full conditional models will lead to interactions in the log-linear model being underestimated and hence P(X, Y, Z) being incorrectly estimated • This can also be seen by looking at the number of parameter estimates needed – If just X is subject to missingness, we need to be able to estimate P(X | Y,Z) P(Y, Z) • 4 parameter estimates needed for P(X | Y, Z) • This cannot be done with chained equation without interaction as there are only 3 free parameters X = a+ by Y + bz z – If X and Y are subject to missingess, we need to be able to estimate P(X, Y | Z) P(Z) • 8 parameter estimates in general needed for P(X, Y | Z) • This cannot be done with chained equations without interaction x = ax+ bxy y + bxz z and y = ay + byx x + byz z as there are only 6 free parameters 11 Passive Imputation • Imputation interactions are needed. Both the Stata program ICE (and also the R MICE package) support passive imputation • The interaction term is recalculated from the main effects after every mice cycle and can then be made use of in the subsequent chained equations in the cycle for the imputation of other variable(s) – Other possible approaches : – Von Hippel “How to impute interactions, squares and other transformed variables”, Sociological Methodology 39:265-291 2009 is a less established alternative to passive imputation – It is also worth noting that where a categorical variable is fully observed an alternative method of imputation is to split it by values of the fully observed variable and separately impute subsets of data 12 Simulation Structure 1. 2. 3. 4. We created a [X,Z] dataset We created Y stochastically given X and Z We stochastically created missingness (MAR) in 1, 2 or 3 variables Using a number of imputation models we did the imputation and performed the substantive analysis Steps 2-4 were repeated 100 times and parameter estimates and standard errors were recorded We tabulated the median of the parameter estimates, the median of the confidence intervals and the coverage of the original data generation parameter within the parameter estimate’s confidence intervals 13 Simulations • We examined the effect of interactions on analysis of imputed data in a series of simulation scenarios involving 3 variables; – Regression with outcome Y and covariates X and Z • The simulation scenarios ranged through all 3 variables being binary; 2 variables binary and one variable normal; one variable binary and 2 normal variables; to 3 normal variables – In each case varying combinations of outcomes and covariates complete/incomplete • We present a subset of the simulation scenarios 14 All variables binary; X and Z incomplete • Dataset : 20,000 observations, Y generated stochastically logit(Y) = 0.5 × X + 0.5 × Z + 0.6 × X × Z • Data divided into 2 sections with Bernoulli distribution (p = 0.5) [splitting allows missingness to be MAR] • Two stratified MAR patterns : – logit(Z is missing) = -2 + X + Y – logit(X is missing) = -2 + 1.3 × Z + 0.8 × Y (In one section of data) (Other section of data) • Imputation then substantive analysis performed • In a second simulation scenario there were 3 stratified MAR patterns : – P(Z missing | X, Y) – P(X missing | Y, Z) – P(X and Z jointly missing | Y) (In section 1 of data) (In section 2 of data) (In section 3 of data) 15 All variables binary; X and Z incomplete (Z ~ X + Y; Var X Z XZ Missingness mechanism % missing values (95% range) 14.0 (13.6, 14.4) 2 stratified MAR 23.0 (22.7, 23.4) patterns (interaction) X ~ Y + Z) Full data Complete case Imputed, no interaction Median OR (median CI) Median OR (median CI Median OR (median CI) coverage CI) CI coverage 0.49 (0.37,0.62) 0.32 (0.18,0.46) 0.27 0.58 (0.44,0.71) 0.81 0.49 (0.36,0.63) 0.54 (0.39,0.70) 0.90 0.59 (0.44,0.73) 0.82 0.61 (0.45,0.76) 0.58 (0.40,0.76) 0.91 0.45 (0.28,0.62) 0.62 X 21.8 (21.4, 22.2) 0.49 (0.37,0.62) 0.45 (0.31,0.59) 0.82 0.59 (0.45,0.72) 0.74 Z 3 stratified MAR 23.1 (22.6, 23.5) patterns 0.49 (0.36,0.63) 0.43 (0.28,0.59) 0.86 0.61 (0.46,0.76) 0.69 0.61 (0.45,0.76) 0.59 (0.41,0.77) 0.90 0.45 (0.28,0.62) 0.58 XZ (Z ~ X + Y + XY; Var X Z XZ X Z XZ X ~ Y + Z) X ~ Y + Z + YZ) (Z ~ X + Y + XY; X ~ Y + Z + YZ) Imputed, YZ interaction Median OR (median CI) CI coverage Imputed, XY, YZ interaction 0.90 0.55 (0.42,0.68) 0.85 0.50 (0.36,0.64) 0.93 0.87 0.55 (0.40,0.69) 0.89 0.95 0.89 0.51 (0.34,0.69) 0.88 0.50 (0.35,0.65) 0.60 (0.42,0.77) 0.53 (0.39,0.67) 0.94 0.56 (0.43,0.70) 0.89 0.50 (0.36,0.64) 0.93 0.55 (0.39,0.70) 3 stratified MAR patterns 0.53 (0.35,0.71) 0.86 0.87 0.56 (0.41,0.72) 0.50 (0.32,0.67) 0.85 0.81 0.49 (0.33,0.65) 0.61 (0.42,0.79) 0.94 0.93 Missingness mechanism Imputed, XY interaction Median OR (median CI CI) coverage (Z ~ X + Y; 0.54 (0.41,0.67) 2 stratified MAR patterns 0.56 (0.41,0.70) (interaction) 0.51 (0.33,0.69) Median OR (median CI) CI coverage 0.93 16 All variables binary; X, Z and Y incomplete • Data generated stochastically logit(Y) = 0.5 × X + 0.5 × Z + 0.6 × X × Z • Data divided randomly into 3 sections with equal probability • 3 stochastic stratified MAR patterns : – logit(Z is missing) = -2 + X + Y (In section 1 of data) – logit(X is missing) = -2 + 1.3 × Z + 0.8 × Y (In section 2 of data) – logit(Y is missing) = -1.5 + 1.9 × Z + 0.6 × X (In section 3 of data) data) • In a second simulation scenario there were 6 stratified MAR patterns : – – – – – – P(Z missing | X, Y) P(X missing | Y, Z) P(Y missing | X, Z) P(X and Y jointly missing | Z) P(X and Z jointly missing | Y) P(Y and Z jointly missing | X) (In section 1 of data) (In section 2 of data) (In section 3 of data) (In section 4 of data) (In section 5 of data) (In section 6 of data) 17 All variables binary; X, Z and Y incomplete Var X Z XZ Missingness mechanism % Missing values (95% range) 11.9 (11.6, 12.2) 3 stratified MAR patterns 13.0 (12.7, 13.3) for Z, X and Y Y 17.2 (16.9,17.6) X 15.2 (14.8, 15.6) 6 stratified MAR patterns 20.9 (20.5, 21.2) for Z, X and Y Z XZ Y Var X Z XZ X Z XZ Full data Median OR (median CI) Complete case Median OR (median CI CI) coverage Imputed, no interaction Median OR (median CI CI) coverage 0.49 (0.37,0.62) 0.44 (0.31,0.58) 0.82 0.56 (0.43,0.70) 0.83 0.49 (0.36,0.63) 0.40 (0.25,0.56) 0.76 0.61 (0.46,0.76) 0.67 0.61 (0.46,0.77) 0.56 (0.37,0.74) 0.91 0.43 (0.25,0.60) 0.50 0.49 (0.37,0.62) 0.46 (0.32,0.59) 0.84 0.57 (0.44,0.71) 0.82 0.49 (0.36,0.63) 0.45 (0.30,0.60) 0.85 0.60 (0.45,0.75) 0.73 0.61 (0.45,0.76) 0.58 (0.40,0.76) 0.92 0.44 (0.26,0.61) 0.57 21.4 (21.0,21.7) Imputed, XY, YZ interactions Imputed, YZ, XZ interactions Median OR (median CI) CI coverage Median OR (median CI Median OR CI) coverage (median CI) 3 stratified MAR 0.53 (0.39,0.66) patterns for Z, X 0.55 (0.40,0.70) and Y 0.96 0.53 (0.40,0.66) 0.96 0.87 0.53 (0.38,0.68) 0.91 0.52 (0.34,0.69) 0.87 0.55 (0.37,0.73) 6 stratified MAR 0.54 (0.40,0.67) patterns for Z, X 0.55 (0.40,0.71) and Y 0.93 Missingness mechanism 0.51 (0.33,0.68) Imputed, XY, XZ interactions Imputed, XY, YZ, XZ interactions CI coverage Median OR (median CI) CI coverage 0.52 (0.39,0.66) 0.96 0.49 (0.36,0.63) 0.94 0.89 0.94 0.54 (0.39,0.70) 0.54 (0.35,0.71) 0.90 0.49 (0.33,0.65) 0.61 (0.43,0.79) 0.95 0.93 0.53 (0.40,0.67) 0.94 0.51 (0.38,0.65) 0.93 0.50 (0.36,0.63) 0.89 0.90 0.53 (0.38,0.69) 0.94 0.97 0.85 0.55 (0.37,0.72) 0.92 0.52 (0.37,0.67) 0.56 (0.39,0.74) 0.50 (0.34,0.65) 0.60 (0.42,0.78) 0.96 0.90 0.92 18 Y continuous, X and Z binary; X, Z and Y incomplete • Data generated stochastically, this time Y is continuous Y ~ 0.45 × X + 0.55 × Z + 0.6 × X × Z + N(0, 1) • Data divided randomly into 3 sections with equal probability • 3 stochastic stratified MAR patterns : logit(Z is missing) = -2.5 + 1.5 × X + Y logit(X is missing) = -3 + 2.5 × Z + 0.8 × Y logit(Y is missing) = -4 + 2.5 × Z + 2.0 × X (In section 1 of data) (In section 2 of data) (In section 3 of data) 19 Y continuous, X and Z binary; X, Z and Y incomplete Var % missing values (95% range) Full data Median coef (median CI) Complete case Median coef (median CI CI) coverage Imputed, no interaction Median coef (median CI) CI coverage X 13.0 (12.7, 13.3) 0.46 (0.40,0.52) 0.40 (0.34,0.46) 0.58 0.50 (0.43,0.56) 0.74 Z 14.6 (14.2, 14.9) 0.55 (0.49,0.62) 0.47 (0.39,0.54) 0.34 0.62 (0.55,0.69) 0.49 0.59 (0.52,0.67) 0.47 (0.39,0.55) 0.12 0.48 (0.40,0.56) 0.20 XZ Y 11.2 (10.9, 11.5) Var Imputed, XY, YZ interaction Imputed, YZ, ZX interaction Imputed, XY, XZ interaction Median coef Median coef Median coef (median CI (median CI) CI coverage (median CI) CI coverage CI) coverage X 0.46 (0.40,0.53) 0.87 0.48 (0.42,0.55) 0.83 0.47 (0.40,0.53) 0.87 0.46 (0.39,0.52) 0.90 Z 0.57 (0.50,0.64) 0.94 0.58 (0.51,0.65) 0.89 0.81 XZ 0.56 (0.48,0.65) 0.83 0.55 (0.47,0.63) 0.80 0.59 (0.52,0.67) 0.55 (0.46,0.63) 0.56 (0.48,0.63) 0.59 (0.51,0.67) 0.95 0.94 0.77 Imputed, XY, YZ, XZ interaction Median coef (median CI) CI coverage 20 Further simulations • In further simulations similar results were obtained, where the distributional assumptions of the imputations models were adhered to • In each case omitting an interaction in a chained equation produced biased results. All 2-way interactions had to be included • Starting with a tri-variate normal distribution and introducing a slight interaction (slight nonnormality results) also gave imputed estimates closest to the full data estimates when the full interactions were introduced into the imputation model 21 Conclusions • In general the imputation models should reflect the structure of the substantive analysis, and should be at least as rich as the analysis model • In order to reflect the structure of the substantive model, the imputation model should not exclude its interactions, and should also include any corresponding interactions involving the outcome variable 22 Acknowledgements • This work was done in collaboration with Jonathan Sterne, Kate Tilling and James Carpenter • Helpful comments and suggestions from Paul Clarke are gratefully acknowledged 23