Online Supplementary Material Manuscript ID: LIDA1518 Subsample ignorable likelihood for accelerated failure time models with missing predictors 1. Introduction This report, which is a supplement to the paper, “Subsample ignorable likelihood for accelerated failure time models with missing predictors”, consists of two parts. (1) In section 2, we conduct Monte Carlo studies to examine the performance of the proposed SILAFT method when the covariates (x and w) are missing at a higher rate (40% and 60%, respectively). (2) In section 3, we provide the WinBUGS code for implementing the SILAFT model on the motivating example in the manuscript. 2. Simulation studies when covariates are missing at missing at higher rates To examine the performance of the proposed SILAFT model when covariates are missing at higher rates, we simulate data from the same Log-Normal model as in the manuscript, but with the covariates x and w missing at higher rates (40% and 60%, respectively). Missing values of W and X were then generated from the following two logistic models: logit P( R logit P( Rwi 0 | zi , wi , xi , (ti , i )) 0( w) z( w) zi w( w) wi x( w) xi t( w)ti xi 0 | Rwi 1, zi , wi , xi , (ti , i )) 0( x ) z( x ) zi w( x ) wi x( x ) xi t( x )ti with xi fully observed when wi is missing. For the missing data generation schemes above, CC analysis is valid if both t( w ) and t( x ) are zero; IL is valid if w( w ) , x( w ) and x( x ) are zero; SILAFT is valid if t( w ) and x( x ) are zero. Four missing data mechanisms were created using different sets of values for the regression coefficients such that, in mechanism (I) all three methods (CC, IL and SILAFT) are consistent, while in mechanisms (II), (III) and (IV), just one of the three methods is valid. The simulation setup is summarized in Table A1. These missing data mechanisms all generate approximately 50% and 35% values missing in W and X, respectively. Table A2 summarizes the root mean squared errors (RMSEs) of estimates of all the regression coefficients, and Tables A3 reports respectively the empirical bias, RMSE and coverage probability of estimates of the individual regression coefficients. Results in bold type reflect situations where the method is consistent based on the theory of Section 4 in the manuscript, and hence should do well. The results are based on 1000 repetitions in each simulation. Table A1: Missing data mechanisms generated in the simulations (50% missingness in w and 35% missingness in x) Mechanisms 0( w ) z( w ) w( w ) x( w ) t( w ) 0( x ) z( x ) w( x ) x( x ) t( x ) I: All valid II: CC valid III: IL valid IV: SILAFT valid 0 -0.5 -1.4 -.5 1 1 1 1 0 1 0 1 0 1 0 0 0 0 0.25 0 1.4 1.4 0.7 -.3 1 1 1 1 0 1 1 1 0 1 0 0 0 0 0.25 0.25 Missing value of W and X are generated based on the following logistic models: logit P( Rw 0 | zi , wi , xi , (ti , i )) 0( w) z( w) zi w( w) wi x( w) xi t( w)ti . i logit P( Rxi 0 | Rwi 1, zi , wi , xi , (ti , i )) 0( x ) z( x ) zi w( x ) wi x( x ) xi t( x )ti In particular, for the four missing data mechanisms: I: Missingness of W = f(Z), Missingness of X = f(Z|W observed), all four methods are valid; II: Missingness of W = f(Z,W, X), Missingness of X = f(Z,W,X|W observed), only CC valid; III: Missingness of W = f(Z), Missingness of X = f(Z,W|W observed), only IL valid; IV: Missingness of W= f(Z,W,(t,δ)), Missingness of X = f(Z,W,(t,δ),W observed), only SILAFT valid. Table A2. Summary RMSEs*1000 of Estimated Regression Coefficients for Before Deletion (BD), Complete Cases (CC), Ignorable Likelihood (IL) and Subsample AFT model, under Four Missing Data Mechanisms I II III IV BD 94 92 93 95 CC 261 268 707 591 IL 158 199 162 219 SILAFT 212 392 534 227 *Four missing data mechanisms: I: Missingness of W = f(Z), Missingness of X = f(Z|W observed), all four methods are valid; II: Missingness of W = f(Z,W, X), Missingness of X = f(Z,W,X|W observed), only CC valid; III: Missingness of W = f(Z), Missingness of X = f(Z,W, (t , ) |W observed), only IML valid; IV: Missingness of W= f(Z,W), Missingness of X = f(Z,W, (t , ) |W observed), only SILAFT valid. RMSE estimates 1000* E r TRUE 2 , with r denoting the r th repetition. Bold values are for methods consistent for the mechanism generating the data Table A3. RMSE, Empirical Bias, and 95% confidence coverage for Individual Regression Coefficients under Four Missing Data Mechanisms (1000 replications) RMSE*1000 Mechanism I Method β0 βz βw Mechanism II βx β0 βz βw Mechanism III βx β0 βz βw Mechanism IV βx β0 βz βw βx BD 46 35 65 35 43 34 66 34 46 35 65 32 45 31 68 36 CC 145 102 172 85 143 100 183 89 553 269 279 210 438 219 285 171 IL 73 42 124 49 103 45 158 44 77 45 127 47 152 58 138 49 SILAFT 118 82 139 70 327 109 169 82 377 247 208 196 109 76 171 69 βz βw βx β0 Bias*1000 Method β0 βz βw βx β0 βz βw βx β0 4 BD 4 -4 1 -5 -2 7 3 0 CC 16 0 -2 -2 -1 8 13 3 IL 10 -4 -3 -4 81 -18 -98 -11 SILAFT 7 -2 6 -10 308 74 87 28 -4 -4 0 4 βz -2 βw -4 βx 0 -540 -249 -221 -191 -419 -196 -215 -146 8 -2 -8 2 -366 -236 -168 -183 135 44 5 8 21 14 -3 4 95% Confidence coverage Method β0 βz βw βx β0 βz βw βx β0 βz βw βx β0 βz βw βx BD 96.2 93.8 95.7 92.4 95.7 94.8 93.9 96.1 94.2 97.5 94.5 94.9 95.1 97.6 92.2 93.7 CC 95.2 95.2 92.9 95.2 94.4 95.2 93.9 97.0 2.2 29.8 75.6 35.3 10.7 50.0 79.6 55.8 IL 94.8 97.1 95.7 92.9 77.5 91.3 87.9 93.9 97.5 95.3 94.9 95.6 42.7 85.9 93.2 95.1 SILAFT 93.3 95.2 94.3 96.7 18.2 81.4 87.4 93.1 3.6 15.3 74.9 21.8 94.7 97.6 93.2 96.6 3. WinBUGS code of SILAFT model on the NLMS dataset In this section, we provide the WinBUGS code for fitting the SILAFT model using the NLMS dataset. The follow-up time is modeled as a log-normal regression model. The missing binary covariates are imputed based on logistic regression models. We assume non-informative priors for all parameters involved. # followup: follow –up time # cens: censoring time # eduhs: high school education or above vs. less than HS # adjincome: adjusted income # raceb: Black race vs. White race # raceo: Other race vs. White race # sex: Female vs. Male # married: Married vs. not married # agebaseline: Age at baseline model { for (i in 1:N) { followup[i] ~ dlnorm(mu[i], sigma) I(cens[i],) mu[i] <- beta[1] + beta[2] * eduhs[i] + beta[3] * adjincome[i] + beta[4] * raceb[i] + beta[5] * raceo[i] + beta[6] * sex[i] + beta[7] * married[i] + beta[8] * agebaseline[i] # married is modeled as a logistic model married[i] ~ dbern(pm[i]) logit(pm[i]) <- bm[1] + bm[2] * sex[i] + bm[3] * agebaseline[i] # education is modeled as a logistic model eduhs[i] ~ dbern(pe[i]) logit(pe[i]) <- be[1] + be[2] * sex[i] + be[3] * agebaseline[i] # race: black vs. white is modeled as a logistic model raceb[i] ~ dbern(prb[i]) logit(prb[i]) <- brb[1] + brb[2] * sex[i] + brb[3] * agebaseline[i] # race: other vs. white is modeled as a logistic model raceo[i] ~ dbern(pro[i]) logit(pro[i]) <- bro[1] + bro[2] * sex[i] + bro[3] * agebaseline[i] } # priors for (j in 1:8) { for (k in 1:3) { for (l in 1:3) { for (m in 1:3) { for (n in 1:3) { beta[j] ~ dnorm(0, .001) } bm[k] ~ dnorm(0, .001) } be[l] ~ dnorm(0, .001) } brb[m] ~ dnorm(0, .001) } bro[n] ~ dnorm(0, .001) } sigma ~ dgamma(0.001, 0.001) }