7. Multiple Imputation: Part 1 1 Introduction • Assume a parametric model: y ∼ f (y | x; θ) • We are interested in making inference about θ. • In Bayesian approach, we want to make inference about θ from f (θ | x, y) = R π(θ)f (y | x, θ) π(θ)f (y | x, θ)dθ where π(θ) is a prior distribution, which is assumed to be known for simplicity here. • The point estimator is θ̂n = E{θ | x, y} (1) V̂n = V {θ | x, y}. (2) and its variance estimator is We may express θ̂n = θ̂n (x, y) and V̂n = V̂n (x, y). • Now, consider the case when x is always observed and y is subject to missingness. Let y = (yobs , ymis ) be the (observed, missing) part of the sample. • We have two approaches of making inference about θ using the observed data (x, yobs ): 1. Direct Bayesian approach: Consider f (θ | x, yobs ) = R 1 π(θ)f (yobs | x, θ) π(θ)f (yobs | x, θ)dθ and use θ̂r = E{θ | x, yobs } (3) and its variance estimator is V̂r = V {θ | x, yobs }. (4) 2. Multiple imputation approach: ∗(k) (a) For each k, generate ymis from f (ymis | x, yobs ). ∗(k) (b) Apply the k-th imputed values to θ̂n in (1) to obtain θ̂n ∗(k) = θ̂n (x, yobs , ymis ). ∗(k) Also, apply the k-th imputed values to V̂n in (2) to obtain V̂n ∗(k) V̂n (x, yobs , ymis ). (c) Combine the M point estimators to get θ̂M I = M 1 X ∗(k) θ̂ M k=1 n as a point estimator of θ. (d) The variance estimator of θ̂M I is V̂M I = WM 1 + 1+ BM M where WM M 1 X ∗(k) V̂ = M k=1 n BM 2 1 X ∗(k) = θ̂n − θ̄M I . M − 1 k=1 M Comparison Bayesian Frequentist Model Posterior distribution Prediction model f (latent, θ | data) f (latent | data, θ) Computation Data augmentation EM algorithm I-step E-step Prediction Parameter update P-step M-step Parameter est’n Posterior mode ML estimation Imputation Multiple imputation Fractional imputation Variance estimation Rubin’s formula Linearization or Bootstrap 2 = 2 Main Result 1. Bayesian Properties (Rubin, 1987): For sufficiently large M , we have θ̂M I M 1 X ∗(k) = θ̂ M k=1 n = M 1 X ∗(k) E(θ | x, yobs , ymis ) M k=1 . = E{E(θ | x, yobs , Ymis ) | x, yobs } = E(θ | x, yobs ), which is equal to θ̂r in (3). Also, for sufficiently large M , V̂M I = WM + BM M M 2 1 X ∗(k) 1 X ∗(k) V̂n + θ̂n − θ̄M I = M k=1 M − 1 k=1 . = E{V (θ | x, yobs , Ymis ) | x, yobs } + V {E(θ | x, yobs , Ymis ) | x, yobs }, which is equal to V̂r in (4) 2. Frequentist Properties (Wang and Robins, 1998) Assume that, under complete data, θ̂n is the MLE of θ and V̂n = {I(θ̂n )}−1 is asymptotically unbiased for V (θ̂n ). Under the existence of missing data, θ̂M I is asymptotically equivalent to the MLE of θ and V̂M I is approximately unbiased for V (θ̂M I ). That is, θ̂M I ∼ = θ̂M LE (5) E{V̂M I } ∼ = V (θ̂M I ) (6) and for sufficiently large M and n. (See Appendix A for a sketched proof.) 3 3 Computation • Gibbs sampling – Geman and Geman (1984): the “Gibbs sampler” for Bayesian image reconstruction – Tanner and Wong (1987): data augmentation for Bayesian inference in generic missing-data problems – Gelfand and Smith (1990): simulation of marginal distributions by repeated draws from conditionals • Idea for Gibbs sampling: Sample from conditional distributions (t) (t) (t) (t) Given Z = Z1 , Z2 , · · · , ZJ , draw Z (t+1) by sampling from the full conditionals of f , (t+1) Z1 (t+1) Z2 (t) (t) Z2 , Z3 , · · · (t) , ZJ (t) (t) Z1 , Z2 , · · · (t) , ZJ−1 ∼ P Z1 | (t) (t) (t) ∼ P Z2 | Z1 , Z3 , · · · , ZJ .. . (t+1) ZJ ∼ P ZJ | . Under mild regularity conditions, P Z (t) → f as t → ∞. • Data augmentation: Application of the Gibbs sampling to missing data problem y = observed data z = missing data θ = model parameters Predictive distribution: Z P (z | y) = P (z | y, θ) dP (θ | y) Posterior distribution: Z P (θ | y) = P (y | y, z) dP (z | y) 4 • Algorithm: Iterative method of data augmentation I-step: Draw z (t+1) ∼ P z | y, θ(t) P-step: Draw θ(t+1) ∼ P θ | y, z (t+1) . • Two uses of data augmentation – Parameter simulation: collect and summarize a sequence of dependent draws of θ, θ(t+1) , θ(t+2) , · · · , θ(t+N ) , where t is large enough to ensure stationarity. – Multiple imputation: collect independent draws of z, z (t) , z (2t) , · · · , z (mt) • Parameter simulation (Bayesian approach) θ1 = component of function of θ of interest Collect iterates of θ1 from data augmentation (t+1) θ1 (t+2) , θ1 (t+N ) , · · · , θ1 , where t is large enough to ensure stationarity and N is the Monte Carlo sample size. P (t+k) – θ̄1 = N −1 N estimates the posterior mean E (θ | y). k=1 θ1 2 PN (t+k) −1 – N − θ1 estimates the posterior variance V (θ1 | y). k=1 θ1 (t+1) – The 2.5th and 97.5th percentiles of θ1 (t+2) , θ1 (t+N ) , · · · , θ1 endpoints of a 95% equal-tailed Bayesian interval for θ1 . 5 estimate the 4 Examples 4.1 Example 1(Univariate Normal distribution) • Let y1 , · · · , yn be IID observations from N (µ, σ 2 ) and only the first r elements are observed and the remaining n − r elements are missing. Assume that the response mechanism is ignorable. • Bayesian imputation: the j-th posterior values of (µ, σ 2 ) are generated from σ ∗(j)2 | yr ∼ rσ̂r2 /χ2r−1 (7) and µ∗(j) | (yr , σ ∗(j)2 ) ∼ N ȳr , r−1 σ ∗(j)2 (8) P P where yr = (y1 , · · · , yr ), ȳr = r−1 ri=1 yi , and σ̂r2 = r−1 ri=1 (yi − ȳr )2 . Given the posterior sample (µ∗(j) , σ ∗(j)2 ), the imputed values are generated from ∗(j) (9) yi | yr , µ∗(j) , σ ∗(j)2 ∼ N µ∗(j) , σ ∗(j)2 independently for i = r + 1, · · · , n. • Let θ = E(Y ) be the parameter of interest and the MI estimator of θ can be expressed as θ̂M I = where (j) θ̂I 1 = n ( r X i=1 M 1 X (j) θ̂ M j=1 I yi + n X ) ∗(j) yi . i=r+1 Then, θ̂M I M n M n − r X ∗(j) 1 X X ∗(j) µ − ȳr + yi − µ∗(j) . = ȳr + nM j=1 nM i=r+1 j=1 (10) Asymptotically, the first term has mean µ and variance r−1 σ 2 , the second term has mean zero and variance (1 − r/n)2 σ 2 /(mr), the third term has mean zero and variance σ 2 (n − r)/(n2 m), and the three terms are mutually independent. Thus, the variance of θ̂M I is 2 1 1 n−r 1 2 1 2 2 V θ̂M I = σ + σ + σ . r M n r n−r 6 (11) • For variance estimation, note that ∗(j) V (yi ∗(j) ) = V (ȳr ) + V (µ∗(j) − ȳr ) + V (yi − µ∗(j) ) r+1 1 2 1 2 r+1 2 σ + σ +σ = r r r−1 r−1 2 ∼ = σ . Writing (j) V̂I (θ̂) = n−1 (n − 1)−1 = n−1 (n − 1)−1 n X ( i=1 n X n 1 X ∗(j) ∗(j) ỹi − ỹ n k=1 k ∗(j) ỹi −µ 2 −n i=1 )2 n 1 X ∗(j) −µ ỹ n k=1 k !2 ∗(j) where ỹi∗ = δi yi + (1 − δi )yi , we have ( n n o 2 X ∗(j) (j) E V̂I (θ̂) = n−1 (n − 1)−1 E ỹi − µ − nV !) n 1 X ∗(j) ỹk n k=1 " i=1 ( 2 )# n − r 1 1 1 ∼ σ2 + σ2 + σ2 = n−1 (n − 1)−1 nσ 2 − n r n r n−r ∼ = n−1 σ 2 which shows that E(WM ) ∼ = V (θ̂n ). Also, ∗(1) ∗(2) ∗(1) − Cov θ̂I , θ̂I E(Bm ) = V θ̂I ) ( n 1 X n − r ∗(1) ∗(1) = V y − µ∗(1) µ − ȳr + n n i=r+1 i 2 n−r 1 1 ∼ + σ2 = n r n−r 1 1 = − σ2. r n Thus, Rubin’s variance estimator satisfies 2 n o 1 1 n − r 1 1 2 2 2 ∼ E V̂M I (θ̂M I ) ∼ σ + σ + σ V θ̂ = = MI , r M n r n−r which is consistent with the general result in (6). 7 4.2 Example 2 (Censored regression model, or Tobit model) • Model zi = x0i β + i zi yi = ci i ∼ N 0, σ 2 if zi ≥ ci if zi < ci . • Data augmentation 1. I-step: Given θ(t) = β (t) , σ (t) , generate the imputed value (for δi = 0) from (t+1) zi (t) = x0i β (t) + i with (t) i ∼ (t+1) If δi = 1, then zi 2. P-step: Given z(t+1) φ (s) Φ [(ci − x0i β (t) ) /σ (t) ] = zi . (t+1) (t+1) (t+1) , generate θ(t+1) from = z1 , z2 , · · · , zn θ(t+1) ∼ P θ | z(t+1) . That is, generate σ 2(t+1) | z(t+1) ∼ (n − p) σ̂ 2(t+1) /χ2n−p and β (t+1) | z(t+1) , σ 2(t+1) ∼ N β̂n(t+1) , n X !−1 xi x0i σ 2(t+1) , i=1 where β̂n(t+1) = n X !−1 xi x0i i=1 n X (t+1) xi zi i=1 and 0 σ̂ 2(t+1) = (n − p)−1 z(t+1) (I − Px ) z(t+1) . 8 4.3 Example 3 (Bayesian bootstrap) • Nonparametric approach to Bayesian imputation • First proposed by Rubin (1981). • Assume that an element of the population takes one of the values d1 , · · · , dK with probability p1 , · · · , pK , respectively. That is, we assume P (Y = dk ) = pk , K X pk = 1. (12) k=1 • Let y1 , · · · , yn be an IID sample from (12) and let nk be the number of yi equal to dk . The parameter is a vector of probabilities p = (p1 , · · · , pK ), such that PK i=1 pi = 1. In this case, the population mean θ = E(Y ) can be expressed as P θ= K i=1 pi di and we only need to estimate p. • If the improper Dirichlet prior with density proportional to QK k=1 p−1 k is placed on the vector p, then the posterior distribution of p is proportional to K Y pnk k −1 k=1 which is a Dirichlet distribution with parameter (n1 , · · · , nK ). This posterior distribution can be simulated using n − 1 independent uniform random numbers. Let u1 , · · · , un−1 be IID U (0, 1), and let gi = u(i) −u(i−1) , i = 1, 2, · · · , n−1 where u(k) is the k-th order statistic of u1 , · · · , un−1 with u(0) = 0 and u(n) = 1. Partition the g1 , · · · , gn into K collections, with the k-th one having nk elements, and let pk be the sum of the gi in the k-th collection. Then, the realized value of p1 , · · · , pk follows a (K − 1)-variate Dirichlet distribution with parameter (n1 , · · · , nK ). In particular, if K = n, then (g1 , · · · , gn ) is the vector of probabilities to attach to the data values y1 , · · · , yn in that Bayesian bootstrap replication. • To implement Rubin’s Bayesian bootstrap to multiple imputation, assume that the first r elements are observed and the remaining n − r elements are missing. The imputed values can be generated with the following steps: 9 [Step 1] From yr = (y1 , · · · , yr ), generate p∗r = (p∗1 , · · · , p∗r ) from the posterior distribution using the Bayesian bootstrap as follows. 1. Generate u1 , · · · , ur−1 independently from U (0, 1) and sort them to get 0 = u(0) < u(1) < · · · < u(r−1) < u(r) = 1. 2. Compute p∗i = u(i) − u(i−1) , i = 1, 2, · · · , r − 1 and p∗r = 1 − [Step 2] Select the imputed value of y1 ∗ ... yi = yr Pr−1 i=1 p∗i . yi by with probability p∗1 ... with probability p∗r independently for each i = r + 1, · · · , n. • Rubin and Schenker (1986) proposed an approximation of this Bayesian bootstrap method, called the approximate Bayesian boostrap (ABB) method, which provides an alternative approach of generating imputed values from the empirical distribution. The ABB method can be described as follows: [Step 1] From yr = (y1 , · · · , yr ), generate a donor set yr∗ = (y1∗ , · · · , yr∗ ) by bootstrapping. That is, we select y1 with probability 1/r ∗ ... ... yi = yr with probability 1/r independently for each i = 1, · · · , r. [Step 2] From the donor set yr∗ = (y1∗ , · · · , yr∗ ), select an imputed value of yi by ∗ y1 with probability 1/r ∗∗ ... ... yi = ∗ yr with probability 1/r independently for each i = r + 1, · · · , n. 10 Reference Gelfand, A.E. and Smith, A.F.M. (1990). Sampling-based approaches to calculating marginal densities, Journal of the American Statistical Association, 85, 398-409. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 62, 721-741. Rubin, D. B. (1981). The Bayesian bootstrap. The Annals of Statistics, 9, 130-134. Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys, John Wiley & Sons, New York. Rubin, D.B. and Schenker, N. (1986). Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association 81, 366-374. Tanner, M.A. and Wong, W.H. (1987). The calculation of posterior distribution by data augmentation. Journal of the American Statistical Association 82, 528-540. Wang, N. and Robins, J.M. (1998). Large-sample theory for parametric multiple imputation procedure, Biometrika, 85, 935-948. 11 Appendix A. Proof of (5) and (6) Let Scom (θ) = S(θ; x, y) be the score function of θ under complete response. The MLE under complete response, denoted by θ̂n , is asymptotically equivalent to −1 Scom (θ), θ̂n ∼ = θ + Icom where Icom = E{−∂Scom (θ)/∂θ0 }. Thus, θ̂M I M 1 X ∗(k) = θ̂ M k=1 n M X . ∗(k) −1 1 S(θ; x, yobs , ymis ) = θ + Icom M k=1 = θ+ −1 Icom −1 + Icom M 1 X E{S(θ; x, yobs , Ymis ) | x, yobs ; θ∗(k) } M k=1 M i 1 Xh ∗(k) S(θ; x, yobs , ymis ) − E{S(θ; x, yobs , Ymis ) | x, yobs ; θ∗(k) } , M k=1 ∗(k) where ymis ∼ f (ymis | x, yobs ; θ∗(k) ) and θ∗(k) ∼ p(θ | x, yobs ). Under some regularity conditions, the posterior distribution converges to a normal distribution with mean −1 θ̂M LE and variance Iobs = V (θ̂M LE ). (This is often called Bernstein–von Mises the∗(k) orem.) Thus, we can apply Taylor linearization on S(θ; x, yobs , ymis ) with respect to θ∗(k) around the true θ to get E{S(θ; x, yobs , Ymis ) | x, yobs ; θ∗(k) } ∼ = E{S(θ; x, yobs , Ymis ) | x, yobs ; θ} + Imis (θ∗(k) − θ) = Sobs (θ) + Imis (θ̂M LE − θ) + Imis (θ∗(k) − θ̂M LE ), (13) where Imis is the information matrix associated with f (ymis | x, yobs ; θ). Since θ̂M LE is the solution to Sobs (θ) = 0, we have −1 θ̂M LE ∼ Sobs (θ) = θ + Iobs and (13) further simplifies to −1 E{S(θ; x, yobs , Ymis ) | x, yobs ; θ∗(k) } = Sobs (θ) + Imis Iobs Sobs (θ) + Imis (θ∗(k) − θ̂M LE ) −1 = Icom Iobs Sobs (θ) + Imis (θ∗(k) − θ̂M LE ). 12 (14) Thus, combining all terms together, we have ∗(k) −1 −1 θ̂n∗(k) = θ̂M LE + Icom Imis (θ∗(k) − θ̂M LE ) + Icom S − E(S | x, yobs ; θ∗(k) ) (15) ∗(k) where S ∗(k) = S(θ; x, yobs , ymis ). Therefore, θ̂M I = θ + −1 Iobs Sobs (θ) + −1 Icom Imis M 1 X ∗(k) (θ − θ̂M LE ) M k=1 M i 1 Xh ∗(k) S(θ; x, yobs , ymis ) − E{S(θ; x, yobs , Ymis ) | x, yobs ; θ∗(k) } + M k=1 ) ( M X −1 (θ∗(k) − θ̂M LE ) = θ̂M LE + Icom Imis M −1 −1 Icom k=1 −1 +Icom M −1 M X S ∗(k) (θ) − E(S | x, yobs ; θ∗(k) ) . k=1 Note that the second term reflects the variability due to generating θ∗ and the third ∗(k) therm reflects the variability due to generating ymis from f (ymis | x, yobs ; θ∗(k) ). The three terms are independent and the last two terms are negligible for large M . −1 −1 To prove (6), note first that E(Vn ) = V (θ̂n ) = Icom and so we have E(WM ) ∼ . = Icom Now, inserting (15) into M BM = 1 X ∗(k) (θ̂ − θ̂M I )⊗2 , M − 1 k=1 n we have −1 −1 −1 −1 −1 E(BM ) = Icom Imis Iobs Imis Icom + Icom Imis Icom −1 −1 = Iobs − Icom , where the last equality follows from −1 (A + BCB 0 ) −1 = A−1 − A−1 BCB 0 A−1 + A−1 BCB 0 (A + BCB 0 ) BCB 0 A−1 −1 with A = Icom , B = I, and C = −Imis . Therefore, E(V̂M I ) ∼ = V (θ̂M LE ) ∼ = Iobs = V (θ̂M I ). 13