2. Fractional Imputation 1 Introduction • Consider the setup of Example 2, random effect model, yij = x0ij β + ai + eij , i = 1, · · · , n1 , j = 1, · · · , n2 , where ai ∼ N (0, σa2 ) and eij ∼ N (0, σe2 ). • Let yi = (yi1 , · · · , yin2 )0 be observed, but ai is never observed. • The joint density of (yi , ai ) is f (yi , ai ; θ) = f1 (yi | ai ; β, σe2 )f2 (ai ; σa2 ) where ( 1 X f1 (yi | ai ; β, σe2 ) = (2πσe2 )−n2 /2 exp − 2 (yij − x0ij β − ai )2 2σe j 1 2 2 2 −1/2 f2 (ai ; σa ) = (2πσa ) exp − 2 ai . 2σa • The score function for θ1 = (β, σe2 ): n1 X ∂ S1 (θ1 ) = log f1 (yi | ai ; θ1 ) ∂θ1 i=1 −2 P P 0 σe i Pj (yij − xij β − ai )xij P = . σe−4 i j {(yij − x0ij β − ai )2 − σe2 } • The score function for θ2 = σa2 : n1 X ∂ log f2 (ai ; θ2 ) S2 (θ2 ) = ∂θ2 i=1 X = σa−4 a2i − σa2 . i 1 ) • EM algorithm: – E-step: Compute the conditional expectation of the score functions given the observed data: E{S(θ) | y; θ̂(t) }. To compute the conditional expectation, we need to derive f (ai | yi ) under the current parameter values (t) f (ai | yi ; θ̂(t) ) = R (t) f1 (yi | ai ; θ̂1 )f2 (ai ; θ̂2 ) (t) (t) f1 (yi | ai ; θ̂1 )f2 (ai ; θ̂2 )dai . (1) When both f1 and f2 are normal, then the above conditional distribution is also normal ai | yi ∼ N [E(ai | yi ), V (ai | yi )] where σa2 E(ai | yi ) = 2 (ȳi − x̄i β) σa + σe2 /n2 and σ2 V (ai | yi ) = 2 a2 σa + σe /n2 σe2 n2 . – M-step: Update the parameter by solving E{S(θ) | y; θ̂(t) } = 0 for θ, where the conditional expectation is computed from the E-step. • If the normality does not hold either in f1 or in f2 , then (1) is not necessarily normal. In this case, E-step may involve Monte Carlo approximation. 2 Monte Carlo EM algorithm • y: observe data, z: latent variable. • We are interested in computing E{S(θ; y, Z) | y}. 2 • Wei and Tanner (1990) proposed Monte Carlo EM algorithm: In the E-step, first draw z ∗(1) , · · · , z ∗(m) ∼ f z | y; θ(t) and approximate m 1 X E{S(θ; y, z) | y} ∼ S(θ; y, z ∗(j) ) = m j=1 Example 1 • Suppose that yi ∼ f (yi | xi ; θ) Assume that xi is always observed but we observe yi only when δi = 1 where δi ∼ Bernoulli [πi (φ)] and πi (φ) = exp (φ0 + φ1 xi + φ2 yi ) . 1 + exp (φ0 + φ1 xi + φ2 yi ) • To implement the MCEM method, in the E-step, we need to generate samples from f (yi | xi , δi = 0; θ̂, φ̂) = R f (yi | xi ; θ̂){1 − πi (φ̂)} f (yi | xi ; θ̂){1 − πi (φ̂)}dyi . • We can use the following rejection method to generate samples from f (yi | xi , δi = 0; θ̂, φ̂): 1. Generate yi∗ from f (yi | xi ; θ̂). 2. Using yi∗ , compute πi∗ (φ̂) = exp(φ̂0 + φ̂1 xi + φ̂2 yi∗ ) 1 + exp(φ̂0 + φ̂1 xi + φ̂2 yi∗ ) Accept yi∗ with probability 1 − πi∗ (φ̂). 3. If yi∗ is not accepted, then goto Step 1. 3 . ∗(1) • Using the m imputed values of yi , denoted by yi ∗(m) , · · · , yi , and the M-step can be implemented by solving n X m X ∗(j) =0 S θ; xi , yi i=1 j=1 and n X m n o X ∗(j) ∗(j) = 0, δi − π(φ; xi , yi ) 1, xi , yi i=1 j=1 where S (θ; xi , yi ) = ∂ log f (yi | xi ; θ)/∂θ. Example 2 (Cont’d) • Basic Setup: Let yij be a binary random variable (that takes 0 or 1) with probability pij = P r (yij = 1 | xij , ai ) and we assume that logit (pij ) = x0ij β + ai where xij is a p-dimensional covariate associate with j-th repetition of unit i, β is the parameter of interest that can represent the treatment effect due to x, and ai represents the random effect associate with unit i. We assume that ai are iid with N (0, σ 2 ). • Missing data : ai • Observed likelihood: ) ( a YZ Y 1 i dai Lobs β, σ 2 = p (xij , ai ; β)yij [1 − p (xij , ai ; β)]1−yij φ σ σ i j where φ (·) is the pdf of the standard normal distribution. • MCEM approach: generate a∗i from f (ai | xi , yi ; β̂, σ̂) ∝ f1 (yi | xi , ai ; β̂)f2 (ai ; σ̂). where f1 (yi | xi , ai ; β̂) = Y p (xij , ai ; β)yij [1 − p (xij , ai ; β)]1−yij j and f2 (ai ; σ̂) = 4 1 ai φ . σ σ • Metropolis-Hastings algorithm: 1. Generate a∗i from f2 (ai ; σ̂). 2. Set ( (t) ai = where a∗i (t−1) ai (t−1) w.p. ρ(ai , a∗i ) (t−1) ∗ w.p. 1 − ρ(ai , ai ) ∗ f1 yi | xi , ai ; β̂ (t−1) ∗ , 1 . ρ ai , ai = min f y | x , a(t−1) ; β̂ 1 i i i Remark • Monte Carlo EM can be used as a frequentist approach to imputation. • Convergence is not guaranteed (for fixed m). • E-step can be computationally heavy. (May use MCMC method). 3 Parametric fractional imputation Motivation • We are interested in computing E{S(θ; yi , Zi ) | yi ; θ(t) } • The conditional distribution is not of known form: f (zi | yi ; θ̂) = R f1 (yi | zi ; θ̂1 )f2 (zi ; θ̂2 ) f1 (yi | zi ; θ̂1 )f2 (zi ; θ̂2 )dzi • Approximate the conditional expectation by E{S(θ; yi , Zi ) | yi ; θ̂} ∼ = m X j=1 ∗(1) where zi ∗(m) , · · · , zi are generated from f2 (zi ; θ̂2 ) and ∗(j) ∗ wij f1 (yi | zi = Pm k=1 5 f1 (yi | ∗(j) ∗ S(θ; yi , zi wij ; θ̂1 ) . ∗(k) zi ; θ̂1 ) ) (2) ∗(1) • More generally, we may use (2) where zi and ∗(j) ∗ wij ∝ with P j f1 (yi | zi ∗(m) , · · · , zi ∗(j) ; θ̂1 )f2 (zi are generated from h(zi ; θ̂2 ) ; θ̂2 ) ∗(j) h(zi ; θ̂2 ) ∗ wij = 1. Remark • Importance sampling idea: For sufficiently large m, m X R ∗ wij g ∗ zij ∼ = j=1 i ,zi ;θ̂) n o h(zi )dzi g(zi ) f (yh(z i) = E g (z ) | y ; θ̂ i i R f (yi ,zi ;θ̂) h(zi )dzi h(zi ) for any g such that the expectation exists. • In the importance sampling literature, h(·) is called proposal distribution and f (·) is called target distribution. ∗ are the normalized importance weights and can be called frac• The weight wij tional weight. • If ymis,i is categorical, then simply use all possible values of ymis,i as the imputed values and then assign their conditional probabilities as the fractional weights. Monte Carlo EM algorithm using PFI (Kim, 2011) ∗(j) 1. Imputation-step: generate zi ∼ h (z), where h(·) does not depend on θ. 2. Weighting-step: compute ∗(j) ∗ wij(t) ∝ f (yi , zi where Pm j=1 ∗(j) ; θ̂(t) )/h(zi ) ∗ wij(t) = 1. 3. M-step: update θ̂ (t+1) : solution to n X m X i=1 j=1 6 ∗ wij(t) S ∗(j) θ; yi , zi = 0. 4. Repeat Step2 and Step 3 until convergence. • “Imputation Step” + “Weighting Step” = E-step. ∗ is too large for some j. In this • We may add an optional step that checks if wij(t) case, h(z) needs to be changed. • The imputed values are not changed for each EM iteration. Only the fractional weights are changed. 1. Computationally efficient (because we use importance sampling only once). 2. Convergence is achieved (because the imputed values are not changed). Return to Example 1 • Fractional imputation 1. Imputation Step: Generate ∗(1) yi , · · · ∗(m) , yi from f yi | xi ; θ̂(0) . 2. Weighting Step: Using the m imputed values generated from Step 1, compute the fractional weights by ∗(j) o f yi | xi ; θ̂(t) n ∗ 1 − π(xi , yi∗(j) ; φ̂(t) ) wij(t) ∝ ∗(j) f yi | xi ; θ̂(0) where exp φ̂0 + φ̂1 xi + φ̂2 yi . π(xi , yi ; φ̂) = 1 + exp φ̂0 + φ̂1 xi + φ̂2 yi • Using the imputed data and the fractional weights, the M-step can be implemented by solving n X m X ∗(j) ∗ wij(t) S θ; xi , yi =0 i=1 j=1 and n X m X ∗ wij(t) n o ∗(j) ∗(j) δi − π(φ; xi , yi ) 1, xi , yi = 0, i=1 j=1 where S (θ; xi , yi ) = ∂ log f (yi | xi ; θ)/∂θ. 7 4 Applications (Small Area Estimation) • Hierarchical structural model 1. Level one model: yij ∼ f1 (yij | xij , ai ; θ1 ) 2. Level two model: ai ∼ f2 (ai ; θ2 ) • Instead of observing (xij , yij ), we observe (xij , ŷij ), where ŷij | yij ∼ N (yij , vij ). You may think that i is a state-level index and j is a county-level index. • Thus, we have two missing data: ai and yi . • EM algorithm 1. E-step: We are interested in generating (t) (t) (yi , ai ) | (xi , ŷi ) ∼ R f1 (yi | xi , ai ; θ̂1 )f2 (ai ; θ̂2 )g(ŷi | yi ) (t) (t) f1 (yi | xi , ai ; θ̂1 )f2 (ai ; θ̂2 )g(ŷi | yi )dyi . (3) One can use either MCMC method or PFI method to compute the conditional expectation. If PFI is used, then we may use the following steps: ∗(1) (a) Generate ai ∗(k) (b) For each ai ∗(k) xij , ai ∗(m) , · · · , ai (t) from f2 (ai ; θ̂2 ). ∗(k) from some proposal distribution h(yij | , generate yij , ŷij ). One can use a normal distribution with mean ∗(k) E(yij | xij , ai , ŷij ) = vij 2(t) σe + vij ∗(k) xij β̂ (t) + ai 2(t) σe + 2(t) σe and variance vij . ∗(k) (c) The fractional weight assigned to (ai ∗(k) ∗ wik(t) ∝ with P k f1 (yi ∗(k) h(yi ∗ wik(t) = 1. 8 ∗(k) | xi , ai | ∗(k) , yi ) is then (t) ∗(k) ; θ̂1 )g(ŷi | yi ∗(k) xi , ai , ŷi ) ) + vij ŷij 2. M-step: the parameters are updated by solving XX i ∗(k) ∗ wik(t) S1 (θ1 ; xi , yi )=0 k and XX i ∗(k) ∗ wik(t) S2 (θ2 ; ai )=0 k where S1 (θ1 ; xi , yi , ai ) = ∂ log f1 (yi | xi , ai ; θ1 )/∂θ1 and S2 (θ2 ; ai ) = ∂ log f2 (ai ; θ2 )/∂θ2 . • Suppose that the problem is a measurement error problem such that we observe (x̂ij , ŷij ) instead of observing (xij , yij ), where x̂ij | xij ∼ g1 (x̂ij | xij ; vij1 ) and ŷij | yij ∼ g2 (ŷij | yij ; vij2 ). In this case, (3) is changed to (t) (t) (xi , yi , ai ) | (x̂i , ŷi ) ∼ R f1 (yi | xi , ai ; θ̂1 )f2 (ai ; θ̂2 )g2 (ŷi | yi )g̃1 (xi | x̂i ) (t) (t) f1 (yi | xi , ai ; θ̂1 )f2 (ai ; θ̂2 )g2 (ŷi | yi )g̃1 (xi | x̂i )dyi , (4) where g̃1 (xi | x̂i ) ∝ g1 (x̂i | xi )g(xi ). The following PFI-EM algorithm can be used: 1. E-step: ∗(1) (a) Generate ai ∗(m) from f2 (ai ; θ̂2 ). ∗(m) from h1 (xij | x̂ij ), which can be N (x̂ij , vij1 ). , · · · , ai ∗(1) (b) Generate xij , · · · , xij ∗(k) (c) For each ai ∗(k) ∗(k) xij , ai (t) ∗(k) from some proposal distribution h(yij | , generate yij , ŷij ). One can use a normal distribution with mean ∗(k) ∗(k) E(yij | xij , ai , ŷij ) = vij 2(t) σe and variance vij . 9 + vij ∗(k) ∗(k) xij β̂ (t) + ai 2(t) + σe 2(t) σe + vij ŷij ∗(k) (d) The fractional weight assigned to (ai ∗(k) ∗ wik(t) with P k ∝ f1 (yi ∗(k) | xi ∗(k) h(yi ∗(k) , ai | ∗(k) , xi ∗(k) , yi ∗(k) (t) ; θ̂1 )g2 (ŷi | yi ) ∗(k) ∗(k) xi , ai , ŷi ) ) is then × g̃1 (xi ∗(k) | x̂i ) ∗(k) h1 (xi | x̂i ) ∗ = 1. If we use h1 (·) = g̃1 (·), then the second term equals wik(t) to one. 2. M-step: the parameters are updated by solving XX i ∗(k) ∗ S1 (θ1 ; xi wik(t) ∗(k) , yi )=0 k and XX i ∗(k) ∗ wik(t) S2 (θ2 ; ai k 10 ) = 0.