3. Fractional Imputation (Part 2) 1 Review • y observed, z latent. • (y, z): joint density f (y, z; θ) = f1 (y | z; θ1 )f2 (z; θ2 ) • MLE of θ is computed by solving E{S(θ; y, Z) | y; θ} = 0 (1) • Using PFI method, the mean score equation (1) can be written as S̄I (θ) ≡ n X m X ∗(j) ∗ wij (θ)S(θ; yi , zi ) = 0, (2) i=1 j=1 where ∗(j) ∗(j) ∗(j) ∗(j) ∗(j) f (yi , zi ; θ)/h(zi ) f1 (yi | zi ; θ1 )f2 (zi ; θ2 )/h(zi ) ∗ wij (θ) = P = P ∗(k) ∗(k) ∗(k) ∗(k) ∗(k) ; θ)/h(zi ) ; θ1 )f2 (zi ; θ2 )/h(zi ) k f (yi , zi k f1 (yi | zi ∗(1) and zi ∗(m) , · · · , zi are generated from h(z). • How to choose h(·) ? – Method 1: For an initial guess θ0 , use ∗(j) h(zi ∗(j) ) = f (yi , zi ; θ0 ) – Method 2: Use a prior π(θ) to get Z ∗(j) ∗(j) h(zi ) = f (yi , zi ; θ)π(θ)dθ. We use the following steps: 1 1. Generate θ∗(1) , · · · , θ∗(m) from π(θ). 2. For each θ∗(j) , generate one z ∗(j) from f2 (z; θ∗(j) ). 3. Use m ∗(j) h(zi ) 1 X ∗(j) = f (yi , zi ; θ∗(k) ) m k=1 In method 2, you may use m quantiles of π(·) in Step 1. – Method 3: In some case, specially when f2 (z; θ2 ) is a normal distribution, we may use m quantiles of the standard normal distribution and choose h(z ∗(j) ) = 1/m. Example (of Method 3): GLMM • Model yij ∼ f1 (yij | xij , ai ; θ1 ) ai ∼ N (0, σa2 ) We are interested in computing the Monte Carlo approximation associated with the conditional distribution of ai given (xi , yi ). For example, the mean score function for θ1 is R E{S(θ1 ; xi , yi , ai ) | xi , yi ; θ̂} = S(θ1 ; xi , yi , ai )f1 (yi | xi , ai ; θ̂1 )φ(ai /σ̂a )d(ai /σ̂a ) R f1 (yi | xi , ai ; θ̂1 )φ(ai /σ̂a )d(ai /σ̂a ) where φ(·) is the pdf of the standard normal distribution. Thus, we can approximate the above term by E{S(θ1 ; xi , yi , ai ) | xi , yi ; θ̂} ∼ = ∗(j) where ai Pm j=1 ∗(j) ∗(j) S(θ1 ; xi , yi , ai )f1 (yi | xi , ai Pm ∗(j) ; θ̂1 ) j=1 f1 (yi | xi , ai ; θ̂1 ) = σ̂a ζj and ζ1 , · · · , ζm are the m quantiles from the standard normal distribution. That is P r(z < ζk ) = k/(m + 1), 2 k = 1, · · · , m. 2 Variance estimation • When θ̂ is the solution to S(θ) = 0, the asymptotic variance of θ̂ is given by V (θ̂) = {I(θ)}−1 (3) where I(θ) = E{−∂S(θ)/∂θ0 }. • To verify this result, we can use a Taylor linearization to S(θ̂) = 0: 0 = S(θ̂) = S(θ) + ∂ S(θ) (θ̂ − θ) ∂θ0 ∼ = S(θ) − {I(θ)} (θ̂ − θ) Thus, −1 θ̂ − θ ∼ = {I(θ)} S(θ) and −1 −1 −1 V (θ̂) ∼ = {I(θ)} V {S(θ)} {I(θ)} = {I(θ)} , where the last equality follows from E{S(θ)} = 0 (4) E{−∂S(θ)/∂θ0 } = E{S(θ)S(θ)0 }. (5) and Equality (5) is called Bartlett Identity. • Under missing data, define Sy (θ) = E{S(y,z) (θ) | y; θ}, where S(y,z) (θ) = S(θ; y, z). Also, define I(y,z) (θ) = E{−∂S(y,z) (θ)/∂θ}. The MLE of θ, which is the solution to (1), can be written as the solution to Sy (θ) = 0. To emphasize that the MLE is the solution to Sy (θ) = 0, denote the MLE by θ̂y . Not surprisingly, we can establish that, similarly to (3), . V (θ̂y ) = {Iy (θ)}−1 = {V (Sy )}−1 (6) where Iy (θ) = E{−∂Sy (θ)/∂θ0 }. [ HW: Prove (6). You first need to prove (4) and (5) for Sy (θ). ] 3 • If we define Sz|y (θ) = S(y,z) (θ) − Sy (θ) and ∂ Iz|y (θ) = E − 0 Sz|y (θ) , ∂θ we can establish the following results: 1. E{Sz|y (θ) | y} = 0. 2. E{Sz|y (θ), Sy (θ)} = 0. 3. V {S(y,z) (θ)} = V {Sy (θ)} + V {Sz|y (θ)} or I(y,z) (θ) = Iy (θ) + Iz|y (θ). (7) Equality (7) is sometimes called missing information principle (Orchard and Woodbury, 1972). • We are interested in estimating Iy (θ) = I(y,z) (θ) − Iz|y (θ). To investigate the second term, Iz|y (θ), note that Iz|y (θ) = V {Sz|y (θ)} = V E{Sz|y (θ) | y} + E V {Sz|y (θ) | y} = 0 + E V {S(y,z) (θ) − Sy (θ) | y} = 0 + E V {S(y,z) (θ) | y} Thus, Iz|y (θ) is estimated by V {S(y,z) (θ) | y}, which is often obtained by m ⊗2 1 X Bm (S) = S(θ; y, z ∗(j) ) − S̄I (θ) m − 1 j=1 where z ∗(1) , · · · , z ∗(m) ∼ f (z | y; θ), B ⊗2 = BB 0 and S̄I (θ) = m−1 Pm j=1 S(θ; y, z ∗(j) ). • Thus, an (asymptotically) unbiased estimator of Iy (θ) is then given by Îy (θ) = E{I(y,z) (θ) | y} − Bm , which is first discovered by Louis (1982). 4 • Louis formula can also be derived from the mean score equation (2) for PFI method. Note that we may write S̄I (θ) as S̄I (θ | θ) = n X m X ∗(j) ∗ (θ)S(θ; yi , zi ) wij = n X i=1 j=1 S̄i (θ) i=1 to reflect the dependency on θ in two places. Thus, we can use chain rule to obtain n X m X ∂ ∗ ∂ ∗(j) wij (θ) S(θ; yi , zi ) S̄I (θ | θ) = ∂θ ∂θ i=1 j=1 n X m X ∂ ∗(j) ∗ + wij (θ) S(θ; yi , zi ) ∂θ i=1 j=1 = n X m X ∗ wij (θ) i=1 j=1 n X m n o⊗2 X ∂ ∗(j) ∗(j) ∗ S(θ; yi , zi ) . S(θ; yi , zi ) − S̄i (θ) + wij (θ) ∂θ i=1 j=1 Thus, the first term estimates V {S(y,z) (θ) | y} and the second term estimates E{−I(y,z) (θ) | y}. Therefore, the asymptotic variance of the MLE of θ is estimated by the inverse of Îy = n X m X ∗(j) ∗ wij (θ̂)I(θ̂; yi , zi ) − n X m X ∗ wij o⊗2 n ∗(j) S(θ̂; yi , zi ) − S̄i (θ̂) i=1 j=1 i=1 j=1 where I(θ; y, z) = − ∂2 log f (y, z; θ). ∂θ∂θ0 • Yang and Kim (2014) further developed a likelihood-ratio test theory associated with the PFI likelihood: lP F I (θ) = − n X log i=1 ( m X ∗(j) j=1 ) ∗ wij (θ) f (yi , zi , ; θ) which turns out to be a PFI approximation to the observed likelihood Z n X ly (θ) = log f (yi , zi ; θ)dzi . i=1 That is, they proved that n o −2 lP F I (θ) − lP F I (θ̂) ∼ χ2 (p) under some regularity conditions. 5 3 Prediction • So far, we have discussed estimation of parameter θ. • In some applications, prediction of z, or a function of z, given the observation is also very important. • For example, in the case of predicting zi , we may use ẑi = E{zi | yi ; θ̂} (8) which can be approximated by PFI: ẑi∗ m X = ∗(j) ∗ wij zi j=1 ∗(j) ∗ ∝ f (yi , zi where wij ∗(j) ; θ̂)/h(zi ) and P j ∗ = 1. wij • How to estimate the mean square prediction error ? M SP E(ẑi ) = E{(ẑi − zi )2 } • Note that, if we define z̃i = E{zi | yi ; θ} then M SP E(z̃i ) = E{(z̃i − zi )2 } = E{V (zi | yi ; θ)} which can be estimated easily using PFI by M\ SP E(z̃i ) = m X ∗(j) ∗ wij (zi − ẑi∗ )2 . j=1 However, M SP E(ẑi ) is generally different from M SP E(z̃i ). In fact, it can be show that M SP E(ẑi ) = M SP E(z̃i ) + E{(ẑi − z̃i )2 }. • Two approaches to estimating M SP E(ẑi ): 1. Linearization method: estimate the second term of (9) by Ê{(ẑi − z̃i )2 } = gi (θ̂)V̂ (θ̂)gi (θ̂)0 where gi (θ) = 6 ∂ E{zi | yi ; θ}. ∂θ0 (9) 2. Parametric bootstrap: (b) (b) (a) Given θ̂, generate yi and zi from f (y, z; θ̂). Let (yi , zi ) be the b-th bootstrap sample. (b) (b) Using the bootstrap sample of yi (i = 1, · · · , n), compute θ̂(b) using the PFI method. (b) (c) Compute ẑi (b) = E{zi | yi ; θ̂(b) } by applying the same formula for obtaining the original predictor ẑi in (8). (d) The bootstrap estimator of M SP E(ẑi ) is computed by B o2 1 X n (b) (b) ẑ − zi , B b=1 i (b) where ẑi is computed in Step (c). • References: – Prasad, N.G.N. and Rao, J.N.K. (1990). The estimation of the mean squared error of small-area estimators, JASA, 85, p163-171. – Chatterjee, Lahiri, and Li (2008) ”Parametric bootstrap approximation to the distribution of EBLUP, and related prediction intervals in linear mixed models”, Annal of Statistics, 36, p1221-1245. 4 Application to SAE • Hierarchical structural model 1. Level one model: yij ∼ f1 (yij | xij , ai ; θ1 ) 2. Level two model: ai ∼ f2 (ai ; θ2 ) • Instead of observing (xij , yij ), we observe (xij , ŷij ), where ŷij | yij ∼ g(ŷij | yij ). You may think that i is a state-level index and j is a county-level index. 7 • Thus, we have two missing data: ai and yi . • EM algorithm can be used to estimate θ1 and θ2 . (See Week 2 material.) • Prediction of Yi = P j yij : Ŷi = E ( X ) yij | xi , ŷi ; θ̂ j which can be approximated by Ŷi = m X ! X ∗ wik j k=1 where ∗(k) ∗ wik ∝ with P k f1 (yi ∗(k) | xi ∗(k) h1 (yi ∗(k) , ai ∗(k) yij ∗(k) ; θ̂1 )g(ŷi | yi ∗(k) | xi ∗(k) , ai ) × , ŷi ) ∗ = 1. wik(t) • Prediction error estimation: Use parametric bootstrap. 8 ∗(k) ; θ̂2 ) ∗(k) | x̂i ) f2 (ai h2 (ai