Likelihood-based inference with missing data under missing-at-random Jae-kwang Kim Joint work with Shu Yang Department of Statistics, Iowa State University May 24, 2014 Outline 1. Introduction 2. Parametric fractional imputation (PFI) 3. PFI approximation of the observed likelihood 4. Main theory 5. Simulation study 6. Discussion Introduction • Let y = (y1 , · · · , yn ) have a joint density f (y ; θ). • Instead of observing y , we observe yobs , where y = (yobs , ymis ). • We are interested in making inference about θ in the presence of missing data. • Maximum likelihood estimator θ̂ maximizes the observed log-likelihood Z lobs (θ) = log f (y ; θ)dymis . Introduction • Inference with missing data is usually based on Wald type inference: −1 θ̂ ∼ N(θ, Îobs ), where l̂obs = −∂ 2 lobs (θ)/∂θ2 evaluated at θ = θ̂. • We are interested in inference based on Wilks’ theorem: n o −2 lobs (θ) − lobs (θ̂) ∼ χ2p , where lobs (θ) is the observed log likelihood and p is the dimension of θ. Basic setup • The observed likelihood involves integration: Z lobs (θ) = log f (y ; θ)dymis = log fobs (yobs ; θ). • Monte Carlo approximation is valid in the neighborhood of θ = θ̂: fobs (yobs ; θ) = E ∼ = ∗(j) where yi f (y ; θ) | yobs f (ymis | yobs ; θ) M ∗(j) 1 X f (yobs , ymis ; θ) ∗(j) M j=1 f (ymis | yobs ; θ) ∼ f (ymis | yobs ; θ̂), Basic setup Wilk Inference is made using the likelihood ratio test. If lobs (θ) is known, Wilk C.I. can be constructed as n o θ ∈ Θ : −2{lobs (θ) − lobs (θ̂)} ≤ χ2p (1 − α) . Computing the observed log likelihood is challenging because the integration over the random variable yi,mis is often intractable. Monte Carlo methods for approximating the observed data likelihood using samples from a distribution near the MLE will fail to compute lobs (θ) correctly when θ is far from the MLE. Parametric fractional imputation (PFI) Main idea of EM algorithm of PFI (Kim, 2011) • PFI saves the computation associated with MCEM. • It uses the importance sampling idea to compute the mean score function where the imputed values are generated only once by importance sampling at the beginning of the EM iteration. • PFI doesn’t change the imputed values but changes the fractional weights for each EM iteration. • It largely reduces the computation burden and ensures the convergence of the EM sequence. Parametric fractional imputation (PFI) ∗(j) • (I-step) For missing yi,mis , yi,mis ∼ h(yi,mis ), for j = 1, . . . , m. • (W-step) Given the current parameter value θ (t) , compute S̄ ∗(t) (θ) ≡ n X m X wij∗ (θ(t) )S(θ; yij∗ ) = 0, i=1 j=1 where S(θ; y ) is the score function of θ, ∗(j) wij∗ (θ) = P m f (yij∗ ; θ)/h(yi,mis ) ∗(k) ∗ k=1 {f (yik ; θ)/h(yi,mis )} , ∗(j) and yij∗ = (yi,obs , yi,mis ). • (M-step) Update parameter θ̂ (t+1) by solving (1). • Repeat (W-step) and (M-step) until convergence. (1) Remarks on PFI method • (I-step) + (W-step) = E-step • Key part is to compute the fractional weights ∗(j) wij∗ (θ) = P m f (yij∗ ; θ)/h(yi,mis ) ∗(k) ∗ k=1 {f (yik ; θ)/h(yi,mis )} . Here, h is often called proposal distribution and f is called target distribution. Parametric fractional imputation (PFI) Wald Inference can be constructed based on ∗−1 θ̂∗ ∼ N θ, Îobs , ∗ is the approximate information matrix. where Iobs Issue with Wald inference Wald-type confidence intervals often have poor coverage when the sampling distribution of the MLE is skewed. In such cases, Wilk-type inference is preferred. PFI approximation of likelihood Using the PFI data, we can express fobs,i (yi,obs ; θ) ∼ = = ∗(j) ∗ j=1 {f (yij ; θ)/h(yi,mis )} Pm ∗(j) j=1 {1/h(yi,mis )} Pm 1 . ∗ ∗ j=1 {wij (θ)/f (yij ; θ)} Pm The observed log-likelihood function lobs (θ) can be approximated by n m X X ∗ ∗ lobs (θ) = − log wij (θ)/f (yij∗ ; θ) . i=1 j=1 Given the imputed values yij∗ , only need f (yij∗ ; θ) and h(yij∗ ). Main Theory 1 Theorem The imputed observed likelihood ratio statistic for testing H0 : θ = θ0 is, n o ∗ ∗ W1 = −2 lobs (θ0 ) − lobs (θ̂) . Under the regularity conditions specified, under the null hypothesis H0 , W1 → χ2 (p), as m → ∞ and n → ∞. Main Theory 2 Theorem Let θ = (θ1 , θ2 ), where θ1 and θ2 are q × 1 and (p − q) × 1 vectors, respectively. Under the same regularity conditions of Theorem 1, under H0 : θ1 = θ1,0 , n o ∗ ∗ W2 = −2 lobs (θ̂(0) ) − lobs (θ̂) → χ2 (q), ∗ (θ). as m → ∞ and n → ∞, where θ̂(0) = arg maxH0 lobs Remarks on Theorem 2 • There are two models nvolved in Theorem 2. One is the full model f and the other is the reduced model f0 , the model under null hypothesis. • Recall that ∗ lobs (θ) = − n X m X ∗ wij (θ)/f (yij∗ ; θ) . log i=1 j=1 ∗(j) wij∗ (θ) =P m f (yij∗ ; θ)/h(yi,mis ) ∗(k) ∗ k=1 {f (yik ; θ)/h(yi,mis )} . • Thus, h remains the same but f needs to be changed to f0 under the reduced model. An example of likelihood ratio test Consider the following bivariate normal distribution y1i µ1 σ12 ρσ1 σ2 ∼N , y2i µ2 ρσ1 σ2 σ22 for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 . Table: Data structure H K y1 y2 X L X M X X Imputation – ∗(j) {(y2i , wij∗ )}m j=1 ∗(j) {(y1i , wij∗ )}m j=1 ∗(j) ∗(j) {(y1i , y2i , wij∗ )}m j=1 Proposal h(·) hK (y2 |y1 ) hL (y1 |y2 ) hM (y1 , y2 ) An example of likelihood ratio test Consider the following bivariate normal distribution y1i µ1 σ12 ρσ1 σ2 ∼N , y2i µ2 ρσ1 σ2 σ22 for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 . Table: Data structure H K y1 y2 X L X M X X Imputation – ∗(j) {(y2i , wij∗ )}m j=1 ∗(j) {(y1i , wij∗ )}m j=1 ∗(j) ∗(j) {(y1i , y2i , wij∗ )}m j=1 Proposal h(·) hK (y2 |y1 ) hL (y1 |y2 ) hM (y1 , y2 ) An example of likelihood ratio test Consider the following bivariate normal distribution y1i µ1 σ12 ρσ1 σ2 ∼N , y2i µ2 ρσ1 σ2 σ22 for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 . Table: Data structure H K y1 y2 X L X M X X Imputation – ∗(j) {(y2i , wij∗ )}m j=1 ∗(j) {(y1i , wij∗ )}m j=1 ∗(j) ∗(j) {(y1i , y2i , wij∗ )}m j=1 Proposal h(·) hK (y2 |y1 ) hL (y1 |y2 ) hM (y1 , y2 ) An example of likelihood ratio test Consider the following bivariate normal distribution y1i µ1 σ12 ρσ1 σ2 ∼N , y2i µ2 ρσ1 σ2 σ22 for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 . Table: Data structure H K y1 y2 X L X M X X Imputation – ∗(j) {(y2i , wij∗ )}m j=1 ∗(j) {(y1i , wij∗ )}m j=1 ∗(j) ∗(j) {(y1i , y2i , wij∗ )}m j=1 Proposal h(·) hK (y2 |y1 ) hL (y1 |y2 ) hM (y1 , y2 ) An example of likelihood ratio test Consider the following bivariate normal distribution y1i µ1 σ12 ρσ1 σ2 ∼N , y2i µ2 ρσ1 σ2 σ22 for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 . Table: Data structure H K y1 y2 X L X M X X Imputation – ∗(j) {(y2i , wij∗ )}m j=1 ∗(j) {(y1i , wij∗ )}m j=1 ∗(j) ∗(j) {(y1i , y2i , wij∗ )}m j=1 Proposal h(·) hK (y2 |y1 ) hL (y1 |y2 ) hM (y1 , y2 ) An example of likelihood ratio test Consider the following bivariate normal distribution y1i µ1 σ12 ρσ1 σ2 ∼N , y2i µ2 ρσ1 σ2 σ22 for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 . Table: Data structure H K y1 y2 X L X M X X Imputation – ∗(j) {(y2i , wij∗ )}m j=1 ∗(j) {(y1i , wij∗ )}m j=1 ∗(j) ∗(j) {(y1i , y2i , wij∗ )}m j=1 Proposal h(·) hK (y2 |y1 ) hL (y1 |y2 ) hM (y1 , y2 ) An example of likelihood ratio test Consider the following bivariate normal distribution y1i µ1 σ12 ρσ1 σ2 ∼N , y2i µ2 ρσ1 σ2 σ22 for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 . Table: Data structure H K y1 y2 X L X M X X Imputation – ∗(j) {(y2i , wij∗ )}m j=1 ∗(j) {(y1i , wij∗ )}m j=1 ∗(j) ∗(j) {(y1i , y2i , wij∗ )}m j=1 Proposal h(·) hK (y2 |y1 ) hL (y1 |y2 ) hM (y1 , y2 ) An example of likelihood ratio test Consider the following bivariate normal distribution y1i µ1 σ12 ρσ1 σ2 ∼N , y2i µ2 ρσ1 σ2 σ22 for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 . Table: Data structure H K y1 y2 X L X M X X Imputation – ∗(j) {(y2i , wij∗ )}m j=1 ∗(j) {(y1i , wij∗ )}m j=1 ∗(j) ∗(j) {(y1i , y2i , wij∗ )}m j=1 Proposal h(·) hK (y2 |y1 ) hL (y1 |y2 ) hM (y1 , y2 ) An example of likelihood ratio test Consider the following bivariate normal distribution y1i µ1 σ12 ρσ1 σ2 ∼N , y2i µ2 ρσ1 σ2 σ22 for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 . Table: Data structure H K y1 y2 X L X M X X Imputation – ∗(j) {(y2i , wij∗ )}m j=1 ∗(j) {(y1i , wij∗ )}m j=1 ∗(j) ∗(j) {(y1i , y2i , wij∗ )}m j=1 Proposal h(·) hK (y2 |y1 ) hL (y1 |y2 ) hM (y1 , y2 ) An example of likelihood ratio test Consider the following bivariate normal distribution y1i µ1 σ12 ρσ1 σ2 ∼N , y2i µ2 ρσ1 σ2 σ22 for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 . Table: Data structure H K y1 y2 X L X M X X Imputation – ∗(j) {(y2i , wij∗ )}m j=1 ∗(j) {(y1i , wij∗ )}m j=1 ∗(j) ∗(j) {(y1i , y2i , wij∗ )}m j=1 Proposal h(·) hK (y2 |y1 ) hL (y1 |y2 ) hM (y1 , y2 ) An example of likelihood ratio test Consider the following bivariate normal distribution y1i µ1 σ12 ρσ1 σ2 ∼N , y2i µ2 ρσ1 σ2 σ22 for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 . Table: Data structure H K y1 y2 X L X M X X Imputation – ∗(j) {(y2i , wij∗ )}m j=1 ∗(j) {(y1i , wij∗ )}m j=1 ∗(j) ∗(j) {(y1i , y2i , wij∗ )}m j=1 Proposal h(·) hK (y2 |y1 ) hL (y1 |y2 ) hM (y1 , y2 ) An example of likelihood ratio test • Under the full model, for i ∈ K , the fractional weights are given by ∗(j) wij∗ (θ) = P m f (yi ∗(j) ; θ)/hK (y1i |y2i ) ∗(k) k=1 {f (yi ∗(k) ; θ)/hK (y1i . |y2i )} Similar for L and M. • The MLE under the full model is computed by solving n X ∗(j) wij∗ (θ)S(θ; yi ) = 0. i=1 We may use the EM algorithm to obtain the solution. • The maximum of the observed likelihood under the full model n m X X ∗(j) ∗ lobs (θ̂) = − log wij∗ (θ̂)/f (yi ; θ̂) . i=1 j=1 An example of likelihood ratio test • Let f0 be the density for the reduced model under H0 : µ1 = µ2 = µ. • For i ∈ K , the fractional weights are given by ∗(j) ∗ w0,ij (θ) = P m f0 (yi ∗(j) ; θ)/hK (y1i |y2i ) ∗(k) ∗(k) ; θ)/hK (y1i |y2i )} k=1 {f0 (yi . Similar for L and M. • Note that we are using the same imputed values for this computation. • The MLE of θ under the reduced model, denoted by θ̂0 , is obtained by solving n X ∗(j) ∗ w0,ij (θ)S0 (θ; yi ) = 0, i=1 where S0 (θ; yi ) is the score function derived from f0 . An example of likelihood ratio test • The maximum of the observed likelihood under the null model is given by ∗ l0,obs (θ̂0 ) = − n X i=1 log m X j=1 ∗(j) ∗ w0,ij (θ̂0 )/f0 (yi ; θ̂0 ) . • The test statistic for testing H0 : µ1 = µ2 is computed from the PFI data as n o ∗ ∗ W2 = −2 l0,obs (θ̂(0) ) − lobs (θ̂) . If W2 > χ21,1−α , then we reject the null model. Simulation One Profile likelihood confidence interval • yi = 2 + xi + ei , xi ∼ N(1, 1), ei ∼ N(0, 1), where xi is fully observed, and yi is subjected to missing. • δi iid ∼ Bernoulli(0.6). Variable yi is observed if δi = 1 and yi is missing if δi = 0. • Monte Carlo samples were independently generated B = 2, 000 times. • Constructing 95% confidence interval for β1 and σ 2 . • Two methods: • Wald method using asymptotic normality • Wilk method using the result of Theorem 2. Simulation One Table: Monte Carlo length and coverage of the Wald and Wilk confidence intervals for β1 . sample size n = 20 n = 50 n = 100 Wald C.I. length coverage 0.895 0.895 0.719 0.928 0.504 0.932 Wilk C.I. length coverage 0.900 0.910 0.740 0.934 0.511 0.936 Simulation One Table: Monte Carlo length and coverage of the Wald and Wilk confidence intervals for σ 2 . sample size n = 20 n = 50 n = 100 Wald C.I. length coverage 1.345 0.743 0.952 0.866 0.499 0.940 Wilk C.I. length coverage 1.644 0.883 1.041 0.928 0.503 0.943 When n=20, about 8% of Monte Carlo samples have negative values for σ 2 with the Wald confidence intervals. Simulation One Sampling distribution of β̂ 0.8 0.0 0.4 Density 1.2 n=20 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Density 1.5 Density 1.0 0.5 0.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 n=100 2.0 n=50 1.6 1.8 2.0 2.2 2.4 2.6 Simulation One Sampling distribution of σˆ2 Density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 n=20 0.0 0.5 1.0 1.5 2.0 2.5 n=100 1.5 0.0 0.5 1.0 Density 1.0 0.5 0.0 Density 1.5 2.0 n=50 0.5 1.0 1.5 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Simulation Two Likelihood ratio test • Samples of size n = 100 and n = 200 are generated from • yi = β0 + β1 x1i + β ei , where 2 x2i + xi = (x1i , x2i ) ∼ N 0 2 , 1 0.1 0.1 2 , ei ∼ N(0, 1). • δi ∼ Bernoulli(0.6) • (β0 , β1 ) = (−2, 1), β2 changes from 0, 0.1, 0.2, and 0.3. • We are interested in testing the null hypothesis H0 : β2 = 0 using • the likelihood ratio test (LRT) of Fractional Imputation (FI) • the LRT of Multiple Imputation (MI)1 1 Meng, X.L. and Rubin, D.B. (1992). Performing Likelihood Ratio Tests with Multiply-Imputed Data Sets, Biometrika, 79, 103-111. Simulation Two Parameter Value β2 = 0 β2 = 0.1 β2 = 0.2 β2 = 0.3 α = 0.05 LRT.MI LRT.FI 0.038 0.060 0.151 0.202 0.474 0.601 0.812 0.888 α = 0.1 LRT.MI LRT.FI 0.088 0.113 0.261 0.314 0.634 0.713 0.894 0.935 Table: Monte Carlo power of the likelihood ratio test (LRT) of Multiple Imputation (MI) and Fractional Imputation (FI) for continuous data with sample size n = 100. Simulation Two Parameter Value β2 = 0 β2 = 0.1 β2 = 0.2 β2 = 0.3 α = 0.05 LRT.MI LRT.FI 0.032 0.049 0.267 0.335 0.795 0.854 0.988 0.995 α = 0.1 LRT.MI LRT.FI 0.083 0.098 0.406 0.449 0.888 0.922 0.996 0.997 Table: Monte Carlo power of the likelihood ratio test (LRT) of Multiple Imputation (MI) and Fractional Imputation (FI) for continuous data with sample size n = 200. Concluding remarks Parametric fractional imputation provides a completed data with fractional weights, which enables to compute the observed log ∗(j) likelihood function: h(yi,mis ) and f (yij∗ ; θ). LRT from PFI is more powerful than the Wald test based on the central limit theorem and also than the LRT from MI proposed by Meng and Rubin (1992). Extension will be a topic of future research: Model selection criteria can be developed, such as AIC or BIC as considered by Ibrahim et al. (2008) and Garcia et al. (2010). The end