Statistical Methods for Handling Missing Data Jae-Kwang Kim Department of Statistics, Iowa State University July 5th, 2014 Outline Textbook : “Statistical Methods for handling incomplete data” by Kim and Shao (2013) • Part 1: Basic Theory (Chapter 2-3) • Part 2: Imputation (Chapter 4) • Part 3: Propensity score approach (Chapter 5) • Part 4: Nonignorable missing (Chapter 6) Jae-Kwang Kim (ISU) July 5th, 2014 2 / 181 Statistical Methods for Handling Missing Data Part 1: Basic Theory Jae-Kwang Kim Department of Statistics, Iowa State University 1 Introduction Definitions for likelihood theory • The likelihood function of θ is defined as L(θ) = f (y; θ) where f (y; θ) is the joint pdf of y. • Let θ̂ be the maximum likelihood estimator (MLE) of θ0 if it satisfies L(θ̂) = max L(θ). θ∈Θ • A parametric family of densities, P = {f (y ; θ); θ ∈ Θ}, is identifiable if for all y, f (y ; θ1 ) 6= f (y ; θ2 ) Jae-Kwang Kim (ISU) Part 1 for every θ1 6= θ2 . 4 / 181 1 Introduction - Fisher information Definition 1 Score function: S(θ) = ∂ log L(θ) ∂θ 2 Fisher information = curvature of the log-likelihood: I(θ) = − ∂ ∂2 log L(θ) = − T S(θ) ∂θ∂θT ∂θ 3 Observed (Fisher) information: I(θ̂n ) where θ̂n is the MLE. 4 Expected (Fisher) information: I(θ) = Eθ {I(θ)} • The observed information is always positive. The observed information applies to a single dataset. • The expected information is meaningful as a function of θ across the admissible values of θ. The expected information is an average quantity over all possible datasets. • I(θ̂) = I(θ̂) for exponential family. Jae-Kwang Kim (ISU) Part 1 5 / 181 1 Introduction - Fisher information Lemma 1. Properties of score functions [Theorem 2.3 of KS] Under regularity conditions allowing the exchange of the order of integration and differentiation, Eθ {S(θ)} = 0 and Vθ {S(θ)} = I(θ). Jae-Kwang Kim (ISU) Part 1 6 / 181 1 Introduction - Fisher information Remark • Under some regularity conditions, the MLE θ̂ converges in probability to the true parameter θ0 . • Thus, we can apply a Taylor linearization on S(θ̂) = 0 to get −1 θ̂ − θ0 ∼ = {I(θ0 )} S(θ0 ). Here, we use the fact that I (θ) = −∂S(θ)/∂θT converges in probability to I(θ). • Thus, the (asymptotic) variance of MLE is V (θ̂) . = {I(θ0 )}−1 V {S(θ0 )} {I(θ0 )}−1 = {I(θ0 )}−1 , where the last equality follows from Lemma 1. Jae-Kwang Kim (ISU) Part 1 7 / 181 2 Observed Likelihood Basic Setup • Let y = (y1 , . . . , yp ) be a p-dimensional random vector with probability density function f (y; θ) whose dominating measure is µ. • Let δij be the response indicator function of yij with δij = 1 0 if yij is observed otherwise. • δ i = (δi1 , · · · , δip ): p-dimensional random vector with density P(δ | y) assuming P(δ|y) = P(δ|y; φ) for some φ. • Let (yi,obs , yi,mis ) be the observed part and missing part of yi , respectively. • Let R(yobs , δ) = {y; yobs (yi , δ i ) = yi,obs , i = 1, . . . , n} be the set of all possible values of y with the same realized value of yobs , for given δ, where yobs (yi , δ i ) is a function that gives the value of yij for δij = 1. Jae-Kwang Kim (ISU) Part 1 8 / 181 2 Observed Likelihood Definition: Observed likelihood Under the above setup, the observed Z likelihood of (θ, φ) is Lobs (θ, φ) = f (y; θ)P(δ|y; φ)dµ(y). R(yobs ,δ ) Under IID setup: The observed likelihood is n Z Y Lobs (θ, φ) = f (yi ; θ)P(δ i |yi ; φ)dµ(yi,mis ) , i=1 where it is understood that, if yi = yi,obs and yi,mis is empty then there is nothing to integrate out. • In the special case of scalar y , the observed likelihood is Y Y Z f (y ; θ) {1 − π(y ; φ)} dy , [f (yi ; θ)π(yi ; φ)] × Lobs (θ, φ) = δi =1 δi =0 where π(y ; φ) = P(δ = 1|y ; φ). Jae-Kwang Kim (ISU) Part 1 9 / 181 2 Observed Likelihood Example 1 [Example 2.3 of KS] Let t1 , t2 , · · · , tn be an IID sample from a distribution with density fθ (t) = θe −θt I (t > 0). Instead of observing ti , we observe (yi , δi ) where ti if δi = 1 yi = c if δi = 0 and δi = if ti ≤ c if ti > c, 1 0 where c is a known censoring time. The observed likelihood for θ can be derived as Lobs (θ) = n h i Y {fθ (ti )}δi {P (ti > c)}1−δi i=1 = θ Pn i=1 δi exp(−θ n X yi ). i=1 Jae-Kwang Kim (ISU) Part 1 10 / 181 2 Observed Likelihood Definition: Missing At Random (MAR) P(δ|y) is the density of the conditional distribution of δ given y. Let yobs = yobs (y, δ) where yi if δi = 1 yi,obs = ∗ if δi = 0. The response mechanism is MAR if P (δ|y1 ) = P (δ|y2 ) { or P(δ|y) = P(δ|yobs )} for all y1 and y2 satisfying yobs (y1 , δ) = yobs (y2 , δ). • MAR: the response mechanism P(δ|y) depends on y only through yobs . • Let y = (yobs , ymis ). By Bayes theorem, P (ymis |yobs , δ) = P(δ|ymis , yobs ) P (ymis |yobs ) . P(δ|yobs ) • MAR: P(ymis |yobs , δ) = P(ymis |yobs ). That is, ymis ⊥ δ | yobs . • MAR: the conditional independence of δ and ymis given yobs . Jae-Kwang Kim (ISU) Part 1 11 / 181 2 Observed Likelihood Remark • MCAR (Missing Completely at random): P(δ | y) does not depend on y. • MAR (Missing at random): P(δ | y) = P(δ | yobs ) • NMAR (Not Missing at random): P(δ | y) 6= P(δ | yobs ) • Thus, MCAR is a special case of MAR. Jae-Kwang Kim (ISU) Part 1 12 / 181 2 Observed Likelihood Theorem 1: Likelihood factorization (Rubin, 1976) [Theorem 2.4 of KS] Pφ (δ|y) is the joint density of δ given y and fθ (y) is the joint density of y. Under conditions 1 the parameters θ and φ are distinct and 2 MAR condition holds, the observed likelihood can be written as Lobs (θ, φ) = L1 (θ)L2 (φ), and the MLE of θ can be obtained by maximizing L1 (θ). Thus, we do not have to specify the model for response mechanism. The response mechanism is called ignorable if the above likelihood factorization holds. Jae-Kwang Kim (ISU) Part 1 13 / 181 2 Observed Likelihood Example 2 [Example 2.4 of KS] • Bivariate data (xi , yi ) with pdf f (x, y ) = f1 (y | x)f2 (x). • xi is always observed and yi is subject to missingness. • Assume that the response status variable δi of yi satisfies P (δi = 1 | xi , yi ) = Λ1 (φ0 + φ1 xi + φ2 yi ) for some function Λ1 (·) of known form. • Let θ be the parameter of interest in the regression model f1 (y | x; θ). Let α be the parameter in the marginal distribution of x, denoted by f2 (xi ; α). Define Λ0 (x) = 1 − Λ1 (x). • Three parameters • θ: parameter of interest • α and φ: nuisance parameter Jae-Kwang Kim (ISU) Part 1 14 / 181 2 Observed Likelihood Example 2 (Cont’d) • Observed likelihood Lobs (θ, α, φ) = Y f1 (yi | xi ; θ) f2 (xi ; α) Λ1 (φ0 + φ1 xi + φ2 yi ) δi =1 × YZ f1 (y | xi ; θ) f2 (xi ; α) Λ0 (φ0 + φ1 xi + φ2 y ) dy δi =0 = where L2 (α) = Qn i=1 f2 L1 (θ, φ) × L2 (α) (xi ; α) . • Thus, we can safely ignore the marginal distribution of x if x is completely observed. Jae-Kwang Kim (ISU) Part 1 15 / 181 2 Observed Likelihood Example 2 (Cont’d) • If φ2 = 0, then MAR holds and L1 (θ, φ) = L1a (θ) × L1b (φ) where L1a (θ) = Y f1 (yi | xi ; θ) δi =1 and L1b (φ) = Y Λ1 (φ0 + φ1 xi ) × δi =1 Y Λ0 (φ0 + φ1 xi ) . δi =0 • Thus, under MAR, the MLE of θ can be obtained by maximizing L1a (θ), which is obtained by ignoring the missing part of the data. Jae-Kwang Kim (ISU) Part 1 16 / 181 2 Observed Likelihood Example 2 (Cont’d) • Instead of yi subject to missingness, if xi is subject to missingness, then the observed likelihood becomes Y Lobs (θ, φ, α) = f1 (yi | xi ; θ) f2 (xi ; α) Λ1 (φ0 + φ1 xi + φ2 yi ) δi =1 × YZ f1 (yi | x; θ) f2 (x; α) Λ0 (φ0 + φ1 x + φ2 yi ) dx δi =0 6= L1 (θ, φ) × L2 (α) . • If φ1 = 0 then Lobs (θ, α, φ) = L1 (θ, α) × L2 (φ) and MAR holds. Although we are not interested in the marginal distribution of x, we still need to specify the model for the marginal distribution of x. Jae-Kwang Kim (ISU) Part 1 17 / 181 3 Mean Score Approach • The observed likelihood is the marginal density of (yobs , δ). • The observed likelihood is Z Lobs (η) = Z f (y; θ)P(δ|y; φ)dµ(y) = R(yobs ,δ ) f (y; θ)P(δ|y; φ)dµ(ymis ) where ymis is the missing part of y and η = (θ, φ). • Observed score equation: Sobs (η) ≡ ∂ log Lobs (η) = 0 ∂η • Computing the observed score function can be computationally challenging because the observed likelihood is an integral form. Jae-Kwang Kim (ISU) Part 1 18 / 181 3 Mean Score Approach Theorem 2: Mean Score Theorem (Fisher, 1922) [Theorem 2.5 of KS] Under some regularity conditions, the observed score function equals to the mean score function. That is, Sobs (η) = S̄(η) where S̄(η) = Scom (η) = f (y, δ; η) = E{Scom (η)|yobs , δ} ∂ log f (y, δ; η), ∂η f (y; θ)P(δ|y; φ). • The mean score function is computed by taking the conditional expectation of the complete-sample score function given the observation. • The mean score function is easier to compute than the observed score function. Jae-Kwang Kim (ISU) Part 1 19 / 181 3 Mean Score Approach Proof of Theorem 2 Since Lobs (η) = f (y, δ; η) /f (y, δ | yobs , δ; η), we have ∂ ∂ ∂ ln Lobs (η) = ln f (y, δ; η) − ln f (y, δ | yobs , δ; η) , ∂η ∂η ∂η taking a conditional expectation of the above equation over the conditional distribution of (y, δ) given (yobs , δ), we have ∂ ∂ ln Lobs (η) = E ln Lobs (η) | yobs , δ ∂η ∂η ∂ = E {Scom (η) | yobs , δ} − E ln f (y, δ | yobs , δ; η) | yobs , δ . ∂η Here, the first equality holds because Lobs (η) is a function of (yobs , δ) only. The last term is equal to zero by Lemma 1, which states that the expected value of the score function is zero and the reference distribution in this case is the conditional distribution of (y, δ) given (yobs , δ). Jae-Kwang Kim (ISU) Part 1 20 / 181 3 Mean Score Approach Example 3 [Example 2.6 of KS] 1 Suppose that the study variable y follows from a normal distribution with mean x0 β and variance σ 2 . The score equations for β and σ 2 under complete response are S1 (β, σ 2 ) = n X yi − x0i β xi /σ 2 = 0 i=1 and S2 (β, σ 2 ) = −n/(2σ 2 ) + n X yi − x0i β 2 /(2σ 4 ) = 0. i=1 2 Assume that yi are observed only for the first r elements and the MAR assumption holds. In this case, the mean score function reduces to S̄1 (β, σ 2 ) = r X yi − x0i β xi /σ 2 i=1 and S̄2 (β, σ 2 ) = −n/(2σ 2 ) + r X yi − x0i β 2 /(2σ 4 ) + (n − r )/(2σ 2 ). i=1 Jae-Kwang Kim (ISU) Part 1 21 / 181 3 Mean Score Approach Example 3 (Cont’d) 3 The maximum likelihood estimator obtained by solving the mean score equations is β̂ = r X !−1 xi x0i i=1 and σ̂ 2 = r X xi yi i=1 r 2 1 X yi − x0i β̂ . r i=1 Thus, the resulting estimators can be also obtained by simply ignoring the missing part of the sample, which is consistent with the result in Example 2 (for φ2 = 0). Jae-Kwang Kim (ISU) Part 1 22 / 181 3 Mean Score Approach Discussion of Example 3 • We are interested in estimating θ for the conditional density f (y | x; θ). • Under MAR, the observed likelihood for θ is Lobs (θ) = r Y f (yi | xi ; θ) × i=1 n Z Y f (y | xi ; θ)dµ(y ) = i=r +1 r Y f (yi | xi ; θ). i=1 • The same conclusion can follow from the mean score theorem. Under MAR, the mean score function is S̄(θ) = r X S(θ; xi , yi ) + i=1 = r X n X E {S(θ; xi , Y ) | xi } i=r +1 S(θ; xi , yi ) i=1 where S(θ; x, y ) is the score function for θ and the second equality follows from Lemma 1. Jae-Kwang Kim (ISU) Part 1 23 / 181 3 Mean Score Approach Example 4 [Example 2.5 of KS] 1 Suppose that the study variable y is randomly distributed with Bernoulli distribution with probability of success pi , where pi = pi (β) = exp (x0i β) 1 + exp (x0i β) for some unknown parameter β and xi is a vector of the covariates in the logistic regression model for yi . We assume that 1 is in the column space of xi . 2 Under complete response, the score function for β is S1 (β) = n X (yi − pi (β)) xi . i=1 Jae-Kwang Kim (ISU) Part 1 24 / 181 3 Mean Score Approach Example 4 (Cont’d) 3 Let δi be the response indicator function for yi with distribution Bernoulli(πi ) where πi = exp (x0i φ0 + yi φ1 ) . 1 + exp (x0i φ0 + yi φ1 ) We assume that xi is always observed, but yi is missing if δi = 0. 4 Under missing data, the mean score function for β is S̄1 (β, φ) = X {yi − pi (β)} xi + 1 XX wi (y ; β, φ) {y − pi (β)} xi , δi =0 y =0 δi =1 where wi (y ; β, φ) is the conditional probability of yi = y given xi and δi = 0: wi (y ; β, φ) = Pβ (yi = y | xi ) Pφ (δi = 0 | yi = y , xi ) P1 z=0 Pβ (yi = z | xi ) Pφ (δi = 0 | yi = z, xi ) Thus, S̄1 (β, φ) is also a function of φ. Jae-Kwang Kim (ISU) Part 1 25 / 181 3 Mean Score Approach Example 4 (Cont’d) 5 If the response mechanism is MAR so that φ1 = 0, then wi (y ; β, φ) Pβ (yi = y | xi ) = Pβ (yi = y | xi ) P1 z=0 Pβ (yi = z | xi ) = and so S̄1 (β, φ) = X {yi − pi (β)} xi = S̄1 (β) . δi =1 6 If MAR does not hold, then (β̂, φ̂) can be obtained by solving S̄1 (β, φ) = 0 and S̄2 (β, φ) = 0 jointly, where X S̄2 (β, φ) = {δi − π (φ; xi , yi )} (xi , yi ) δi =1 + 1 XX wi (y ; β, φ) {δi − πi (φ; xi , y )} (xi , y ) . δi =0 y =0 Jae-Kwang Kim (ISU) Part 1 26 / 181 3 Mean Score Approach Discussion of Example 4 • We maynot have a uniquesolution to S̄(η) = 0, where S̄(η) = S̄1 (β, φ) , S̄2 (β, φ) when MAR does not hold, because of the non-identifiability problem associated with non-ignorable missing. • To avoid this problem, often a reduced model is used for the response model. Pr (δ = 1 | x, y ) = Pr (δ = 1 | u, y ) where x = (u, z). The reduced response model introduces a smaller set of parameters and the over-identified situation can be resolved. (More discussion will be made in Part 4 lecture.) • Computing the solution to S̄(η) is also difficult. EM algorithm, which will be presented soon, is a useful computational tool. Jae-Kwang Kim (ISU) Part 1 27 / 181 4 Observed information Definition 1 Observed score function: Sobs (η) = ∂ ∂η log Lobs (η) 2 ∂ 2 Fisher information from observed likelihood: Iobs (η) = − ∂η∂η T log Lobs (η) 3 Expected (Fisher) information from observed likelihood: Iobs (η) = Eη {Iobs (η)}. Lemma 2 [Theorem 2.6 of KS] Under regularity conditions, E{Sobs (η)} = 0, and V{Sobs (η)} = Iobs (η), where Iobs (η) = Eη {Iobs (η)} is the expected information from the observed likelihood. Jae-Kwang Kim (ISU) Part 1 28 / 181 4 Observed information • Under missing data, the MLE η̂ is the solution to S̄(η) = 0. • Under some regularity conditions, η̂ converges in probability to η0 and has the asymptotic variance {Iobs (η0 )}−1 with n o n o ∂ ⊗2 Iobs (η) = E − T Sobs (η) = E Sobs (η) = E S̄ ⊗2 (η) ∂η & B ⊗2 = BB T . • For variance estimation of η̂, may use {Iobs (η̂)}−1 . • IID setup: The empirical (Fisher) information for the variance of η̂ is h i−1 Ĥ(η̂) = ( n X )−1 S̄i⊗2 (η̂) i=1 where S̄i (η) = E{Si (η)|yi,obs , δ i } (Redner and Walker, 1984). • In general, Iobs (η̂) is preferred to Iobs (η̂) for variance estimation of η̂. Jae-Kwang Kim (ISU) Part 1 29 / 181 4 Observed information Return to Example 1 • Observed score function Sobs (θ) = n X δi log(θ) − θ i=1 • MLE for θ: • Fisher information: Iobs (θ) = n X yi i=1 Pn yi θ̂ = Pi=1 n i=1 δi Pn δi /θ2 P • Expected information: Iobs (θ) = ni=1 (1 − e −θc )/θ2 = n(1 − e −θc )/θ2 . i=1 Which one do you prefer ? Jae-Kwang Kim (ISU) Part 1 30 / 181 4 Observed information Motivation • Lcom (η) = f (y, δ; η): complete-sample likelihood with no missing data • Fisher information associated with Lcom (η): Icom (η) = − ∂ ∂2 Scom (η) = − log Lcom (η) T ∂η ∂η∂η T • Lobs (η): the observed likelihood • Fisher information associated with Lobs (η): Iobs (η) = − ∂2 ∂ Sobs (η) = − log Lobs (η) T ∂η ∂η∂η T • How to express Iobs (η) in terms of Icom (η) and Scom (η) ? Jae-Kwang Kim (ISU) Part 1 31 / 181 4 Observed information Theorem 3 (Louis, 1982; Oakes, 1999) [Theorem 2.7 of KS] Under regularity conditions allowing the exchange of the order of integration and differentiation, h i ⊗2 Iobs (η) = E{Icom (η)|yobs , δ} − E{Scom (η)|yobs , δ} − S̄(η)⊗2 = E{Icom (η)|yobs , δ} − V {Scom (η)|yobs , δ}, where S̄(η) = E{Scom (η)|yobs , δ}. Jae-Kwang Kim (ISU) Part 1 32 / 181 Proof of Theorem 3 By Theorem 2, the observed information associated with Lobs (η) can be expressed as Iobs (η) = − ∂ S̄ (η) ∂η 0 where S̄ (η) = E {Scom (η) | yobs , δ; η}. Thus, we have Z ∂ ∂ S̄ (η) = Scom (η; y)f (y, δ | yobs , δ; η) dµ(y) ∂η 0 ∂η 0 Z ∂ S (η; y) f (y, δ | yobs , δ; η) dµ(y) = com ∂η 0 Z ∂ + Scom (η; y) f (y, δ | y , δ; η) dµ(y) obs ∂η 0 = E ∂Scom (η)/∂η 0 | yobs , δ Z ∂ + Scom (η; y) log f (y, δ | y , δ; η) f (y, δ | yobs , δ; η) dµ(y). obs ∂η 0 The first term is equal to −E {Icom (η) | yobs , δ} and the second term is equal to E Scom (η)Smis (η)0 | yobs , δ = E S̄(η) + Smis (η) Smis (η)0 | yobs , δ = E Smis (η)Smis (η)0 | yobs , δ because E S̄(η)Smis (η)0 | yobs , δ = 0. Jae-Kwang Kim (ISU) Part 1 33 / 181 4. Observed information • Smis (η): the score function with the conditional density f (y, δ|yobs , δ). n o • Expected missing information: Imis (η) = E − ∂η∂T Smis (η) satisfying Imis (η) = E Smis (η)⊗2 . • Missing information principle (Orchard and Woodbury, 1972): Imis (η) = Icom (η) − Iobs (η), where Icom (η) = E −∂Scom (η)/∂η T is the expected information with complete-sample likelihood . • An alternative expression of the missing information principle is V{Smis (η)} = V{Scom (η)} − V{S̄(η)}. Note that V{Scom (η)} = Icom (η) and V{Sobs (η)} = Iobs (η). Jae-Kwang Kim (ISU) Part 1 34 / 181 4. Observed information Example 5 1 Consider the following bivariate normal distribution: y1i y2i ∼N µ1 µ2 σ11 , σ12 σ12 σ22 , for i = 1, 2, · · · , n. Assume for simplicity that σ11 , σ12 and σ22 are known constants and µ = (µ1 , µ2 )0 be the parameter of interest. 2 The complete sample score function for µ is Scom (µ) = n X i=1 (i) Scom n X σ11 (µ) = σ12 i=1 σ12 σ22 −1 y1i − µ1 y2i − µ2 . The information matrix of µ based on the complete sample is −1 σ11 σ12 Icom (µ) = n . σ12 σ22 Jae-Kwang Kim (ISU) Part 1 35 / 181 4. Observed information Example 5 (Cont’d) 3 Suppose that there are some missing values in y1i and y2i and the original sample is partitioned into four sets: H = both y1 and y2 respond K = only y1 is observed L = only y2 is observed M = both y1 and y2 are missing. Let nH , nK , nL , nM represent the size of H, K , L, M, respectively. 4 Assume that the response mechanism does not depend on the value of (y1 , y2 ) and so it is MAR. In this case, the observed observation in set K is n o (i) E Scom (µ) | y1i , i ∈ K = = Jae-Kwang Kim (ISU) score function of µ based on a single σ11 σ12 σ12 σ22 −1 −1 σ11 (y1i − µ1 ) 0 Part 1 y1i − µ1 E (y2i | y1i ) − µ2 . 36 / 181 4. Observed information Example 5 (Cont’d) 5 Similarly, we have n o (i) E Scom (µ) | y2i , i ∈ L = 0 −1 (y2i − µ2 ) σ22 . 6 Therefore, and the observed information matrix of µ is Iobs (µ) = nH σ11 σ12 σ12 σ22 −1 + nK −1 σ11 0 0 0 + nL 0 0 0 −1 σ22 and the asymptotic variance of the MLE of µ can be obtained by the inverse of Iobs (µ). Jae-Kwang Kim (ISU) Part 1 37 / 181 5. EM algorithm • Interested in finding η̂ that maximizes Lobs (η). The MLE can be obtained by solving Sobs (η) = 0, which is equivalent to solving S̄(η) = 0 by Theorem 2. • Computing the solution S̄(η) = 0 can be challenging because it often involves computing Iobs (η) = −∂ S̄(η)/∂η 0 in order to apply Newton method: n o−1 η̂ (t+1) = η̂ (t) + Iobs (η̂ (t) ) S̄(η̂ (t) ). We may rely on Louis formula (Theorem 3) to compute Iobs (η). • EM algorithm provides an alternative method of solving S̄(η) = 0 by writing S̄(η) = E {Scom (η) | yobs , δ; η} and using the following iterative method: n o η̂ (t+1) ← solve E Scom (η) | yobs , δ; η̂ (t) = 0. Jae-Kwang Kim (ISU) Part 1 38 / 181 5. EM algorithm Definition Let η (t) be the current value of the parameter estimate of η. The EM algorithm can be defined as iteratively carrying out the following E-step and M-steps: • E-step: Compute n o Q η | η (t) = E ln f (y, δ; η) | yobs , δ, η (t) • M-step: Find η (t+1) that maximizes Q(η | η (t) ) w.r.t. η. Theorem 4 (Dempster et al., 1977) [Theorem 3.2] Let Lobs (η) = Q(η (t+1) (t) R f (y, δ; η) dµ(y) be the observed likelihood of η. If R(yobs ,δ) (t) | η ) ≥ Q(η Jae-Kwang Kim (ISU) | η (t) ), then Lobs (η (t+1) ) ≥ Lobs (η (t) ). Part 1 39 / 181 5. EM algorithm Remark 1 Convergence of EM algorithm is linear. It can be shown that (t) (t−1) η (t+1) − η (t) ∼ = Jmis η − η −1 where Jmis = Icom Imis is called the fraction of missing information. 2 Under MAR and for the exponential family of the distribution of the form f (y; θ) = b (y) exp θ0 T (y) − A (θ) . The M-step computes θ(t+1) by the solution to n o E T (y) | yobs , θ(t) = E {T (y) | θ} . Jae-Kwang Kim (ISU) Part 1 40 / 181 h1(θ) 0.5 1.0 1.5 2.0 2.5 5. EM algorithm h2(θ) 0.2 0.4 0.6 0.8 1.0 θ Figure : Illustration of EM algorithm for exponential family (h1 (θ) = E {T (y) | yobs , θ}, h2 (θ) = E {T (y) | θ}) Jae-Kwang Kim (ISU) Part 1 41 / 181 5. EM algorithm Return to Example 4 • E-step: S̄1 β | β (t) , φ(t) = X {yi − pi (β)} xi + 1 XX wij(t) {j − pi (β)} xi , δi =0 j=0 δi =1 where wij(t) = Pr (Yi = j | xi , δi = 0; β (t) , φ(t) ) = P1 Pr (Yi = j | xi ; β (t) )Pr (δi = 0 | xi , j; φ(t) ) (t) (t) y =0 Pr (Yi = y | xi ; β )Pr (δi = 0 | xi , y ; φ ) and S̄2 φ | β (t) , φ(t) = X {δi − π (xi , yi ; φ)} x0i , yi 0 δi =1 + 1 XX wij(t) {δi − πi (xi , j; φ)} x0i , j 0 . δi =0 j=0 Jae-Kwang Kim (ISU) Part 1 42 / 181 5. EM algorithm Return to Example 4 (Cont’d) • M-step: The parameter estimates are updated by solving h i S̄1 β | β (t) , φ(t) , S̄2 φ | β (t) , φ(t) = (0, 0) for β and φ. • For categorical missing data, the conditional expectation in the E-step can be computed using the weighted mean with weights wij(t) . Ibrahim (1990) called this method EM by weighting. • Observed information matrix can also be obtained by the Louis formula (in Theorem 3) using the weighted mean in the E-step. Jae-Kwang Kim (ISU) Part 1 43 / 181 5. EM algorithm Example 6. Mixture model [Example 3.8 of KS] • Observation Yi = (1 − Wi ) Z1i + Wi Z2i , i = 1, 2, · · · , n where • Parameter of interest Jae-Kwang Kim (ISU) Z1i ∼ Z2i ∼ N µ1 , σ12 N µ2 , σ22 Wi ∼ Bernoulli (π) . θ = µ1 , µ2 , σ12 , σ22 , π Part 1 44 / 181 5. EM algorithm Example 6 (Cont’d) • Observed likelihood Lobs (θ) = n Y pdf (yi | θ) i=1 where pdf (y | θ) = (1 − π) φ y | µ1 , σ12 + πφ y | µ2 , σ22 and Jae-Kwang Kim (ISU) (y − µ)2 1 φ y | µ, σ 2 = √ exp − . 2σ 2 2πσ Part 1 45 / 181 5. EM algorithm Example 6 (Cont’d) • Full sample likelihood Lfull (θ) = n Y pdf (yi , wi | θ) i=1 where h i1−w h iw pdf (y , w | θ) = φ y | µ1 , σ12 φ y | µ2 , σ22 π w (1 − π)1−w . Thus, ln Lfull (θ) = n h X i (1 − wi ) ln φ yi | µ1 , σ12 + wi ln φ yi | µ2 , σ22 i=1 + n X {wi ln (π) + (1 − wi ) ln (1 − π)} i=1 Jae-Kwang Kim (ISU) Part 1 46 / 181 5. EM algorithm Example 6 (Cont’d) • E-M algorithm [E-step] Q θ | θ(t) = n h X (t) 1 − ri i (t) ln φ yi | µ1 , σ12 + ri ln φ yi | µ2 , σ22 i=1 + n n o X (t) (t) ln (1 − π) ri ln (π) + 1 − ri i=1 (t) where ri = E wi | yi , θ(t) E (wi | yi , θ) = with πφ yi | µ2 , σ22 (1 − π) φ (yi | µ1 , σ12 ) + πφ (yi | µ2 , σ22 ) [M-step] ∂ Q θ | θ(t) = 0. ∂θ Jae-Kwang Kim (ISU) Part 1 47 / 181 5. EM algorithm Example 7. (Robust regression) [Example 3.12 of KS] • Model: yi = β0 + β1 xi + σei with ei ∼ t(ν), ν: known. • Missing data setup: √ ei = ui / wi where ui ∼ N(0, 1), wi ∼ χ2 /ν. • (xi , yi , wi ): complete data (xi , yi always observed, wi always missing) yi | (xi , wi ) ∼ N β0 + β1 xi , σ 2 /wi and xi is fixed (thus, independent of wi ). • EM algorithm can be used to estimate θ = (β0 , β1 , σ 2 ) Jae-Kwang Kim (ISU) Part 1 48 / 181 5. EM algorithm Example 7 (Cont’d) • E-step: Find the conditional distribution of wi given xi . By Bayes theorem, f (wi | xi , yi ) ∝ f (wi )f (yi | wi , xi ) ( 2 )−1 ν + 1 y − β − β x 0 1 i i ∼ Gamma ,2 ν + 2 σ Thus, E (wi | xi , yi , θ(t) ) = (t) where di (t) ν+1 , (t) 2 ν + di (t) = (yi − β0 − β1 xi )/σ (t) . Jae-Kwang Kim (ISU) July 5th, 2014 49 / 181 5. EM algorithm Example 7 (Cont’d) • M-step: (t) (t) µx , µy = (t+1) β0 = (t+1) = β1 σ 2(t+1) = (t) where wi n X (t) wi n X (t) (xi , yi ) /( wi ) i=1 (t) µy i=1 (t+1) (t) − β1 µx Pn (t) (t) (t) i=1 wi (xi − µx )(yi − µy ) Pn (t) (t) 2 i=1 wi (xi − µx ) n 2 1 X (t) (t) (t) wi yi − β0 − β1 xi n i=1 = E (wi | xi , yi , θ(t) ). Jae-Kwang Kim (ISU) July 5th, 2014 50 / 181 6. Summary • Interested in finding the MLE that maximizes the observed likelihood function. • Under MAR, the model specification of the response mechanism is not necessary. • Mean score equation can be used to compute the MLE. • EM algorithm is a useful computational tool for solving the mean score equation. • The E-step of the EM algorithm may require some computational tool (see Part 2). • The asymptotic variance of the MLE can be computed by the inverse of the observed information matrix, which can be computed using Louis formula. Jae-Kwang Kim (ISU) July 5th, 2014 51 / 181 REFERENCES Cheng, P. E. (1994), ‘Nonparametric estimation of mean functionals with data missing at random’, Journal of the American Statistical Association 89, 81–87. Dempster, A. P., N. M. Laird and D. B. Rubin (1977), ‘Maximum likelihood from incomplete data via the EM algorithm’, Journal of the Royal Statistical Society: Series B 39, 1–37. Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical statistics’, Philosophical Transactions of the Royal Society of London A 222, 309–368. Fuller, W. A., M. M. Loughin and H. D. Baker (1994), ‘Regression weighting in the presence of nonresponse with application to the 1987-1988 Nationwide Food Consumption Survey’, Survey Methodology 20, 75–85. Hirano, K., G. Imbens and G. Ridder (2003), ‘Efficient estimation of average treatment effects using the estimated propensity score’, Econometrica 71, 1161–1189. Ibrahim, J. G. (1990), ‘Incomplete data in generalized linear models’, Journal of the American Statistical Association 85, 765–769. Jae-Kwang Kim (ISU) July 5th, 2014 51 / 181 Kim, J. K. (2011), ‘Parametric fractional imputation for missing data analysis’, Biometrika 98, 119–132. Kim, J. K. and C. L. Yu (2011), ‘A semi-parametric estimation of mean functionals with non-ignorable missing data’, Journal of the American Statistical Association 106, 157–165. Kim, J. K., M. J. Brick, W. A. Fuller and G. Kalton (2006), ‘On the bias of the multiple imputation variance estimator in survey sampling’, Journal of the Royal Statistical Society: Series B 68, 509–521. Kim, J. K. and M. K. Riddles (2012), ‘Some theory for propensity-score-adjustment estimators in survey sampling’, Survey Methodology 38, 157–165. Kott, P. S. and T. Chang (2010), ‘Using calibration weighting to adjust for nonignorable unit nonresponse’, Journal of the American Statistical Association 105, 1265–1275. Louis, T. A. (1982), ‘Finding the observed information matrix when using the EM algorithm’, Journal of the Royal Statistical Society: Series B 44, 226–233. Meng, X. L. (1994), ‘Multiple-imputation inferences with uncongenial sources of input (with discussion)’, Statistical Science 9, 538–573. Jae-Kwang Kim (ISU) July 5th, 2014 51 / 181 Oakes, D. (1999), ‘Direct calculation of the information matrix via the em algorithm’, Journal of the Royal Statistical Society: Series B 61, 479–482. Orchard, T. and M.A. Woodbury (1972), A missing information principle: theory and applications, in ‘Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability’, Vol. 1, University of California Press, Berkeley, California, pp. 695–715. Redner, R. A. and H. F. Walker (1984), ‘Mixture densities, maximum likelihood and the EM algorithm’, SIAM Review 26, 195–239. Robins, J. M., A. Rotnitzky and L. P. Zhao (1994), ‘Estimation of regression coefficients when some regressors are not always observed’, Journal of the American Statistical Association 89, 846–866. Robins, J. M. and N. Wang (2000), ‘Inference for imputation estimators’, Biometrika 87, 113–124. Rubin, D. B. (1976), ‘Inference and missing data’, Biometrika 63, 581–590. Tanner, M. A. and W. H. Wong (1987), ‘The calculation of posterior distribution by data augmentation’, Journal of the American Statistical Association 82, 528–540. Jae-Kwang Kim (ISU) July 5th, 2014 51 / 181 Wang, N. and J. M. Robins (1998), ‘Large-sample theory for parametric multiple imputation procedures’, Biometrika 85, 935–948. Wang, S., J. Shao and J. K. Kim (2014), ‘Identifiability and estimation in problems with nonignorable nonresponse’, Statistica Sinica 24, 1097 – 1116. Wei, G. C. and M. A. Tanner (1990), ‘A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms’, Journal of the American Statistical Association 85, 699–704. Zhou, M. and J. K. Kim (2012), ‘An efficient method of estimation for longitudinal surveys with monotone missing data’, Biometrika 99, 631–648. Statistical Methods for Handling Missing Data Part 2: Imputation Jae-Kwang Kim Department of Statistics, Iowa State University Jae-Kwang Kim (ISU) Part 2 52 / 181 Introduction Basic setup • Y: a vector of random variables with distribution F (y; θ). • y1 , · · · , yn are n independent realizations of Y. • We are interested in estimating ψ which is implicitly defined by E {U(ψ; Y)} = 0. • Under complete observation, a consistent estimator ψ̂n of ψ can be obtained by solving estimating equation for ψ: n X U(ψ; yi ) = 0. i=1 • A special case of estimating function is the score function. In this case, ψ = θ. • Sandwich variance estimator is often used to estimate the variance of ψ̂n : V̂ (ψ̂n ) = τ̂u−1 V̂ (U)τ̂u−1 0 where τu = E {∂U(ψ; y)/∂ψ 0 }. Jae-Kwang Kim (ISU) Part 2 53 / 181 1. Introduction Missing data setup • Suppose that yi is not fully observed. • yi = (yobs,i , ymis,i ): (observed, missing) part of yi • δ i : response indicator functions for yi . • Under the existence of missing data, we can use the following estimators: ψ̂: solution to n X E {U (ψ; yi ) | yobs,i , δ i } = 0. (1) i=1 • The equation in (1) is often called expected estimating equation. Jae-Kwang Kim (ISU) Part 2 54 / 181 1. Introduction Motivation (for imputation) Computing the conditional expectation in (1) can be a challenging problem. 1 The conditional expectation depends on unknown parameter values. That is, E {U (ψ; yi ) | yobs,i , δ i } = E {U (ψ; yi ) | yobs,i , δ i ; θ, φ} , where θ is the parameter in f (y; θ) and φ is the parameter in p(δ | y; φ). 2 Even if we know η = (θ, φ), computing the conditional expectation is numerically difficult. Jae-Kwang Kim (ISU) Part 2 55 / 181 1. Introduction Imputation • Imputation: Monte Carlo approximation of the conditional expectation (given the observed data). m 1 X ∗(j) E {U (ψ; yi ) | yobs,i , δ i } ∼ U ψ; yobs,i , ymis,i = m j=1 1 ∗ Bayesian approach: generate ymis,i from Z f (ymis,i | yobs , δ) = 2 f (ymis,i | yobs , δ; η) p(η | yobs , δ)dη ∗ Frequentist approach: generate ymis,i from f (ymis,i | yobs,i , δ; η̂) , where η̂ is a consistent estimator. Jae-Kwang Kim (ISU) Part 2 56 / 181 1. Introduction Imputation Questions 1 How to generate the Monte Carlo samples (or the imputed values) ? 2 What is the asymptotic distribution of ψ̂I∗ which solves m 1 X ∗(j) U ψ; yobs,i , ymis,i = 0, m j=1 ∗(j) where ymis,i ∼ f (ymis,i | yobs,i , δ; η̂p ) for some η̂p ? 3 How to estimate the variance of ψ̂I∗ ? Jae-Kwang Kim (ISU) Part 2 57 / 181 2. Basic Theory for Imputation Basic Setup (for Case 1: ψ = η) • y = (y1 , · · · , yn ) ∼ f (y; θ) • δ = (δ1 , · · · , δn ) ∼ P(δ|y; φ) • y = (yobs , ymis ): (observed, missing) part of y. ∗(1) ∗(m) • ymis , · · · , ymis : m imputed values of ymis generated from f (ymis | yobs , δ; η̂p ) = R f (y; θ̂p )P(δ | y; φ̂p ) , f (y; θ̂p )P(δ | y; φ̂p )dµ(ymis ) where η̂p = (θ̂p , φ̂p ) is a preliminary estimator of η = (θ, φ). • Using m imputed values, imputed score function is computed as ∗ S̄imp,m (η | η̂p ) ≡ m 1 X ∗(j) Scom η; yobs , ymis , δ m j=1 where Scom (η; y) is the score function of η = (θ, φ) under complete response. Jae-Kwang Kim (ISU) Part 2 58 / 181 2. Basic Theory for Imputation Lemma 1 (Asymptotic results for m = ∞) Assume that η̂p converges in probability to η. Let η̂I∗,m be the solution to m 1 X ∗(j) Scom η; yobs , ymis , δ = 0, m j=1 ∗(1) ∗(m) where ymis , · · · , ymis are the imputed values generated from f (ymis | yobs , δ; η̂p ). Then, under some regularity conditions, for m → ∞, ∗ ∼ η̂imp,∞ = η̂MLE + Jmis (η̂p − η̂MLE ) and (2) . −1 ∗ 0 = Iobs + Jmis {V (η̂p ) − V (η̂MLE )} Jmis , V η̂imp,∞ −1 where Jmis = Icom Imis is the fraction of missing information. Jae-Kwang Kim (ISU) Part 2 59 / 181 2. Basic Theory for Imputation Remark • For m = ∞, the imputed score equation becomes the mean score equation. • Equation (2) means that ∗ η̂imp,∞ = (I − Jmis ) η̂MLE + Jmis η̂p . (3) ∗ That is, η̂imp,∞ is a convex combination of η̂MLE and η̂p . ∗ • Note that η̂imp,∞ is one-step EM update with initial estimate η̂p . Let η̂ (t) be the t-th EM update of η that is computed by solving S̄ η | η̂ (t−1) = 0 with η̂ (0) = η̂p . Equation (3) implies that η̂ (t) = (I − Jmis ) η̂MLE + Jmis η̂ (t−1) . Thus, we can obtain η̂ (t) = η̂MLE + (Jmis )t−1 η̂ (0) − η̂MLE , which justifies limt→∞ η̂ (t) = η̂MLE . Jae-Kwang Kim (ISU) Part 2 60 / 181 2. Basic Theory for Imputation Theorem 1 (Asymptotic results for m < ∞) [Theorem 4.1 of KS] √ Let η̂p be a preliminary n-consistent estimator of η with variance Vp . Under some ∗ regularity conditions, the solution η̂imp,m to S̄m∗ (η | η̂p ) ≡ m 1 X ∗(j) Scom η; yobs , ymis , δ = 0 m j=1 has mean η0 and asymptotic variance equal to n o . −1 ∗ −1 0 −1 −1 V η̂imp,m = Iobs + Jmis Vp − Iobs Jmis + m−1 Icom Imis Icom (4) −1 where Jmis = Icom Imis . This theorem was originally presented by Wang and Robins (1998). Jae-Kwang Kim (ISU) Part 2 61 / 181 2. Basic Theory for Imputation Remark • If we use η̂p = η̂MLE , then the asymptotic variance in (4) is . −1 ∗ −1 −1 V η̂imp,m = Iobs + m−1 Icom Imis Icom . • In Bayesian imputation (or multiple imputation), the posterior values of η are −1 independently generated from η ∼ N(η̂MLE , Iobs ), which implies that we can use −1 −1 −1 Vp = Iobs + m Iobs . Thus, the asymptotic variance in (4) for multiple imputation is . −1 0 ∗ −1 −1 −1 −1 V η̂imp,m = Iobs + m−1 Jmis Iobs Jmis + m−1 Icom Imis Icom . The second term is the additional price we pay when generating the posterior values, rather than using η̂MLE directly. Jae-Kwang Kim (ISU) Part 2 62 / 181 2. Basic Theory for Imputation Basic Setup (for Case 2: ψ 6= η) • Parameter ψ defined by E {U(ψ; y)} = 0. • Under complete response, a consistent estimator of ψ can be obtained by solving U (ψ; y) = 0. • Assume that some part of y, denoted by ymis , is not observed and m imputed ∗(1) ∗(m) values, say ymis , · · · , ymis , are generated from f (ymis | yobs , δ; η̂MLE ), where η̂MLE is the MLE of η0 . • The imputed estimating function using m imputed values is computed as ∗ Ūimp,m (ψ | η̂MLE ) = m 1 X U(ψ; y∗(j) ), m j=1 (5) ∗(j) where y∗(j) = (yobs , ymis ). ∗ ∗ • Let ψ̂imp,m be the solution to Ūimp,m (ψ | η̂MLE ) = 0. We are interested in the ∗ asymptotic properties of ψ̂imp,m . Jae-Kwang Kim (ISU) Part 2 63 / 181 2. Basic Theory for Imputation Theorem 2 [Theorem 4.2 of KS] Suppose that the parameter of interest ψ0 is estimated by solving U (ψ) = 0 under complete response. Then, under some regularity conditions, the solution to E {U (ψ) | yobs , δ; η̂MLE } = 0 (6) 0 has mean ψ0 and the asymptotic variance τ −1 Ωτ −1 , where τ = −E ∂U (ψ0 ) /∂ψ 0 Ω = V Ū (ψ0 | η0 ) + κSobs (η0 ) and −1 . κ = E {U (ψ0 ) Smis (η0 )} Iobs This theorem was originally presented by Robins and Wang (2000). Jae-Kwang Kim (ISU) Part 2 64 / 181 2. Basic Theory for Imputation Remark • Writing Ū(ψ | η) ≡ E {U(ψ) | yobs , δ; η}, the solution to (6) can be treated as the solution to the joint estimating equation U1 (ψ, η) U (ψ, η) ≡ = 0, U2 (η) where U1 (ψ, η) = Ū (ψ | η) and U2 (η) = Sobs (η). • We can apply the Taylor expansion to get where ψ̂ η̂ B11 B21 ∼ = B12 B22 ψ0 η0 − = B11 B21 B12 B22 E (∂U1 /∂ψ 0 ) E (∂U2 /∂ψ 0 ) −1 U1 (ψ0 , η0 ) U2 (η0 ) E (∂U1 /∂η 0 ) E (∂U2 /∂η 0 ) . Thus, as B21 = 0, n o −1 −1 ψ̂ ∼ = ψ0 − B11 U1 (ψ0 , η0 ) − B12 B22 U2 (η0 ) . In Theorem 2, B11 = τ , B12 = E (USmis ), and B22 = −Iobs . Jae-Kwang Kim (ISU) Part 2 65 / 181 3. Monte Carlo EM Motivation: Monte Carlo samples in the EM algorithm can be used as imputed values. Monte Carlo EM 1 In the EM algorithm defined by • [E-step] Compute n o Q η | η (t) = E ln f (y, δ; η) | yobs , δ; η (t) • [M-step] Find η (t+1) that maximizes Q η | η (t) , E-step is computationally cumbersome because it involves integral. 2 Wei and Tanner (1990): In the E-step, first draw ∗(1) ∗(m) ymis , · · · , ymis ∼ f ymis | yobs , δ; η (t) and approximate m 1 X ∗(j) Q η | η (t) ∼ ln f yobs , ymis , δ; η . = m j=1 Jae-Kwang Kim (ISU) Part 2 66 / 181 3. Monte Carlo EM Example 1 [Example 3.15 of KS] • Suppose that yi ∼ f (yi | xi ; θ) Assume that xi is always observed but we observe yi only when δi = 1 where δi ∼ Bernoulli [πi (φ)] and πi (φ) = exp (φ0 + φ1 xi + φ2 yi ) . 1 + exp (φ0 + φ1 xi + φ2 yi ) • To implement the MCEM method, in the E-step, we need to generate samples from f (yi | xi , δi = 0; θ̂, φ̂) = R Jae-Kwang Kim (ISU) Part 2 f (yi | xi ; θ̂){1 − πi (φ̂)} . f (yi | xi ; θ̂){1 − πi (φ̂)}dyi 67 / 181 3. Monte Carlo EM Example 1 (Cont’d) • We can use the following rejection method to generate samples from f (yi | xi , δi = 0; θ̂, φ̂): 1 2 Generate yi∗ from f (yi | xi ; θ̂). Using yi∗ , compute πi∗ (φ̂) = exp(φ̂0 + φ̂1 xi + φ̂2 yi∗ ) 1 + exp(φ̂0 + φ̂1 xi + φ̂2 yi∗ ) . Accept yi∗ with probability 1 − πi∗ (φ̂). 3 If yi∗ is not accepted, then goto Step 1. Jae-Kwang Kim (ISU) Part 2 68 / 181 3. Monte Carlo EM Example 1 (Cont’d) • Using the m imputed values of yi , denoted by yi∗(1) , · · · , yi∗(m) , and the M-step can be implemented by solving n X m X ∗(j) =0 S θ; xi , yi i=1 j=1 and n X m n o X ∗(j) ∗(j) δi − π(φ; xi , yi ) 1, xi , yi = 0, i=1 j=1 where S (θ; xi , yi ) = ∂ log f (yi | xi ; θ)/∂θ. Jae-Kwang Kim (ISU) Part 2 69 / 181 3. Monte Carlo EM Example 2 (GLMM) [Example 3.18] • Basic Setup: Let yij be a binary random variable (that takes 0 or 1) with probability pij = Pr (yij = 1 | xij , ai ) and we assume that logit (pij ) = x0ij β + ai where xij is a p-dimensional covariate associate with j-th repetition of unit i, β is the parameter of interest that can represent the treatment effect due to x, and ai represents the random effect associate with unit i. We assume that ai are iid with N 0, σ 2 . • Missing data : ai • Observed likelihood: YZ Lobs β, σ 2 = i ( Y yij p (xij , ai ; β) [1 − p (xij , ai ; β)] j 1−yij ) 1 ai φ dai σ σ where φ (·) is the pdf of the standard normal distribution. Jae-Kwang Kim (ISU) Part 2 70 / 181 3. Monte Carlo EM Example 2 (Cont’d) • MCEM approach: generate ai∗ from f (ai | xi , yi ; β̂, σ̂) ∝ f1 (yi | xi , ai ; β̂)f2 (ai ; σ̂). • Metropolis-Hastings algorithm: 1 2 Generate ai∗ from f2 (ai ; σ̂). Set ( ai∗ (t) ai = (t−1) ai where (t−1) ρ ai Jae-Kwang Kim (ISU) , ai∗ (t−1) w.p. ρ(ai , ai∗ ) (t−1) ∗ w.p. 1 − ρ(ai , ai ) f1 yi | xi , ai∗ ; β̂ , 1 . = min f y | x , a(t−1) ; β̂ 1 i i i Part 2 71 / 181 3. Monte Carlo EM Remark • Monte Carlo EM can be used as a frequentist approach to imputation. • Convergence is not guaranteed (for fixed m). • E-step can be computationally heavy. (May use MCMC method). Jae-Kwang Kim (ISU) Part 2 72 / 181 4. Parametric Fractional Imputation Parametric fractional imputation ∗(1) ∗(m) 1 More than one (say m) imputed values of ymis,i : ymis,i , · · · , ymis,i from some (initial) density h (ymis,i ). 2 Create weighted data set where Pm j=1 wij∗ , yij∗ ; j = 1, 2, · · · , m; i = 1, 2 · · · , n ∗(j) wij∗ = 1, yij∗ = (yobs,i , ymis,i ) ∗(j) wij∗ ∝ f (yij∗ , δ i ; η̂)/h(ymis,i ), η̂ is the maximum likelihood estimator of η, and f (y, δ; η) is the joint density of (y, δ). 3 The weight wij∗ are the normalized importance weights and can be called fractional weights. If ymis,i is categorical, then simply use all possible values of ymis,i as the imputed values and then assign their conditional probabilities as the fractional weights. Jae-Kwang Kim (ISU) Part 2 73 / 181 4. Parametric Fractional Imputation Remark • Importance sampling idea: For sufficiently large m, m X R wij∗ g yij∗ j=1 ∼ = (yi ,δi ;η̂) h(ymis,i )dymis,i g (yi ) fh(y mis,i ) = E {g (yi ) | yobs,i , δi ; η̂} R f (yi ,δi ;η̂) h(ymis,i )dymis,i h(ymis,i ) for any g such that the expectation exists. • In the importance sampling literature, h(·) is called proposal distribution and f (·) is called target distribution. • Do not need to compute the conditional distribution f (ymis,i | yobs,i , δi ; η). Only the joint distribution f (yobs,i , ymis,i , δi ; η) is needed because ∗(j) ∗(j) ∗(j) ∗(j) f (yobs,i , ymis,i , δi ; η̂)/h(yi,mis ) f (ymis,i | yobs,i , δi ; η̂)/h(yi,mis ) = Pm . Pm ∗(k) ∗(k) ∗(k) ∗(k) k=1 f (yobs,i , ymis,i , δi ; η̂)/h(yi,mis ) k=1 f (ymis,i | yobs,i , δi ; η̂)/h(yi,mis ) Jae-Kwang Kim (ISU) Part 2 74 / 181 4. Parametric Fractional Imputation EM algorithm by fractional imputation ∗(j) 1 Imputation-step: generate ymis,i ∼ h (yi,mis ). 2 Weighting-step: compute ∗(j) ∗ wij(t) ∝ f (yij∗ , δi ; η̂(t) )/h(yi,mis ) where Pm j=1 ∗ wij(t) = 1. 3 M-step: update η̂ (t+1) : solution to n X m X ∗ wij(t) S η; yij∗ , δi = 0. i=1 j=1 4 Repeat Step2 and Step 3 until convergence. • “Imputation Step” + “Weighting Step” = E-step. ∗ • We may add an optional step that checks if wij(t) is too large for some j. In this case, h(yi,mis ) needs to be changed. Jae-Kwang Kim (ISU) Part 2 75 / 181 4. Parametric Fractional imputation • The imputed values are not changed for each EM iteration. Only the fractional weights are changed. Computationally efficient (because we use importance sampling only once). 2 Convergence is achieved (because the imputed values are not changed). 1 • For sufficiently large t, η̂ (t) −→ η̂ ∗ . Also, for sufficiently large m, η̂ ∗ −→ η̂MLE . • For estimation of ψ in E {U(ψ; Y )} = 0, simply use n m 1 XX ∗ wij U(ψ; yij∗ ) = 0. n i=1 j=1 • Linearization variance estimation (using the result of Theorem 2) is discussed in Kim (2011). Jae-Kwang Kim (ISU) Part 2 76 / 181 4. Parametric Fractional imputation Return to Example 1 • Fractional imputation ∗(m) ∗(1) from f yi | xi ; θ̂(0) . Imputation Step: Generate yi , · · · , yi 2 Weighting Step: Using the m imputed values generated from Step 1, compute the fractional weights by ∗(j) o f yi | xi ; θ̂(t) n ∗ 1 − π(xi , yi∗(j) ; φ̂(t) ) wij(t) ∝ ∗(j) f yi | xi ; θ̂(0) 1 where Jae-Kwang Kim (ISU) exp φ̂0 + φ̂1 xi + φ̂2 yi . π(xi , yi ; φ̂) = 1 + exp φ̂0 + φ̂1 xi + φ̂2 yi Part 2 77 / 181 4. Parametric Fractional imputation Return to Example 1 (Cont’d) • Using the imputed data and the fractional weights, the M-step can be implemented by solving n X m X ∗(j) ∗ =0 wij(t) S θ; xi , yi i=1 j=1 and n X m X n o ∗(j) ∗(j) ∗ δi − π(φ; xi , yi ) 1, xi , yi = 0, wij(t) (7) i=1 j=1 where S (θ; xi , yi ) = ∂ log f (yi | xi ; θ)/∂θ. Jae-Kwang Kim (ISU) Part 2 78 / 181 4. Parametric Fractional imputation Example 3: Categorical missing data Original data i 1 2 3 4 5 6 Jae-Kwang Kim (ISU) Y1i 1 1 ? 0 0 ? Y2i 1 ? 0 1 0 ? Part 2 Xi 3 4 5 6 7 8 79 / 181 4. Parametric Fractional imputation Example 3 (Con’d) • Y1 and Y2 are dichotomous and X is continuous. • Model Pr (Y1 = 1 | X ) = Λ (α0 + α1 X ) Pr (Y2 = 1 | X , Y1 ) = Λ (β0 + β1 X + β2 Y1 ) where Λ (x) = {1 + exp (−x)} −1 • Assume MAR Jae-Kwang Kim (ISU) Part 2 80 / 181 4. Parametric Fractional imputation Example 3 (Con’d) Imputed data i 1 2 3 4 5 6 Jae-Kwang Kim (ISU) Fractional Weight 1 Pr (Y2 = 0 | Y1 = 1, X = 4) Pr (Y2 = 1 | Y1 = 1, X = 4) Pr (Y1 = 0 | Y2 = 0, X = 5) Pr (Y1 = 1 | Y2 = 0, X = 5) 1 1 Pr (Y1 = 0, Y1 = 0 | X = 8) Pr (Y1 = 0, Y1 = 1 | X = 8) Pr (Y1 = 1, Y1 = 0 | X = 8) Pr (Y1 = 1, Y1 = 1 | X = 8) Part 2 Y1i 1 1 1 0 1 0 0 0 0 1 1 Y2i 1 0 1 0 0 1 0 0 1 0 1 Xi 3 4 4 5 5 6 7 8 8 8 8 81 / 181 4. Parametric Fractional imputation Example 3 (Con’d) • Implementation of EM algorithm using fractional imputation • E-step: compute the mean score functions using the fractional weights. • M-step: solve the mean score function. • Because we have a completed data with weights, we can estimate other parameters such as θ = Pr (Y2 = 1 | X > 5). Jae-Kwang Kim (ISU) Part 2 82 / 181 4. Parametric Fractional imputation Example 4: Measurement error model [Example 4.13 of KS] • Interested in estimating θ in f (y | x; θ). • Instead of observing x, we observe z which can be highly correlated with x. • Thus, z is an instrumental variable for x: f (y | x, z) = f (y | x) and f (y | z = a) 6= f (y | z = b) for a 6= b. • In addition to original sample, we have a separate calibration sample that observes (xi , zi ). Jae-Kwang Kim (ISU) Part 2 83 / 181 4. Parametric Fractional imputation Table : Data Structure Calibration Sample Original Sample Jae-Kwang Kim (ISU) Part 2 Z o o X o Y o 84 / 181 4. Parametric Fractional imputation Example 4 (Cont’d) • The goal is to generate x in the original sample from f (xi | zi , yi ) ∝ f (xi | zi ) f (yi | xi , zi ) = f (xi | zi ) f (yi | xi ) • Obtain a consistent estimator fˆ(x | z) from calibration sample. • E-step ∗(1) ∗(m) Generate xi , · · · , xi from fˆ(xi | zi ). ∗(j) 2 Compute the fractional weights associated with xi by 1 ∗(j) wij∗ ∝ f (yi | xi ; θ̂) • M-step: Solve the weighted score equation for θ. Jae-Kwang Kim (ISU) Part 2 85 / 181 5. Multiple imputation Features 1 Imputed values are generated from ∗(j) yi,mis ∼ f (yi,mis | yi,obs , δ i ; η ∗ ) where η ∗ is generated from the posterior distribution π (η | yobs ). 2 Variance estimation formula is simple (Rubin’s formula). 1 )Bm m 2 P −1 Pm where WM = m−1 m , k=1 V̂I (k) , Bm = (m − 1) k=1 ψ̂(k) − ψ̄m −1 Pm ψ̄m = m k=1 ψ̂(k) is the average of m imputed estimators, and V̂I (k) is the imputed version of the variance estimator of ψ̂ under complete response. V̂MI (ψ̄m ) = Wm + (1 + Jae-Kwang Kim (ISU) Part 2 86 / 181 5. Multiple imputation • The computation for Bayesian imputation can be implemented by the data augmentation (Tanner and Wong, 1987) technique, which is a special application of the Gibb’s sampling method: ∗ ∼ f (ymis | yobs , δ; η ∗ ) I-step: Generate ymis ∗ ∗ 2 P-step: Generate η ∼ π (η | yobs , ymis , δ) 1 • Needs some tools for checking the convergence to a stable distribution. • Consistency of variance estimator is questionable (Kim et al., 2006). Jae-Kwang Kim (ISU) Part 2 87 / 181 6. Simulation Study Simulation 1 • Bivariate data (xi , yi ) of size n = 200 with xi ∼ N(3, 1) yi ∼ N(−2 + xi , 1) • xi always observed, yi subject to missingness. • MCAR (δ ∼ Bernoulli(0.6)) • Parameters of interest 1 2 θ1 = E (Y ) θ2 = Pr (Y < 1) • Multiple imputation (MI) and fractional imputation (FI) are applied with m = 50. • For estimation of θ2 , the following method-of-moment estimator is used. θ̂2,MME = n−1 n X I (yi < 1) i=1 Jae-Kwang Kim (ISU) Part 2 88 / 181 6. Simulation Study Table 1 Monte Carlo bias and variance of the point estimators. Parameter θ1 θ2 Estimator Complete sample MI FI Complete sample MI FI Bias 0.00 0.00 0.00 0.00 0.00 0.00 Variance 0.0100 0.0134 0.0133 0.00129 0.00137 0.00137 Std Var 100 134 133 100 106 106 Table 2 Monte Carlo relative bias of the variance estimator. Parameter V (θ̂1 ) V (θ̂2 ) Jae-Kwang Kim (ISU) Imputation MI FI MI FI Part 2 Relative bias (%) -0.24 1.21 23.08 2.05 89 / 181 6. Simulation study • Rubin’s formula is based on the following decomposition: V (θ̂MI ) = V (θ̂n ) + V (θ̂MI − θ̂n ) where θ̂n is the complete-sample estimator of θ. Basically, Wm term estimates V (θ̂n ) and (1 + m−1 )Bm term estimates V (θ̂MI − θ̂n ). • For general case, we have V (θ̂MI ) = V (θ̂n ) + V (θ̂MI − θ̂n ) + 2Cov (θ̂MI − θ̂n , θ̂n ) and Rubin’s variance estimator ignores the covariance term. Thus, a sufficient condition for the validity of unbiased variance estimator is Cov (θ̂MI − θ̂n , θ̂n ) = 0. • Meng (1994) called the condition congeniality of θ̂n . • Congeniality holds when θ̂n is the MLE of θ. Jae-Kwang Kim (ISU) Part 2 90 / 181 6. Simulation study • For example, there are two estimators of θ = P(Y < 1) when Y follows from N(µ, σ 2 ). R1 Maximum likelihood method: θ̂MLE = −∞ φ(z; µ̂, σ̂ 2 )dz Pn 2 Method of moments: θ̂MME = n−1 i=1 I (yi < 1) 1 • In the simulation setup, the imputed estimator of θ2 can be expressed as θ̂2,I = n−1 n X [δi I (yi < 1) + (1 − δi )E {I (yi < 1) | xi ; µ̂, σ̂}] . i=1 Thus, imputed estimator of θ2 “borrows strength” by making use of extra information associated with f (y | x). • Thus, when the congeniality conditions does not hold, the imputed estimator improves the efficiency (due to the imputation model that uses extra information) but the variance estimator does not recognize this improvement. Jae-Kwang Kim (ISU) Part 2 91 / 181 6. Simulation Study Simulation 2 • Bivariate data (xi , yi ) of size n = 100 with Yi = β0 + β1 xi + β2 xi2 − 1 + ei (8) where (β0 , β1 , β2 ) = (0, 0.9, 0.06), xi ∼ N (0, 1), ei ∼ N (0, 0.16), and xi and ei are independent. The variable xi is always observed but the probability that yi responds is 0.5. • In MI, the imputer’s model is Yi = β0 + β1 xi + ei . That is, imputer’s model uses extra information of β2 = 0. • From the imputed data, we fit model (8) and computed power of a test H0 : β2 = 0 with 0.05 significant level. • In addition, we also considered the Complete-Case (CC) method that simply uses the complete cases only for the regression analysis Jae-Kwang Kim (ISU) Part 2 92 / 181 6. Simulation Study Table 3 Simulation results for the Monte Carlo experiment based on 10,000 Monte Carlo samples. Method MI FI CC E (θ̂) 0.028 0.046 0.060 V (θ̂) 0.00056 0.00146 0.00234 R.B. (V̂ ) 1.81 0.02 -0.01 Power 0.044 0.314 0.285 Table 3 shows that MI provides efficient point estimator than CC method but variance estimation is very conservative (more than 100% overestimation). Because of the serious positive bias of MI variance estimator, the statistical power of the test based on MI is actually lower than the CC method. Jae-Kwang Kim (ISU) Part 2 93 / 181 7. Summary • Imputation can be viewed as a Monte Carlo tool for computing the conditional expectation. • Monte Carlo EM is very popular but the E-step can be computationally heavy. • Parametric fractional imputation is a useful tool for frequentist imputation. • Multiple imputation is motivated from a Bayesian framework. The frequentist validity of multiple imputation requires the condition of congeniality. • Uncongeniality may lead to overestimation of variance which can seriously increase type-2 errors. Jae-Kwang Kim (ISU) Part 2 94 / 181 REFERENCES Cheng, P. E. (1994), ‘Nonparametric estimation of mean functionals with data missing at random’, Journal of the American Statistical Association 89, 81–87. Dempster, A. P., N. M. Laird and D. B. Rubin (1977), ‘Maximum likelihood from incomplete data via the EM algorithm’, Journal of the Royal Statistical Society: Series B 39, 1–37. Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical statistics’, Philosophical Transactions of the Royal Society of London A 222, 309–368. Fuller, W. A., M. M. Loughin and H. D. Baker (1994), ‘Regression weighting in the presence of nonresponse with application to the 1987-1988 Nationwide Food Consumption Survey’, Survey Methodology 20, 75–85. Hirano, K., G. Imbens and G. Ridder (2003), ‘Efficient estimation of average treatment effects using the estimated propensity score’, Econometrica 71, 1161–1189. Ibrahim, J. G. (1990), ‘Incomplete data in generalized linear models’, Journal of the American Statistical Association 85, 765–769. Kim, J. K. (2011), ‘Parametric fractional imputation for missing data analysis’, Biometrika 98, 119–132. Kim, J. K. and C. L. Yu (2011), ‘A semi-parametric estimation of mean functionals with non-ignorable missing data’, Journal of the American Statistical Association 106, 157–165. Jae-Kwang Kim (ISU) Part 2 94 / 181 Kim, J. K., M. J. Brick, W. A. Fuller and G. Kalton (2006), ‘On the bias of the multiple imputation variance estimator in survey sampling’, Journal of the Royal Statistical Society: Series B 68, 509–521. Kim, J. K. and M. K. Riddles (2012), ‘Some theory for propensity-score-adjustment estimators in survey sampling’, Survey Methodology 38, 157–165. Kott, P. S. and T. Chang (2010), ‘Using calibration weighting to adjust for nonignorable unit nonresponse’, Journal of the American Statistical Association 105, 1265–1275. Louis, T. A. (1982), ‘Finding the observed information matrix when using the EM algorithm’, Journal of the Royal Statistical Society: Series B 44, 226–233. Meng, X. L. (1994), ‘Multiple-imputation inferences with uncongenial sources of input (with discussion)’, Statistical Science 9, 538–573. Oakes, D. (1999), ‘Direct calculation of the information matrix via the em algorithm’, Journal of the Royal Statistical Society: Series B 61, 479–482. Orchard, T. and M.A. Woodbury (1972), A missing information principle: theory and applications, in ‘Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability’, Vol. 1, University of California Press, Berkeley, California, pp. 695–715. Redner, R. A. and H. F. Walker (1984), ‘Mixture densities, maximum likelihood and the EM algorithm’, SIAM Review 26, 195–239. Robins, J. M., A. Rotnitzky and L. P. Zhao (1994), ‘Estimation of regression coefficients when some regressors are not always observed’, Journal of the American Statistical Association 89, 846–866. Jae-Kwang Kim (ISU) Part 2 94 / 181 Robins, J. M. and N. Wang (2000), ‘Inference for imputation estimators’, Biometrika 87, 113–124. Rubin, D. B. (1976), ‘Inference and missing data’, Biometrika 63, 581–590. Tanner, M. A. and W. H. Wong (1987), ‘The calculation of posterior distribution by data augmentation’, Journal of the American Statistical Association 82, 528–540. Wang, N. and J. M. Robins (1998), ‘Large-sample theory for parametric multiple imputation procedures’, Biometrika 85, 935–948. Wang, S., J. Shao and J. K. Kim (2014), ‘Identifiability and estimation in problems with nonignorable nonresponse’, Statistica Sinica 24, 1097 – 1116. Wei, G. C. and M. A. Tanner (1990), ‘A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms’, Journal of the American Statistical Association 85, 699–704. Zhou, M. and J. K. Kim (2012), ‘An efficient method of estimation for longitudinal surveys with monotone missing data’, Biometrika 99, 631–648. Statistical Methods for Handling Missing Data Part 3: Propensity score approach Jae-Kwang Kim Department of Statistics, Iowa State University Jae-Kwang Kim (ISU) Part 3 95 / 181 1. Introduction Basic Setup • (X , Y ): random variable • θ: Defined by solving E {U(θ; X , Y )} = 0. • yi is subject to missingness δi = 1 0 if yi responds if yi is missing. • Want to find wi such that the solution θ̂w to n X δi wi U(θ; xi , yi ) = 0 i=1 is consistent for θ. Jae-Kwang Kim (ISU) Part 3 96 / 181 Basic Setup Complete-Case (CC) method • Solve n X δi U(θ; xi , yi ) = 0 i=1 • Biased unless Pr (δ = 1 | X , Y ) does not depend on (X , Y ), i.e. biased unless the set of the respondents is a simple random sample from the original data. Jae-Kwang Kim (ISU) Part 3 97 / 181 Basic Setup Weighted Complete-Case (WCC) method • Solve ÛW (θ) ≡ n X δi wi U(θ; xi , yi ) = 0 i=1 for some weights wi . The weight is often called the propensity scores (or propensity weights). • The choice of wi = 1 Pr (δi = 1 | xi , yi ) will make the resulting estimator consistent. • Requires some assumption about Pr (δi = 1 | xi , yi ). Jae-Kwang Kim (ISU) Part 3 98 / 181 Basic Setup Justification for using wi = 1/Pr (δi = 1 | xi , yi ) • Note that n o E ÛW (θ) | x1 , · · · , xn , y1 , · · · , yn = Ûn (θ) where the expectation is taken with respect to f (δ | x, y ). • Thus, the probability limit of the solution to ÛW (θ) = 0 is equal to the probability limit of the solution to Ûn (θ) = 0. • No distributional assumptions made about (X , Y ). Jae-Kwang Kim (ISU) Part 3 99 / 181 Regression weighting method Motivation • Assume that xi are observed throughout the sample. • Intercept term is included in xi . • Study variable yi observed only when δi = 1. • Parameter of interest: θ = E (Y ). • Regression estimator of θ is defined by θ̂reg = n X δi wi yi = x̄0n β̂, i=1 where β̂ = 0 −1 i=1 δi xi xi Pn Pn wi = Jae-Kwang Kim (ISU) i=1 δi xi yi and n 1X xi n i=1 !0 Part 3 n X !−1 δi xi x0i xi . i=1 100 / 181 Regression weighting method Theorem 1 (Fuller et al., 1994) Let πi be the true response probability of unit i. If the response probability satisfies 1 = x0i λ πi (9) for some λ for al unit i in the sample, then the regression estimator is asymptotically unbiased with asymptotic variance ) ( n 1 X 1 2 − 1 (yi − xi β) , V (θ̂reg ) = V (θ̂n ) + E n2 i=1 πi where θ̂n = n−1 Pn i=1 Jae-Kwang Kim (ISU) yi and β is the probability limit of β̂. Part 3 101 / 181 Sketched Proof Since πi−1 is in the column space of xi , we have n X δi yi − x0i β̂ = 0 πi i=1 and θ̂reg can be written as θ̂reg = x̄0n β̂ + n 1 X δi yi − x0i β̂ . n i=1 πi (10) Now, writing θ̂reg = θ̂reg (β̂), we have E ∂ θ̂reg (β) ∂β ( =E x̄0n n 1 X δi 0 − xi n i=1 πi ) = 0. Thus, we can safely ignore the sampling variability of β̂ in (10). Jae-Kwang Kim (ISU) Part 3 102 / 181 Regression weighting method Example 1 [Example 5.2 of KS] Assume that the sample is partitioned into G exhaustive and mutually exclusive groups, denoted by A1 , · · · , AG , where |Ag | = ng with g being the group indicator. Assume a uniform response mechanism for each group. Thus, we assume that πi = pg for some pg ∈ (0, 1] if i ∈ Ag . Let 1 if i ∈ Ag xig = 0 otherwise. Then, xi = (xi1 , · · · , xiG ) satisfies (9). The regression estimator of θ = E (Y ) can be written as G G 1X 1 X ng X δi yi = θ̂reg = ng ȳRg , n g =1 rg n g =1 i∈Ag P where rg = i∈Ag δi is the realized size of respondents in group g and P −1 P ȳRg = i∈Ag δi i∈Ag δi yi . Jae-Kwang Kim (ISU) Part 3 103 / 181 Regression weighting method Example 1 (Cont’d) Because the covariate satisfies (9), the regression estimator is asymptotically unbiased and the asymptotic variance of θ̂reg is X G X ng −2 2 (yi − ȳng ) . V θ̂reg = V θ̂n + E n −1 rg g =1 i∈Ag Variance estimation of θ̂reg can be implemented by using a standard variance estimation formula applied to d̂i = ȳRg + (ng /rg ) (yi − ȳRg ). Jae-Kwang Kim (ISU) Part 3 104 / 181 Propensity score method Idea • For simplicity, assume that Pr (δi = 1 | xi , yi ) takes a parametric form. Pr (δi = 1 | xi , yi ) = π(xi , yi ; φ∗ ) for some unknown φ∗ . The functional form of π(·) is known. For example, π(x, y ; φ∗ ) = exp(φ∗0 + φ∗1 x + φ∗2 y ) 1 + exp(φ∗0 + φ∗1 x + φ∗2 y ) • Propensity score approach to missing data: obtain θ̂PS which solves ÛPS (θ) ≡ n X i=1 δi 1 U(θ; xi , yi ) = 0 π(xi , yi ; φ̂) for some φ̂ which converges to φ∗ in probability. Jae-Kwang Kim (ISU) Part 3 105 / 181 Propensity score method Issues • Identifiability: Model parameters may not be fully identifiable from the observed sample. • May assume Pr (δi = 1 | xi , yi ) = Pr (δi = 1 | xi ) . This condition is often called MAR (Missing at random). • For longitudinal data with monotone missing pattern, the MAR condition means Pr (δi,t = 1 | xi , yi1 , · · · , yiT ) = Pr (δi,t = 1 | xi , yi1 , · · · , yi,t−1 ) . That is, the response probability at time t may depend on the value of y observed up to time t. Jae-Kwang Kim (ISU) Part 3 106 / 181 Propensity score method Issues • Estimation of φ∗ • Maximum likelihood method: Solve S(φ) ≡ n X {δi − πi (φ)} qi (φ) = 0 i=1 where qi = ∂logit{πi (φ)}/∂φ. • Maximum likelihood method does not always lead to efficient estimation (see Example 2 next). • Inference using θ̂PS : Note that θ̂PS = θ̂PS (φ̂). We need to incorporate the sampling variability of φ̂ in making inference about θ using θ̂PS . Jae-Kwang Kim (ISU) Part 3 107 / 181 3. Propensity score method Example 2 • Response model exp(φ∗0 + φ∗1 xi ) 1 + exp(φ∗0 + φ∗1 xi ) πi (φ∗ ) = • Parameter of interest: θ = E (Y ). • PS estimator of θ: Solve jointly U(θ, φ) = n X δi {πi (φ)}−1 (yi − θ) = 0 i=1 S(φ) = n X {δi − πi (φ)}(1, xi ) = (0, 0) i=1 Jae-Kwang Kim (ISU) Part 3 108 / 181 3. Propensity score method Example 2 (Cont’d) • Taylor linearization ÛPS (θ, φ̂) ∼ = −1 ∂U ∂S ÛPS (θ, φ∗ ) − E E S(φ∗ ) ∂φ ∂φ = ÛPS (θ, φ∗ ) − {Cov (U, S)} {V (S)}−1 S(φ∗ ), (11) by the property of zero-mean function. (i.e. If E (U) = 0, then E (∂U/∂φ) = −Cov (U, S).) • So, we have ∗ ⊥ ∗ V {ÛPS (θ, φ̂)} ∼ = V {ÛPS (θ, φ ) | S } ≤ V {ÛPS (θ, φ )}, where V {Û | S ⊥ } = V (Û) − Cov (Û, S) {V (S)}−1 Cov (S, Û). Jae-Kwang Kim (ISU) Part 3 109 / 181 3. Propensity score method Example 2 (Cont’d) • Sandwich variance formula 0 −1 −1 V {θ̂PS (φ̂)} ∼ = τPS V {ÛPS (θ; φ̂)}τPS where τPS = E ∂ ÛPS (θ, φ̂) . ∂θ • Note that, by (11), E ∂ ÛPS (θ, φ̂) ∂θ =E ∂ ÛPS (θ, φ∗ ) . ∂θ • So, we have ∗ ⊥ ∗ V {θ̂PS (φ̂)} ∼ = V {θ̂PS (φ ) | S } ≤ V {θ̂PS (φ )}, where V {θ̂ | S ⊥ } = V (θ̂) − Cov (θ̂, S) {V (S)}−1 Cov (S, θ̂). Jae-Kwang Kim (ISU) Part 3 110 / 181 3. Propensity score method Augmented propensity score method • To improve the efficiency of the PS estimator, one can consider solving n X i=1 n δi X 1 {U(θ; xi , yi ) − b(θ; xi )} + b(θ; xi ) = 0, π̂i i=1 (12) where b(θ; xi ) is to be determined. • Assume that the estimated response probability is computed by π̂i = π(xi ; φ̂), where φ̂ is computed by n X i=1 δi − 1 hi (φ) = 0 π(xi ; φ) for some hi (φ) = h(xi ; φ). • Note that (12) forms a class of estimators, indexed by b, and the solution to (12) is asymptotically unbiased regardless of the choice of b(θ; xi ). Jae-Kwang Kim (ISU) Part 3 111 / 181 3. Propensity score method Augmented propensity score method Theorem 2 [Theorem 5.1 of KS] Assume that the response probability Pr (δ = 1 | x, y ) = π(x) does not depend on the value of y . Let θ̂b be the solution to (12) for given b(θ; xi ). Under some regularity conditions, θ̂b is consistent and its asymptotic variance satisfies h n oi V θ̂b ≥ n−1 τ −1 V {E (U | X )} + E π −1 V (U | X ) (τ −1 )0 , (13) where τ = E (∂U/∂θ0 ) and the equality holds when b(θ; xi ) = E {U(θ; xi , yi ) | xi }. Originally proved by Robins et al. (1994). Jae-Kwang Kim (ISU) Part 3 112 / 181 3. Propensity score method Augmented propensity score method Remark • The lower bound of the variance is achieved for b∗ (θ; xi ) = E {U(θ; xi , yi ) | xi }, which requires assumptions about the law of y given x (called outcome regression model). • Thus, under the joint distribution of the response model and outcome regression model, the particular choice of b ∗ (θ; xi ) = E {U(θ; xi , yi ) | xi } satisfies 1 2 Achieves the lower bound of the asymptotic variance The choice of φ̂ in π̂i = π(xi , φ̂) does not make any difference in the asymptotic sense. • Also, it is doubly robust in the sense that it remains consistent if either model (outcome regression model or response model) is true. Jae-Kwang Kim (ISU) Part 3 113 / 181 4. GLS method Motivation • The propensity score method is used to reduce the bias, rather than to reduce the variance. • In the previous example, the PS estimator for θx = E (X ) is Pn δi π̂i−1 xi θ̂x,PS = Pi=1 n −1 i=1 δi π̂i where π̂i = πi (φ̂). • Note that θ̂x,PS is not necessarily equal to x̄n = n−1 • How to incorporate the extra information of x̄n ? Jae-Kwang Kim (ISU) Part 3 Pn i=1 xi . 114 / 181 4. GLS method GLS (or GMM) approach • Let θ = (θx , θy ). We have three estimators for two parameters. • Find θ that minimizes −1 0 x̄n − θx x̄n − θx x̄n − θx θ̂x,PS − θx QPS (θ) = θ̂x,PS − θx V̂ θ̂x,PS − θx θ̂y ,PS − θy θ̂y ,PS − θy θ̂y ,PS − θy (14) where θ̂PS = θ̂PS (φ̂). • Computation for V̂ is somewhat cumbersome. Jae-Kwang Kim (ISU) Part 3 115 / 181 4. GLS method Alternative GLS (or GMM) approach (Zhou and Kim, 2012) • Find (θ, φ) that minimizes 0 x̄n − θx θ̂x,PS (φ) − θx V̂ θ̂y ,PS (φ) − θy S(φ) −1 x̄n − θx θ̂x,PS (φ) − θx θ̂y ,PS (φ) − θy S(φ) x̄n − θx θ̂x,PS (φ) − θx . θ̂y ,PS (φ) − θy S(φ) • Computation for V̂ is easier since we can treat φ as if known. • Let Q ∗ (θ, φ) be the above objective function. It can be shown that Q ∗ (θ, φ̂) = QPS (θ) in (14) and so minimizing Q ∗ (θ, φ̂) is equivalent to minimizing QPS (θ). Jae-Kwang Kim (ISU) Part 3 116 / 181 4. GLS method Justification for the equivalence • May write Q ∗ (θ, φ) = = ÛPS (θ, φ) S(φ) 0 V11 V21 V12 V22 −1 ÛPS (θ, φ) S(φ) Q1 (θ | φ) + Q2 (φ) where 0 −1 −1 −1 ÛPS − V12 V22 S V UPS | S ⊥ ÛPS − V12 V22 S n o−1 Q2 (φ) = S(φ)0 V̂ (S) S(φ) Q1 (θ | φ) = • For the MLE φ̂, we have Q2 (φ̂) = 0 and Q1 (θ | φ̂) = QPS (θ). Jae-Kwang Kim (ISU) Part 3 117 / 181 4. GLS method Example 3 (Example 5.5 of KS) • Response model: same as Example 2 πi (φ∗ ) = exp(φ∗0 + φ∗1 xi ) 1 + exp(φ∗0 + φ∗1 xi ) • Three direct PS estimators of (1, θx , θy ): (θ̂1,PS , θ̂x,PS , θ̂y ,PS ) = n−1 n X δi π̂i−1 (1, xi , yi ) . i=1 • x̄n = n−1 ni=1 xi available. • What is the optimal estimator of θy ? P Jae-Kwang Kim (ISU) Part 3 118 / 181 4. GLS method Example 3 (Cont’d) • Minimize x̄n − θx θ̂1,PS (φ) − 1 θ̂x,PS (φ) − θx θ̂y ,PS (φ) − θy S(φ) 0 V̂ x̄n θ̂1,PS (φ) θ̂x,PS (φ) θ̂y ,PS (φ) S(φ) −1 x̄n − θx θ̂1,PS (φ) − 1 θ̂x,PS (φ) − θx θ̂y ,PS (φ) − θy S(φ) with respect to (θx , θy , φ), where S(φ) = n X δi − 1 hi (φ) = 0 πi (φ) i=1 with hi (φ) = πi (φ)(1, xi )0 . Jae-Kwang Kim (ISU) Part 3 119 / 181 4. GLS method Example 3 (Cont’d) • Equivalently, minimize 0 θ̂y ,PS (φ) − θy θ̂1,PS (φ) − 1 V̂ θ̂x,PS (φ) − x̄n S(φ) −1 θ̂y ,PS (φ) θ̂ (φ) 1,PS θ̂x,PS (φ) − x̄n S(φ) θ̂y ,PS (φ) − θy θ̂1,PS (φ) − 1 θ̂x,PS (φ) − x̄n S(φ) with respect to (θy , φ), since the optimal estimator of θx is x̄n . Jae-Kwang Kim (ISU) Part 3 120 / 181 4. GLS method Example 3 (Cont’d) • The solution can be written as n o θ̂y ,opt = θ̂y ,PS + 1 − θ̂1,PS B̂0 + x̄n − θ̂1,PS B̂1 + 0 − S(φ̂) Ĉ where 0 −1 B̂0 n n 1 1 1 X X B̂1 = δi bi xi xi δi bi xi yi i=1 i=1 hi hi hi Ĉ and bi = π̂i−2 (1 − π̂i ). • Note that the last term {0 − S(φ̂)}Ĉ , which is equal to zero, does not contribute to the point estimation. But, it is used for variance estimation. Jae-Kwang Kim (ISU) Part 3 121 / 181 4. GLS method Example 3 (Cont’d) • That is, for variance estimation, we simply express θ̂y ,opt = n−1 n X η̂i i=1 where δi yi − B̂0 − xi B̂1 − h0i Ĉ π̂i and apply the standard variance formula to η̂i . η̂i = B̂0 + xi B̂1 + h0i Ĉ + • This idea can be extended to the survey sampling setup. Jae-Kwang Kim (ISU) Part 3 122 / 181 4. GLS method Example 3 (Cont’d) • The optimal estimator is linear in y . That is, we can write θ̂y ,opt = n X 1 X δi gi yi = wi yi n i=1 π̂i δi =1 where gi satisfies n n X X δi gi (1, xi , h0i ) = (1, xi , h0i ). π̂ i i=1 i=1 • Thus, it is doubly robust under Eζ (y | x) = β0 + β1 x in the sense that θy ,opt is consistent when either the response model or the outcome regression model holds. Jae-Kwang Kim (ISU) Part 3 123 / 181 5. Doubly robust method • Two models • Response Probability (RP) model: model about δ Pr (δ = 1 | x, y ) = π(x; φ) • Outcome Regression (OR) model: model about y E (y | x) = m(xi ; β) • Doubly robust (DR) estimation aims to achieve (asymptotic) unbiasedness under either RP model or OR model. • For estimation of θ = E (Y ), a doubly robust estimator is θ̂DR = n 1X δi ŷi + (yi − ŷi ) n i=1 π̂i where ŷi = m(xi ; β̂) and π̂i = π(xi ; φ̂). Jae-Kwang Kim (ISU) Part 3 124 / 181 5. Doubly robust method • Note that θ̂DR − θ̂n = n−1 n X δi i=1 π̂i − 1 (yi − ŷi ) . (15) Taking an expectation of the above, we note that the first term has approximate zero expectation if the RP model is true. The second term has approximate zero expectation if the OR model is true. Thus, θ̂DR is approximately unbiased when either RP model or OR model is true. • When both models are true, then the choice of β̂ and φ̂ does not make any difference in the asymptotic sense. Robins et al (1994) called the property local efficiency of the DR estimator. Jae-Kwang Kim (ISU) Part 3 125 / 181 5. Doubly robust method • Kim and Riddles (2012) considered an augmented propensity model of the form π̂i∗ = πi∗ (φ̂, λ̂) = πi (φ̂) , πi (φ̂) + {1 − πi (φ̂)} exp(λ̂0 + λ̂1 m̂i ) (16) where πi (φ̂) is the estimated response probability under the response probability model and (λ̂0 , λ̂1 ) satisfies n X i=1 n X δi (1, m̂i ) (1, m̂i ) = ∗ πi (φ̂, λ̂) i=1 (17) with m̂i = m(xi ; β̂). Jae-Kwang Kim (ISU) Part 3 126 / 181 5. Doubly robust method ∗ • The augmented PSA estimator, defined by θ̂PSA = n−1 Pn ∗ i=1 δi yi /π̂i , based on the augmented propensity in (16) satisfies, under the assumed response probability model, n 1X δi ∗ ∼ b̂ + b̂ m̂ + y − b̂ − b̂ m̂ , (18) θ̂PSA = 0 1 i 0 1 i i n i=1 π̂i where b̂0 b̂1 = ( n X i=1 Jae-Kwang Kim (ISU) δi 1 −1 π̂i 1 m̂i Part 3 1 m̂i 0 )−1 X n i=1 δi 1 −1 π̂i 1 m̂i yi . 127 / 181 5. Doubly robust method • The augmented PSA estimator using π̂i∗ = πi∗ (φ̂, λ̂) = π̂i , π̂i + {1 − π̂i } exp(λ̂0 /π̂i + λ̂1 xi /π̂i ) with (λ̂0 , λ̂1 ) satisfying n X i=1 n X δi (1, xi ) (1, x i) = πi∗ (φ̂, λ̂) i=1 is asymptotically equivalent to the optimal regression PSA estimator discussed in Example 3. Jae-Kwang Kim (ISU) Part 3 128 / 181 6. Nonparametric method Motivation • So far, we have assumed a parametric model for π(x) = Pr (δ = 1 | x). • Using the nonparametric regression technique, we can use a nonparametric estimator of π(x) given by a nonparametric regression estimator of π(x) = E (δ | x) can be obtained by Pn δi Kh (xi , x) π̂h (x) = Pi=1 , n i=1 Kh (xi , x) (19) where Kh is the kernel function which satisfies certain regularity conditions and h is the bandwidth. • Once a nonparametric estimator of π(x) is obtained, the nonparametric PSA estimator θ̂NPS of θ0 = E (Y ) is given by θ̂NPS = Jae-Kwang Kim (ISU) n 1 X δi yi . n i=1 π̂h (xi ) Part 3 (20) 129 / 181 6. Nonparametric method Theorem 3 [Theorem 5.2 of KS] Under some regularity conditions, we have n 1X δi θ̂NPS = m(xi ) + {yi − m(xi )} + op (n−1/2 ), n i=1 π(xi ) (21) where m(x) = E (Y | x) and π(x) = P(δ = 1 | x). Furthermore, we have √ n θ̂NPS − θ → N 0, σ12 , where σ12 = V {m (X )} + E {π (X )}−1 V (Y | X ) . Originally proved by Hirano et al. (2003). Jae-Kwang Kim (ISU) Part 3 130 / 181 6. Nonparametric method Remark • Unlike the usual asymptotic for nonparametric regression, √ n-consistency was established. • The nonparametric PSA estimator achieves the lower bound of the variance in (13) that was discussed in Theorem 2. • Instead of nonparametric PSA method, we can use the same Kernel regression technique to obtain a nonparametric imputation estimator given by θ̂NPI = where n 1X {δi yi + (1 − δi )m̂h (xi )} n i=1 (22) Pn δi Kh (xi , x)yi m̂h (x) = Pi=1 . n i=1 δi Kh (xi , x) Cheng (1994) proves that θ̂NPI has the same asymptotic variance in Theorem 3. Jae-Kwang Kim (ISU) Part 3 131 / 181 Application to longitudinal missing Basic Setup • Xi is always observed and remains unchanged for t = 0, 1, . . . , T . • Yit is the response for subject i at time t. • δit : The response indicator for subject i at time t. • Assuming no missing in the baseline year, Y0 can be absorbed into X . • Monotone missing pattern δit = 0 ⇒ δi,t+1 = 0, ∀t = 1, . . . , T − 1. (Xi0 , Yi1 , . . . , Yi,t )0 • Li,t = : Measurement up to t. • Parameter of interest θ is estimated by solving n X U(θ; Li,T ) = 0 i=1 for θ, under complete response. Jae-Kwang Kim (ISU) Part 3 132 / 181 Application to longitudinal missing Missing mechanism (under monotone missing pattern) • Missing completely at random (MCAR) : P(δit=1 |δi,t−1 = 1, Li,T ) = P(δit=1 |δi,t−1 = 1). • Covariate-dependent missing (CDM) : P(δit = 1|δi,t−1 = 1, Li,T ) = P(δit = 1|δi,t−1 = 1, Xi ). • Missing at random (MAR) : P(δit = 1|δi,t−1 = 1, Li,T ) = P(δit = 1|δi,t−1 = 1, Li,t−1 ). • Missing not at random (MNAR) : Missing at random does not hold. Jae-Kwang Kim (ISU) Part 3 133 / 181 Application to longitudinal missing Motivation • Panel attrition is frequently encountered in panel surveys, while classical methods often assume covariate-dependent missing, which can be unrealistic. We want to develop a PS method under MAR. • Want to make full use of available information. Jae-Kwang Kim (ISU) Part 3 134 / 181 Application to longitudinal missing Idea • Under MAR, in the longitudinal data case, we would consider the conditional probabilities: pit := P(δit = 1|δi,t−1 = 1, Li,t−1 ), t = 1, . . . , T . Then πit = t Y pij . j=1 πt then can be modeled through modeling pt with pt (Lt−1 ; φt ). Q • Once we obtain π̂iT = Tt=1 p̂it is obtained, we can use n X δiT U(θ; Li,T ) = 0 π̂ iT i=1 to obtain a consistent estimator of θ. Jae-Kwang Kim (ISU) Part 3 135 / 181 Application to longitudinal missing Score Function for Longitudinal Data Under parametric models for pt ’s, the partial likelihood for φ1 , . . . , φT is L(φ1 , . . . , φT ) = n Y T h Y δ piti,t (1 − pit )1−δi,t iδi,t−1 , i=1 t=1 and the corresponding score function is (S1 (φ1 ), . . . , ST (φT )), where St (φt ) = n X δi,t−1 {δit − pit (φt )} qit (φt ) = 0 i=1 where qit (φt ) = ∂logit{pit (φt )}/∂φt . Under logistic regression model such that pt = 1/{1 + exp(−φ0t Lt−1 )}, we have qit (φt ) = Lt−1 . Jae-Kwang Kim (ISU) Part 3 136 / 181 Application to longitudinal missing Remark • Zhou and Kim (2012) proposed an optimal estimator of µt = E (Yt ) incorporating all available information. • The idea can be extended to non-monotone missing data by re-defining πit = P (δi1 = · · · = δit = 1 | Lit ) = t Y pij j=1 where pit := P(δit = 1|δi1 = · · · = δi,t−1 = 1, Li,t−1 ). • The score equation for φt in pit = p(Li,t−1 ; φt ) is then St (φt ) = n X ∗ δi,t−1 {δit − pit (φt )} qit (φt ) = 0 i=1 where ∗ δi,t−1 = Jae-Kwang Kim (ISU) Qt−1 j=1 δij and qit (φt ) = ∂logit{pit (φt )}/∂φt . Part 3 137 / 181 8. Concluding remarks • Uses a model for the response probability. • Parameter estimation for response model can be implemented using the idea of maximum likelihood method. • GLS method can be used to incorporate the auxiliary information. • DR procedure offers some protection against misspecification of one model or the other. • Can be extended to nonignorable missing when the parameters are identifiable (Part 4). Jae-Kwang Kim (ISU) Part 3 138 / 181 REFERENCES Cheng, P. E. (1994), ‘Nonparametric estimation of mean functionals with data missing at random’, Journal of the American Statistical Association 89, 81–87. Dempster, A. P., N. M. Laird and D. B. Rubin (1977), ‘Maximum likelihood from incomplete data via the EM algorithm’, Journal of the Royal Statistical Society: Series B 39, 1–37. Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical statistics’, Philosophical Transactions of the Royal Society of London A 222, 309–368. Fuller, W. A., M. M. Loughin and H. D. Baker (1994), ‘Regression weighting in the presence of nonresponse with application to the 1987-1988 Nationwide Food Consumption Survey’, Survey Methodology 20, 75–85. Hirano, K., G. Imbens and G. Ridder (2003), ‘Efficient estimation of average treatment effects using the estimated propensity score’, Econometrica 71, 1161–1189. Ibrahim, J. G. (1990), ‘Incomplete data in generalized linear models’, Journal of the American Statistical Association 85, 765–769. Kim, J. K. (2011), ‘Parametric fractional imputation for missing data analysis’, Biometrika 98, 119–132. Kim, J. K. and C. L. Yu (2011), ‘A semi-parametric estimation of mean functionals with non-ignorable missing data’, Journal of the American Statistical Association 106, 157–165. Jae-Kwang Kim (ISU) Part 3 138 / 181 Kim, J. K., M. J. Brick, W. A. Fuller and G. Kalton (2006), ‘On the bias of the multiple imputation variance estimator in survey sampling’, Journal of the Royal Statistical Society: Series B 68, 509–521. Kim, J. K. and M. K. Riddles (2012), ‘Some theory for propensity-score-adjustment estimators in survey sampling’, Survey Methodology 38, 157–165. Kott, P. S. and T. Chang (2010), ‘Using calibration weighting to adjust for nonignorable unit nonresponse’, Journal of the American Statistical Association 105, 1265–1275. Louis, T. A. (1982), ‘Finding the observed information matrix when using the EM algorithm’, Journal of the Royal Statistical Society: Series B 44, 226–233. Meng, X. L. (1994), ‘Multiple-imputation inferences with uncongenial sources of input (with discussion)’, Statistical Science 9, 538–573. Oakes, D. (1999), ‘Direct calculation of the information matrix via the em algorithm’, Journal of the Royal Statistical Society: Series B 61, 479–482. Orchard, T. and M.A. Woodbury (1972), A missing information principle: theory and applications, in ‘Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability’, Vol. 1, University of California Press, Berkeley, California, pp. 695–715. Redner, R. A. and H. F. Walker (1984), ‘Mixture densities, maximum likelihood and the EM algorithm’, SIAM Review 26, 195–239. Robins, J. M., A. Rotnitzky and L. P. Zhao (1994), ‘Estimation of regression coefficients when some regressors are not always observed’, Journal of the American Statistical Association 89, 846–866. Jae-Kwang Kim (ISU) Part 3 138 / 181 Robins, J. M. and N. Wang (2000), ‘Inference for imputation estimators’, Biometrika 87, 113–124. Rubin, D. B. (1976), ‘Inference and missing data’, Biometrika 63, 581–590. Tanner, M. A. and W. H. Wong (1987), ‘The calculation of posterior distribution by data augmentation’, Journal of the American Statistical Association 82, 528–540. Wang, N. and J. M. Robins (1998), ‘Large-sample theory for parametric multiple imputation procedures’, Biometrika 85, 935–948. Wang, S., J. Shao and J. K. Kim (2014), ‘Identifiability and estimation in problems with nonignorable nonresponse’, Statistica Sinica 24, 1097 – 1116. Wei, G. C. and M. A. Tanner (1990), ‘A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms’, Journal of the American Statistical Association 85, 699–704. Zhou, M. and J. K. Kim (2012), ‘An efficient method of estimation for longitudinal surveys with monotone missing data’, Biometrika 99, 631–648. Statistical Methods for Handling Missing Data Part 4: Nonignorable missing Jae-Kwang Kim Department of Statistics, Iowa State University Jae-Kwang Kim (ISU) Part 4 139 / 181 Observed likelihood • (X , Y ): random variable, y is subject to missingness • f (y | x; θ): model of y on x • g (δ | x, y ; φ): model of δ on (x, y ) • Observed likelihood Lobs (θ, φ) = Y f (yi | xi ; θ) g (δi | xi , yi ; φ) δi =1 × YZ f (yi | xi ; θ) g (δi | xi , yi ; φ) dyi δi =0 • Under what conditions the parameters are identifiable ? Jae-Kwang Kim (ISU) Part 4 140 / 181 Lemma Suppose that we can decompose the covariate vector x into two parts, u and z, such that g (δ|y , x) = g (δ|y , u) (23) and, for any given u, there exist zu,1 and zu,2 such that f (y |u, z = zu,1 ) 6= f (y |u, z = zu,2 ). (24) Under some other minor conditions, all the parameters in f and g are identifiable. Jae-Kwang Kim (ISU) Part 4 141 / 181 Remark • Condition (23) means δ ⊥ z | y , u. • That is, given (y , u), z does not help in explaining δ. • Thus, z plays the role of instrumental variable in econometrics: f (y ∗ | x ∗ , z ∗ ) = f (y ∗ | x ∗ ), Cov (z ∗ , x ∗ ) 6= 0. Here, y ∗ = δ, x ∗ = (y , u), and z ∗ = z. • We may call z the nonresponse instrument variable. • Rigorous theory developed by Wang et al. (2014). Jae-Kwang Kim (ISU) Part 4 142 / 181 Parameter estimation under the existence of nonresponse instrument variable • Full likelihood-based ML estimation • Generalized method of moment (GMM) approach (Section 6.3 of KS) • Conditional likelihood approach (Section 6.2 of KS) • Pseudo likelihood approach (Section 6.4 of KS) • Exponential tilting method (Section 6.5 of KS) • Latent variable approach (Section 6.6 of KS) Jae-Kwang Kim (ISU) Part 4 143 / 181 Full likelihood-based ML estimation Example (Example 6.2 of KS) • θ: parameter of interest in f (y | x; θ). • x is always observed and y is subject to missingness. • Response probability is nonignorable πi = π(xi , yi ; φ) with logit(πi ) = φ01 xi + φ02 yi . • To guarantee identifiability, we may need Pr (δ = 1 | x, y ) = Pr (δ = 1 | u, y ), where x = (u, z). Jae-Kwang Kim (ISU) Part 4 144 / 181 Full likelihood-based ML estimation Example (Cont’d) • Once the model is identified, we can use the following EM algorithm by fractional imputation 1 2 ∗(1) ∗(m) from h (yi | xi ). Generate yi , · · · , yi Using the m imputed values generated from Step 1, compute the fractional weights by ∗(j) o f yi | xi ; θ̂(t) n ∗(j) ∗ 1 − π(xi , yi ; φ̂(t) ) (25) wij(t) ∝ ∗(j) h yi | xi where π(xi , yi ; φ̂) is the estimated response probability evaluated at φ̂. Jae-Kwang Kim (ISU) Part 4 145 / 181 Full likelihood-based ML estimation Example (Cont’d) 3 Using the imputed data and the fractional weights, the M-step can be implemented by solving n X m X ∗(j) ∗ =0 wij(t) S θ; xi , yi (26) n o ∗(j) ∗(j) ∗ δi − π(φ; xi , yi ) x0i , yi = 0, wij(t) (27) i=1 j=1 and m n X X i=1 j=1 where S (θ; xi , yi ) = ∂ log f (yi | xi ; θ)/∂θ. 4 Set t = t + 1 and go to Step 2. Continue until convergence. Jae-Kwang Kim (ISU) Part 4 146 / 181 Basic setup • (X , Y ): random variable • θ: Defined by solving E {U(θ; X , Y )} = 0. • yi is subject to missingness δi = 1 0 if yi responds if yi is missing. • Want to find wi such that the solution θ̂w to n X δi wi U(θ; xi , yi ) = 0 i=1 is consistent for θ. Jae-Kwang Kim (ISU) Part 4 147 / 181 Basic Setup • Result 1: The choice of wi = 1 E (δi | xi , yi ) (28) makes the resulting estimator θ̂w consistent. • Result 2: If δi ∼ Bernoulli(πi ), then using wi = 1/πi also makes the resulting estimator consistent, but it is less efficient than θ̂w using wi in (28). Jae-Kwang Kim (ISU) Part 4 148 / 181 Parameter estimation : GMM method • Because z is a nonresponse instrumental variable, we may assume P(δ = 1 | x, y ) = π(φ0 + φ1 u + φ2 y ) for some (φ0 , φ1 , φ2 ). • Kott and Chang (2010): Construct a set of estimating equations such as n X i=1 δi − 1 (1, ui , zi ) = 0 π(φ0 + φ1 ui + φ2 yi ) that are unbiased to zero. • May have overidentified situation: Use the generalized method of moments (GMM). Jae-Kwang Kim (ISU) Part 4 149 / 181 Example • Suppose that we are interested in estimating the parameters in the regression model yi = β0 + β1 x1i + β2 x2i + ei (29) where E (ei | xi ) = 0. • Assume that yi is subject to missingness and assume that P(δi = 1 | x1i , xi2 , yi ) = exp(φ0 + φ1 x1i + φ2 yi ) . 1 + exp(φ0 + φ1 x1i + φ2 yi ) Thus, x2i is the nonresponse instrument variable in this setup. Jae-Kwang Kim (ISU) Part 4 150 / 181 Example (Cont’d) • A consistent estimator of φ can be obtained by solving Û2 (φ) ≡ n X i=1 δ − 1 (1, x1i , x2i ) = (0, 0, 0). π(φ; x1i , yi ) (30) Roughly speaking, the solution to (30) exists almost surely if E {∂ Û2 (φ)/∂φ} is of full rank in the neighborhood of the true value of φ. If x2 is vector, then (30) is overidentified and the solution to (30) does not exist. In the case, the GMM algorithm can be used. • Finding the solution to Û2 (φ) = 0 can be obtained by finding the minimizer of Q(φ) = Û2 (φ)0 Û2 (φ) or QW (φ) = Û2 (φ)0 W Û2 (φ) where W = {V (Û2 )}−1 . Jae-Kwang Kim (ISU) Part 4 151 / 181 Example (Cont’d) • Once the solution φ̂ to (30) is obtained, then a consistent estimator of β = (β0 , β1 , β2 ) can be obtained by solving Û1 (β, φ̂) ≡ n X δi {yi − β0 − β1 x1i − β2 x2i } (1, x1i , x2i ) = (0, 0, 0) π̂ i i=1 (31) for β. Jae-Kwang Kim (ISU) Part 4 152 / 181 Asymptotic Properties • The asymptotic variance of the GMM estimator φ̂ that minimizes Û2 (φ)0 Σ̂−1 Û2 (φ) is −1 0 −1 V (φ̂) ∼ = ΓΣ Γ where Γ = E {∂ Û2 (φ)/∂φ} Σ = V (Û2 ). • The variance is estimated by (Γ̂0 Σ̂−1 Γ̂)−1 , where Γ̂ = ∂ Û/∂φ evaluated at φ̂ and Σ̂ is an estimated variance-covariance matrix of Û2 (φ) evaluated at φ̂. Jae-Kwang Kim (ISU) Part 4 153 / 181 Asymptotic Properties • The asymptotic variance of β̂ obtained from (31) with φ̂ computed from the GMM can be obtained by −1 0 −1 V (θ̂) ∼ = Γa Σa Γa where E {∂ Û(θ)/∂θ} Γa = Σa = V (Û) Û = (Û10 , Û20 )0 and θ = (β, φ). Jae-Kwang Kim (ISU) Part 4 154 / 181 Likelihood-based approach • A classical way of likelihood-based approach for parameter estimation under non ignorable nonresponse is to maximize Lobs (θ, φ) with respect to (θ, φ), where Y Lobs (θ, φ) = f (yi | xi ; θ) g (δi | xi , yi ; φ) δi =1 × YZ f (yi | xi ; θ) g (δi | xi , yi ; φ) dyi δi =0 • Such approach can be called full likelihood-based approach because it uses full information available in the observed data. • However, it is well known that such full likelihood-based approach is quite sensitive to the failure of the assumed model. • On the other hand, partial likelihood-based approach (or conditional likelihood approach) uses a subset of the sample. Jae-Kwang Kim (ISU) Part 4 155 / 181 Conditional Likelihood approach Idea • Since f (y | x)g (δ | x, y ) = f1 (y | x, δ)g1 (δ | x), for some f1 and g1 , we can write Y Lobs (θ) = f1 (yi | xi , δi = 1) g1 (δi | xi ) δi =1 × YZ f1 (yi | xi , δi = 0) g1 (δi | xi ) dyi δi =0 = Y f1 (yi | xi , δi = 1) × n Y g1 (δi | xi ) . i=1 δi =1 • The conditional likelihood is defined to be the first component: Lc (θ) = Y f1 (yi | xi , δi = 1) = δi =1 Y R δi =1 f (yi | xi ; θ)π(xi , yi ) , f (y | xi ; θ)π(xi , y )dy where π(x, yi ) = Pr (δi = 1 | xi , yi ). Jae-Kwang Kim (ISU) Part 4 156 / 181 Conditional Likelihood approach Example • Assume that the original sample is a random sample from an exponential distribution with mean µ = 1/θ. That is, the probability density function of y is f (y ; θ) = θ exp(−θy )I (y > 0). • Suppose that we observe yi only when yi > K for a known K > 0. • Thus, the response indicator function is defined by δi = 1 if yi > K and δi = 0 otherwise. Jae-Kwang Kim (ISU) Part 4 157 / 181 Conditional Likelihood approach Example • To compute the maximum likelihood estimator from the observed likelihood, note that Sobs (θ) = X 1 δi =1 θ − yi + X 1 δi =0 θ − E (yi | δi = 0) . • Since K exp(−θK ) 1 − , θ 1 − exp(−θK ) the maximum likelihood estimator of θ can be obtained by the following iteration equation: ( ) n o−1 K exp(−K θ̂(t) ) n−r (t+1) = ȳr − , (32) θ̂ r 1 − exp(−K θ̂(t) ) P P where r = ni=1 δi and ȳr = r −1 ni=1 δi yi . E (Y | y > K ) = Jae-Kwang Kim (ISU) Part 4 158 / 181 Conditional Likelihood approach Example • Since πi = Pr (δi = 1 | yi ) = I (yi > K ) and E (πi ) = E {I (yi > K )} = exp(−K θ), the conditional likelihood reduces to Y θ exp{−θ(yi − K )}. δi =1 The maximum conditional likelihood estimator of θ is θ̂c = 1 . ȳr − K Since E (y | y > K ) = µ + K , the maximum conditional likelihood estimator of µ, which is µ̂c = 1/θ̂c , is unbiased for µ. Jae-Kwang Kim (ISU) Part 4 159 / 181 Conditional Likelihood approach Remark • Under some regularity conditions, the solution θ̂c that maximizes Lc (θ) satisfies L Ic1/2 (θ̂c − θ) −→ N(0, I ) where Ic (θ) = −E ∂ Sc (θ) | xi ; θ ∂θ0 = n X {E (Si πi | xi ; θ)}⊗2 E Si Si0 πi | xi ; θ − , E (πi | xi ; θ) i=1 Sc (θ) = ∂ ln Lc (θ)/∂θ, and Si (θ) = ∂ ln f (yi | xi ; θ) /∂θ. • Works only when π(x, y ) is a known function. • Does not require nonresponse instrumental variable assumption. • Popular for biased sampling problem. Jae-Kwang Kim (ISU) Part 4 160 / 181 Pseudo Likelihood approach Idea • Consider bivariate (xi , yi ) with density f (y | x; θ)h(x) where yi are subject to missingness. • We are interested in estimating θ. • Suppose that Pr (δ = 1 | x, y ) depends only on y . (i.e. x is nonresponse instrument) • Note that f (x | y , δ) = f (x | y ). • Thus, we can consider the following conditional likelihood Lc (θ) = Y f (xi | yi , δi = 1) = δi =1 Y f (xi | yi ). δi =1 • We can consider maximizing the pseudo likelihood Lp (θ) = Y R δi =1 f (yi | xi ; θ)ĥ(xi ) , f (yi | x; θ)ĥ(x)dx where ĥ(x) is a consistent estimator of the marginal density of x. Jae-Kwang Kim (ISU) Part 4 161 / 181 Pseudo Likelihood approach Idea • We may use the empirical density in ĥ(x). That is, ĥ(x) = 1/n if x = xi . In this case, Lc (θ) = Y δi =1 f (y | x ; θ) Pn i i . k=1 f (yi | xk ; θ) • We can extend the idea to the case of x = (u, z) where z is a nonresponse instrument. In this case, the conditional likelihood becomes Y i:δi =1 Jae-Kwang Kim (ISU) p(zi | yi , ui ) = Y R i:δi =1 Part 4 f (yi | ui , zi ; θ)p(zi |ui ) . f (yi | ui , z; θ)p(z|ui )dz (33) 162 / 181 Pseudo Likelihood approach • Let p̂(z|u) be an estimated conditional probability density of z given u. Substituting this estimate into the likelihood in (33), we obtain the following pseudo likelihood: Y R i:δi =1 f (yi | ui , zi ; θ)p̂(zi |ui ) . f (yi | ui , z; θ)p̂(z|ui )dz (34) • The pseudo maximum likelihood estimator (PMLE) of θ, denoted by θ̂p , can be obtained by solving Sp (θ; α̂) ≡ X [S(θ; xi , yi ) − E {S(θ; ui , z, yi ) | yi , ui ; θ, α̂}] = 0 δi =1 for θ, where S(θ; x, y ) = S(θ; u, z, y ) = ∂ log f (y | x; θ)/∂θ and R S(θ; ui , z, yi )f (yi | ui , z; θ)p(z | ui ; α̂)dz R E {S(θ; ui , z, yi ) | yi , ui ; θ, α̂} = . f (yi | ui , z; θ)p(z | ui ; α̂)dz Jae-Kwang Kim (ISU) Part 4 163 / 181 Pseudo Likelihood approach • The Fisher-scoring method for obtaining the PMLE is given by n o−1 θ̂p(t+1) = θ̂p(t) + Ip θ̂(t) , α̂ Sp (θ̂(t) , α̂) where Ip (θ, α̂) = Xh i E {S(θ; ui , z, yi )⊗2 | yi , ui ; θ, α̂} − E {S(θ; ui , z, yi ) | yi , ui ; θ, α̂}⊗2 . δi =1 • Variance estimation is very complicated. Jackknife or bootstrap can be used. Jae-Kwang Kim (ISU) Part 4 164 / 181 Exponential tilting method Motivation • Observed likelihood function can be written Lobs (φ) = n Y {πi (φ)}δi 1−δi Z {1 − πi (φ)} f (y |xi )dy , i=1 where f (y |x) is the true conditional distribution of y given x. • To find the MLE of φ, we solve the mean score equation S̄(φ) = 0, where S̄(φ) = n X [δi Si (φ) + (1 − δi )E {Si (φ)|xi , δi = 0}] , (35) i=1 where Si (φ) = {δi − πi (φ)}(∂logitπi (φ)/∂φ) is the score function of φ for the density g (δ | x, y ; φ) = π δ (1 − π)1−δ with π = π(x, y ; φ). Jae-Kwang Kim (ISU) Part 4 165 / 181 Motivation • The conditional expectation in (35) can be evaluated by using f (y |x, δ = 0) = f (y |x) P(δ = 0|x, y ) E {P(δ = 0|x, y )|x} (36) Two problems occur: Requires correct specification of f (y | x; θ). Known to be sensitive to the choice of f (y | x; θ). 2 Computationally heavy: Often uses Monte Carlo computation. 1 Jae-Kwang Kim (ISU) Part 4 166 / 181 Exponential tilting method Remedy (for Problem One) Idea Instead of specifying a parametric model for f (y | x), consider specifying a parametric model for f (y | x, δ = 1), denoted by f1 (y | x). In this case, R Si (φ)f1 (y | xi )O(xi , y ; φ)dy R E {Si (φ) | xi , δi = 0} = f1 (y | xi )O(xi , y ; φ)dy where O(x, y ; φ) = Jae-Kwang Kim (ISU) 1 − π(φ; x, y ) . π(φ; x, y ) Part 4 167 / 181 Remark • Based on the following identity f0 (yi | xi ) = f1 (yi | xi ) × O (xi , yi ) , E {O (xi , Yi ) | xi , δi = 1} (37) where fδ (yi | xi ) = f (yi | xi , δi = δ) and O (xi , yi ) = Pr (δi = 0 | xi , yi ) Pr (δi = 1 | xi , yi ) (38) is the conditional odds of nonresponse. • Kim and Yu (2011) considered a Kernel-based nonparametric regression method of estimating f (y | x, δ = 1) to obtain E (Y | x, δ = 0). Jae-Kwang Kim (ISU) Part 4 168 / 181 • If the response probability follows from a logistic regression model π(ui , yi ) ≡ Pr (δi = 1 | ui , yi ) = exp (φ0 + φ1 ui + φ2 yi ) , 1 + exp (φ0 + φ1 ui + φ2 yi ) (39) the expression (37) can be simplified to f0 (yi | xi ) = f1 (yi | xi ) × exp (γyi ) , E {exp (γY ) | xi , δi = 1} (40) where γ = −φ2 and f1 (y | x) is the conditional density of y given x and δ = 1. • Model (40) states that the density for the nonrespondents is an exponential tilting of the density for the respondents. The parameter γ is the tilting parameter that determines the amount of departure from the ignorability of the response mechanism. If γ = 0, the the response mechanism is ignorable and f0 (y |x) = f1 (y |x). Jae-Kwang Kim (ISU) Part 4 169 / 181 Exponential tilting method Problem Two How to compute R E {Si (φ) | xi , δi = 0} = Si (φ)O(xi , y ; φ)f1 (y | xi )dy R O(xi , y ; φ)f1 (y | xi )dy without relying on Monte Carlo computation ? Jae-Kwang Kim (ISU) Part 4 170 / 181 • Computation for Z E1 {Q(xi , Y ) | xi } = Q(xi , y )f1 (y | xi )dy . • If xi were null, then we would approximate the integration by the empirical distribution among δ = 1. • Use Z Z Q(xi , y )f1 (y | xi )dy = ∝ f1 (y | xi ) f1 (y )dy f1 (y ) X f1 (yj | xi ) Q(xi , yj ) f1 (yj ) Q(xi , y ) δj =1 where Z f1 (y ) = ∝ f1 (y | x)f (x | δ = 1)dx X f1 (y | xi ). δi =1 Jae-Kwang Kim (ISU) Part 4 171 / 181 Exponential tilting method • In practice, f1 (y | x) is unknown and is estimated by fˆ1 (y | x) = f1 (y | x; γ̂). • Thus, given γ̂, a fully efficient estimator of φ can be obtained by solving S2 (φ, γ̂) ≡ n X δi S(φ; xi , yi ) + (1 − δi )S̄0 (φ | xi ; γ̂, φ) = 0, (41) i=1 where P S̄0 (φ | xi ; γ̂, φ) = S(φ; xi , yj )f1 (yj | xi ; γ̂)O(φ; xi , yj )/fˆ1 (yj ) P ˆ δj =1 f1 (yj | xi ; γ̂)O(φ; xi , yj )/f1 (yi ) δj =1 and fˆ1 (y ) = nR−1 n X δi f1 (y | xi ; γ̂). i=1 • May use EM algorithm to solve (41) for φ. Jae-Kwang Kim (ISU) Part 4 172 / 181 Exponential tilting method • Step 1: Use the responding part of (xi , yi ), obtain γ̂ in the model f1 (y | x; γ). S1 (γ) ≡ X S1 (γ; xi , yi ) = 0. (42) δi =1 • Step 2: Given γ̂ from Step 1, obtain φ̂ by solving (41): S2 (φ, γ̂) = 0. • Step 3: Using φ̂ computed from Step 2, the PSA estimator of θ can be obtained by solving n X δi U(θ; xi , yi ) = 0, π̂ i i=1 (43) where π̂i = πi (φ̂). Jae-Kwang Kim (ISU) Part 4 173 / 181 Exponential tilting method Remark • In many cases, x is categorical and f1 (y | x) can be fully nonparametric. • If x has a continuous part, nonparametric Kernel smoothing can be used. • The proposed method seems to be robust against the failure of the assumed model on f1 (y | x; γ) . • Asymptotic normality of PSA estimator can be obtained & Linearization method can be used for variance estimation (Details skipped) • By augmenting the estimating function, we can also impose a calibration constraint such as Jae-Kwang Kim (ISU) n n X X δi xi = xi . π̂i i=1 i=1 Part 4 174 / 181 Exponential tilting method Example (Example 6.5 of KS) • Assume that both xi = (zi , ui ) and yi are categorical with category {(i, j); i ∈ Sz × Su } and Sy , respectively. • We are interested in estimating θk = Pr (Y = k), for k ∈ Sy . • Now, we have nonresponse in y and let δi be the response indicator function for yi . We assume that the response probability satisfies Pr (δ = 1 | x, y ) = π (u, y ; φ) . • To estimate φ, we first compute the observed conditional probability of y among the respondents: P p̂1 (y | xi ) = I (xj = xi , yj = y ) δj =1 P δj =1 Jae-Kwang Kim (ISU) Part 4 I (xj = xi ) . 175 / 181 Exponential tilting method Example (Cont’d) • The EM algorithm can be implemented by (41) with P S̄0 (φ | xi ; φ) = S(φ; δi , ui , yj )p̂1 (yj | xi )O(φ; ui , yj )/p̂1 (yj ) P , δj =1 p̂1 (yj | xi )O(φ; ui , yj )/p̂1 (yj ) δj =1 where O(φ; u, y ) = {1 − π(u, y ; φ)}/π(u, y ; φ) and p̂1 (y ) = nR−1 n X δi p̂1 (y | xi ). i=1 • Alternatively, we can use P S̄0 (φ | xi ; φ) = Jae-Kwang Kim (ISU) y ∈Sy S(φ; δi , ui , y )p̂1 (y | xi )O(φ; ui , y ) P . y ∈Sy p̂1 (y | xi )O(φ; ui , y ) Part 4 (44) 176 / 181 Exponential tilting method Example (Cont’d) • Once π̂(u, y ) = π(u, y ; φ̂) is computed, we can use θ̂k,ET = n−1 X I (yi = k) + XX wiy∗ I (y = k) δi =0 y ∈Sy δi =1 , where wiy∗ is the fractional weights computed by {π̂(ui , y )}−1 − 1}p̂1 (y |xi ) . −1 − 1}p̂ (y |x ) 1 i y ∈Sy {π̂(ui , y ) wiy∗ = P Jae-Kwang Kim (ISU) Part 4 177 / 181 Real Data Example Exit Poll: The Assembly election (2012 Gang-dong district in Seoul) Gender Male Female Age 20-29 30-39 40-49 5020-29 30-39 40-49 50- Total Truth Jae-Kwang Kim (ISU) Party A 93 104 146 560 106 129 170 501 1,809 62,489 Party B 115 233 295 350 159 242 262 218 1,874 57,909 Part 4 Other 4 8 5 3 8 5 5 7 45 1,624 Refusal 28 82 49 174 62 70 69 211 745 Total 240 427 495 1,087 335 446 506 937 4,473 122,022 178 / 181 Comparison of the methods (%) Table : Analysis result : Gang-dong district in Seoul Method No adjustment Adjustment (Age * Sex) New Method Truth Jae-Kwang Kim (ISU) Party A 48.5 49.0 51.0 51.2 Part 4 Party B 50.3 49.8 47.7 47.5 Other 1.2 1.2 1.2 1.3 179 / 181 Analysis result in Seoul (48 Seats) Table : Analysis result : 48 seats in Seoul Method No adjustment Adjustment (Age* Sex) New Method Truth Jae-Kwang Kim (ISU) Party A 10 10 15 16 Part 4 Party B 36 36 29 30 Other 2 2 4 2 180 / 181 6. Concluding remarks • Uses a model for the response probability. • Parameter estimation for response model can be implemented using the idea of maximum likelihood method. • Instrumental variable needed for identifiability of the response model. • Likelihood-based approach vs GMM approach • Less tools for model diagnostics or model validation • Promising areas of research Jae-Kwang Kim (ISU) Part 4 181 / 181 REFERENCES Cheng, P. E. (1994), ‘Nonparametric estimation of mean functionals with data missing at random’, Journal of the American Statistical Association 89, 81–87. Dempster, A. P., N. M. Laird and D. B. Rubin (1977), ‘Maximum likelihood from incomplete data via the EM algorithm’, Journal of the Royal Statistical Society: Series B 39, 1–37. Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical statistics’, Philosophical Transactions of the Royal Society of London A 222, 309–368. Fuller, W. A., M. M. Loughin and H. D. Baker (1994), ‘Regression weighting in the presence of nonresponse with application to the 1987-1988 Nationwide Food Consumption Survey’, Survey Methodology 20, 75–85. Hirano, K., G. Imbens and G. Ridder (2003), ‘Efficient estimation of average treatment effects using the estimated propensity score’, Econometrica 71, 1161–1189. Ibrahim, J. G. (1990), ‘Incomplete data in generalized linear models’, Journal of the American Statistical Association 85, 765–769. Kim, J. K. (2011), ‘Parametric fractional imputation for missing data analysis’, Biometrika 98, 119–132. Kim, J. K. and C. L. Yu (2011), ‘A semi-parametric estimation of mean functionals with non-ignorable missing data’, Journal of the American Statistical Association 106, 157–165. Jae-Kwang Kim (ISU) Part 4 181 / 181 Kim, J. K., M. J. Brick, W. A. Fuller and G. Kalton (2006), ‘On the bias of the multiple imputation variance estimator in survey sampling’, Journal of the Royal Statistical Society: Series B 68, 509–521. Kim, J. K. and M. K. Riddles (2012), ‘Some theory for propensity-score-adjustment estimators in survey sampling’, Survey Methodology 38, 157–165. Kott, P. S. and T. Chang (2010), ‘Using calibration weighting to adjust for nonignorable unit nonresponse’, Journal of the American Statistical Association 105, 1265–1275. Louis, T. A. (1982), ‘Finding the observed information matrix when using the EM algorithm’, Journal of the Royal Statistical Society: Series B 44, 226–233. Meng, X. L. (1994), ‘Multiple-imputation inferences with uncongenial sources of input (with discussion)’, Statistical Science 9, 538–573. Oakes, D. (1999), ‘Direct calculation of the information matrix via the em algorithm’, Journal of the Royal Statistical Society: Series B 61, 479–482. Orchard, T. and M.A. Woodbury (1972), A missing information principle: theory and applications, in ‘Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability’, Vol. 1, University of California Press, Berkeley, California, pp. 695–715. Redner, R. A. and H. F. Walker (1984), ‘Mixture densities, maximum likelihood and the EM algorithm’, SIAM Review 26, 195–239. Robins, J. M., A. Rotnitzky and L. P. Zhao (1994), ‘Estimation of regression coefficients when some regressors are not always observed’, Journal of the American Statistical Association 89, 846–866. Jae-Kwang Kim (ISU) Part 4 181 / 181 Robins, J. M. and N. Wang (2000), ‘Inference for imputation estimators’, Biometrika 87, 113–124. Rubin, D. B. (1976), ‘Inference and missing data’, Biometrika 63, 581–590. Tanner, M. A. and W. H. Wong (1987), ‘The calculation of posterior distribution by data augmentation’, Journal of the American Statistical Association 82, 528–540. Wang, N. and J. M. Robins (1998), ‘Large-sample theory for parametric multiple imputation procedures’, Biometrika 85, 935–948. Wang, S., J. Shao and J. K. Kim (2014), ‘Identifiability and estimation in problems with nonignorable nonresponse’, Statistica Sinica 24, 1097 – 1116. Wei, G. C. and M. A. Tanner (1990), ‘A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms’, Journal of the American Statistical Association 85, 699–704. Zhou, M. and J. K. Kim (2012), ‘An efficient method of estimation for longitudinal surveys with monotone missing data’, Biometrika 99, 631–648. Jae-Kwang Kim (ISU) Part 4 181 / 181