1. Introduction 1 Motivating examples Example 1 • Suppose you are interested in building a linear regression of y on x: yi = xi β + ei , i.i.d ei ∼ (0, σe2 ) (1) where β and σe2 are unknown parameters. • Instead of observing (xi , yi ) throughout the sample of size n, suppose that you observe response indicator variable δi first and observe (xi , yi ) for δi = 1 and observe xi only for δi = 0. In this case, how to estimate the regression parameters? • CC (Complete-Case) method: estimate the regression parameters (by OLS) using the complete cases (using the cases with δi = 1). Many statistical software packages include CC method as a default option. • When is the CC method justified ? If so, is it the best ? • Can we have the same conclusion if the parameter of interest is θ in the conditional distribution f (y | x; θ) ? • What if we have x subject to missingness instead of y subject to missingness in δi = 0 ? 1 Example 2 • Suppose that you now have full observation of (xi , yi ), but you cannot believe that the errors are independent in (1). • One way to handle correlated errors is to use a random effect model: yij = xij β + ai + eij , (2) where ai ∼ (0, σa2 ) and eij ∼ (0, σe2 ). The parameter to estimate is β, σa2 and σe2 . • Marginal model approach: Another way of expressing (2) is to compute the marginal distribution of y given the covariate x Z f (y | x) = f (y | x, a)g(a | x)da. In this case, model (2) is equivalent to E(yij | xij ) = xij β and Cov(yij , yi0 j 0 2 σa + σe2 σ2 | x) = a 0 if i = i0 and j = j 0 if i = i0 and j = 6 j0 otherwise Here, ρ = σa2 /(σe2 +σa2 ) is the within-cluster correlation (intra-cluster correlation) of y after adjusting for the effect of x. • More generally, model (2) can take the form of yij ∼ f1 (y | xij , ai ; θ1 ) and ai ∼ f2 (a | zi ; θ2 ), where zi is a cluster-specific covariate. This is often called multi-level model. Model f1 is level-one model and model f2 is level-two model. The cluster effect, ai , is sometime called latent variable in the sense that it is never observed. It is also called random effect (in contract to fixed effect) in the sense that the cluster effect is not fixed (i.e. we are not sampling the same clusters over the repeated sampling). 2 • There are huge literature on estimating the parameters and predicting the random effect. Example 3 • Now, suppose that your sample consists of G groups but you do not observe the indicator variables for the group identity. • You can view this problem as a missing data problem with xi missing in (1), where xi = (x1i , · · · , xiG ) and xig = if i ∈ group g otherwise 1 0 • The density of y (which is a vector) can be written as f (y) = G X πg fg (y) g=1 where πg = P r(xg = 1) is the proportion of group g and fg (y) is the conditional density of y conditional on group g. We may build a parametric model for fg (y) such as N (µg , Σg ). In this case, the problem is called model-based clustering. • Under model-based clustering problem, we may need to predict xi by E(xi | yi ) which involves unknown parameters. Thus, EM algorithm can be nicely implemented. Choice of G can be addressed as the problem of model selection with missing data. Example 4 • Suppose that we are interested in estimating the effect of y on x using a parametric model f (y | x; θ). Instead of observing (xi , yi ) in the sample, suppose 2 that we observe (x̃i , ỹi ), proxy measurement of (xi , yi ) such that x̃i ∼ N (xi , σ1i ) 2 2 2 and ỹi ∼ N (yi , σ2i ) with known σ1i and σ2i . • Such model is called measurement error model problem. If f (y | x) is normal model, Fuller (1987) provide a comprehensive overview of the solutions. Carroll et al (2006) contains extensions to nonlinear models. 3 2 Basic Concepts • In Example 1, the validity of the CC method is justified when f (y | x, δ = 1) = f (y | x). (3) If only y is subject to missingness, condition (3) is often called missing at random (MAR). It means the conditional independence between y and δ given x: y⊥δ|x • A sufficient condition for (2) is P r(δ = 1 | x, y) = P r(δ = 1 | x). (4) • Under MAR, we do not have to specify a model for the response mechanism because the likelihood function is factorized into two parts, one is a function of θ in f (y | x, θ) and the other is a function of φ in P r(δ = 1 | x; φ). • If x is subject to missingness, instead of y missing, then condition (4) means that MAR does not hold (Not Missing at Random), but CC method still provide valid estimate for the parameter in the conditional model. Example 5 • Bivariate data (xi , yi ) with pdf f (x, y) = f1 (y | x)f2 (x) • xi is always observed and yi is subject to missingness • Assume that the response status variable δi of yi satisfies P (δi = 1 | xi , yi ) = Λ1 (φ0 + φ1 xi + φ2 yi ) for Λ1 (x) = 1 − {1 + exp (x)}−1 . • Let θ be the parameter of interest in the regression model f1 (y | x; θ). Let α be the parameter in the marginal distribution of x, denoted by f2 (xi ; α). Define Λ0 (x) = 1 − Λ1 (x). 4 • Three parameters – θ: Parameter of interest – α and φ: Nuisance parameter • Observed likelihood " # Y Lobs (θ, α, φ) = f1 (yi | xi ; θ) f2 (xi ; α) Λ1 (φ0 + φ1 xi + φ2 yi ) δi =1 " × YZ # f1 (yi | xi ; θ) f2 (xi ; α) Λ0 (φ0 + φ1 xi + φ2 yi ) dyi δi =0 = L1 (θ, φ) × L2 (α) where L2 (α) = Qn i=1 f2 (xi ; α) . • Thus, we can safely ignore the marginal distribution of x if x is completely observed. • If φ2 = 0, then MAR holds and L1 (θ, φ) = L1a (θ) × L1b (φ) where L1a (θ) = Y f1 (yi | xi ; θ) δi =1 and L1b (φ) = Y Λ1 (φ0 + φ1 xi ) × δi =1 Y Λ0 (φ0 + φ1 xi ) . δi =0 • Thus, under MAR, the MLE of θ can be obtained by maximizing L1a (θ), which is obtained by ignoring the missing part of the data. • Instead of yi subject to missing, if xi is subject to missing, then the observed likelihood becomes " # Y Lobs (θ, φ, α) = f1 (yi | xi ; θ) f2 (xi ; α) Λ1 (φ0 + φ1 xi + φ2 yi ) δi =1 " × YZ # f1 (yi | xi ; θ) f2 (xi ; α) Λ0 (φ0 + φ1 xi + φ2 yi ) dxi δi =0 6= L1 (θ, φ) × L2 (α) . 5 • If φ1 = 0 then Lobs (θ, α, φ) = L1 (θ, α) × L2 (φ) and MAR holds. Although we are not interested in the marginal distribution of x, we have to specify the model for the marginal distribution of x. • However, we can still use the CC method because (3) holds. 3 EM algorithm • Examples 2 - 4 involve latent variables. Let z denote the latent variable and y denote the observable variable. The joint density is f (y, z; θ) = f1 (y | z; θ1 )f (z; θ2 ) • We are interested in estimating θ from the observed data (i.e. using yi ’s only). Thus, the observed likelihood often takes the form of Z Lobs (θ) = f (y, z; θ)dz. • Note that ∂ ∂θR R f (y, z; θ)dz f (y, z; θ)dz R ∂ f (y, z; θ)dz = R∂θ f (y, z; θ)dz R ∂ { ∂θ log f (y, z; θ)}f (y, z; θ)dz R = f (y, z; θ)dz = E{S(θ; y, Z) | y} ∂ log Lobs (θ) = ∂θ where S(θ; y, Z) = ∂ log f (y, z; θ)/∂θ. Thus, the MLE of θ can often be obtained by solving E{S(θ; y, Z) | y} = 0. (5) • Equation (4) is sometimes called mean score equation. In computing the conditional expectation in (5), we need to know the parameter values. Thus, the following iterative algorithm can be used: θ̂(t+1) ← solve E{S(θ; y, Z) | y; θ̂(t) } = 0. 6 Computing the conditional expectation at the current parameter value is called E-step and solving the mean score equation to update the parameter is called M-step. This is called EM algorithm. Example 6 • Suppose that the study variable y is randomly distributed with Bernoulli distribution with probability of success pi , where pi = pi (β) = exp (x0i β) 1 + exp (x0i β) for some unknown parameter β and xi is a vector of the covariates in the logistic regression model for yi . We assume that 1 is in the column space of xi . • Under complete response, the score function for β is S1 (β) = n X (yi − pi (β)) xi . i=1 • Let δi be the response indicator function for yi with distribution Bernoulli(πi ) where πi = exp (x0i φ0 + yi φ1 ) . 1 + exp (x0i φ0 + yi φ1 ) We assume that xi is always observed, but yi is missing if δi = 0. • Under missing data, the mean score function for β is S̄1 (β, φ) = X {yi − pi (β)} xi + 1 XX wi (y; β, φ) {y − pi (β)} xi , δi =0 y=0 δi =1 where wi (y; β, φ) is the conditional probability of yi = y given xi and δi = 0: Pβ (yi = y | xi ) Pφ (δi = 0 | yi = y, xi ) wi (y; β, φ) = P1 z=0 Pβ (yi = z | xi ) Pφ (δi = 0 | yi = z, xi ) Thus, S̄1 (β, φ) is also a function of φ. • If the response mechanism is MAR so that φ1 = 0, then Pβ (yi = y | xi ) wi (y; β, φ) = P1 = Pβ (yi = y | xi ) z=0 Pβ (yi = z | xi ) 7 and so S̄1 (β, φ) = X {yi − pi (β)} xi = S̄1 (β) . δi =1 • If MAR does not hold, then (β̂, φ̂) can be obtained by solving S̄1 (β, φ) = 0 and S̄2 (β, φ) = 0 jointly, where X S̄2 (β, φ) = {δi − π (φ; xi , yi )} (xi , yi ) δi =1 + 1 XX wi (y; β, φ) {δi − πi (φ; xi , y)} (xi , y) . δi =0 y=0 • EM algorithm can be implemented as follows: – E-step: (t) (t) S̄1 β | β , φ = X {yi − pi (β)} xi + 1 XX wij(t) {j − pi (β)} xi , δi =0 j=0 δi =1 where wij(t) = P r(Yi = j | xi , δi = 0; β (t) , φ(t) ) P r(Yi = j | xi ; β (t) )P r(δi = 0 | xi , j; φ(t) ) = P1 (t) (t) y=0 P r(Yi = y | xi ; β )P r(δi = 0 | xi , y; φ ) and S̄2 φ | β (t) , φ(t) = X 0 {δi − π (xi , yi ; φ)} (x0i , yi ) δi =1 + 1 XX 0 wij(t) {δi − πi (xi , j; φ)} (x0i , j) . δi =0 j=0 – M-step: The parameter estimates are updated by solving S̄1 β | β (t) , φ(t) , S̄2 φ | β (t) , φ(t) = (0, 0) for β and φ. • For categorical missing data, the conditional expectation in the E-step can be computed using the weighted mean with weights wij(t) . Ibrahim (1990) called this method EM by weighting. 8 4 Take-home messages • CC method is justified if the response probability depends only on the covariates in the regression model. • MAR enables the likelihood factorization and simplifies the computation for MLE. • Some parameters may not be identifiable from the observed data. • EM algorithm is an iterative method of finding the MLE without evaluating the observed likelihood function. • Computational tools (such as Monte Carlo method) may be needed to implement the EM algorithm, (to be discussed in the next week) • Inference tools (such as variance estimation) are available. (Not discussed here.) • Less tools developed for model diagnostics or model selection with missing data. • Try to develop EM algorithm for the examples in Example 2 - Example 4. 9