1. Introduction 1 Motivating examples

advertisement
1. Introduction
1
Motivating examples
Example 1
• Suppose you are interested in building a linear regression of y on x:
yi = xi β + ei ,
i.i.d
ei ∼ (0, σe2 )
(1)
where β and σe2 are unknown parameters.
• Instead of observing (xi , yi ) throughout the sample of size n, suppose that you
observe response indicator variable δi first and observe (xi , yi ) for δi = 1 and
observe xi only for δi = 0. In this case, how to estimate the regression parameters?
• CC (Complete-Case) method: estimate the regression parameters (by OLS) using the complete cases (using the cases with δi = 1). Many statistical software
packages include CC method as a default option.
• When is the CC method justified ? If so, is it the best ?
• Can we have the same conclusion if the parameter of interest is θ in the conditional distribution f (y | x; θ) ?
• What if we have x subject to missingness instead of y subject to missingness in
δi = 0 ?
1
Example 2
• Suppose that you now have full observation of (xi , yi ), but you cannot believe
that the errors are independent in (1).
• One way to handle correlated errors is to use a random effect model:
yij = xij β + ai + eij ,
(2)
where ai ∼ (0, σa2 ) and eij ∼ (0, σe2 ). The parameter to estimate is β, σa2 and σe2 .
• Marginal model approach: Another way of expressing (2) is to compute the
marginal distribution of y given the covariate x
Z
f (y | x) = f (y | x, a)g(a | x)da.
In this case, model (2) is equivalent to
E(yij | xij ) = xij β
and
Cov(yij , yi0 j 0
 2
 σa + σe2
σ2
| x) =
 a
0
if i = i0 and j = j 0
if i = i0 and j =
6 j0
otherwise
Here, ρ = σa2 /(σe2 +σa2 ) is the within-cluster correlation (intra-cluster correlation)
of y after adjusting for the effect of x.
• More generally, model (2) can take the form of
yij ∼ f1 (y | xij , ai ; θ1 )
and ai ∼ f2 (a | zi ; θ2 ), where zi is a cluster-specific covariate. This is often
called multi-level model. Model f1 is level-one model and model f2 is level-two
model. The cluster effect, ai , is sometime called latent variable in the sense that
it is never observed. It is also called random effect (in contract to fixed effect) in
the sense that the cluster effect is not fixed (i.e. we are not sampling the same
clusters over the repeated sampling).
2
• There are huge literature on estimating the parameters and predicting the random effect.
Example 3
• Now, suppose that your sample consists of G groups but you do not observe the
indicator variables for the group identity.
• You can view this problem as a missing data problem with xi missing in (1),
where xi = (x1i , · · · , xiG ) and
xig =
if i ∈ group g
otherwise
1
0
• The density of y (which is a vector) can be written as
f (y) =
G
X
πg fg (y)
g=1
where πg = P r(xg = 1) is the proportion of group g and fg (y) is the conditional
density of y conditional on group g. We may build a parametric model for fg (y)
such as N (µg , Σg ). In this case, the problem is called model-based clustering.
• Under model-based clustering problem, we may need to predict xi by E(xi |
yi ) which involves unknown parameters. Thus, EM algorithm can be nicely
implemented. Choice of G can be addressed as the problem of model selection
with missing data.
Example 4
• Suppose that we are interested in estimating the effect of y on x using a parametric model f (y | x; θ). Instead of observing (xi , yi ) in the sample, suppose
2
that we observe (x̃i , ỹi ), proxy measurement of (xi , yi ) such that x̃i ∼ N (xi , σ1i
)
2
2
2
and ỹi ∼ N (yi , σ2i
) with known σ1i
and σ2i
.
• Such model is called measurement error model problem. If f (y | x) is normal
model, Fuller (1987) provide a comprehensive overview of the solutions. Carroll
et al (2006) contains extensions to nonlinear models.
3
2
Basic Concepts
• In Example 1, the validity of the CC method is justified when
f (y | x, δ = 1) = f (y | x).
(3)
If only y is subject to missingness, condition (3) is often called missing at random
(MAR). It means the conditional independence between y and δ given x:
y⊥δ|x
• A sufficient condition for (2) is
P r(δ = 1 | x, y) = P r(δ = 1 | x).
(4)
• Under MAR, we do not have to specify a model for the response mechanism
because the likelihood function is factorized into two parts, one is a function of
θ in f (y | x, θ) and the other is a function of φ in P r(δ = 1 | x; φ).
• If x is subject to missingness, instead of y missing, then condition (4) means
that MAR does not hold (Not Missing at Random), but CC method still provide
valid estimate for the parameter in the conditional model.
Example 5
• Bivariate data (xi , yi ) with pdf f (x, y) = f1 (y | x)f2 (x)
• xi is always observed and yi is subject to missingness
• Assume that the response status variable δi of yi satisfies
P (δi = 1 | xi , yi ) = Λ1 (φ0 + φ1 xi + φ2 yi )
for Λ1 (x) = 1 − {1 + exp (x)}−1 .
• Let θ be the parameter of interest in the regression model f1 (y | x; θ). Let α be
the parameter in the marginal distribution of x, denoted by f2 (xi ; α). Define
Λ0 (x) = 1 − Λ1 (x).
4
• Three parameters
– θ: Parameter of interest
– α and φ: Nuisance parameter
• Observed likelihood
"
#
Y
Lobs (θ, α, φ) =
f1 (yi | xi ; θ) f2 (xi ; α) Λ1 (φ0 + φ1 xi + φ2 yi )
δi =1
"
×
YZ
#
f1 (yi | xi ; θ) f2 (xi ; α) Λ0 (φ0 + φ1 xi + φ2 yi ) dyi
δi =0
= L1 (θ, φ) × L2 (α)
where L2 (α) =
Qn
i=1
f2 (xi ; α) .
• Thus, we can safely ignore the marginal distribution of x if x is completely
observed.
• If φ2 = 0, then MAR holds and
L1 (θ, φ) = L1a (θ) × L1b (φ)
where
L1a (θ) =
Y
f1 (yi | xi ; θ)
δi =1
and
L1b (φ) =
Y
Λ1 (φ0 + φ1 xi ) ×
δi =1
Y
Λ0 (φ0 + φ1 xi ) .
δi =0
• Thus, under MAR, the MLE of θ can be obtained by maximizing L1a (θ), which
is obtained by ignoring the missing part of the data.
• Instead of yi subject to missing, if xi is subject to missing, then the observed
likelihood becomes
"
#
Y
Lobs (θ, φ, α) =
f1 (yi | xi ; θ) f2 (xi ; α) Λ1 (φ0 + φ1 xi + φ2 yi )
δi =1
"
×
YZ
#
f1 (yi | xi ; θ) f2 (xi ; α) Λ0 (φ0 + φ1 xi + φ2 yi ) dxi
δi =0
6= L1 (θ, φ) × L2 (α) .
5
• If φ1 = 0 then
Lobs (θ, α, φ) = L1 (θ, α) × L2 (φ)
and MAR holds. Although we are not interested in the marginal distribution of
x, we have to specify the model for the marginal distribution of x.
• However, we can still use the CC method because (3) holds.
3
EM algorithm
• Examples 2 - 4 involve latent variables. Let z denote the latent variable and y
denote the observable variable. The joint density is
f (y, z; θ) = f1 (y | z; θ1 )f (z; θ2 )
• We are interested in estimating θ from the observed data (i.e. using yi ’s only).
Thus, the observed likelihood often takes the form of
Z
Lobs (θ) = f (y, z; θ)dz.
• Note that
∂
∂θR
R
f (y, z; θ)dz
f (y, z; θ)dz
R ∂
f (y, z; θ)dz
= R∂θ
f (y, z; θ)dz
R ∂
{ ∂θ log f (y, z; θ)}f (y, z; θ)dz
R
=
f (y, z; θ)dz
= E{S(θ; y, Z) | y}
∂
log Lobs (θ) =
∂θ
where S(θ; y, Z) = ∂ log f (y, z; θ)/∂θ. Thus, the MLE of θ can often be obtained
by solving
E{S(θ; y, Z) | y} = 0.
(5)
• Equation (4) is sometimes called mean score equation. In computing the conditional expectation in (5), we need to know the parameter values. Thus, the
following iterative algorithm can be used:
θ̂(t+1) ← solve E{S(θ; y, Z) | y; θ̂(t) } = 0.
6
Computing the conditional expectation at the current parameter value is called
E-step and solving the mean score equation to update the parameter is called
M-step. This is called EM algorithm.
Example 6
• Suppose that the study variable y is randomly distributed with Bernoulli distribution with probability of success pi , where
pi = pi (β) =
exp (x0i β)
1 + exp (x0i β)
for some unknown parameter β and xi is a vector of the covariates in the logistic
regression model for yi . We assume that 1 is in the column space of xi .
• Under complete response, the score function for β is
S1 (β) =
n
X
(yi − pi (β)) xi .
i=1
• Let δi be the response indicator function for yi with distribution Bernoulli(πi )
where
πi =
exp (x0i φ0 + yi φ1 )
.
1 + exp (x0i φ0 + yi φ1 )
We assume that xi is always observed, but yi is missing if δi = 0.
• Under missing data, the mean score function for β is
S̄1 (β, φ) =
X
{yi − pi (β)} xi +
1
XX
wi (y; β, φ) {y − pi (β)} xi ,
δi =0 y=0
δi =1
where wi (y; β, φ) is the conditional probability of yi = y given xi and δi = 0:
Pβ (yi = y | xi ) Pφ (δi = 0 | yi = y, xi )
wi (y; β, φ) = P1
z=0 Pβ (yi = z | xi ) Pφ (δi = 0 | yi = z, xi )
Thus, S̄1 (β, φ) is also a function of φ.
• If the response mechanism is MAR so that φ1 = 0, then
Pβ (yi = y | xi )
wi (y; β, φ) = P1
= Pβ (yi = y | xi )
z=0 Pβ (yi = z | xi )
7
and so
S̄1 (β, φ) =
X
{yi − pi (β)} xi = S̄1 (β) .
δi =1
• If MAR does not hold, then (β̂, φ̂) can be obtained by solving S̄1 (β, φ) = 0 and
S̄2 (β, φ) = 0 jointly, where
X
S̄2 (β, φ) =
{δi − π (φ; xi , yi )} (xi , yi )
δi =1
+
1
XX
wi (y; β, φ) {δi − πi (φ; xi , y)} (xi , y) .
δi =0 y=0
• EM algorithm can be implemented as follows:
– E-step:
(t)
(t)
S̄1 β | β , φ
=
X
{yi − pi (β)} xi +
1
XX
wij(t) {j − pi (β)} xi ,
δi =0 j=0
δi =1
where
wij(t) = P r(Yi = j | xi , δi = 0; β (t) , φ(t) )
P r(Yi = j | xi ; β (t) )P r(δi = 0 | xi , j; φ(t) )
= P1
(t)
(t)
y=0 P r(Yi = y | xi ; β )P r(δi = 0 | xi , y; φ )
and
S̄2 φ | β (t) , φ(t)
=
X
0
{δi − π (xi , yi ; φ)} (x0i , yi )
δi =1
+
1
XX
0
wij(t) {δi − πi (xi , j; φ)} (x0i , j) .
δi =0 j=0
– M-step:
The parameter estimates are updated by solving
S̄1 β | β (t) , φ(t) , S̄2 φ | β (t) , φ(t) = (0, 0)
for β and φ.
• For categorical missing data, the conditional expectation in the E-step can be
computed using the weighted mean with weights wij(t) . Ibrahim (1990) called
this method EM by weighting.
8
4
Take-home messages
• CC method is justified if the response probability depends only on the covariates
in the regression model.
• MAR enables the likelihood factorization and simplifies the computation for
MLE.
• Some parameters may not be identifiable from the observed data.
• EM algorithm is an iterative method of finding the MLE without evaluating the
observed likelihood function.
• Computational tools (such as Monte Carlo method) may be needed to implement the EM algorithm, (to be discussed in the next week)
• Inference tools (such as variance estimation) are available. (Not discussed here.)
• Less tools developed for model diagnostics or model selection with missing data.
• Try to develop EM algorithm for the examples in Example 2 - Example 4.
9
Download