Statistical Methods for Handling Missing Data Jae-Kwang Kim July 5th, 2014

advertisement
Statistical Methods for Handling Missing Data
Jae-Kwang Kim
Department of Statistics, Iowa State University
July 5th, 2014
Outline
Textbook : “Statistical Methods for handling incomplete data” by Kim and Shao (2013)
• Part 1: Basic Theory (Chapter 2-3)
• Part 2: Imputation (Chapter 4)
• Part 3: Propensity score approach (Chapter 5)
• Part 4: Nonignorable missing (Chapter 6)
Jae-Kwang Kim (ISU)
July 5th, 2014
2 / 181
Statistical Methods for Handling Missing Data
Part 1: Basic Theory
Jae-Kwang Kim
Department of Statistics, Iowa State University
1 Introduction
Definitions for likelihood theory
• The likelihood function of θ is defined as
L(θ) = f (y; θ)
where f (y; θ) is the joint pdf of y.
• Let θ̂ be the maximum likelihood estimator (MLE) of θ0 if it satisfies
L(θ̂) = max L(θ).
θ∈Θ
• A parametric family of densities, P = {f (y ; θ); θ ∈ Θ}, is identifiable if for all y,
f (y ; θ1 ) 6= f (y ; θ2 )
Jae-Kwang Kim (ISU)
Part 1
for every
θ1 6= θ2 .
4 / 181
1 Introduction - Fisher information
Definition
1 Score function:
S(θ) =
∂ log L(θ)
∂θ
2 Fisher information = curvature of the log-likelihood:
I(θ) = −
∂
∂2
log L(θ) = − T S(θ)
∂θ∂θT
∂θ
3 Observed (Fisher) information: I(θ̂n ) where θ̂n is the MLE.
4 Expected (Fisher) information: I(θ) = Eθ {I(θ)}
• The observed information is always positive. The observed information applies to a
single dataset.
• The expected information is meaningful as a function of θ across the admissible
values of θ. The expected information is an average quantity over all possible
datasets.
• I(θ̂) = I(θ̂) for exponential family.
Jae-Kwang Kim (ISU)
Part 1
5 / 181
1 Introduction - Fisher information
Lemma 1. Properties of score functions [Theorem 2.3 of KS]
Under regularity conditions allowing the exchange of the order of integration and
differentiation,
Eθ {S(θ)} = 0 and Vθ {S(θ)} = I(θ).
Jae-Kwang Kim (ISU)
Part 1
6 / 181
1 Introduction - Fisher information
Remark
• Under some regularity conditions, the MLE θ̂ converges in probability to the true
parameter θ0 .
• Thus, we can apply a Taylor linearization on S(θ̂) = 0 to get
−1
θ̂ − θ0 ∼
= {I(θ0 )} S(θ0 ).
Here, we use the fact that I (θ) = −∂S(θ)/∂θT converges in probability to I(θ).
• Thus, the (asymptotic) variance of MLE is
V (θ̂)
.
=
{I(θ0 )}−1 V {S(θ0 )} {I(θ0 )}−1
=
{I(θ0 )}−1 ,
where the last equality follows from Lemma 1.
Jae-Kwang Kim (ISU)
Part 1
7 / 181
2 Observed Likelihood
Basic Setup
• Let y = (y1 , . . . , yp ) be a p-dimensional random vector with probability density
function f (y; θ) whose dominating measure is µ.
• Let δij be the response indicator function of yij with δij =
1
0
if yij is observed
otherwise.
• δ i = (δi1 , · · · , δip ): p-dimensional random vector with density P(δ | y) assuming
P(δ|y) = P(δ|y; φ) for some φ.
• Let (yi,obs , yi,mis ) be the observed part and missing part of yi , respectively.
• Let R(yobs , δ) = {y; yobs (yi , δ i ) = yi,obs , i = 1, . . . , n} be the set of all possible
values of y with the same realized value of yobs , for given δ, where yobs (yi , δ i ) is a
function that gives the value of yij for δij = 1.
Jae-Kwang Kim (ISU)
Part 1
8 / 181
2 Observed Likelihood
Definition: Observed likelihood
Under the above setup, the observed
Z likelihood of (θ, φ) is
Lobs (θ, φ) =
f (y; θ)P(δ|y; φ)dµ(y).
R(yobs ,δ )
Under IID setup: The observed likelihood is
n Z
Y
Lobs (θ, φ) =
f (yi ; θ)P(δ i |yi ; φ)dµ(yi,mis ) ,
i=1
where it is understood that, if yi = yi,obs and yi,mis is empty then there is nothing to
integrate out.
• In the special case of scalar y , the observed likelihood
is
Y
Y Z
f (y ; θ) {1 − π(y ; φ)} dy ,
[f (yi ; θ)π(yi ; φ)] ×
Lobs (θ, φ) =
δi =1
δi =0
where π(y ; φ) = P(δ = 1|y ; φ).
Jae-Kwang Kim (ISU)
Part 1
9 / 181
2 Observed Likelihood
Example 1 [Example 2.3 of KS]
Let t1 , t2 , · · · , tn be an IID sample from a distribution with density
fθ (t) = θe −θt I (t > 0). Instead of observing ti , we observe (yi , δi ) where
ti
if δi = 1
yi =
c
if δi = 0
and
δi =
if ti ≤ c
if ti > c,
1
0
where c is a known censoring time. The observed likelihood for θ can be derived as
Lobs (θ)
=
n h
i
Y
{fθ (ti )}δi {P (ti > c)}1−δi
i=1
=
θ
Pn
i=1
δi
exp(−θ
n
X
yi ).
i=1
Jae-Kwang Kim (ISU)
Part 1
10 / 181
2 Observed Likelihood
Definition: Missing At Random (MAR)
P(δ|y) is the density of the conditional distribution of δ given y. Let yobs = yobs (y, δ)
where
yi if δi = 1
yi,obs =
∗ if δi = 0.
The response mechanism is MAR if P (δ|y1 ) = P (δ|y2 ) { or P(δ|y) = P(δ|yobs )} for all
y1 and y2 satisfying yobs (y1 , δ) = yobs (y2 , δ).
• MAR: the response mechanism P(δ|y) depends on y only through yobs .
• Let y = (yobs , ymis ). By Bayes theorem,
P (ymis |yobs , δ) =
P(δ|ymis , yobs )
P (ymis |yobs ) .
P(δ|yobs )
• MAR: P(ymis |yobs , δ) = P(ymis |yobs ). That is, ymis ⊥ δ | yobs .
• MAR: the conditional independence of δ and ymis given yobs .
Jae-Kwang Kim (ISU)
Part 1
11 / 181
2 Observed Likelihood
Remark
• MCAR (Missing Completely at random): P(δ | y) does not depend on y.
• MAR (Missing at random): P(δ | y) = P(δ | yobs )
• NMAR (Not Missing at random): P(δ | y) 6= P(δ | yobs )
• Thus, MCAR is a special case of MAR.
Jae-Kwang Kim (ISU)
Part 1
12 / 181
2 Observed Likelihood
Theorem 1: Likelihood factorization (Rubin, 1976) [Theorem 2.4 of
KS]
Pφ (δ|y) is the joint density of δ given y and fθ (y) is the joint density of y. Under
conditions
1 the parameters θ and φ are distinct and
2 MAR condition holds,
the observed likelihood can be written as
Lobs (θ, φ) = L1 (θ)L2 (φ),
and the MLE of θ can be obtained by maximizing L1 (θ).
Thus, we do not have to specify the model for response mechanism. The response
mechanism is called ignorable if the above likelihood factorization holds.
Jae-Kwang Kim (ISU)
Part 1
13 / 181
2 Observed Likelihood
Example 2 [Example 2.4 of KS]
• Bivariate data (xi , yi ) with pdf f (x, y ) = f1 (y | x)f2 (x).
• xi is always observed and yi is subject to missingness.
• Assume that the response status variable δi of yi satisfies
P (δi = 1 | xi , yi ) = Λ1 (φ0 + φ1 xi + φ2 yi )
for some function Λ1 (·) of known form.
• Let θ be the parameter of interest in the regression model f1 (y | x; θ). Let α be
the parameter in the marginal distribution of x, denoted by f2 (xi ; α). Define
Λ0 (x) = 1 − Λ1 (x).
• Three parameters
• θ: parameter of interest
• α and φ: nuisance parameter
Jae-Kwang Kim (ISU)
Part 1
14 / 181
2 Observed Likelihood
Example 2 (Cont’d)
• Observed likelihood

Lobs (θ, α, φ)
=

Y

f1 (yi | xi ; θ) f2 (xi ; α) Λ1 (φ0 + φ1 xi + φ2 yi )
δi =1

×

YZ

f1 (y | xi ; θ) f2 (xi ; α) Λ0 (φ0 + φ1 xi + φ2 y ) dy 
δi =0
=
where L2 (α) =
Qn
i=1 f2
L1 (θ, φ) × L2 (α)
(xi ; α) .
• Thus, we can safely ignore the marginal distribution of x if x is completely
observed.
Jae-Kwang Kim (ISU)
Part 1
15 / 181
2 Observed Likelihood
Example 2 (Cont’d)
• If φ2 = 0, then MAR holds and
L1 (θ, φ) = L1a (θ) × L1b (φ)
where
L1a (θ) =
Y
f1 (yi | xi ; θ)
δi =1
and
L1b (φ) =
Y
Λ1 (φ0 + φ1 xi ) ×
δi =1
Y
Λ0 (φ0 + φ1 xi ) .
δi =0
• Thus, under MAR, the MLE of θ can be obtained by maximizing L1a (θ), which is
obtained by ignoring the missing part of the data.
Jae-Kwang Kim (ISU)
Part 1
16 / 181
2 Observed Likelihood
Example 2 (Cont’d)
• Instead of yi subject to missingness, if xi is subject to missingness, then the
observed likelihood becomes


Y
Lobs (θ, φ, α) = 
f1 (yi | xi ; θ) f2 (xi ; α) Λ1 (φ0 + φ1 xi + φ2 yi )
δi =1

×

YZ

f1 (yi | x; θ) f2 (x; α) Λ0 (φ0 + φ1 x + φ2 yi ) dx 
δi =0
6=
L1 (θ, φ) × L2 (α) .
• If φ1 = 0 then
Lobs (θ, α, φ) = L1 (θ, α) × L2 (φ)
and MAR holds. Although we are not interested in the marginal distribution of x,
we still need to specify the model for the marginal distribution of x.
Jae-Kwang Kim (ISU)
Part 1
17 / 181
3 Mean Score Approach
• The observed likelihood is the marginal density of (yobs , δ).
• The observed likelihood is
Z
Lobs (η) =
Z
f (y; θ)P(δ|y; φ)dµ(y) =
R(yobs ,δ )
f (y; θ)P(δ|y; φ)dµ(ymis )
where ymis is the missing part of y and η = (θ, φ).
• Observed score equation:
Sobs (η) ≡
∂
log Lobs (η) = 0
∂η
• Computing the observed score function can be computationally challenging
because the observed likelihood is an integral form.
Jae-Kwang Kim (ISU)
Part 1
18 / 181
3 Mean Score Approach
Theorem 2: Mean Score Theorem (Fisher, 1922) [Theorem 2.5 of KS]
Under some regularity conditions, the observed score function equals to the mean score
function. That is,
Sobs (η) = S̄(η)
where
S̄(η)
=
Scom (η)
=
f (y, δ; η)
=
E{Scom (η)|yobs , δ}
∂
log f (y, δ; η),
∂η
f (y; θ)P(δ|y; φ).
• The mean score function is computed by taking the conditional expectation of the
complete-sample score function given the observation.
• The mean score function is easier to compute than the observed score function.
Jae-Kwang Kim (ISU)
Part 1
19 / 181
3 Mean Score Approach
Proof of Theorem 2
Since Lobs (η) = f (y, δ; η) /f (y, δ | yobs , δ; η), we have
∂
∂
∂
ln Lobs (η) =
ln f (y, δ; η) −
ln f (y, δ | yobs , δ; η) ,
∂η
∂η
∂η
taking a conditional expectation of the above equation over the conditional distribution
of (y, δ) given (yobs , δ), we have
∂
∂
ln Lobs (η) = E
ln Lobs (η) | yobs , δ
∂η
∂η
∂
= E {Scom (η) | yobs , δ} − E
ln f (y, δ | yobs , δ; η) | yobs , δ .
∂η
Here, the first equality holds because Lobs (η) is a function of (yobs , δ) only. The last
term is equal to zero by Lemma 1, which states that the expected value of the score
function is zero and the reference distribution in this case is the conditional distribution
of (y, δ) given (yobs , δ).
Jae-Kwang Kim (ISU)
Part 1
20 / 181
3 Mean Score Approach
Example 3 [Example 2.6 of KS]
1 Suppose that the study variable y follows from a normal distribution with mean x0 β
and variance σ 2 . The score equations for β and σ 2 under complete response are
S1 (β, σ 2 ) =
n
X
yi − x0i β xi /σ 2 = 0
i=1
and
S2 (β, σ 2 ) = −n/(2σ 2 ) +
n
X
yi − x0i β
2
/(2σ 4 ) = 0.
i=1
2 Assume that yi are observed only for the first r elements and the MAR assumption
holds. In this case, the mean score function reduces to
S̄1 (β, σ 2 ) =
r
X
yi − x0i β xi /σ 2
i=1
and
S̄2 (β, σ 2 ) = −n/(2σ 2 ) +
r
X
yi − x0i β
2
/(2σ 4 ) + (n − r )/(2σ 2 ).
i=1
Jae-Kwang Kim (ISU)
Part 1
21 / 181
3 Mean Score Approach
Example 3 (Cont’d)
3 The maximum likelihood estimator obtained by solving the mean score equations is
β̂ =
r
X
!−1
xi x0i
i=1
and
σ̂ 2 =
r
X
xi yi
i=1
r
2
1 X
yi − x0i β̂ .
r i=1
Thus, the resulting estimators can be also obtained by simply ignoring the missing
part of the sample, which is consistent with the result in Example 2 (for φ2 = 0).
Jae-Kwang Kim (ISU)
Part 1
22 / 181
3 Mean Score Approach
Discussion of Example 3
• We are interested in estimating θ for the conditional density f (y | x; θ).
• Under MAR, the observed likelihood for θ is
Lobs (θ) =
r
Y
f (yi | xi ; θ) ×
i=1
n Z
Y
f (y | xi ; θ)dµ(y ) =
i=r +1
r
Y
f (yi | xi ; θ).
i=1
• The same conclusion can follow from the mean score theorem. Under MAR, the
mean score function is
S̄(θ)
=
r
X
S(θ; xi , yi ) +
i=1
=
r
X
n
X
E {S(θ; xi , Y ) | xi }
i=r +1
S(θ; xi , yi )
i=1
where S(θ; x, y ) is the score function for θ and the second equality follows from
Lemma 1.
Jae-Kwang Kim (ISU)
Part 1
23 / 181
3 Mean Score Approach
Example 4 [Example 2.5 of KS]
1 Suppose that the study variable y is randomly distributed with Bernoulli
distribution with probability of success pi , where
pi = pi (β) =
exp (x0i β)
1 + exp (x0i β)
for some unknown parameter β and xi is a vector of the covariates in the logistic
regression model for yi . We assume that 1 is in the column space of xi .
2 Under complete response, the score function for β is
S1 (β) =
n
X
(yi − pi (β)) xi .
i=1
Jae-Kwang Kim (ISU)
Part 1
24 / 181
3 Mean Score Approach
Example 4 (Cont’d)
3 Let δi be the response indicator function for yi with distribution Bernoulli(πi )
where
πi =
exp (x0i φ0 + yi φ1 )
.
1 + exp (x0i φ0 + yi φ1 )
We assume that xi is always observed, but yi is missing if δi = 0.
4 Under missing data, the mean score function for β is
S̄1 (β, φ) =
X
{yi − pi (β)} xi +
1
XX
wi (y ; β, φ) {y − pi (β)} xi ,
δi =0 y =0
δi =1
where wi (y ; β, φ) is the conditional probability of yi = y given xi and δi = 0:
wi (y ; β, φ)
=
Pβ (yi = y | xi ) Pφ (δi = 0 | yi = y , xi )
P1
z=0 Pβ (yi = z | xi ) Pφ (δi = 0 | yi = z, xi )
Thus, S̄1 (β, φ) is also a function of φ.
Jae-Kwang Kim (ISU)
Part 1
25 / 181
3 Mean Score Approach
Example 4 (Cont’d)
5 If the response mechanism is MAR so that φ1 = 0, then
wi (y ; β, φ)
Pβ (yi = y | xi )
= Pβ (yi = y | xi )
P1
z=0 Pβ (yi = z | xi )
=
and so
S̄1 (β, φ) =
X
{yi − pi (β)} xi = S̄1 (β) .
δi =1
6 If MAR does not hold, then (β̂, φ̂) can be obtained by solving S̄1 (β, φ) = 0 and
S̄2 (β, φ) = 0 jointly, where
X
S̄2 (β, φ) =
{δi − π (φ; xi , yi )} (xi , yi )
δi =1
+
1
XX
wi (y ; β, φ) {δi − πi (φ; xi , y )} (xi , y ) .
δi =0 y =0
Jae-Kwang Kim (ISU)
Part 1
26 / 181
3 Mean Score Approach
Discussion of Example 4
• We maynot have a uniquesolution to S̄(η) = 0, where
S̄(η) = S̄1 (β, φ) , S̄2 (β, φ) when MAR does not hold, because of the
non-identifiability problem associated with non-ignorable missing.
• To avoid this problem, often a reduced model is used for the response model.
Pr (δ = 1 | x, y ) = Pr (δ = 1 | u, y )
where x = (u, z). The reduced response model introduces a smaller set of
parameters and the over-identified situation can be resolved. (More discussion will
be made in Part 4 lecture.)
• Computing the solution to S̄(η) is also difficult. EM algorithm, which will be
presented soon, is a useful computational tool.
Jae-Kwang Kim (ISU)
Part 1
27 / 181
4 Observed information
Definition
1 Observed score function: Sobs (η) =
∂
∂η
log Lobs (η)
2
∂
2 Fisher information from observed likelihood: Iobs (η) = − ∂η∂η
T log Lobs (η)
3 Expected (Fisher) information from observed likelihood: Iobs (η) = Eη {Iobs (η)}.
Lemma 2 [Theorem 2.6 of KS]
Under regularity conditions,
E{Sobs (η)} = 0,
and
V{Sobs (η)} = Iobs (η),
where Iobs (η) = Eη {Iobs (η)} is the expected information from the observed likelihood.
Jae-Kwang Kim (ISU)
Part 1
28 / 181
4 Observed information
• Under missing data, the MLE η̂ is the solution to S̄(η) = 0.
• Under some regularity conditions, η̂ converges in probability to η0 and has the
asymptotic variance {Iobs (η0 )}−1 with
n
o
n
o
∂
⊗2
Iobs (η) = E − T Sobs (η) = E Sobs
(η) = E S̄ ⊗2 (η)
∂η
&
B ⊗2 = BB T .
• For variance estimation of η̂, may use {Iobs (η̂)}−1 .
• IID setup: The empirical (Fisher) information for the variance of η̂ is
h
i−1
Ĥ(η̂)
=
( n
X
)−1
S̄i⊗2 (η̂)
i=1
where S̄i (η) = E{Si (η)|yi,obs , δ i } (Redner and Walker, 1984).
• In general, Iobs (η̂) is preferred to Iobs (η̂) for variance estimation of η̂.
Jae-Kwang Kim (ISU)
Part 1
29 / 181
4 Observed information
Return to Example 1
• Observed score function
Sobs (θ) =
n
X
δi log(θ) − θ
i=1
• MLE for θ:
• Fisher information: Iobs (θ) =
n
X
yi
i=1
Pn
yi
θ̂ = Pi=1
n
i=1 δi
Pn
δi /θ2
P
• Expected information: Iobs (θ) = ni=1 (1 − e −θc )/θ2 = n(1 − e −θc )/θ2 .
i=1
Which one do you prefer ?
Jae-Kwang Kim (ISU)
Part 1
30 / 181
4 Observed information
Motivation
• Lcom (η) = f (y, δ; η): complete-sample likelihood with no missing data
• Fisher information associated with Lcom (η):
Icom (η) = −
∂
∂2
Scom (η) = −
log Lcom (η)
T
∂η
∂η∂η T
• Lobs (η): the observed likelihood
• Fisher information associated with Lobs (η):
Iobs (η) = −
∂2
∂
Sobs (η) = −
log Lobs (η)
T
∂η
∂η∂η T
• How to express Iobs (η) in terms of Icom (η) and Scom (η) ?
Jae-Kwang Kim (ISU)
Part 1
31 / 181
4 Observed information
Theorem 3 (Louis, 1982; Oakes, 1999) [Theorem 2.7 of KS]
Under regularity conditions allowing the exchange of the order of integration and
differentiation,
h
i
⊗2
Iobs (η) = E{Icom (η)|yobs , δ} − E{Scom
(η)|yobs , δ} − S̄(η)⊗2
=
E{Icom (η)|yobs , δ} − V {Scom (η)|yobs , δ},
where S̄(η) = E{Scom (η)|yobs , δ}.
Jae-Kwang Kim (ISU)
Part 1
32 / 181
Proof of Theorem 3
By Theorem 2, the observed information associated with Lobs (η) can be expressed as
Iobs (η) = −
∂
S̄ (η)
∂η 0
where S̄ (η) = E {Scom (η) | yobs , δ; η}. Thus, we have
Z
∂
∂
S̄
(η)
=
Scom (η; y)f (y, δ | yobs , δ; η) dµ(y)
∂η 0
∂η 0
Z ∂
S
(η;
y)
f (y, δ | yobs , δ; η) dµ(y)
=
com
∂η 0
Z
∂
+ Scom (η; y)
f
(y,
δ
|
y
,
δ;
η)
dµ(y)
obs
∂η 0
= E ∂Scom (η)/∂η 0 | yobs , δ
Z
∂
+ Scom (η; y)
log
f
(y,
δ
|
y
,
δ;
η)
f (y, δ | yobs , δ; η) dµ(y).
obs
∂η 0
The first term is equal to −E {Icom (η) | yobs , δ} and the second term is equal to
E Scom (η)Smis (η)0 | yobs , δ
= E S̄(η) + Smis (η) Smis (η)0 | yobs , δ
= E Smis (η)Smis (η)0 | yobs , δ
because E S̄(η)Smis (η)0 | yobs , δ = 0.
Jae-Kwang Kim (ISU)
Part 1
33 / 181
4. Observed information
• Smis (η): the score function with the conditional density f (y, δ|yobs , δ).
n
o
• Expected missing information: Imis (η) = E − ∂η∂T Smis (η) satisfying
Imis (η) = E Smis (η)⊗2 .
• Missing information principle (Orchard and Woodbury, 1972):
Imis (η) = Icom (η) − Iobs (η),
where Icom (η) = E −∂Scom (η)/∂η T is the expected information with
complete-sample likelihood .
• An alternative expression of the missing information principle is
V{Smis (η)} = V{Scom (η)} − V{S̄(η)}.
Note that V{Scom (η)} = Icom (η) and V{Sobs (η)} = Iobs (η).
Jae-Kwang Kim (ISU)
Part 1
34 / 181
4. Observed information
Example 5
1 Consider the following bivariate normal distribution:
y1i
y2i
∼N
µ1
µ2
σ11
,
σ12
σ12
σ22
,
for i = 1, 2, · · · , n. Assume for simplicity that σ11 , σ12 and σ22 are known
constants and µ = (µ1 , µ2 )0 be the parameter of interest.
2 The complete sample score function for µ is
Scom (µ) =
n
X
i=1
(i)
Scom
n X
σ11
(µ) =
σ12
i=1
σ12
σ22
−1 y1i − µ1
y2i − µ2
.
The information matrix of µ based on the complete sample is
−1
σ11 σ12
Icom (µ) = n
.
σ12 σ22
Jae-Kwang Kim (ISU)
Part 1
35 / 181
4. Observed information
Example 5 (Cont’d)
3 Suppose that there are some missing values in y1i and y2i and the original sample
is partitioned into four sets:
H
=
both y1 and y2 respond
K
=
only y1 is observed
L
=
only y2 is observed
M
=
both y1 and y2 are missing.
Let nH , nK , nL , nM represent the size of H, K , L, M, respectively.
4 Assume that the response mechanism does not depend on the value of (y1 , y2 ) and
so it is MAR. In this case, the observed
observation in set K is
n
o
(i)
E Scom
(µ) | y1i , i ∈ K
=
=
Jae-Kwang Kim (ISU)
score function of µ based on a single
σ11
σ12
σ12
σ22
−1 −1
σ11
(y1i − µ1 )
0
Part 1
y1i − µ1
E (y2i | y1i ) − µ2
.
36 / 181
4. Observed information
Example 5 (Cont’d)
5 Similarly, we have
n
o
(i)
E Scom
(µ) | y2i , i ∈ L
=
0
−1
(y2i − µ2 )
σ22
.
6 Therefore, and the observed information matrix of µ is
Iobs (µ) = nH
σ11
σ12
σ12
σ22
−1
+ nK
−1
σ11
0
0
0
+ nL
0
0
0
−1
σ22
and the asymptotic variance of the MLE of µ can be obtained by the inverse of
Iobs (µ).
Jae-Kwang Kim (ISU)
Part 1
37 / 181
5. EM algorithm
• Interested in finding η̂ that maximizes Lobs (η). The MLE can be obtained by
solving Sobs (η) = 0, which is equivalent to solving S̄(η) = 0 by Theorem 2.
• Computing the solution S̄(η) = 0 can be challenging because it often involves
computing Iobs (η) = −∂ S̄(η)/∂η 0 in order to apply Newton method:
n
o−1
η̂ (t+1) = η̂ (t) + Iobs (η̂ (t) )
S̄(η̂ (t) ).
We may rely on Louis formula (Theorem 3) to compute Iobs (η).
• EM algorithm provides an alternative method of solving S̄(η) = 0 by writing
S̄(η) = E {Scom (η) | yobs , δ; η}
and using the following iterative method:
n
o
η̂ (t+1) ← solve E Scom (η) | yobs , δ; η̂ (t) = 0.
Jae-Kwang Kim (ISU)
Part 1
38 / 181
5. EM algorithm
Definition
Let η (t) be the current value of the parameter estimate of η. The EM algorithm can be
defined as iteratively carrying out the following E-step and M-steps:
• E-step: Compute
n
o
Q η | η (t) = E ln f (y, δ; η) | yobs , δ, η (t)
• M-step: Find η (t+1) that maximizes Q(η | η (t) ) w.r.t. η.
Theorem 4 (Dempster et al., 1977) [Theorem 3.2]
Let Lobs (η) =
Q(η
(t+1)
(t)
R
f (y, δ; η) dµ(y) be the observed likelihood of η. If
R(yobs ,δ)
(t)
| η ) ≥ Q(η
Jae-Kwang Kim (ISU)
| η (t) ), then Lobs (η (t+1) ) ≥ Lobs (η (t) ).
Part 1
39 / 181
5. EM algorithm
Remark
1 Convergence of EM algorithm is linear. It can be shown that
(t)
(t−1)
η (t+1) − η (t) ∼
= Jmis η − η
−1
where Jmis = Icom
Imis is called the fraction of missing information.
2 Under MAR and for the exponential family of the distribution of the form
f (y; θ) = b (y) exp θ0 T (y) − A (θ) .
The M-step computes θ(t+1) by the solution to
n
o
E T (y) | yobs , θ(t) = E {T (y) | θ} .
Jae-Kwang Kim (ISU)
Part 1
40 / 181
h1(θ)
0.5
1.0
1.5
2.0
2.5
5. EM algorithm
h2(θ)
0.2
0.4
0.6
0.8
1.0
θ
Figure : Illustration of EM algorithm for exponential family
(h1 (θ) = E {T (y) | yobs , θ}, h2 (θ) = E {T (y) | θ})
Jae-Kwang Kim (ISU)
Part 1
41 / 181
5. EM algorithm
Return to Example 4
• E-step:
S̄1 β | β (t) , φ(t)
=
X
{yi − pi (β)} xi +
1
XX
wij(t) {j − pi (β)} xi ,
δi =0 j=0
δi =1
where
wij(t)
=
Pr (Yi = j | xi , δi = 0; β (t) , φ(t) )
=
P1
Pr (Yi = j | xi ; β (t) )Pr (δi = 0 | xi , j; φ(t) )
(t)
(t)
y =0 Pr (Yi = y | xi ; β )Pr (δi = 0 | xi , y ; φ )
and
S̄2 φ | β (t) , φ(t)
=
X
{δi − π (xi , yi ; φ)} x0i , yi
0
δi =1
+
1
XX
wij(t) {δi − πi (xi , j; φ)} x0i , j
0
.
δi =0 j=0
Jae-Kwang Kim (ISU)
Part 1
42 / 181
5. EM algorithm
Return to Example 4 (Cont’d)
• M-step:
The parameter estimates are updated by solving
h i
S̄1 β | β (t) , φ(t) , S̄2 φ | β (t) , φ(t) = (0, 0)
for β and φ.
• For categorical missing data, the conditional expectation in the E-step can be
computed using the weighted mean with weights wij(t) . Ibrahim (1990) called this
method EM by weighting.
• Observed information matrix can also be obtained by the Louis formula (in
Theorem 3) using the weighted mean in the E-step.
Jae-Kwang Kim (ISU)
Part 1
43 / 181
5. EM algorithm
Example 6. Mixture model [Example 3.8 of KS]
• Observation
Yi = (1 − Wi ) Z1i + Wi Z2i ,
i = 1, 2, · · · , n
where
• Parameter of interest
Jae-Kwang Kim (ISU)
Z1i
∼
Z2i
∼
N µ1 , σ12
N µ2 , σ22
Wi
∼
Bernoulli (π) .
θ = µ1 , µ2 , σ12 , σ22 , π
Part 1
44 / 181
5. EM algorithm
Example 6 (Cont’d)
• Observed likelihood
Lobs (θ) =
n
Y
pdf (yi | θ)
i=1
where
pdf (y | θ) = (1 − π) φ y | µ1 , σ12 + πφ y | µ2 , σ22
and
Jae-Kwang Kim (ISU)
(y − µ)2
1
φ y | µ, σ 2 = √
exp −
.
2σ 2
2πσ
Part 1
45 / 181
5. EM algorithm
Example 6 (Cont’d)
• Full sample likelihood
Lfull (θ) =
n
Y
pdf (yi , wi | θ)
i=1
where
h i1−w h iw
pdf (y , w | θ) = φ y | µ1 , σ12
φ y | µ2 , σ22
π w (1 − π)1−w .
Thus,
ln Lfull (θ)
=
n h
X
i
(1 − wi ) ln φ yi | µ1 , σ12 + wi ln φ yi | µ2 , σ22
i=1
+
n
X
{wi ln (π) + (1 − wi ) ln (1 − π)}
i=1
Jae-Kwang Kim (ISU)
Part 1
46 / 181
5. EM algorithm
Example 6 (Cont’d)
• E-M algorithm
[E-step]
Q θ | θ(t)
=
n h
X
(t)
1 − ri
i
(t)
ln φ yi | µ1 , σ12 + ri ln φ yi | µ2 , σ22
i=1
+
n n
o
X
(t)
(t)
ln (1 − π)
ri ln (π) + 1 − ri
i=1
(t)
where ri
= E wi | yi , θ(t)
E (wi | yi , θ)
=
with
πφ yi | µ2 , σ22
(1 − π) φ (yi | µ1 , σ12 ) + πφ (yi | µ2 , σ22 )
[M-step]
∂ Q θ | θ(t) = 0.
∂θ
Jae-Kwang Kim (ISU)
Part 1
47 / 181
5. EM algorithm
Example 7. (Robust regression) [Example 3.12 of KS]
• Model: yi = β0 + β1 xi + σei with ei ∼ t(ν), ν: known.
• Missing data setup:
√
ei = ui / wi
where
ui ∼ N(0, 1),
wi ∼ χ2 /ν.
• (xi , yi , wi ): complete data (xi , yi always observed, wi always missing)
yi | (xi , wi ) ∼ N β0 + β1 xi , σ 2 /wi
and xi is fixed (thus, independent of wi ).
• EM algorithm can be used to estimate θ = (β0 , β1 , σ 2 )
Jae-Kwang Kim (ISU)
Part 1
48 / 181
5. EM algorithm
Example 7 (Cont’d)
• E-step: Find the conditional distribution of wi given xi . By Bayes
theorem,
f (wi | xi , yi ) ∝ f (wi )f (yi | wi , xi )


(
2 )−1
ν
+
1
y
−
β
−
β
x
0
1 i
i

∼ Gamma 
,2 ν +
2
σ
Thus,
E (wi | xi , yi , θ(t) ) =
(t)
where di
(t)
ν+1
,
(t) 2
ν + di
(t)
= (yi − β0 − β1 xi )/σ (t) .
Jae-Kwang Kim (ISU)
July 5th, 2014
49 / 181
5. EM algorithm
Example 7 (Cont’d)
• M-step:
(t)
(t)
µx , µy
=
(t+1)
β0
=
(t+1)
=
β1
σ 2(t+1) =
(t)
where wi
n
X
(t)
wi
n
X
(t)
(xi , yi ) /(
wi )
i=1
(t)
µy
i=1
(t+1) (t)
− β1
µx
Pn
(t)
(t)
(t)
i=1 wi (xi − µx )(yi − µy )
Pn
(t)
(t) 2
i=1 wi (xi − µx )
n
2
1 X (t) (t)
(t)
wi
yi − β0 − β1 xi
n
i=1
= E (wi | xi , yi , θ(t) ).
Jae-Kwang Kim (ISU)
July 5th, 2014
50 / 181
6. Summary
• Interested in finding the MLE that maximizes the observed likelihood
function.
• Under MAR, the model specification of the response mechanism is
not necessary.
• Mean score equation can be used to compute the MLE.
• EM algorithm is a useful computational tool for solving the mean
score equation.
• The E-step of the EM algorithm may require some computational tool
(see Part 2).
• The asymptotic variance of the MLE can be computed by the inverse
of the observed information matrix, which can be computed using
Louis formula.
Jae-Kwang Kim (ISU)
July 5th, 2014
51 / 181
REFERENCES
Cheng, P. E. (1994), ‘Nonparametric estimation of mean functionals with
data missing at random’, Journal of the American Statistical
Association 89, 81–87.
Dempster, A. P., N. M. Laird and D. B. Rubin (1977), ‘Maximum
likelihood from incomplete data via the EM algorithm’, Journal of the
Royal Statistical Society: Series B 39, 1–37.
Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical
statistics’, Philosophical Transactions of the Royal Society of London A
222, 309–368.
Fuller, W. A., M. M. Loughin and H. D. Baker (1994), ‘Regression
weighting in the presence of nonresponse with application to the
1987-1988 Nationwide Food Consumption Survey’, Survey Methodology
20, 75–85.
Hirano, K., G. Imbens and G. Ridder (2003), ‘Efficient estimation of
average treatment effects using the estimated propensity score’,
Econometrica 71, 1161–1189.
Ibrahim, J. G. (1990), ‘Incomplete data in generalized linear models’,
Journal of the American Statistical Association 85, 765–769.
Jae-Kwang Kim (ISU)
July 5th, 2014
51 / 181
Kim, J. K. (2011), ‘Parametric fractional imputation for missing data
analysis’, Biometrika 98, 119–132.
Kim, J. K. and C. L. Yu (2011), ‘A semi-parametric estimation of mean
functionals with non-ignorable missing data’, Journal of the American
Statistical Association 106, 157–165.
Kim, J. K., M. J. Brick, W. A. Fuller and G. Kalton (2006), ‘On the bias
of the multiple imputation variance estimator in survey sampling’,
Journal of the Royal Statistical Society: Series B 68, 509–521.
Kim, J. K. and M. K. Riddles (2012), ‘Some theory for
propensity-score-adjustment estimators in survey sampling’, Survey
Methodology 38, 157–165.
Kott, P. S. and T. Chang (2010), ‘Using calibration weighting to adjust for
nonignorable unit nonresponse’, Journal of the American Statistical
Association 105, 1265–1275.
Louis, T. A. (1982), ‘Finding the observed information matrix when using
the EM algorithm’, Journal of the Royal Statistical Society: Series B
44, 226–233.
Meng, X. L. (1994), ‘Multiple-imputation inferences with uncongenial
sources of input (with discussion)’, Statistical Science 9, 538–573.
Jae-Kwang Kim (ISU)
July 5th, 2014
51 / 181
Oakes, D. (1999), ‘Direct calculation of the information matrix via the em
algorithm’, Journal of the Royal Statistical Society: Series B
61, 479–482.
Orchard, T. and M.A. Woodbury (1972), A missing information principle:
theory and applications, in ‘Proceedings of the 6th Berkeley Symposium
on Mathematical Statistics and Probability’, Vol. 1, University of
California Press, Berkeley, California, pp. 695–715.
Redner, R. A. and H. F. Walker (1984), ‘Mixture densities, maximum
likelihood and the EM algorithm’, SIAM Review 26, 195–239.
Robins, J. M., A. Rotnitzky and L. P. Zhao (1994), ‘Estimation of
regression coefficients when some regressors are not always observed’,
Journal of the American Statistical Association 89, 846–866.
Robins, J. M. and N. Wang (2000), ‘Inference for imputation estimators’,
Biometrika 87, 113–124.
Rubin, D. B. (1976), ‘Inference and missing data’, Biometrika
63, 581–590.
Tanner, M. A. and W. H. Wong (1987), ‘The calculation of posterior
distribution by data augmentation’, Journal of the American Statistical
Association 82, 528–540.
Jae-Kwang Kim (ISU)
July 5th, 2014
51 / 181
Wang, N. and J. M. Robins (1998), ‘Large-sample theory for parametric
multiple imputation procedures’, Biometrika 85, 935–948.
Wang, S., J. Shao and J. K. Kim (2014), ‘Identifiability and estimation in
problems with nonignorable nonresponse’, Statistica Sinica 24, 1097 –
1116.
Wei, G. C. and M. A. Tanner (1990), ‘A Monte Carlo implementation of
the EM algorithm and the poor man’s data augmentation algorithms’,
Journal of the American Statistical Association 85, 699–704.
Zhou, M. and J. K. Kim (2012), ‘An efficient method of estimation for
longitudinal surveys with monotone missing data’, Biometrika
99, 631–648.
Statistical Methods for Handling Missing Data
Part 2: Imputation
Jae-Kwang Kim
Department of Statistics, Iowa State University
Jae-Kwang Kim (ISU)
Part 2
52 / 181
Introduction
Basic setup
• Y: a vector of random variables with distribution F (y; θ).
• y1 , · · · , yn are n independent realizations of Y.
• We are interested in estimating ψ which is implicitly defined by E {U(ψ; Y)} = 0.
• Under complete observation, a consistent estimator ψ̂n of ψ can be obtained by
solving estimating equation for ψ:
n
X
U(ψ; yi ) = 0.
i=1
• A special case of estimating function is the score function. In this case, ψ = θ.
• Sandwich variance estimator is often used to estimate the variance of ψ̂n :
V̂ (ψ̂n ) = τ̂u−1 V̂ (U)τ̂u−1
0
where τu = E {∂U(ψ; y)/∂ψ 0 }.
Jae-Kwang Kim (ISU)
Part 2
53 / 181
1. Introduction
Missing data setup
• Suppose that yi is not fully observed.
• yi = (yobs,i , ymis,i ): (observed, missing) part of yi
• δ i : response indicator functions for yi .
• Under the existence of missing data, we can use the following estimators:
ψ̂: solution to
n
X
E {U (ψ; yi ) | yobs,i , δ i } = 0.
(1)
i=1
• The equation in (1) is often called expected estimating equation.
Jae-Kwang Kim (ISU)
Part 2
54 / 181
1. Introduction
Motivation (for imputation)
Computing the conditional expectation in (1) can be a challenging problem.
1 The conditional expectation depends on unknown parameter values. That is,
E {U (ψ; yi ) | yobs,i , δ i } = E {U (ψ; yi ) | yobs,i , δ i ; θ, φ} ,
where θ is the parameter in f (y; θ) and φ is the parameter in p(δ | y; φ).
2 Even if we know η = (θ, φ), computing the conditional expectation is numerically
difficult.
Jae-Kwang Kim (ISU)
Part 2
55 / 181
1. Introduction
Imputation
• Imputation: Monte Carlo approximation of the conditional expectation (given the
observed data).
m
1 X ∗(j)
E {U (ψ; yi ) | yobs,i , δ i } ∼
U ψ; yobs,i , ymis,i
=
m j=1
1
∗
Bayesian approach: generate ymis,i
from
Z
f (ymis,i | yobs , δ) =
2
f (ymis,i | yobs , δ; η) p(η | yobs , δ)dη
∗
Frequentist approach: generate ymis,i
from f (ymis,i | yobs,i , δ; η̂) , where
η̂ is a consistent estimator.
Jae-Kwang Kim (ISU)
Part 2
56 / 181
1. Introduction
Imputation
Questions
1 How to generate the Monte Carlo samples (or the imputed values) ?
2 What is the asymptotic distribution of ψ̂I∗ which solves
m
1 X ∗(j)
U ψ; yobs,i , ymis,i = 0,
m j=1
∗(j)
where ymis,i ∼ f (ymis,i | yobs,i , δ; η̂p ) for some η̂p ?
3 How to estimate the variance of ψ̂I∗ ?
Jae-Kwang Kim (ISU)
Part 2
57 / 181
2. Basic Theory for Imputation
Basic Setup (for Case 1: ψ = η)
• y = (y1 , · · · , yn ) ∼ f (y; θ)
• δ = (δ1 , · · · , δn ) ∼ P(δ|y; φ)
• y = (yobs , ymis ): (observed, missing) part of y.
∗(1)
∗(m)
• ymis
, · · · , ymis : m imputed values of ymis generated from
f (ymis | yobs , δ; η̂p ) = R
f (y; θ̂p )P(δ | y; φ̂p )
,
f (y; θ̂p )P(δ | y; φ̂p )dµ(ymis )
where η̂p = (θ̂p , φ̂p ) is a preliminary estimator of η = (θ, φ).
• Using m imputed values, imputed score function is computed as
∗
S̄imp,m
(η | η̂p ) ≡
m
1 X
∗(j)
Scom η; yobs , ymis , δ
m j=1
where Scom (η; y) is the score function of η = (θ, φ) under complete response.
Jae-Kwang Kim (ISU)
Part 2
58 / 181
2. Basic Theory for Imputation
Lemma 1 (Asymptotic results for m = ∞)
Assume that η̂p converges in probability to η. Let η̂I∗,m be the solution to
m
1 X
∗(j)
Scom η; yobs , ymis , δ = 0,
m j=1
∗(1)
∗(m)
where ymis , · · · , ymis are the imputed values generated from f (ymis | yobs , δ; η̂p ). Then,
under some regularity conditions, for m → ∞,
∗
∼
η̂imp,∞
= η̂MLE + Jmis (η̂p − η̂MLE )
and
(2)
. −1
∗
0
= Iobs + Jmis {V (η̂p ) − V (η̂MLE )} Jmis
,
V η̂imp,∞
−1
where Jmis = Icom
Imis is the fraction of missing information.
Jae-Kwang Kim (ISU)
Part 2
59 / 181
2. Basic Theory for Imputation
Remark
• For m = ∞, the imputed score equation becomes the mean score equation.
• Equation (2) means that
∗
η̂imp,∞
= (I − Jmis ) η̂MLE + Jmis η̂p .
(3)
∗
That is, η̂imp,∞
is a convex combination of η̂MLE and η̂p .
∗
• Note that η̂imp,∞
is one-step EM update with initial estimate η̂p . Let η̂ (t) be the
t-th EM update of η that is computed by solving
S̄ η | η̂ (t−1) = 0
with η̂ (0) = η̂p . Equation (3) implies that
η̂ (t) = (I − Jmis ) η̂MLE + Jmis η̂ (t−1) .
Thus, we can obtain
η̂ (t) = η̂MLE + (Jmis )t−1 η̂ (0) − η̂MLE ,
which justifies limt→∞ η̂ (t) = η̂MLE .
Jae-Kwang Kim (ISU)
Part 2
60 / 181
2. Basic Theory for Imputation
Theorem 1 (Asymptotic results for m < ∞) [Theorem 4.1 of KS]
√
Let η̂p be a preliminary n-consistent estimator of η with variance Vp . Under some
∗
regularity conditions, the solution η̂imp,m
to
S̄m∗ (η | η̂p ) ≡
m
1 X
∗(j)
Scom η; yobs , ymis , δ = 0
m j=1
has mean η0 and asymptotic variance equal to
n
o
. −1
∗
−1
0
−1
−1
V η̂imp,m
= Iobs + Jmis Vp − Iobs
Jmis
+ m−1 Icom
Imis Icom
(4)
−1
where Jmis = Icom
Imis .
This theorem was originally presented by Wang and Robins (1998).
Jae-Kwang Kim (ISU)
Part 2
61 / 181
2. Basic Theory for Imputation
Remark
• If we use η̂p = η̂MLE , then the asymptotic variance in (4) is
. −1
∗
−1
−1
V η̂imp,m
= Iobs + m−1 Icom
Imis Icom
.
• In Bayesian imputation (or multiple imputation), the posterior values of η are
−1
independently generated from η ∼ N(η̂MLE , Iobs
), which implies that we can use
−1
−1 −1
Vp = Iobs + m Iobs . Thus, the asymptotic variance in (4) for multiple
imputation is
. −1
0
∗
−1
−1
−1
−1
V η̂imp,m
= Iobs + m−1 Jmis
Iobs Jmis
+ m−1 Icom
Imis Icom
.
The second term is the additional price we pay when generating the posterior
values, rather than using η̂MLE directly.
Jae-Kwang Kim (ISU)
Part 2
62 / 181
2. Basic Theory for Imputation
Basic Setup (for Case 2: ψ 6= η)
• Parameter ψ defined by E {U(ψ; y)} = 0.
• Under complete response, a consistent estimator of ψ can be obtained by solving
U (ψ; y) = 0.
• Assume that some part of y, denoted by ymis , is not observed and m imputed
∗(1)
∗(m)
values, say ymis , · · · , ymis , are generated from f (ymis | yobs , δ; η̂MLE ), where η̂MLE
is the MLE of η0 .
• The imputed estimating function using m imputed values is computed as
∗
Ūimp,m
(ψ | η̂MLE ) =
m
1 X
U(ψ; y∗(j) ),
m j=1
(5)
∗(j)
where y∗(j) = (yobs , ymis ).
∗
∗
• Let ψ̂imp,m
be the solution to Ūimp,m
(ψ | η̂MLE ) = 0. We are interested in the
∗
asymptotic properties of ψ̂imp,m
.
Jae-Kwang Kim (ISU)
Part 2
63 / 181
2. Basic Theory for Imputation
Theorem 2 [Theorem 4.2 of KS]
Suppose that the parameter of interest ψ0 is estimated by solving U (ψ) = 0 under
complete response. Then, under some regularity conditions, the solution to
E {U (ψ) | yobs , δ; η̂MLE } = 0
(6)
0
has mean ψ0 and the asymptotic variance τ −1 Ωτ −1 , where
τ = −E ∂U (ψ0 ) /∂ψ 0
Ω = V Ū (ψ0 | η0 ) + κSobs (η0 )
and
−1
.
κ = E {U (ψ0 ) Smis (η0 )} Iobs
This theorem was originally presented by Robins and Wang (2000).
Jae-Kwang Kim (ISU)
Part 2
64 / 181
2. Basic Theory for Imputation
Remark
• Writing
Ū(ψ | η) ≡ E {U(ψ) | yobs , δ; η},
the solution to (6) can be treated as the solution to the joint estimating equation
U1 (ψ, η)
U (ψ, η) ≡
= 0,
U2 (η)
where U1 (ψ, η) = Ū (ψ | η) and U2 (η) = Sobs (η).
• We can apply the Taylor expansion to get
where
ψ̂
η̂
B11
B21
∼
=
B12
B22
ψ0
η0
−
=
B11
B21
B12
B22
E (∂U1 /∂ψ 0 )
E (∂U2 /∂ψ 0 )
−1 U1 (ψ0 , η0 )
U2 (η0 )
E (∂U1 /∂η 0 )
E (∂U2 /∂η 0 )
.
Thus, as B21 = 0,
n
o
−1
−1
ψ̂ ∼
= ψ0 − B11 U1 (ψ0 , η0 ) − B12 B22 U2 (η0 ) .
In Theorem 2, B11 = τ , B12 = E (USmis ), and B22 = −Iobs .
Jae-Kwang Kim (ISU)
Part 2
65 / 181
3. Monte Carlo EM
Motivation: Monte Carlo samples in the EM algorithm can be used as imputed values.
Monte Carlo EM
1 In the EM algorithm defined by
• [E-step] Compute
n
o
Q η | η (t) = E ln f (y, δ; η) | yobs , δ; η (t)
• [M-step] Find η (t+1) that maximizes Q η | η (t) ,
E-step is computationally cumbersome because it involves integral.
2 Wei and Tanner (1990): In the E-step, first draw
∗(1)
∗(m)
ymis , · · · , ymis ∼ f
ymis | yobs , δ; η (t)
and approximate
m
1 X
∗(j)
Q η | η (t) ∼
ln f yobs , ymis , δ; η .
=
m j=1
Jae-Kwang Kim (ISU)
Part 2
66 / 181
3. Monte Carlo EM
Example 1 [Example 3.15 of KS]
• Suppose that
yi ∼ f (yi | xi ; θ)
Assume that xi is always observed but we observe yi only when δi = 1 where
δi ∼ Bernoulli [πi (φ)] and
πi (φ) =
exp (φ0 + φ1 xi + φ2 yi )
.
1 + exp (φ0 + φ1 xi + φ2 yi )
• To implement the MCEM method, in the E-step, we need to generate samples from
f (yi | xi , δi = 0; θ̂, φ̂) = R
Jae-Kwang Kim (ISU)
Part 2
f (yi | xi ; θ̂){1 − πi (φ̂)}
.
f (yi | xi ; θ̂){1 − πi (φ̂)}dyi
67 / 181
3. Monte Carlo EM
Example 1 (Cont’d)
• We can use the following rejection method to generate samples from
f (yi | xi , δi = 0; θ̂, φ̂):
1
2
Generate yi∗ from f (yi | xi ; θ̂).
Using yi∗ , compute
πi∗ (φ̂) =
exp(φ̂0 + φ̂1 xi + φ̂2 yi∗ )
1 + exp(φ̂0 + φ̂1 xi + φ̂2 yi∗ )
.
Accept yi∗ with probability 1 − πi∗ (φ̂).
3 If yi∗ is not accepted, then goto Step 1.
Jae-Kwang Kim (ISU)
Part 2
68 / 181
3. Monte Carlo EM
Example 1 (Cont’d)
• Using the m imputed values of yi , denoted by yi∗(1) , · · · , yi∗(m) , and the M-step can
be implemented by solving
n X
m
X
∗(j)
=0
S θ; xi , yi
i=1 j=1
and
n X
m n
o
X
∗(j)
∗(j)
δi − π(φ; xi , yi ) 1, xi , yi
= 0,
i=1 j=1
where S (θ; xi , yi ) = ∂ log f (yi | xi ; θ)/∂θ.
Jae-Kwang Kim (ISU)
Part 2
69 / 181
3. Monte Carlo EM
Example 2 (GLMM) [Example 3.18]
• Basic Setup: Let yij be a binary random variable (that takes 0 or 1) with
probability pij = Pr (yij = 1 | xij , ai ) and we assume that
logit (pij ) = x0ij β + ai
where xij is a p-dimensional covariate associate with j-th repetition of unit i, β is
the parameter of interest that can represent the treatment effect due to x, and ai
represents
the random effect associate with unit i. We assume that ai are iid with
N 0, σ 2 .
• Missing data : ai
• Observed likelihood:
YZ
Lobs β, σ 2 =
i
(
Y
yij
p (xij , ai ; β) [1 − p (xij , ai ; β)]
j
1−yij
)
1 ai φ
dai
σ
σ
where φ (·) is the pdf of the standard normal distribution.
Jae-Kwang Kim (ISU)
Part 2
70 / 181
3. Monte Carlo EM
Example 2 (Cont’d)
• MCEM approach: generate ai∗ from
f (ai | xi , yi ; β̂, σ̂) ∝ f1 (yi | xi , ai ; β̂)f2 (ai ; σ̂).
• Metropolis-Hastings algorithm:
1
2
Generate ai∗ from f2 (ai ; σ̂).
Set
(
ai∗
(t)
ai =
(t−1)
ai
where
(t−1)
ρ ai
Jae-Kwang Kim (ISU)
, ai∗
(t−1)
w.p. ρ(ai
, ai∗ )
(t−1) ∗
w.p. 1 − ρ(ai
, ai )



 f1 yi | xi , ai∗ ; β̂
, 1 .
= min

 f y | x , a(t−1) ; β̂
1
i
i i
Part 2
71 / 181
3. Monte Carlo EM
Remark
• Monte Carlo EM can be used as a frequentist approach to imputation.
• Convergence is not guaranteed (for fixed m).
• E-step can be computationally heavy. (May use MCMC method).
Jae-Kwang Kim (ISU)
Part 2
72 / 181
4. Parametric Fractional Imputation
Parametric fractional imputation
∗(1)
∗(m)
1 More than one (say m) imputed values of ymis,i : ymis,i , · · · , ymis,i from some
(initial) density h (ymis,i ).
2 Create weighted data set
where
Pm
j=1
wij∗ , yij∗ ; j = 1, 2, · · · , m; i = 1, 2 · · · , n
∗(j)
wij∗ = 1, yij∗ = (yobs,i , ymis,i )
∗(j)
wij∗ ∝ f (yij∗ , δ i ; η̂)/h(ymis,i ),
η̂ is the maximum likelihood estimator of η, and f (y, δ; η) is the joint density of
(y, δ).
3 The weight wij∗ are the normalized importance weights and can be called fractional
weights.
If ymis,i is categorical, then simply use all possible values of ymis,i as the imputed values
and then assign their conditional probabilities as the fractional weights.
Jae-Kwang Kim (ISU)
Part 2
73 / 181
4. Parametric Fractional Imputation
Remark
• Importance sampling idea: For sufficiently large m,
m
X
R
wij∗ g
yij∗
j=1
∼
=
(yi ,δi ;η̂)
h(ymis,i )dymis,i
g (yi ) fh(y
mis,i )
= E {g (yi ) | yobs,i , δi ; η̂}
R f (yi ,δi ;η̂)
h(ymis,i )dymis,i
h(ymis,i )
for any g such that the expectation exists.
• In the importance sampling literature, h(·) is called proposal distribution and f (·)
is called target distribution.
• Do not need to compute the conditional distribution f (ymis,i | yobs,i , δi ; η). Only the
joint distribution f (yobs,i , ymis,i , δi ; η) is needed because
∗(j)
∗(j)
∗(j)
∗(j)
f (yobs,i , ymis,i , δi ; η̂)/h(yi,mis )
f (ymis,i | yobs,i , δi ; η̂)/h(yi,mis )
= Pm
.
Pm
∗(k)
∗(k)
∗(k)
∗(k)
k=1 f (yobs,i , ymis,i , δi ; η̂)/h(yi,mis )
k=1 f (ymis,i | yobs,i , δi ; η̂)/h(yi,mis )
Jae-Kwang Kim (ISU)
Part 2
74 / 181
4. Parametric Fractional Imputation
EM algorithm by fractional imputation
∗(j)
1 Imputation-step: generate ymis,i ∼ h (yi,mis ).
2 Weighting-step: compute
∗(j)
∗
wij(t)
∝ f (yij∗ , δi ; η̂(t) )/h(yi,mis )
where
Pm
j=1
∗
wij(t)
= 1.
3 M-step: update
η̂ (t+1) : solution to
n X
m
X
∗
wij(t)
S η; yij∗ , δi = 0.
i=1 j=1
4 Repeat Step2 and Step 3 until convergence.
• “Imputation Step” + “Weighting Step” = E-step.
∗
• We may add an optional step that checks if wij(t)
is too large for some j. In this
case, h(yi,mis ) needs to be changed.
Jae-Kwang Kim (ISU)
Part 2
75 / 181
4. Parametric Fractional imputation
• The imputed values are not changed for each EM iteration. Only the fractional
weights are changed.
Computationally efficient (because we use importance sampling only
once).
2 Convergence is achieved (because the imputed values are not changed).
1
• For sufficiently large t, η̂ (t) −→ η̂ ∗ . Also, for sufficiently large m, η̂ ∗ −→ η̂MLE .
• For estimation of ψ in E {U(ψ; Y )} = 0, simply use
n
m
1 XX ∗
wij U(ψ; yij∗ ) = 0.
n i=1 j=1
• Linearization variance estimation (using the result of Theorem 2) is discussed in
Kim (2011).
Jae-Kwang Kim (ISU)
Part 2
76 / 181
4. Parametric Fractional imputation
Return to Example 1
• Fractional imputation
∗(m)
∗(1)
from f yi | xi ; θ̂(0) .
Imputation Step: Generate yi , · · · , yi
2 Weighting Step: Using the m imputed values generated from Step 1,
compute the fractional weights by
∗(j)
o
f yi
| xi ; θ̂(t) n
∗
1 − π(xi , yi∗(j) ; φ̂(t) )
wij(t)
∝ ∗(j)
f yi
| xi ; θ̂(0)
1
where
Jae-Kwang Kim (ISU)
exp φ̂0 + φ̂1 xi + φ̂2 yi
.
π(xi , yi ; φ̂) =
1 + exp φ̂0 + φ̂1 xi + φ̂2 yi
Part 2
77 / 181
4. Parametric Fractional imputation
Return to Example 1 (Cont’d)
• Using the imputed data and the fractional weights, the M-step can be
implemented by solving
n X
m
X
∗(j)
∗
=0
wij(t)
S θ; xi , yi
i=1 j=1
and
n X
m
X
n
o
∗(j)
∗(j)
∗
δi − π(φ; xi , yi ) 1, xi , yi
= 0,
wij(t)
(7)
i=1 j=1
where S (θ; xi , yi ) = ∂ log f (yi | xi ; θ)/∂θ.
Jae-Kwang Kim (ISU)
Part 2
78 / 181
4. Parametric Fractional imputation
Example 3: Categorical missing data
Original data
i
1
2
3
4
5
6
Jae-Kwang Kim (ISU)
Y1i
1
1
?
0
0
?
Y2i
1
?
0
1
0
?
Part 2
Xi
3
4
5
6
7
8
79 / 181
4. Parametric Fractional imputation
Example 3 (Con’d)
• Y1 and Y2 are dichotomous and X is continuous.
• Model
Pr (Y1 = 1 | X )
=
Λ (α0 + α1 X )
Pr (Y2 = 1 | X , Y1 )
=
Λ (β0 + β1 X + β2 Y1 )
where Λ (x) = {1 + exp (−x)}
−1
• Assume MAR
Jae-Kwang Kim (ISU)
Part 2
80 / 181
4. Parametric Fractional imputation
Example 3 (Con’d)
Imputed data
i
1
2
3
4
5
6
Jae-Kwang Kim (ISU)
Fractional Weight
1
Pr (Y2 = 0 | Y1 = 1, X = 4)
Pr (Y2 = 1 | Y1 = 1, X = 4)
Pr (Y1 = 0 | Y2 = 0, X = 5)
Pr (Y1 = 1 | Y2 = 0, X = 5)
1
1
Pr (Y1 = 0, Y1 = 0 | X = 8)
Pr (Y1 = 0, Y1 = 1 | X = 8)
Pr (Y1 = 1, Y1 = 0 | X = 8)
Pr (Y1 = 1, Y1 = 1 | X = 8)
Part 2
Y1i
1
1
1
0
1
0
0
0
0
1
1
Y2i
1
0
1
0
0
1
0
0
1
0
1
Xi
3
4
4
5
5
6
7
8
8
8
8
81 / 181
4. Parametric Fractional imputation
Example 3 (Con’d)
• Implementation of EM algorithm using fractional imputation
• E-step: compute the mean score functions using the fractional weights.
• M-step: solve the mean score function.
• Because we have a completed data with weights, we can estimate other
parameters such as θ = Pr (Y2 = 1 | X > 5).
Jae-Kwang Kim (ISU)
Part 2
82 / 181
4. Parametric Fractional imputation
Example 4: Measurement error model [Example 4.13 of KS]
• Interested in estimating θ in f (y | x; θ).
• Instead of observing x, we observe z which can be highly correlated with x.
• Thus, z is an instrumental variable for x:
f (y | x, z) = f (y | x)
and
f (y | z = a) 6= f (y | z = b)
for a 6= b.
• In addition to original sample, we have a separate calibration sample that observes
(xi , zi ).
Jae-Kwang Kim (ISU)
Part 2
83 / 181
4. Parametric Fractional imputation
Table : Data Structure
Calibration Sample
Original Sample
Jae-Kwang Kim (ISU)
Part 2
Z
o
o
X
o
Y
o
84 / 181
4. Parametric Fractional imputation
Example 4 (Cont’d)
• The goal is to generate x in the original sample from
f (xi | zi , yi )
∝
f (xi | zi ) f (yi | xi , zi )
=
f (xi | zi ) f (yi | xi )
• Obtain a consistent estimator fˆ(x | z) from calibration sample.
• E-step
∗(1)
∗(m)
Generate xi , · · · , xi
from fˆ(xi | zi ).
∗(j)
2 Compute the fractional weights associated with xi
by
1
∗(j)
wij∗ ∝ f (yi | xi
; θ̂)
• M-step: Solve the weighted score equation for θ.
Jae-Kwang Kim (ISU)
Part 2
85 / 181
5. Multiple imputation
Features
1 Imputed values are generated from
∗(j)
yi,mis ∼ f (yi,mis | yi,obs , δ i ; η ∗ )
where η ∗ is generated from the posterior distribution π (η | yobs ).
2 Variance estimation formula is simple (Rubin’s formula).
1
)Bm
m
2
P
−1 Pm
where WM = m−1 m
,
k=1 V̂I (k) , Bm = (m − 1)
k=1 ψ̂(k) − ψ̄m
−1 Pm
ψ̄m = m
k=1 ψ̂(k) is the average of m imputed estimators, and V̂I (k) is the
imputed version of the variance estimator of ψ̂ under complete response.
V̂MI (ψ̄m ) = Wm + (1 +
Jae-Kwang Kim (ISU)
Part 2
86 / 181
5. Multiple imputation
• The computation for Bayesian imputation can be implemented by the data
augmentation (Tanner and Wong, 1987) technique, which is a special application
of the Gibb’s sampling method:
∗
∼ f (ymis | yobs , δ; η ∗ )
I-step: Generate ymis
∗
∗
2 P-step: Generate η ∼ π (η | yobs , ymis
, δ)
1
• Needs some tools for checking the convergence to a stable distribution.
• Consistency of variance estimator is questionable (Kim et al., 2006).
Jae-Kwang Kim (ISU)
Part 2
87 / 181
6. Simulation Study
Simulation 1
• Bivariate data (xi , yi ) of size n = 200 with
xi
∼
N(3, 1)
yi
∼
N(−2 + xi , 1)
• xi always observed, yi subject to missingness.
• MCAR (δ ∼ Bernoulli(0.6))
• Parameters of interest
1
2
θ1 = E (Y )
θ2 = Pr (Y < 1)
• Multiple imputation (MI) and fractional imputation (FI) are applied with m = 50.
• For estimation of θ2 , the following method-of-moment estimator is used.
θ̂2,MME = n−1
n
X
I (yi < 1)
i=1
Jae-Kwang Kim (ISU)
Part 2
88 / 181
6. Simulation Study
Table 1 Monte Carlo bias and variance of the point estimators.
Parameter
θ1
θ2
Estimator
Complete sample
MI
FI
Complete sample
MI
FI
Bias
0.00
0.00
0.00
0.00
0.00
0.00
Variance
0.0100
0.0134
0.0133
0.00129
0.00137
0.00137
Std Var
100
134
133
100
106
106
Table 2 Monte Carlo relative bias of the variance estimator.
Parameter
V (θ̂1 )
V (θ̂2 )
Jae-Kwang Kim (ISU)
Imputation
MI
FI
MI
FI
Part 2
Relative bias (%)
-0.24
1.21
23.08
2.05
89 / 181
6. Simulation study
• Rubin’s formula is based on the following decomposition:
V (θ̂MI ) = V (θ̂n ) + V (θ̂MI − θ̂n )
where θ̂n is the complete-sample estimator of θ. Basically, Wm term estimates
V (θ̂n ) and (1 + m−1 )Bm term estimates V (θ̂MI − θ̂n ).
• For general case, we have
V (θ̂MI ) = V (θ̂n ) + V (θ̂MI − θ̂n ) + 2Cov (θ̂MI − θ̂n , θ̂n )
and Rubin’s variance estimator ignores the covariance term. Thus, a sufficient
condition for the validity of unbiased variance estimator is
Cov (θ̂MI − θ̂n , θ̂n ) = 0.
• Meng (1994) called the condition congeniality of θ̂n .
• Congeniality holds when θ̂n is the MLE of θ.
Jae-Kwang Kim (ISU)
Part 2
90 / 181
6. Simulation study
• For example, there are two estimators of θ = P(Y < 1) when Y follows from
N(µ, σ 2 ).
R1
Maximum likelihood method: θ̂MLE = −∞ φ(z; µ̂, σ̂ 2 )dz
Pn
2 Method of moments: θ̂MME = n−1
i=1 I (yi < 1)
1
• In the simulation setup, the imputed estimator of θ2 can be expressed as
θ̂2,I
=
n−1
n
X
[δi I (yi < 1) + (1 − δi )E {I (yi < 1) | xi ; µ̂, σ̂}] .
i=1
Thus, imputed estimator of θ2 “borrows strength” by making use of extra
information associated with f (y | x).
• Thus, when the congeniality conditions does not hold, the imputed estimator
improves the efficiency (due to the imputation model that uses extra information)
but the variance estimator does not recognize this improvement.
Jae-Kwang Kim (ISU)
Part 2
91 / 181
6. Simulation Study
Simulation 2
• Bivariate data (xi , yi ) of size n = 100 with
Yi = β0 + β1 xi + β2 xi2 − 1 + ei
(8)
where (β0 , β1 , β2 ) = (0, 0.9, 0.06), xi ∼ N (0, 1), ei ∼ N (0, 0.16), and xi and ei are
independent. The variable xi is always observed but the probability that yi
responds is 0.5.
• In MI, the imputer’s model is
Yi = β0 + β1 xi + ei .
That is, imputer’s model uses extra information of β2 = 0.
• From the imputed data, we fit model (8) and computed power of a test
H0 : β2 = 0 with 0.05 significant level.
• In addition, we also considered the Complete-Case (CC) method that simply uses
the complete cases only for the regression analysis
Jae-Kwang Kim (ISU)
Part 2
92 / 181
6. Simulation Study
Table 3 Simulation results for the Monte Carlo experiment based on 10,000 Monte
Carlo samples.
Method
MI
FI
CC
E (θ̂)
0.028
0.046
0.060
V (θ̂)
0.00056
0.00146
0.00234
R.B. (V̂ )
1.81
0.02
-0.01
Power
0.044
0.314
0.285
Table 3 shows that MI provides efficient point estimator than CC method but variance
estimation is very conservative (more than 100% overestimation). Because of the serious
positive bias of MI variance estimator, the statistical power of the test based on MI is
actually lower than the CC method.
Jae-Kwang Kim (ISU)
Part 2
93 / 181
7. Summary
• Imputation can be viewed as a Monte Carlo tool for computing the conditional
expectation.
• Monte Carlo EM is very popular but the E-step can be computationally heavy.
• Parametric fractional imputation is a useful tool for frequentist imputation.
• Multiple imputation is motivated from a Bayesian framework. The frequentist
validity of multiple imputation requires the condition of congeniality.
• Uncongeniality may lead to overestimation of variance which can seriously increase
type-2 errors.
Jae-Kwang Kim (ISU)
Part 2
94 / 181
REFERENCES
Cheng, P. E. (1994), ‘Nonparametric estimation of mean functionals with data missing
at random’, Journal of the American Statistical Association 89, 81–87.
Dempster, A. P., N. M. Laird and D. B. Rubin (1977), ‘Maximum likelihood from
incomplete data via the EM algorithm’, Journal of the Royal Statistical Society:
Series B 39, 1–37.
Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical statistics’,
Philosophical Transactions of the Royal Society of London A 222, 309–368.
Fuller, W. A., M. M. Loughin and H. D. Baker (1994), ‘Regression weighting in the
presence of nonresponse with application to the 1987-1988 Nationwide Food
Consumption Survey’, Survey Methodology 20, 75–85.
Hirano, K., G. Imbens and G. Ridder (2003), ‘Efficient estimation of average treatment
effects using the estimated propensity score’, Econometrica 71, 1161–1189.
Ibrahim, J. G. (1990), ‘Incomplete data in generalized linear models’, Journal of the
American Statistical Association 85, 765–769.
Kim, J. K. (2011), ‘Parametric fractional imputation for missing data analysis’,
Biometrika 98, 119–132.
Kim, J. K. and C. L. Yu (2011), ‘A semi-parametric estimation of mean functionals with
non-ignorable missing data’, Journal of the American Statistical Association
106, 157–165.
Jae-Kwang Kim (ISU)
Part 2
94 / 181
Kim, J. K., M. J. Brick, W. A. Fuller and G. Kalton (2006), ‘On the bias of the multiple
imputation variance estimator in survey sampling’, Journal of the Royal Statistical
Society: Series B 68, 509–521.
Kim, J. K. and M. K. Riddles (2012), ‘Some theory for propensity-score-adjustment
estimators in survey sampling’, Survey Methodology 38, 157–165.
Kott, P. S. and T. Chang (2010), ‘Using calibration weighting to adjust for nonignorable
unit nonresponse’, Journal of the American Statistical Association 105, 1265–1275.
Louis, T. A. (1982), ‘Finding the observed information matrix when using the EM
algorithm’, Journal of the Royal Statistical Society: Series B 44, 226–233.
Meng, X. L. (1994), ‘Multiple-imputation inferences with uncongenial sources of input
(with discussion)’, Statistical Science 9, 538–573.
Oakes, D. (1999), ‘Direct calculation of the information matrix via the em algorithm’,
Journal of the Royal Statistical Society: Series B 61, 479–482.
Orchard, T. and M.A. Woodbury (1972), A missing information principle: theory and
applications, in ‘Proceedings of the 6th Berkeley Symposium on Mathematical
Statistics and Probability’, Vol. 1, University of California Press, Berkeley, California,
pp. 695–715.
Redner, R. A. and H. F. Walker (1984), ‘Mixture densities, maximum likelihood and the
EM algorithm’, SIAM Review 26, 195–239.
Robins, J. M., A. Rotnitzky and L. P. Zhao (1994), ‘Estimation of regression coefficients
when some regressors are not always observed’, Journal of the American Statistical
Association 89, 846–866.
Jae-Kwang Kim (ISU)
Part 2
94 / 181
Robins, J. M. and N. Wang (2000), ‘Inference for imputation estimators’, Biometrika
87, 113–124.
Rubin, D. B. (1976), ‘Inference and missing data’, Biometrika 63, 581–590.
Tanner, M. A. and W. H. Wong (1987), ‘The calculation of posterior distribution by
data augmentation’, Journal of the American Statistical Association 82, 528–540.
Wang, N. and J. M. Robins (1998), ‘Large-sample theory for parametric multiple
imputation procedures’, Biometrika 85, 935–948.
Wang, S., J. Shao and J. K. Kim (2014), ‘Identifiability and estimation in problems with
nonignorable nonresponse’, Statistica Sinica 24, 1097 – 1116.
Wei, G. C. and M. A. Tanner (1990), ‘A Monte Carlo implementation of the EM
algorithm and the poor man’s data augmentation algorithms’, Journal of the
American Statistical Association 85, 699–704.
Zhou, M. and J. K. Kim (2012), ‘An efficient method of estimation for longitudinal
surveys with monotone missing data’, Biometrika 99, 631–648.
Statistical Methods for Handling Missing Data
Part 3: Propensity score approach
Jae-Kwang Kim
Department of Statistics, Iowa State University
Jae-Kwang Kim (ISU)
Part 3
95 / 181
1. Introduction
Basic Setup
• (X , Y ): random variable
• θ: Defined by solving
E {U(θ; X , Y )} = 0.
• yi is subject to missingness
δi =
1
0
if yi responds
if yi is missing.
• Want to find wi such that the solution θ̂w to
n
X
δi wi U(θ; xi , yi ) = 0
i=1
is consistent for θ.
Jae-Kwang Kim (ISU)
Part 3
96 / 181
Basic Setup
Complete-Case (CC) method
• Solve
n
X
δi U(θ; xi , yi ) = 0
i=1
• Biased unless Pr (δ = 1 | X , Y ) does not depend on (X , Y ), i.e. biased unless the
set of the respondents is a simple random sample from the original data.
Jae-Kwang Kim (ISU)
Part 3
97 / 181
Basic Setup
Weighted Complete-Case (WCC) method
• Solve
ÛW (θ) ≡
n
X
δi wi U(θ; xi , yi ) = 0
i=1
for some weights wi . The weight is often called the propensity scores (or
propensity weights).
• The choice of
wi =
1
Pr (δi = 1 | xi , yi )
will make the resulting estimator consistent.
• Requires some assumption about Pr (δi = 1 | xi , yi ).
Jae-Kwang Kim (ISU)
Part 3
98 / 181
Basic Setup
Justification for using wi = 1/Pr (δi = 1 | xi , yi )
• Note that
n
o
E ÛW (θ) | x1 , · · · , xn , y1 , · · · , yn = Ûn (θ)
where the expectation is taken with respect to f (δ | x, y ).
• Thus, the probability limit of the solution to ÛW (θ) = 0 is equal to the probability
limit of the solution to Ûn (θ) = 0.
• No distributional assumptions made about (X , Y ).
Jae-Kwang Kim (ISU)
Part 3
99 / 181
Regression weighting method
Motivation
• Assume that xi are observed throughout the sample.
• Intercept term is included in xi .
• Study variable yi observed only when δi = 1.
• Parameter of interest: θ = E (Y ).
• Regression estimator of θ is defined by
θ̂reg =
n
X
δi wi yi = x̄0n β̂,
i=1
where β̂ =
0 −1
i=1 δi xi xi
Pn
Pn
wi =
Jae-Kwang Kim (ISU)
i=1
δi xi yi and
n
1X
xi
n i=1
!0
Part 3
n
X
!−1
δi xi x0i
xi .
i=1
100 / 181
Regression weighting method
Theorem 1 (Fuller et al., 1994)
Let πi be the true response probability of unit i. If the response probability satisfies
1
= x0i λ
πi
(9)
for some λ for al unit i in the sample, then the regression estimator is asymptotically
unbiased with asymptotic variance
)
(
n 1 X 1
2
− 1 (yi − xi β) ,
V (θ̂reg ) = V (θ̂n ) + E
n2 i=1 πi
where θ̂n = n−1
Pn
i=1
Jae-Kwang Kim (ISU)
yi and β is the probability limit of β̂.
Part 3
101 / 181
Sketched Proof
Since πi−1 is in the column space of xi , we have
n
X
δi yi − x0i β̂ = 0
πi
i=1
and θ̂reg can be written as
θ̂reg = x̄0n β̂ +
n
1 X δi yi − x0i β̂ .
n i=1 πi
(10)
Now, writing θ̂reg = θ̂reg (β̂), we have
E
∂
θ̂reg (β)
∂β
(
=E
x̄0n
n
1 X δi 0
−
xi
n i=1 πi
)
= 0.
Thus, we can safely ignore the sampling variability of β̂ in (10).
Jae-Kwang Kim (ISU)
Part 3
102 / 181
Regression weighting method
Example 1 [Example 5.2 of KS]
Assume that the sample is partitioned into G exhaustive and mutually exclusive groups,
denoted by A1 , · · · , AG , where |Ag | = ng with g being the group indicator. Assume a
uniform response mechanism for each group. Thus, we assume that πi = pg for some
pg ∈ (0, 1] if i ∈ Ag . Let
1 if i ∈ Ag
xig =
0 otherwise.
Then, xi = (xi1 , · · · , xiG ) satisfies (9). The regression estimator of θ = E (Y ) can be
written as
G
G
1X
1 X ng X
δi yi =
θ̂reg =
ng ȳRg ,
n g =1 rg
n g =1
i∈Ag
P
where rg = i∈Ag δi is the realized size of respondents in group g and
P
−1 P
ȳRg =
i∈Ag δi
i∈Ag δi yi .
Jae-Kwang Kim (ISU)
Part 3
103 / 181
Regression weighting method
Example 1 (Cont’d)
Because the covariate satisfies (9), the regression estimator is asymptotically unbiased
and the asymptotic variance of θ̂reg is


X
G 

X
ng
−2
2
(yi − ȳng ) .
V θ̂reg
= V θ̂n + E n
−1


rg
g =1
i∈Ag
Variance estimation of θ̂reg can be implemented by using a standard variance estimation
formula applied to d̂i = ȳRg + (ng /rg ) (yi − ȳRg ).
Jae-Kwang Kim (ISU)
Part 3
104 / 181
Propensity score method
Idea
• For simplicity, assume that Pr (δi = 1 | xi , yi ) takes a parametric form.
Pr (δi = 1 | xi , yi ) = π(xi , yi ; φ∗ )
for some unknown φ∗ . The functional form of π(·) is known. For example,
π(x, y ; φ∗ ) =
exp(φ∗0 + φ∗1 x + φ∗2 y )
1 + exp(φ∗0 + φ∗1 x + φ∗2 y )
• Propensity score approach to missing data: obtain θ̂PS which solves
ÛPS (θ) ≡
n
X
i=1
δi
1
U(θ; xi , yi ) = 0
π(xi , yi ; φ̂)
for some φ̂ which converges to φ∗ in probability.
Jae-Kwang Kim (ISU)
Part 3
105 / 181
Propensity score method
Issues
• Identifiability: Model parameters may not be fully identifiable from the observed
sample.
• May assume
Pr (δi = 1 | xi , yi ) = Pr (δi = 1 | xi ) .
This condition is often called MAR (Missing at random).
• For longitudinal data with monotone missing pattern, the MAR
condition means
Pr (δi,t = 1 | xi , yi1 , · · · , yiT ) = Pr (δi,t = 1 | xi , yi1 , · · · , yi,t−1 ) .
That is, the response probability at time t may depend on the value of
y observed up to time t.
Jae-Kwang Kim (ISU)
Part 3
106 / 181
Propensity score method
Issues
• Estimation of φ∗
• Maximum likelihood method: Solve
S(φ) ≡
n
X
{δi − πi (φ)} qi (φ) = 0
i=1
where qi = ∂logit{πi (φ)}/∂φ.
• Maximum likelihood method does not always lead to efficient
estimation (see Example 2 next).
• Inference using θ̂PS : Note that θ̂PS = θ̂PS (φ̂). We need to incorporate the sampling
variability of φ̂ in making inference about θ using θ̂PS .
Jae-Kwang Kim (ISU)
Part 3
107 / 181
3. Propensity score method
Example 2
• Response model
exp(φ∗0 + φ∗1 xi )
1 + exp(φ∗0 + φ∗1 xi )
πi (φ∗ ) =
• Parameter of interest: θ = E (Y ).
• PS estimator of θ: Solve jointly
U(θ, φ)
=
n
X
δi {πi (φ)}−1 (yi − θ) = 0
i=1
S(φ)
=
n
X
{δi − πi (φ)}(1, xi ) = (0, 0)
i=1
Jae-Kwang Kim (ISU)
Part 3
108 / 181
3. Propensity score method
Example 2 (Cont’d)
• Taylor linearization
ÛPS (θ, φ̂)
∼
=
−1
∂U
∂S
ÛPS (θ, φ∗ ) − E
E
S(φ∗ )
∂φ
∂φ
=
ÛPS (θ, φ∗ ) − {Cov (U, S)} {V (S)}−1 S(φ∗ ),
(11)
by the property of zero-mean function.
(i.e. If E (U) = 0, then E (∂U/∂φ) = −Cov (U, S).)
• So, we have
∗
⊥
∗
V {ÛPS (θ, φ̂)} ∼
= V {ÛPS (θ, φ ) | S } ≤ V {ÛPS (θ, φ )},
where
V {Û | S ⊥ } = V (Û) − Cov (Û, S) {V (S)}−1 Cov (S, Û).
Jae-Kwang Kim (ISU)
Part 3
109 / 181
3. Propensity score method
Example 2 (Cont’d)
• Sandwich variance formula
0
−1
−1
V {θ̂PS (φ̂)} ∼
= τPS V {ÛPS (θ; φ̂)}τPS
where
τPS = E
∂
ÛPS (θ, φ̂) .
∂θ
• Note that, by (11),
E
∂
ÛPS (θ, φ̂)
∂θ
=E
∂
ÛPS (θ, φ∗ ) .
∂θ
• So, we have
∗
⊥
∗
V {θ̂PS (φ̂)} ∼
= V {θ̂PS (φ ) | S } ≤ V {θ̂PS (φ )},
where
V {θ̂ | S ⊥ } = V (θ̂) − Cov (θ̂, S) {V (S)}−1 Cov (S, θ̂).
Jae-Kwang Kim (ISU)
Part 3
110 / 181
3. Propensity score method
Augmented propensity score method
• To improve the efficiency of the PS estimator, one can consider solving
n
X
i=1
n
δi
X
1
{U(θ; xi , yi ) − b(θ; xi )} +
b(θ; xi ) = 0,
π̂i
i=1
(12)
where b(θ; xi ) is to be determined.
• Assume that the estimated response probability is computed by π̂i = π(xi ; φ̂),
where φ̂ is computed by
n X
i=1
δi
− 1 hi (φ) = 0
π(xi ; φ)
for some hi (φ) = h(xi ; φ).
• Note that (12) forms a class of estimators, indexed by b, and the solution to (12)
is asymptotically unbiased regardless of the choice of b(θ; xi ).
Jae-Kwang Kim (ISU)
Part 3
111 / 181
3. Propensity score method
Augmented propensity score method
Theorem 2 [Theorem 5.1 of KS]
Assume that the response probability Pr (δ = 1 | x, y ) = π(x) does not depend on the
value of y . Let θ̂b be the solution to (12) for given b(θ; xi ). Under some regularity
conditions, θ̂b is consistent and its asymptotic variance satisfies
h
n
oi
V θ̂b ≥ n−1 τ −1 V {E (U | X )} + E π −1 V (U | X ) (τ −1 )0 ,
(13)
where τ = E (∂U/∂θ0 ) and the equality holds when b(θ; xi ) = E {U(θ; xi , yi ) | xi }.
Originally proved by Robins et al. (1994).
Jae-Kwang Kim (ISU)
Part 3
112 / 181
3. Propensity score method
Augmented propensity score method
Remark
• The lower bound of the variance is achieved for b∗ (θ; xi ) = E {U(θ; xi , yi ) | xi },
which requires assumptions about the law of y given x (called outcome regression
model).
• Thus, under the joint distribution of the response model and outcome regression
model, the particular choice of b ∗ (θ; xi ) = E {U(θ; xi , yi ) | xi } satisfies
1
2
Achieves the lower bound of the asymptotic variance
The choice of φ̂ in π̂i = π(xi , φ̂) does not make any difference in the
asymptotic sense.
• Also, it is doubly robust in the sense that it remains consistent if either model
(outcome regression model or response model) is true.
Jae-Kwang Kim (ISU)
Part 3
113 / 181
4. GLS method
Motivation
• The propensity score method is used to reduce the bias, rather than to reduce the
variance.
• In the previous example, the PS estimator for θx = E (X ) is
Pn
δi π̂i−1 xi
θ̂x,PS = Pi=1
n
−1
i=1 δi π̂i
where π̂i = πi (φ̂).
• Note that θ̂x,PS is not necessarily equal to x̄n = n−1
• How to incorporate the extra information of x̄n ?
Jae-Kwang Kim (ISU)
Part 3
Pn
i=1
xi .
114 / 181
4. GLS method
GLS (or GMM) approach
• Let θ = (θx , θy ). We have three estimators for two parameters.
• Find θ that minimizes
−1 

0  
x̄n − θx
x̄n − θx
x̄n − θx


 θ̂x,PS − θx 
QPS (θ) =  θ̂x,PS − θx  V̂  θ̂x,PS − θx 


θ̂y ,PS − θy
θ̂y ,PS − θy
θ̂y ,PS − θy

(14)
where θ̂PS = θ̂PS (φ̂).
• Computation for V̂ is somewhat cumbersome.
Jae-Kwang Kim (ISU)
Part 3
115 / 181
4. GLS method
Alternative GLS (or GMM) approach (Zhou and Kim, 2012)
• Find (θ, φ) that minimizes
0 
x̄n − θx


 θ̂x,PS (φ) − θx  

 V̂
 θ̂y ,PS (φ) − θy  


S(φ)

−1 
x̄n − θx


 θ̂x,PS (φ) − θx  



 θ̂y ,PS (φ) − θy  


S(φ)


x̄n − θx
θ̂x,PS (φ) − θx 
.
θ̂y ,PS (φ) − θy 
S(φ)
• Computation for V̂ is easier since we can treat φ as if known.
• Let Q ∗ (θ, φ) be the above objective function. It can be shown that
Q ∗ (θ, φ̂) = QPS (θ) in (14) and so minimizing Q ∗ (θ, φ̂) is equivalent to minimizing
QPS (θ).
Jae-Kwang Kim (ISU)
Part 3
116 / 181
4. GLS method
Justification for the equivalence
• May write
Q ∗ (θ, φ)
=
=
ÛPS (θ, φ)
S(φ)
0 V11
V21
V12
V22
−1 ÛPS (θ, φ)
S(φ)
Q1 (θ | φ) + Q2 (φ)
where
0 −1 −1
−1
ÛPS − V12 V22
S
V UPS | S ⊥
ÛPS − V12 V22
S
n
o−1
Q2 (φ) = S(φ)0 V̂ (S)
S(φ)
Q1 (θ | φ)
=
• For the MLE φ̂, we have Q2 (φ̂) = 0 and Q1 (θ | φ̂) = QPS (θ).
Jae-Kwang Kim (ISU)
Part 3
117 / 181
4. GLS method
Example 3 (Example 5.5 of KS)
• Response model: same as Example 2
πi (φ∗ ) =
exp(φ∗0 + φ∗1 xi )
1 + exp(φ∗0 + φ∗1 xi )
• Three direct PS estimators of (1, θx , θy ):
(θ̂1,PS , θ̂x,PS , θ̂y ,PS ) = n−1
n
X
δi π̂i−1 (1, xi , yi ) .
i=1
• x̄n = n−1 ni=1 xi available.
• What is the optimal estimator of θy ?
P
Jae-Kwang Kim (ISU)
Part 3
118 / 181
4. GLS method
Example 3 (Cont’d)
• Minimize






x̄n − θx
θ̂1,PS (φ) − 1
θ̂x,PS (φ) − θx
θ̂y ,PS (φ) − θy
S(φ)
0 



 
 
 V̂
 
 









x̄n
θ̂1,PS (φ)
θ̂x,PS (φ)
θ̂y ,PS (φ)
S(φ)
−1 


 

 


 





x̄n − θx
θ̂1,PS (φ) − 1
θ̂x,PS (φ) − θx
θ̂y ,PS (φ) − θy
S(φ)






with respect to (θx , θy , φ), where
S(φ) =
n X
δi
− 1 hi (φ) = 0
πi (φ)
i=1
with hi (φ) = πi (φ)(1, xi )0 .
Jae-Kwang Kim (ISU)
Part 3
119 / 181
4. GLS method
Example 3 (Cont’d)
• Equivalently, minimize
0 
θ̂y ,PS (φ) − θy


 θ̂1,PS (φ) − 1  
 V̂

 θ̂x,PS (φ) − x̄n  


S(φ)

−1 
θ̂y ,PS (φ)



 
θ̂
(φ)
1,PS



 θ̂x,PS (φ) − x̄n  


S(φ)


θ̂y ,PS (φ) − θy
θ̂1,PS (φ) − 1 

θ̂x,PS (φ) − x̄n 
S(φ)
with respect to (θy , φ), since the optimal estimator of θx is x̄n .
Jae-Kwang Kim (ISU)
Part 3
120 / 181
4. GLS method
Example 3 (Cont’d)
• The solution can be written as
n
o
θ̂y ,opt = θ̂y ,PS + 1 − θ̂1,PS B̂0 + x̄n − θ̂1,PS B̂1 + 0 − S(φ̂) Ĉ
where




0 −1
 
B̂0
n
n
1
1
1
 X
X
 B̂1  =
δi bi  xi   xi 
δi bi  xi  yi


i=1
i=1
hi
hi
hi
Ĉ

and bi = π̂i−2 (1 − π̂i ).
• Note that the last term {0 − S(φ̂)}Ĉ , which is equal to zero, does not contribute
to the point estimation. But, it is used for variance estimation.
Jae-Kwang Kim (ISU)
Part 3
121 / 181
4. GLS method
Example 3 (Cont’d)
• That is, for variance estimation, we simply express
θ̂y ,opt = n−1
n
X
η̂i
i=1
where
δi yi − B̂0 − xi B̂1 − h0i Ĉ
π̂i
and apply the standard variance formula to η̂i .
η̂i = B̂0 + xi B̂1 + h0i Ĉ +
• This idea can be extended to the survey sampling setup.
Jae-Kwang Kim (ISU)
Part 3
122 / 181
4. GLS method
Example 3 (Cont’d)
• The optimal estimator is linear in y . That is, we can write
θ̂y ,opt =
n
X
1 X δi
gi yi =
wi yi
n i=1 π̂i
δi =1
where gi satisfies
n
n
X
X
δi
gi (1, xi , h0i ) =
(1, xi , h0i ).
π̂
i
i=1
i=1
• Thus, it is doubly robust under Eζ (y | x) = β0 + β1 x in the sense that θy ,opt is
consistent when either the response model or the outcome regression model holds.
Jae-Kwang Kim (ISU)
Part 3
123 / 181
5. Doubly robust method
• Two models
• Response Probability (RP) model: model about δ
Pr (δ = 1 | x, y ) = π(x; φ)
• Outcome Regression (OR) model: model about y
E (y | x) = m(xi ; β)
• Doubly robust (DR) estimation aims to achieve (asymptotic) unbiasedness under
either RP model or OR model.
• For estimation of θ = E (Y ), a doubly robust estimator is
θ̂DR =
n 1X
δi
ŷi +
(yi − ŷi )
n i=1
π̂i
where ŷi = m(xi ; β̂) and π̂i = π(xi ; φ̂).
Jae-Kwang Kim (ISU)
Part 3
124 / 181
5. Doubly robust method
• Note that
θ̂DR − θ̂n = n−1
n X
δi
i=1
π̂i
− 1 (yi − ŷi ) .
(15)
Taking an expectation of the above, we note that the first term has approximate
zero expectation if the RP model is true. The second term has approximate zero
expectation if the OR model is true. Thus, θ̂DR is approximately unbiased when
either RP model or OR model is true.
• When both models are true, then the choice of β̂ and φ̂ does not make any
difference in the asymptotic sense. Robins et al (1994) called the property local
efficiency of the DR estimator.
Jae-Kwang Kim (ISU)
Part 3
125 / 181
5. Doubly robust method
• Kim and Riddles (2012) considered an augmented propensity model of the form
π̂i∗ = πi∗ (φ̂, λ̂) =
πi (φ̂)
,
πi (φ̂) + {1 − πi (φ̂)} exp(λ̂0 + λ̂1 m̂i )
(16)
where πi (φ̂) is the estimated response probability under the response probability
model and (λ̂0 , λ̂1 ) satisfies
n
X
i=1
n
X
δi
(1, m̂i )
(1, m̂i ) =
∗
πi (φ̂, λ̂)
i=1
(17)
with m̂i = m(xi ; β̂).
Jae-Kwang Kim (ISU)
Part 3
126 / 181
5. Doubly robust method
∗
• The augmented PSA estimator, defined by θ̂PSA
= n−1
Pn
∗
i=1 δi yi /π̂i , based on the
augmented propensity in (16) satisfies, under the assumed response probability
model,
n 1X
δi ∗
∼
b̂
+
b̂
m̂
+
y
−
b̂
−
b̂
m̂
,
(18)
θ̂PSA
=
0
1 i
0
1 i
i
n i=1
π̂i
where
b̂0
b̂1
=
( n
X
i=1
Jae-Kwang Kim (ISU)
δi
1
−1
π̂i
1
m̂i
Part 3
1
m̂i
0 )−1 X
n
i=1
δi
1
−1
π̂i
1
m̂i
yi .
127 / 181
5. Doubly robust method
• The augmented PSA estimator using
π̂i∗ = πi∗ (φ̂, λ̂) =
π̂i
,
π̂i + {1 − π̂i } exp(λ̂0 /π̂i + λ̂1 xi /π̂i )
with (λ̂0 , λ̂1 ) satisfying
n
X
i=1
n
X
δi
(1, xi )
(1,
x
i) =
πi∗ (φ̂, λ̂)
i=1
is asymptotically equivalent to the optimal regression PSA estimator discussed in
Example 3.
Jae-Kwang Kim (ISU)
Part 3
128 / 181
6. Nonparametric method
Motivation
• So far, we have assumed a parametric model for π(x) = Pr (δ = 1 | x).
• Using the nonparametric regression technique, we can use a nonparametric
estimator of π(x) given by a nonparametric regression estimator of
π(x) = E (δ | x) can be obtained by
Pn
δi Kh (xi , x)
π̂h (x) = Pi=1
,
n
i=1 Kh (xi , x)
(19)
where Kh is the kernel function which satisfies certain regularity conditions and h is
the bandwidth.
• Once a nonparametric estimator of π(x) is obtained, the nonparametric PSA
estimator θ̂NPS of θ0 = E (Y ) is given by
θ̂NPS =
Jae-Kwang Kim (ISU)
n
1 X δi
yi .
n i=1 π̂h (xi )
Part 3
(20)
129 / 181
6. Nonparametric method
Theorem 3 [Theorem 5.2 of KS]
Under some regularity conditions, we have
n 1X
δi
θ̂NPS =
m(xi ) +
{yi − m(xi )} + op (n−1/2 ),
n i=1
π(xi )
(21)
where m(x) = E (Y | x) and π(x) = P(δ = 1 | x). Furthermore, we have
√ n θ̂NPS − θ → N 0, σ12 ,
where σ12 = V {m (X )} + E {π (X )}−1 V (Y | X ) .
Originally proved by Hirano et al. (2003).
Jae-Kwang Kim (ISU)
Part 3
130 / 181
6. Nonparametric method
Remark
• Unlike the usual asymptotic for nonparametric regression,
√
n-consistency was
established.
• The nonparametric PSA estimator achieves the lower bound of the variance in (13)
that was discussed in Theorem 2.
• Instead of nonparametric PSA method, we can use the same Kernel regression
technique to obtain a nonparametric imputation estimator given by
θ̂NPI =
where
n
1X
{δi yi + (1 − δi )m̂h (xi )}
n i=1
(22)
Pn
δi Kh (xi , x)yi
m̂h (x) = Pi=1
.
n
i=1 δi Kh (xi , x)
Cheng (1994) proves that θ̂NPI has the same asymptotic variance in Theorem 3.
Jae-Kwang Kim (ISU)
Part 3
131 / 181
Application to longitudinal missing
Basic Setup
• Xi is always observed and remains unchanged for t = 0, 1, . . . , T .
• Yit is the response for subject i at time t.
• δit : The response indicator for subject i at time t.
• Assuming no missing in the baseline year, Y0 can be absorbed into X .
• Monotone missing pattern
δit = 0 ⇒ δi,t+1 = 0, ∀t = 1, . . . , T − 1.
(Xi0 , Yi1 , . . . , Yi,t )0
• Li,t =
: Measurement up to t.
• Parameter of interest θ is estimated by solving
n
X
U(θ; Li,T ) = 0
i=1
for θ, under complete response.
Jae-Kwang Kim (ISU)
Part 3
132 / 181
Application to longitudinal missing
Missing mechanism (under monotone missing pattern)
• Missing completely at random (MCAR) :
P(δit=1 |δi,t−1 = 1, Li,T ) = P(δit=1 |δi,t−1 = 1).
• Covariate-dependent missing (CDM) :
P(δit = 1|δi,t−1 = 1, Li,T ) = P(δit = 1|δi,t−1 = 1, Xi ).
• Missing at random (MAR) :
P(δit = 1|δi,t−1 = 1, Li,T ) = P(δit = 1|δi,t−1 = 1, Li,t−1 ).
• Missing not at random (MNAR) : Missing at random does not hold.
Jae-Kwang Kim (ISU)
Part 3
133 / 181
Application to longitudinal missing
Motivation
• Panel attrition is frequently encountered in panel surveys, while classical methods
often assume covariate-dependent missing, which can be unrealistic. We want to
develop a PS method under MAR.
• Want to make full use of available information.
Jae-Kwang Kim (ISU)
Part 3
134 / 181
Application to longitudinal missing
Idea
• Under MAR, in the longitudinal data case, we would consider the conditional
probabilities:
pit := P(δit = 1|δi,t−1 = 1, Li,t−1 ), t = 1, . . . , T .
Then
πit =
t
Y
pij .
j=1
πt then can be modeled through modeling pt with pt (Lt−1 ; φt ).
Q
• Once we obtain π̂iT = Tt=1 p̂it is obtained, we can use
n
X
δiT
U(θ; Li,T ) = 0
π̂
iT
i=1
to obtain a consistent estimator of θ.
Jae-Kwang Kim (ISU)
Part 3
135 / 181
Application to longitudinal missing
Score Function for Longitudinal Data Under parametric models for pt ’s, the partial
likelihood for φ1 , . . . , φT is
L(φ1 , . . . , φT ) =
n Y
T h
Y
δ
piti,t (1 − pit )1−δi,t
iδi,t−1
,
i=1 t=1
and the corresponding score function is (S1 (φ1 ), . . . , ST (φT )), where
St (φt ) =
n
X
δi,t−1 {δit − pit (φt )} qit (φt ) = 0
i=1
where qit (φt ) = ∂logit{pit (φt )}/∂φt . Under logistic regression model such that
pt = 1/{1 + exp(−φ0t Lt−1 )}, we have qit (φt ) = Lt−1 .
Jae-Kwang Kim (ISU)
Part 3
136 / 181
Application to longitudinal missing
Remark
• Zhou and Kim (2012) proposed an optimal estimator of µt = E (Yt ) incorporating
all available information.
• The idea can be extended to non-monotone missing data by re-defining
πit = P (δi1 = · · · = δit = 1 | Lit ) =
t
Y
pij
j=1
where
pit := P(δit = 1|δi1 = · · · = δi,t−1 = 1, Li,t−1 ).
• The score equation for φt in pit = p(Li,t−1 ; φt ) is then
St (φt ) =
n
X
∗
δi,t−1
{δit − pit (φt )} qit (φt ) = 0
i=1
where
∗
δi,t−1
=
Jae-Kwang Kim (ISU)
Qt−1
j=1
δij and qit (φt ) = ∂logit{pit (φt )}/∂φt .
Part 3
137 / 181
8. Concluding remarks
• Uses a model for the response probability.
• Parameter estimation for response model can be implemented using the idea of
maximum likelihood method.
• GLS method can be used to incorporate the auxiliary information.
• DR procedure offers some protection against misspecification of one model or the
other.
• Can be extended to nonignorable missing when the parameters are identifiable
(Part 4).
Jae-Kwang Kim (ISU)
Part 3
138 / 181
REFERENCES
Cheng, P. E. (1994), ‘Nonparametric estimation of mean functionals with data missing
at random’, Journal of the American Statistical Association 89, 81–87.
Dempster, A. P., N. M. Laird and D. B. Rubin (1977), ‘Maximum likelihood from
incomplete data via the EM algorithm’, Journal of the Royal Statistical Society:
Series B 39, 1–37.
Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical statistics’,
Philosophical Transactions of the Royal Society of London A 222, 309–368.
Fuller, W. A., M. M. Loughin and H. D. Baker (1994), ‘Regression weighting in the
presence of nonresponse with application to the 1987-1988 Nationwide Food
Consumption Survey’, Survey Methodology 20, 75–85.
Hirano, K., G. Imbens and G. Ridder (2003), ‘Efficient estimation of average treatment
effects using the estimated propensity score’, Econometrica 71, 1161–1189.
Ibrahim, J. G. (1990), ‘Incomplete data in generalized linear models’, Journal of the
American Statistical Association 85, 765–769.
Kim, J. K. (2011), ‘Parametric fractional imputation for missing data analysis’,
Biometrika 98, 119–132.
Kim, J. K. and C. L. Yu (2011), ‘A semi-parametric estimation of mean functionals with
non-ignorable missing data’, Journal of the American Statistical Association
106, 157–165.
Jae-Kwang Kim (ISU)
Part 3
138 / 181
Kim, J. K., M. J. Brick, W. A. Fuller and G. Kalton (2006), ‘On the bias of the multiple
imputation variance estimator in survey sampling’, Journal of the Royal Statistical
Society: Series B 68, 509–521.
Kim, J. K. and M. K. Riddles (2012), ‘Some theory for propensity-score-adjustment
estimators in survey sampling’, Survey Methodology 38, 157–165.
Kott, P. S. and T. Chang (2010), ‘Using calibration weighting to adjust for nonignorable
unit nonresponse’, Journal of the American Statistical Association 105, 1265–1275.
Louis, T. A. (1982), ‘Finding the observed information matrix when using the EM
algorithm’, Journal of the Royal Statistical Society: Series B 44, 226–233.
Meng, X. L. (1994), ‘Multiple-imputation inferences with uncongenial sources of input
(with discussion)’, Statistical Science 9, 538–573.
Oakes, D. (1999), ‘Direct calculation of the information matrix via the em algorithm’,
Journal of the Royal Statistical Society: Series B 61, 479–482.
Orchard, T. and M.A. Woodbury (1972), A missing information principle: theory and
applications, in ‘Proceedings of the 6th Berkeley Symposium on Mathematical
Statistics and Probability’, Vol. 1, University of California Press, Berkeley, California,
pp. 695–715.
Redner, R. A. and H. F. Walker (1984), ‘Mixture densities, maximum likelihood and the
EM algorithm’, SIAM Review 26, 195–239.
Robins, J. M., A. Rotnitzky and L. P. Zhao (1994), ‘Estimation of regression coefficients
when some regressors are not always observed’, Journal of the American Statistical
Association 89, 846–866.
Jae-Kwang Kim (ISU)
Part 3
138 / 181
Robins, J. M. and N. Wang (2000), ‘Inference for imputation estimators’, Biometrika
87, 113–124.
Rubin, D. B. (1976), ‘Inference and missing data’, Biometrika 63, 581–590.
Tanner, M. A. and W. H. Wong (1987), ‘The calculation of posterior distribution by
data augmentation’, Journal of the American Statistical Association 82, 528–540.
Wang, N. and J. M. Robins (1998), ‘Large-sample theory for parametric multiple
imputation procedures’, Biometrika 85, 935–948.
Wang, S., J. Shao and J. K. Kim (2014), ‘Identifiability and estimation in problems with
nonignorable nonresponse’, Statistica Sinica 24, 1097 – 1116.
Wei, G. C. and M. A. Tanner (1990), ‘A Monte Carlo implementation of the EM
algorithm and the poor man’s data augmentation algorithms’, Journal of the
American Statistical Association 85, 699–704.
Zhou, M. and J. K. Kim (2012), ‘An efficient method of estimation for longitudinal
surveys with monotone missing data’, Biometrika 99, 631–648.
Statistical Methods for Handling Missing Data
Part 4: Nonignorable missing
Jae-Kwang Kim
Department of Statistics, Iowa State University
Jae-Kwang Kim (ISU)
Part 4
139 / 181
Observed likelihood
• (X , Y ): random variable, y is subject to missingness
• f (y | x; θ): model of y on x
• g (δ | x, y ; φ): model of δ on (x, y )
• Observed likelihood
Lobs (θ, φ)
=
Y
f (yi | xi ; θ) g (δi | xi , yi ; φ)
δi =1
×
YZ
f (yi | xi ; θ) g (δi | xi , yi ; φ) dyi
δi =0
• Under what conditions the parameters are identifiable ?
Jae-Kwang Kim (ISU)
Part 4
140 / 181
Lemma
Suppose that we can decompose the covariate vector x into two parts, u and z, such that
g (δ|y , x) = g (δ|y , u)
(23)
and, for any given u, there exist zu,1 and zu,2 such that
f (y |u, z = zu,1 ) 6= f (y |u, z = zu,2 ).
(24)
Under some other minor conditions, all the parameters in f and g are identifiable.
Jae-Kwang Kim (ISU)
Part 4
141 / 181
Remark
• Condition (23) means
δ ⊥ z | y , u.
• That is, given (y , u), z does not help in explaining δ.
• Thus, z plays the role of instrumental variable in econometrics:
f (y ∗ | x ∗ , z ∗ ) = f (y ∗ | x ∗ ), Cov (z ∗ , x ∗ ) 6= 0.
Here, y ∗ = δ, x ∗ = (y , u), and z ∗ = z.
• We may call z the nonresponse instrument variable.
• Rigorous theory developed by Wang et al. (2014).
Jae-Kwang Kim (ISU)
Part 4
142 / 181
Parameter estimation under the existence of nonresponse
instrument variable
• Full likelihood-based ML estimation
• Generalized method of moment (GMM) approach (Section 6.3 of KS)
• Conditional likelihood approach (Section 6.2 of KS)
• Pseudo likelihood approach (Section 6.4 of KS)
• Exponential tilting method (Section 6.5 of KS)
• Latent variable approach (Section 6.6 of KS)
Jae-Kwang Kim (ISU)
Part 4
143 / 181
Full likelihood-based ML estimation
Example (Example 6.2 of KS)
• θ: parameter of interest in f (y | x; θ).
• x is always observed and y is subject to missingness.
• Response probability is nonignorable πi = π(xi , yi ; φ) with logit(πi ) = φ01 xi + φ02 yi .
• To guarantee identifiability, we may need
Pr (δ = 1 | x, y ) = Pr (δ = 1 | u, y ),
where x = (u, z).
Jae-Kwang Kim (ISU)
Part 4
144 / 181
Full likelihood-based ML estimation
Example (Cont’d)
• Once the model is identified, we can use the following EM algorithm by fractional
imputation
1
2
∗(1)
∗(m)
from h (yi | xi ).
Generate yi , · · · , yi
Using the m imputed values generated from Step 1, compute the
fractional weights by
∗(j)
o
f yi
| xi ; θ̂(t) n
∗(j)
∗
1 − π(xi , yi ; φ̂(t) )
(25)
wij(t)
∝
∗(j)
h yi
| xi
where π(xi , yi ; φ̂) is the estimated response probability evaluated at φ̂.
Jae-Kwang Kim (ISU)
Part 4
145 / 181
Full likelihood-based ML estimation
Example (Cont’d)
3 Using the imputed data and the fractional weights, the M-step can be
implemented by solving
n X
m
X
∗(j)
∗
=0
wij(t)
S θ; xi , yi
(26)
n
o
∗(j)
∗(j)
∗
δi − π(φ; xi , yi ) x0i , yi
= 0,
wij(t)
(27)
i=1 j=1
and
m
n X
X
i=1 j=1
where S (θ; xi , yi ) = ∂ log f (yi | xi ; θ)/∂θ.
4 Set t = t + 1 and go to Step 2. Continue until convergence.
Jae-Kwang Kim (ISU)
Part 4
146 / 181
Basic setup
• (X , Y ): random variable
• θ: Defined by solving
E {U(θ; X , Y )} = 0.
• yi is subject to missingness
δi =
1
0
if yi responds
if yi is missing.
• Want to find wi such that the solution θ̂w to
n
X
δi wi U(θ; xi , yi ) = 0
i=1
is consistent for θ.
Jae-Kwang Kim (ISU)
Part 4
147 / 181
Basic Setup
• Result 1: The choice of
wi =
1
E (δi | xi , yi )
(28)
makes the resulting estimator θ̂w consistent.
• Result 2: If δi ∼ Bernoulli(πi ), then using wi = 1/πi also makes the resulting
estimator consistent, but it is less efficient than θ̂w using wi in (28).
Jae-Kwang Kim (ISU)
Part 4
148 / 181
Parameter estimation : GMM method
• Because z is a nonresponse instrumental variable, we may assume
P(δ = 1 | x, y ) = π(φ0 + φ1 u + φ2 y )
for some (φ0 , φ1 , φ2 ).
• Kott and Chang (2010): Construct a set of estimating equations such as
n X
i=1
δi
− 1 (1, ui , zi ) = 0
π(φ0 + φ1 ui + φ2 yi )
that are unbiased to zero.
• May have overidentified situation: Use the generalized method of moments
(GMM).
Jae-Kwang Kim (ISU)
Part 4
149 / 181
Example
• Suppose that we are interested in estimating the parameters in the regression
model
yi = β0 + β1 x1i + β2 x2i + ei
(29)
where E (ei | xi ) = 0.
• Assume that yi is subject to missingness and assume that
P(δi = 1 | x1i , xi2 , yi ) =
exp(φ0 + φ1 x1i + φ2 yi )
.
1 + exp(φ0 + φ1 x1i + φ2 yi )
Thus, x2i is the nonresponse instrument variable in this setup.
Jae-Kwang Kim (ISU)
Part 4
150 / 181
Example (Cont’d)
• A consistent estimator of φ can be obtained by solving
Û2 (φ) ≡
n X
i=1
δ
− 1 (1, x1i , x2i ) = (0, 0, 0).
π(φ; x1i , yi )
(30)
Roughly speaking, the solution to (30) exists almost surely if E {∂ Û2 (φ)/∂φ} is of
full rank in the neighborhood of the true value of φ. If x2 is vector, then (30) is
overidentified and the solution to (30) does not exist. In the case, the GMM
algorithm can be used.
• Finding the solution to Û2 (φ) = 0 can be obtained by finding the minimizer of
Q(φ) = Û2 (φ)0 Û2 (φ) or QW (φ) = Û2 (φ)0 W Û2 (φ) where W = {V (Û2 )}−1 .
Jae-Kwang Kim (ISU)
Part 4
151 / 181
Example (Cont’d)
• Once the solution φ̂ to (30) is obtained, then a consistent estimator of
β = (β0 , β1 , β2 ) can be obtained by solving
Û1 (β, φ̂) ≡
n
X
δi
{yi − β0 − β1 x1i − β2 x2i } (1, x1i , x2i ) = (0, 0, 0)
π̂
i
i=1
(31)
for β.
Jae-Kwang Kim (ISU)
Part 4
152 / 181
Asymptotic Properties
• The asymptotic variance of the GMM estimator φ̂ that minimizes Û2 (φ)0 Σ̂−1 Û2 (φ)
is
−1
0 −1
V (φ̂) ∼
= ΓΣ Γ
where
Γ
=
E {∂ Û2 (φ)/∂φ}
Σ
=
V (Û2 ).
• The variance is estimated by
(Γ̂0 Σ̂−1 Γ̂)−1 ,
where Γ̂ = ∂ Û/∂φ evaluated at φ̂ and Σ̂ is an estimated variance-covariance
matrix of Û2 (φ) evaluated at φ̂.
Jae-Kwang Kim (ISU)
Part 4
153 / 181
Asymptotic Properties
• The asymptotic variance of β̂ obtained from (31) with φ̂ computed from the GMM
can be obtained by
−1
0 −1
V (θ̂) ∼
= Γa Σa Γa
where
E {∂ Û(θ)/∂θ}
Γa
=
Σa
=
V (Û)
Û
=
(Û10 , Û20 )0
and θ = (β, φ).
Jae-Kwang Kim (ISU)
Part 4
154 / 181
Likelihood-based approach
• A classical way of likelihood-based approach for parameter estimation under non
ignorable nonresponse is to maximize Lobs (θ, φ) with respect to (θ, φ), where
Y
Lobs (θ, φ) =
f (yi | xi ; θ) g (δi | xi , yi ; φ)
δi =1
×
YZ
f (yi | xi ; θ) g (δi | xi , yi ; φ) dyi
δi =0
• Such approach can be called full likelihood-based approach because it uses full
information available in the observed data.
• However, it is well known that such full likelihood-based approach is quite sensitive
to the failure of the assumed model.
• On the other hand, partial likelihood-based approach (or conditional likelihood
approach) uses a subset of the sample.
Jae-Kwang Kim (ISU)
Part 4
155 / 181
Conditional Likelihood approach
Idea
• Since
f (y | x)g (δ | x, y ) = f1 (y | x, δ)g1 (δ | x),
for some f1 and g1 , we can write
Y
Lobs (θ) =
f1 (yi | xi , δi = 1) g1 (δi | xi )
δi =1
×
YZ
f1 (yi | xi , δi = 0) g1 (δi | xi ) dyi
δi =0
=
Y
f1 (yi | xi , δi = 1) ×
n
Y
g1 (δi | xi ) .
i=1
δi =1
• The conditional likelihood is defined to be the first component:
Lc (θ) =
Y
f1 (yi | xi , δi = 1) =
δi =1
Y
R
δi =1
f (yi | xi ; θ)π(xi , yi )
,
f (y | xi ; θ)π(xi , y )dy
where π(x, yi ) = Pr (δi = 1 | xi , yi ).
Jae-Kwang Kim (ISU)
Part 4
156 / 181
Conditional Likelihood approach
Example
• Assume that the original sample is a random sample from an exponential
distribution with mean µ = 1/θ. That is, the probability density function of y is
f (y ; θ) = θ exp(−θy )I (y > 0).
• Suppose that we observe yi only when yi > K for a known K > 0.
• Thus, the response indicator function is defined by δi = 1 if yi > K and δi = 0
otherwise.
Jae-Kwang Kim (ISU)
Part 4
157 / 181
Conditional Likelihood approach
Example
• To compute the maximum likelihood estimator from the observed likelihood, note
that
Sobs (θ) =
X 1
δi =1
θ
− yi
+
X 1
δi =0
θ
− E (yi | δi = 0) .
• Since
K exp(−θK )
1
−
,
θ
1 − exp(−θK )
the maximum likelihood estimator of θ can be obtained by the following iteration
equation:
(
)
n
o−1
K exp(−K θ̂(t) )
n−r
(t+1)
= ȳr −
,
(32)
θ̂
r
1 − exp(−K θ̂(t) )
P
P
where r = ni=1 δi and ȳr = r −1 ni=1 δi yi .
E (Y | y > K ) =
Jae-Kwang Kim (ISU)
Part 4
158 / 181
Conditional Likelihood approach
Example
• Since πi = Pr (δi = 1 | yi ) = I (yi > K ) and E (πi ) = E {I (yi > K )} = exp(−K θ),
the conditional likelihood reduces to
Y
θ exp{−θ(yi − K )}.
δi =1
The maximum conditional likelihood estimator of θ is
θ̂c =
1
.
ȳr − K
Since E (y | y > K ) = µ + K , the maximum conditional likelihood estimator of µ,
which is µ̂c = 1/θ̂c , is unbiased for µ.
Jae-Kwang Kim (ISU)
Part 4
159 / 181
Conditional Likelihood approach
Remark
• Under some regularity conditions, the solution θ̂c that maximizes Lc (θ) satisfies
L
Ic1/2 (θ̂c − θ) −→ N(0, I )
where
Ic (θ) = −E
∂
Sc (θ) | xi ; θ
∂θ0
=
n X
{E (Si πi | xi ; θ)}⊗2
E Si Si0 πi | xi ; θ −
,
E (πi | xi ; θ)
i=1
Sc (θ) = ∂ ln Lc (θ)/∂θ, and Si (θ) = ∂ ln f (yi | xi ; θ) /∂θ.
• Works only when π(x, y ) is a known function.
• Does not require nonresponse instrumental variable assumption.
• Popular for biased sampling problem.
Jae-Kwang Kim (ISU)
Part 4
160 / 181
Pseudo Likelihood approach
Idea
• Consider bivariate (xi , yi ) with density f (y | x; θ)h(x) where yi are subject to
missingness.
• We are interested in estimating θ.
• Suppose that Pr (δ = 1 | x, y ) depends only on y . (i.e. x is nonresponse
instrument)
• Note that f (x | y , δ) = f (x | y ).
• Thus, we can consider the following conditional likelihood
Lc (θ) =
Y
f (xi | yi , δi = 1) =
δi =1
Y
f (xi | yi ).
δi =1
• We can consider maximizing the pseudo likelihood
Lp (θ) =
Y
R
δi =1
f (yi | xi ; θ)ĥ(xi )
,
f (yi | x; θ)ĥ(x)dx
where ĥ(x) is a consistent estimator of the marginal density of x.
Jae-Kwang Kim (ISU)
Part 4
161 / 181
Pseudo Likelihood approach
Idea
• We may use the empirical density in ĥ(x). That is, ĥ(x) = 1/n if x = xi . In this
case,
Lc (θ) =
Y
δi =1
f (y | x ; θ)
Pn i i
.
k=1 f (yi | xk ; θ)
• We can extend the idea to the case of x = (u, z) where z is a nonresponse
instrument. In this case, the conditional likelihood becomes
Y
i:δi =1
Jae-Kwang Kim (ISU)
p(zi | yi , ui ) =
Y
R
i:δi =1
Part 4
f (yi | ui , zi ; θ)p(zi |ui )
.
f (yi | ui , z; θ)p(z|ui )dz
(33)
162 / 181
Pseudo Likelihood approach
• Let p̂(z|u) be an estimated conditional probability density of z given u. Substituting
this estimate into the likelihood in (33), we obtain the following pseudo likelihood:
Y
R
i:δi =1
f (yi | ui , zi ; θ)p̂(zi |ui )
.
f (yi | ui , z; θ)p̂(z|ui )dz
(34)
• The pseudo maximum likelihood estimator (PMLE) of θ, denoted by θ̂p , can be
obtained by solving
Sp (θ; α̂) ≡
X
[S(θ; xi , yi ) − E {S(θ; ui , z, yi ) | yi , ui ; θ, α̂}] = 0
δi =1
for θ, where S(θ; x, y ) = S(θ; u, z, y ) = ∂ log f (y | x; θ)/∂θ and
R
S(θ; ui , z, yi )f (yi | ui , z; θ)p(z | ui ; α̂)dz
R
E {S(θ; ui , z, yi ) | yi , ui ; θ, α̂} =
.
f (yi | ui , z; θ)p(z | ui ; α̂)dz
Jae-Kwang Kim (ISU)
Part 4
163 / 181
Pseudo Likelihood approach
• The Fisher-scoring method for obtaining the PMLE is given by
n o−1
θ̂p(t+1) = θ̂p(t) + Ip θ̂(t) , α̂
Sp (θ̂(t) , α̂)
where
Ip (θ, α̂) =
Xh
i
E {S(θ; ui , z, yi )⊗2 | yi , ui ; θ, α̂} − E {S(θ; ui , z, yi ) | yi , ui ; θ, α̂}⊗2 .
δi =1
• Variance estimation is very complicated. Jackknife or bootstrap can be used.
Jae-Kwang Kim (ISU)
Part 4
164 / 181
Exponential tilting method
Motivation
• Observed likelihood function can be written
Lobs (φ) =
n
Y
{πi (φ)}δi
1−δi
Z
{1 − πi (φ)} f (y |xi )dy
,
i=1
where f (y |x) is the true conditional distribution of y given x.
• To find the MLE of φ, we solve the mean score equation S̄(φ) = 0, where
S̄(φ)
=
n
X
[δi Si (φ) + (1 − δi )E {Si (φ)|xi , δi = 0}] ,
(35)
i=1
where Si (φ) = {δi − πi (φ)}(∂logitπi (φ)/∂φ) is the score function of φ for the
density g (δ | x, y ; φ) = π δ (1 − π)1−δ with π = π(x, y ; φ).
Jae-Kwang Kim (ISU)
Part 4
165 / 181
Motivation
• The conditional expectation in (35) can be evaluated by using
f (y |x, δ = 0)
=
f (y |x)
P(δ = 0|x, y )
E {P(δ = 0|x, y )|x}
(36)
Two problems occur:
Requires correct specification of f (y | x; θ). Known to be sensitive to
the choice of f (y | x; θ).
2 Computationally heavy: Often uses Monte Carlo computation.
1
Jae-Kwang Kim (ISU)
Part 4
166 / 181
Exponential tilting method
Remedy (for Problem One)
Idea
Instead of specifying a parametric model for f (y | x), consider specifying a parametric
model for f (y | x, δ = 1), denoted by f1 (y | x). In this case,
R
Si (φ)f1 (y | xi )O(xi , y ; φ)dy
R
E {Si (φ) | xi , δi = 0} =
f1 (y | xi )O(xi , y ; φ)dy
where
O(x, y ; φ) =
Jae-Kwang Kim (ISU)
1 − π(φ; x, y )
.
π(φ; x, y )
Part 4
167 / 181
Remark
• Based on the following identity
f0 (yi | xi ) = f1 (yi | xi ) ×
O (xi , yi )
,
E {O (xi , Yi ) | xi , δi = 1}
(37)
where fδ (yi | xi ) = f (yi | xi , δi = δ) and
O (xi , yi ) =
Pr (δi = 0 | xi , yi )
Pr (δi = 1 | xi , yi )
(38)
is the conditional odds of nonresponse.
• Kim and Yu (2011) considered a Kernel-based nonparametric regression method of
estimating f (y | x, δ = 1) to obtain E (Y | x, δ = 0).
Jae-Kwang Kim (ISU)
Part 4
168 / 181
• If the response probability follows from a logistic regression model
π(ui , yi ) ≡ Pr (δi = 1 | ui , yi ) =
exp (φ0 + φ1 ui + φ2 yi )
,
1 + exp (φ0 + φ1 ui + φ2 yi )
(39)
the expression (37) can be simplified to
f0 (yi | xi ) = f1 (yi | xi ) ×
exp (γyi )
,
E {exp (γY ) | xi , δi = 1}
(40)
where γ = −φ2 and f1 (y | x) is the conditional density of y given x and δ = 1.
• Model (40) states that the density for the nonrespondents is an exponential tilting
of the density for the respondents. The parameter γ is the tilting parameter that
determines the amount of departure from the ignorability of the response
mechanism. If γ = 0, the the response mechanism is ignorable and
f0 (y |x) = f1 (y |x).
Jae-Kwang Kim (ISU)
Part 4
169 / 181
Exponential tilting method
Problem Two
How to compute
R
E {Si (φ) | xi , δi = 0} =
Si (φ)O(xi , y ; φ)f1 (y | xi )dy
R
O(xi , y ; φ)f1 (y | xi )dy
without relying on Monte Carlo computation ?
Jae-Kwang Kim (ISU)
Part 4
170 / 181
• Computation for
Z
E1 {Q(xi , Y ) | xi } =
Q(xi , y )f1 (y | xi )dy .
• If xi were null, then we would approximate the integration by the empirical
distribution among δ = 1.
• Use
Z
Z
Q(xi , y )f1 (y | xi )dy
=
∝
f1 (y | xi )
f1 (y )dy
f1 (y )
X
f1 (yj | xi )
Q(xi , yj )
f1 (yj )
Q(xi , y )
δj =1
where
Z
f1 (y )
=
∝
f1 (y | x)f (x | δ = 1)dx
X
f1 (y | xi ).
δi =1
Jae-Kwang Kim (ISU)
Part 4
171 / 181
Exponential tilting method
• In practice, f1 (y | x) is unknown and is estimated by fˆ1 (y | x) = f1 (y | x; γ̂).
• Thus, given γ̂, a fully efficient estimator of φ can be obtained by solving
S2 (φ, γ̂) ≡
n
X
δi S(φ; xi , yi ) + (1 − δi )S̄0 (φ | xi ; γ̂, φ) = 0,
(41)
i=1
where
P
S̄0 (φ | xi ; γ̂, φ) =
S(φ; xi , yj )f1 (yj | xi ; γ̂)O(φ; xi , yj )/fˆ1 (yj )
P
ˆ
δj =1 f1 (yj | xi ; γ̂)O(φ; xi , yj )/f1 (yi )
δj =1
and
fˆ1 (y ) = nR−1
n
X
δi f1 (y | xi ; γ̂).
i=1
• May use EM algorithm to solve (41) for φ.
Jae-Kwang Kim (ISU)
Part 4
172 / 181
Exponential tilting method
• Step 1: Use the responding part of (xi , yi ), obtain γ̂ in the model f1 (y | x; γ).
S1 (γ) ≡
X
S1 (γ; xi , yi ) = 0.
(42)
δi =1
• Step 2: Given γ̂ from Step 1, obtain φ̂ by solving (41):
S2 (φ, γ̂) = 0.
• Step 3: Using φ̂ computed from Step 2, the PSA estimator of θ can be obtained by
solving
n
X
δi
U(θ; xi , yi ) = 0,
π̂
i
i=1
(43)
where π̂i = πi (φ̂).
Jae-Kwang Kim (ISU)
Part 4
173 / 181
Exponential tilting method
Remark
• In many cases, x is categorical and f1 (y | x) can be fully nonparametric.
• If x has a continuous part, nonparametric Kernel smoothing can be used.
• The proposed method seems to be robust against the failure of the assumed model
on f1 (y | x; γ) .
• Asymptotic normality of PSA estimator can be obtained & Linearization method
can be used for variance estimation (Details skipped)
• By augmenting the estimating function, we can also impose a calibration
constraint such as
Jae-Kwang Kim (ISU)
n
n
X
X
δi
xi =
xi .
π̂i
i=1
i=1
Part 4
174 / 181
Exponential tilting method
Example (Example 6.5 of KS)
• Assume that both xi = (zi , ui ) and yi are categorical with category
{(i, j); i ∈ Sz × Su } and Sy , respectively.
• We are interested in estimating θk = Pr (Y = k), for k ∈ Sy .
• Now, we have nonresponse in y and let δi be the response indicator function for yi .
We assume that the response probability satisfies
Pr (δ = 1 | x, y ) = π (u, y ; φ) .
• To estimate φ, we first compute the observed conditional probability of y among
the respondents:
P
p̂1 (y | xi ) =
I (xj = xi , yj = y )
δj =1
P
δj =1
Jae-Kwang Kim (ISU)
Part 4
I (xj = xi )
.
175 / 181
Exponential tilting method
Example (Cont’d)
• The EM algorithm can be implemented by (41) with
P
S̄0 (φ | xi ; φ) =
S(φ; δi , ui , yj )p̂1 (yj | xi )O(φ; ui , yj )/p̂1 (yj )
P
,
δj =1 p̂1 (yj | xi )O(φ; ui , yj )/p̂1 (yj )
δj =1
where O(φ; u, y ) = {1 − π(u, y ; φ)}/π(u, y ; φ) and
p̂1 (y ) = nR−1
n
X
δi p̂1 (y | xi ).
i=1
• Alternatively, we can use
P
S̄0 (φ | xi ; φ) =
Jae-Kwang Kim (ISU)
y ∈Sy
S(φ; δi , ui , y )p̂1 (y | xi )O(φ; ui , y )
P
.
y ∈Sy p̂1 (y | xi )O(φ; ui , y )
Part 4
(44)
176 / 181
Exponential tilting method
Example (Cont’d)
• Once π̂(u, y ) = π(u, y ; φ̂) is computed, we can use
θ̂k,ET = n−1

X
I (yi = k) +

XX
wiy∗ I (y = k)
δi =0 y ∈Sy
δi =1


,

where wiy∗ is the fractional weights computed by
{π̂(ui , y )}−1 − 1}p̂1 (y |xi )
.
−1 − 1}p̂ (y |x )
1
i
y ∈Sy {π̂(ui , y )
wiy∗ = P
Jae-Kwang Kim (ISU)
Part 4
177 / 181
Real Data Example
Exit Poll: The Assembly election (2012 Gang-dong district in Seoul)
Gender
Male
Female
Age
20-29
30-39
40-49
5020-29
30-39
40-49
50-
Total
Truth
Jae-Kwang Kim (ISU)
Party A
93
104
146
560
106
129
170
501
1,809
62,489
Party B
115
233
295
350
159
242
262
218
1,874
57,909
Part 4
Other
4
8
5
3
8
5
5
7
45
1,624
Refusal
28
82
49
174
62
70
69
211
745
Total
240
427
495
1,087
335
446
506
937
4,473
122,022
178 / 181
Comparison of the methods (%)
Table : Analysis result : Gang-dong district in Seoul
Method
No adjustment
Adjustment (Age * Sex)
New Method
Truth
Jae-Kwang Kim (ISU)
Party A
48.5
49.0
51.0
51.2
Part 4
Party B
50.3
49.8
47.7
47.5
Other
1.2
1.2
1.2
1.3
179 / 181
Analysis result in Seoul (48 Seats)
Table : Analysis result : 48 seats in Seoul
Method
No adjustment
Adjustment (Age* Sex)
New Method
Truth
Jae-Kwang Kim (ISU)
Party A
10
10
15
16
Part 4
Party B
36
36
29
30
Other
2
2
4
2
180 / 181
6. Concluding remarks
• Uses a model for the response probability.
• Parameter estimation for response model can be implemented using the idea of
maximum likelihood method.
• Instrumental variable needed for identifiability of the response model.
• Likelihood-based approach vs GMM approach
• Less tools for model diagnostics or model validation
• Promising areas of research
Jae-Kwang Kim (ISU)
Part 4
181 / 181
REFERENCES
Cheng, P. E. (1994), ‘Nonparametric estimation of mean functionals with data missing
at random’, Journal of the American Statistical Association 89, 81–87.
Dempster, A. P., N. M. Laird and D. B. Rubin (1977), ‘Maximum likelihood from
incomplete data via the EM algorithm’, Journal of the Royal Statistical Society:
Series B 39, 1–37.
Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical statistics’,
Philosophical Transactions of the Royal Society of London A 222, 309–368.
Fuller, W. A., M. M. Loughin and H. D. Baker (1994), ‘Regression weighting in the
presence of nonresponse with application to the 1987-1988 Nationwide Food
Consumption Survey’, Survey Methodology 20, 75–85.
Hirano, K., G. Imbens and G. Ridder (2003), ‘Efficient estimation of average treatment
effects using the estimated propensity score’, Econometrica 71, 1161–1189.
Ibrahim, J. G. (1990), ‘Incomplete data in generalized linear models’, Journal of the
American Statistical Association 85, 765–769.
Kim, J. K. (2011), ‘Parametric fractional imputation for missing data analysis’,
Biometrika 98, 119–132.
Kim, J. K. and C. L. Yu (2011), ‘A semi-parametric estimation of mean functionals with
non-ignorable missing data’, Journal of the American Statistical Association
106, 157–165.
Jae-Kwang Kim (ISU)
Part 4
181 / 181
Kim, J. K., M. J. Brick, W. A. Fuller and G. Kalton (2006), ‘On the bias of the multiple
imputation variance estimator in survey sampling’, Journal of the Royal Statistical
Society: Series B 68, 509–521.
Kim, J. K. and M. K. Riddles (2012), ‘Some theory for propensity-score-adjustment
estimators in survey sampling’, Survey Methodology 38, 157–165.
Kott, P. S. and T. Chang (2010), ‘Using calibration weighting to adjust for nonignorable
unit nonresponse’, Journal of the American Statistical Association 105, 1265–1275.
Louis, T. A. (1982), ‘Finding the observed information matrix when using the EM
algorithm’, Journal of the Royal Statistical Society: Series B 44, 226–233.
Meng, X. L. (1994), ‘Multiple-imputation inferences with uncongenial sources of input
(with discussion)’, Statistical Science 9, 538–573.
Oakes, D. (1999), ‘Direct calculation of the information matrix via the em algorithm’,
Journal of the Royal Statistical Society: Series B 61, 479–482.
Orchard, T. and M.A. Woodbury (1972), A missing information principle: theory and
applications, in ‘Proceedings of the 6th Berkeley Symposium on Mathematical
Statistics and Probability’, Vol. 1, University of California Press, Berkeley, California,
pp. 695–715.
Redner, R. A. and H. F. Walker (1984), ‘Mixture densities, maximum likelihood and the
EM algorithm’, SIAM Review 26, 195–239.
Robins, J. M., A. Rotnitzky and L. P. Zhao (1994), ‘Estimation of regression coefficients
when some regressors are not always observed’, Journal of the American Statistical
Association 89, 846–866.
Jae-Kwang Kim (ISU)
Part 4
181 / 181
Robins, J. M. and N. Wang (2000), ‘Inference for imputation estimators’, Biometrika
87, 113–124.
Rubin, D. B. (1976), ‘Inference and missing data’, Biometrika 63, 581–590.
Tanner, M. A. and W. H. Wong (1987), ‘The calculation of posterior distribution by
data augmentation’, Journal of the American Statistical Association 82, 528–540.
Wang, N. and J. M. Robins (1998), ‘Large-sample theory for parametric multiple
imputation procedures’, Biometrika 85, 935–948.
Wang, S., J. Shao and J. K. Kim (2014), ‘Identifiability and estimation in problems with
nonignorable nonresponse’, Statistica Sinica 24, 1097 – 1116.
Wei, G. C. and M. A. Tanner (1990), ‘A Monte Carlo implementation of the EM
algorithm and the poor man’s data augmentation algorithms’, Journal of the
American Statistical Association 85, 699–704.
Zhou, M. and J. K. Kim (2012), ‘An efficient method of estimation for longitudinal
surveys with monotone missing data’, Biometrika 99, 631–648.
Jae-Kwang Kim (ISU)
Part 4
181 / 181
Download