7. Multiple Imputation: Part 1 1 Introduction

advertisement
7. Multiple Imputation: Part 1
1
Introduction
• Assume a parametric model: y ∼ f (y | x; θ)
• We are interested in making inference about θ.
• In Bayesian approach, we want to make inference about θ from
f (θ | x, y) = R
π(θ)f (y | x, θ)
π(θ)f (y | x, θ)dθ
where π(θ) is a prior distribution, which is assumed to be known for simplicity
here.
• The point estimator is
θ̂n = E{θ | x, y}
(1)
V̂n = V {θ | x, y}.
(2)
and its variance estimator is
We may express θ̂n = θ̂n (x, y) and V̂n = V̂n (x, y).
• Now, consider the case when x is always observed and y is subject to missingness.
Let y = (yobs , ymis ) be the (observed, missing) part of the sample.
• We have two approaches of making inference about θ using the observed data
(x, yobs ):
1. Direct Bayesian approach: Consider
f (θ | x, yobs ) = R
1
π(θ)f (yobs | x, θ)
π(θ)f (yobs | x, θ)dθ
and use
θ̂r = E{θ | x, yobs }
(3)
and its variance estimator is
V̂r = V {θ | x, yobs }.
(4)
2. Multiple imputation approach:
∗(k)
(a) For each k, generate ymis from f (ymis | x, yobs ).
∗(k)
(b) Apply the k-th imputed values to θ̂n in (1) to obtain θ̂n
∗(k)
= θ̂n (x, yobs , ymis ).
∗(k)
Also, apply the k-th imputed values to V̂n in (2) to obtain V̂n
∗(k)
V̂n (x, yobs , ymis ).
(c) Combine the M point estimators to get
θ̂M I =
M
1 X ∗(k)
θ̂
M k=1 n
as a point estimator of θ.
(d) The variance estimator of θ̂M I is
V̂M I = WM
1
+ 1+
BM
M
where
WM
M
1 X ∗(k)
V̂
=
M k=1 n
BM
2
1 X ∗(k)
=
θ̂n − θ̄M I .
M − 1 k=1
M
Comparison
Bayesian
Frequentist
Model
Posterior distribution
Prediction model
f (latent, θ | data)
f (latent | data, θ)
Computation
Data augmentation
EM algorithm
I-step
E-step
Prediction
Parameter update
P-step
M-step
Parameter est’n
Posterior mode
ML estimation
Imputation
Multiple imputation Fractional imputation
Variance estimation
Rubin’s formula
Linearization
or Bootstrap
2
=
2
Main Result
1. Bayesian Properties (Rubin, 1987): For sufficiently large M , we have
θ̂M I
M
1 X ∗(k)
=
θ̂
M k=1 n
=
M
1 X
∗(k)
E(θ | x, yobs , ymis )
M k=1
.
= E{E(θ | x, yobs , Ymis ) | x, yobs }
= E(θ | x, yobs ),
which is equal to θ̂r in (3). Also, for sufficiently large M ,
V̂M I = WM + BM
M
M
2
1 X ∗(k)
1 X ∗(k)
V̂n +
θ̂n − θ̄M I
=
M k=1
M − 1 k=1
.
= E{V (θ | x, yobs , Ymis ) | x, yobs } + V {E(θ | x, yobs , Ymis ) | x, yobs },
which is equal to V̂r in (4)
2. Frequentist Properties (Wang and Robins, 1998)
Assume that, under complete data, θ̂n is the MLE of θ and V̂n = {I(θ̂n )}−1 is
asymptotically unbiased for V (θ̂n ). Under the existence of missing data, θ̂M I is
asymptotically equivalent to the MLE of θ and V̂M I is approximately unbiased
for V (θ̂M I ). That is,
θ̂M I ∼
= θ̂M LE
(5)
E{V̂M I } ∼
= V (θ̂M I )
(6)
and
for sufficiently large M and n. (See Appendix A for a sketched proof.)
3
3
Computation
• Gibbs sampling
– Geman and Geman (1984): the “Gibbs sampler” for Bayesian image reconstruction
– Tanner and Wong (1987): data augmentation for Bayesian inference in
generic missing-data problems
– Gelfand and Smith (1990): simulation of marginal distributions by repeated
draws from conditionals
• Idea for Gibbs sampling: Sample from conditional distributions
(t)
(t)
(t)
(t)
Given Z = Z1 , Z2 , · · · , ZJ , draw Z (t+1) by sampling from the full conditionals of f ,
(t+1)
Z1
(t+1)
Z2
(t)
(t)
Z2 , Z3 , · · ·
(t)
, ZJ
(t)
(t)
Z1 , Z2 , · · ·
(t)
, ZJ−1
∼ P Z1 |
(t)
(t)
(t)
∼ P Z2 | Z1 , Z3 , · · · , ZJ
..
.
(t+1)
ZJ
∼ P ZJ |
.
Under mild regularity conditions, P Z (t) → f as t → ∞.
• Data augmentation: Application of the Gibbs sampling to missing data problem
y =
observed data
z =
missing data
θ =
model parameters
Predictive distribution:
Z
P (z | y) =
P (z | y, θ) dP (θ | y)
Posterior distribution:
Z
P (θ | y) =
P (y | y, z) dP (z | y)
4
• Algorithm: Iterative method of data augmentation
I-step: Draw
z (t+1) ∼ P z | y, θ(t)
P-step: Draw
θ(t+1) ∼ P θ | y, z (t+1) .
• Two uses of data augmentation
– Parameter simulation: collect and summarize a sequence of dependent
draws of θ,
θ(t+1) , θ(t+2) , · · · , θ(t+N ) ,
where t is large enough to ensure stationarity.
– Multiple imputation: collect independent draws of z,
z (t) , z (2t) , · · · , z (mt)
• Parameter simulation (Bayesian approach)
θ1 = component of function of θ of interest
Collect iterates of θ1 from data augmentation
(t+1)
θ1
(t+2)
, θ1
(t+N )
, · · · , θ1
,
where t is large enough to ensure stationarity and N is the Monte Carlo sample
size.
P
(t+k)
– θ̄1 = N −1 N
estimates the posterior mean E (θ | y).
k=1 θ1
2
PN
(t+k)
−1
– N
− θ1 estimates the posterior variance V (θ1 | y).
k=1 θ1
(t+1)
– The 2.5th and 97.5th percentiles of θ1
(t+2)
, θ1
(t+N )
, · · · , θ1
endpoints of a 95% equal-tailed Bayesian interval for θ1 .
5
estimate the
4
Examples
4.1
Example 1(Univariate Normal distribution)
• Let y1 , · · · , yn be IID observations from N (µ, σ 2 ) and only the first r elements
are observed and the remaining n − r elements are missing. Assume that the
response mechanism is ignorable.
• Bayesian imputation: the j-th posterior values of (µ, σ 2 ) are generated from
σ ∗(j)2 | yr ∼ rσ̂r2 /χ2r−1
(7)
and
µ∗(j) | (yr , σ ∗(j)2 ) ∼ N ȳr , r−1 σ ∗(j)2
(8)
P
P
where yr = (y1 , · · · , yr ), ȳr = r−1 ri=1 yi , and σ̂r2 = r−1 ri=1 (yi − ȳr )2 . Given
the posterior sample (µ∗(j) , σ ∗(j)2 ), the imputed values are generated from
∗(j)
(9)
yi | yr , µ∗(j) , σ ∗(j)2 ∼ N µ∗(j) , σ ∗(j)2
independently for i = r + 1, · · · , n.
• Let θ = E(Y ) be the parameter of interest and the MI estimator of θ can be
expressed as
θ̂M I =
where
(j)
θ̂I
1
=
n
( r
X
i=1
M
1 X (j)
θ̂
M j=1 I
yi +
n
X
)
∗(j)
yi
.
i=r+1
Then,
θ̂M I
M
n
M
n − r X ∗(j)
1 X X ∗(j)
µ − ȳr +
yi − µ∗(j) .
= ȳr +
nM j=1
nM i=r+1 j=1
(10)
Asymptotically, the first term has mean µ and variance r−1 σ 2 , the second term
has mean zero and variance (1 − r/n)2 σ 2 /(mr), the third term has mean zero
and variance σ 2 (n − r)/(n2 m), and the three terms are mutually independent.
Thus, the variance of θ̂M I is
2 1
1 n−r
1 2
1
2
2
V θ̂M I = σ +
σ +
σ .
r
M
n
r
n−r
6
(11)
• For variance estimation, note that
∗(j)
V (yi
∗(j)
) = V (ȳr ) + V (µ∗(j) − ȳr ) + V (yi − µ∗(j) )
r+1
1 2 1 2 r+1
2
σ + σ
+σ
=
r
r
r−1
r−1
2
∼
= σ .
Writing
(j)
V̂I (θ̂) = n−1 (n − 1)−1
= n−1 (n − 1)−1
n
X
(
i=1

n X

n
1 X ∗(j)
∗(j)
ỹi −
ỹ
n k=1 k
∗(j)
ỹi
−µ
2
−n
i=1
)2
n
1 X ∗(j)
−µ
ỹ
n k=1 k
!2 


∗(j)
where ỹi∗ = δi yi + (1 − δi )yi
, we have
( n
n
o
2
X ∗(j)
(j)
E V̂I (θ̂)
= n−1 (n − 1)−1
E ỹi − µ − nV
!)
n
1 X ∗(j)
ỹk
n
k=1
" i=1
(
2 )#
n
−
r
1
1
1
∼
σ2 +
σ2 +
σ2
= n−1 (n − 1)−1 nσ 2 − n
r
n
r
n−r
∼
= n−1 σ 2
which shows that E(WM ) ∼
= V (θ̂n ). Also,
∗(1) ∗(2)
∗(1)
− Cov θ̂I , θ̂I
E(Bm ) = V θ̂I
)
(
n 1 X
n − r ∗(1)
∗(1)
= V
y
− µ∗(1)
µ
− ȳr +
n
n i=r+1 i
2 n−r
1
1
∼
+
σ2
=
n
r n−r
1 1
=
−
σ2.
r n
Thus, Rubin’s variance estimator satisfies
2 n
o 1
1
n
−
r
1
1
2
2
2
∼
E V̂M I (θ̂M I ) ∼
σ
+
σ
+
σ
V
θ̂
=
=
MI ,
r
M
n
r
n−r
which is consistent with the general result in (6).
7
4.2
Example 2 (Censored regression model, or Tobit model)
• Model
zi = x0i β + i
zi
yi =
ci
i ∼ N 0, σ 2
if zi ≥ ci
if zi < ci .
• Data augmentation
1. I-step: Given θ(t) = β (t) , σ (t) , generate the imputed value (for δi = 0)
from
(t+1)
zi
(t)
= x0i β (t) + i
with
(t)
i ∼
(t+1)
If δi = 1, then zi
2. P-step: Given z(t+1)
φ (s)
Φ [(ci − x0i β (t) ) /σ (t) ]
= zi .
(t+1) (t+1)
(t+1)
, generate θ(t+1) from
= z1 , z2 , · · · , zn
θ(t+1) ∼ P θ | z(t+1) .
That is, generate
σ 2(t+1) | z(t+1) ∼ (n − p) σ̂ 2(t+1) /χ2n−p
and

β (t+1) | z(t+1) , σ 2(t+1) ∼ N β̂n(t+1) ,
n
X

!−1
xi x0i
σ 2(t+1)  ,
i=1
where
β̂n(t+1)
=
n
X
!−1
xi x0i
i=1
n
X
(t+1)
xi zi
i=1
and
0
σ̂ 2(t+1) = (n − p)−1 z(t+1) (I − Px ) z(t+1) .
8
4.3
Example 3 (Bayesian bootstrap)
• Nonparametric approach to Bayesian imputation
• First proposed by Rubin (1981).
• Assume that an element of the population takes one of the values d1 , · · · , dK
with probability p1 , · · · , pK , respectively. That is, we assume
P (Y = dk ) = pk ,
K
X
pk = 1.
(12)
k=1
• Let y1 , · · · , yn be an IID sample from (12) and let nk be the number of yi equal
to dk . The parameter is a vector of probabilities p = (p1 , · · · , pK ), such that
PK
i=1 pi = 1. In this case, the population mean θ = E(Y ) can be expressed as
P
θ= K
i=1 pi di and we only need to estimate p.
• If the improper Dirichlet prior with density proportional to
QK
k=1
p−1
k is placed
on the vector p, then the posterior distribution of p is proportional to
K
Y
pnk k −1
k=1
which is a Dirichlet distribution with parameter (n1 , · · · , nK ). This posterior
distribution can be simulated using n − 1 independent uniform random numbers. Let u1 , · · · , un−1 be IID U (0, 1), and let gi = u(i) −u(i−1) , i = 1, 2, · · · , n−1
where u(k) is the k-th order statistic of u1 , · · · , un−1 with u(0) = 0 and u(n) = 1.
Partition the g1 , · · · , gn into K collections, with the k-th one having nk elements, and let pk be the sum of the gi in the k-th collection. Then, the realized
value of p1 , · · · , pk follows a (K − 1)-variate Dirichlet distribution with parameter (n1 , · · · , nK ). In particular, if K = n, then (g1 , · · · , gn ) is the vector of
probabilities to attach to the data values y1 , · · · , yn in that Bayesian bootstrap
replication.
• To implement Rubin’s Bayesian bootstrap to multiple imputation, assume that
the first r elements are observed and the remaining n − r elements are missing.
The imputed values can be generated with the following steps:
9
[Step 1] From yr = (y1 , · · · , yr ), generate p∗r = (p∗1 , · · · , p∗r ) from the posterior
distribution using the Bayesian bootstrap as follows.
1. Generate u1 , · · · , ur−1 independently from U (0, 1) and sort them to
get 0 = u(0) < u(1) < · · · < u(r−1) < u(r) = 1.
2. Compute p∗i = u(i) − u(i−1) , i = 1, 2, · · · , r − 1 and p∗r = 1 −
[Step 2] Select the imputed value of

 y1
∗
...
yi =

yr
Pr−1
i=1
p∗i .
yi by
with probability p∗1
...
with probability p∗r
independently for each i = r + 1, · · · , n.
• Rubin and Schenker (1986) proposed an approximation of this Bayesian bootstrap method, called the approximate Bayesian boostrap (ABB) method, which
provides an alternative approach of generating imputed values from the empirical distribution. The ABB method can be described as follows:
[Step 1] From yr = (y1 , · · · , yr ), generate a donor set yr∗ = (y1∗ , · · · , yr∗ ) by
bootstrapping. That is, we select

 y1 with probability 1/r
∗
... ...
yi =

yr
with probability 1/r
independently for each i = 1, · · · , r.
[Step 2] From the donor set yr∗ = (y1∗ , · · · , yr∗ ), select an imputed value of yi by
 ∗
 y1 with probability 1/r
∗∗
... ...
yi =
 ∗
yr
with probability 1/r
independently for each i = r + 1, · · · , n.
10
Reference
Gelfand, A.E. and Smith, A.F.M. (1990). Sampling-based approaches to calculating
marginal densities, Journal of the American Statistical Association, 85, 398-409.
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and
the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 62, 721-741.
Rubin, D. B. (1981). The Bayesian bootstrap. The Annals of Statistics, 9, 130-134.
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys, John Wiley &
Sons, New York.
Rubin, D.B. and Schenker, N. (1986). Multiple imputation for interval estimation
from simple random samples with ignorable nonresponse. Journal of the American
Statistical Association 81, 366-374.
Tanner, M.A. and Wong, W.H. (1987). The calculation of posterior distribution by
data augmentation. Journal of the American Statistical Association 82, 528-540.
Wang, N. and Robins, J.M. (1998). Large-sample theory for parametric multiple imputation procedure, Biometrika, 85, 935-948.
11
Appendix
A. Proof of (5) and (6)
Let Scom (θ) = S(θ; x, y) be the score function of θ under complete response. The MLE
under complete response, denoted by θ̂n , is asymptotically equivalent to
−1
Scom (θ),
θ̂n ∼
= θ + Icom
where Icom = E{−∂Scom (θ)/∂θ0 }. Thus,
θ̂M I
M
1 X ∗(k)
=
θ̂
M k=1 n
M
X
.
∗(k)
−1 1
S(θ; x, yobs , ymis )
= θ + Icom
M k=1
= θ+
−1
Icom
−1
+ Icom
M
1 X
E{S(θ; x, yobs , Ymis ) | x, yobs ; θ∗(k) }
M k=1
M
i
1 Xh
∗(k)
S(θ; x, yobs , ymis ) − E{S(θ; x, yobs , Ymis ) | x, yobs ; θ∗(k) } ,
M k=1
∗(k)
where ymis ∼ f (ymis | x, yobs ; θ∗(k) ) and θ∗(k) ∼ p(θ | x, yobs ). Under some regularity
conditions, the posterior distribution converges to a normal distribution with mean
−1
θ̂M LE and variance Iobs
= V (θ̂M LE ). (This is often called Bernstein–von Mises the∗(k)
orem.) Thus, we can apply Taylor linearization on S(θ; x, yobs , ymis ) with respect to
θ∗(k) around the true θ to get
E{S(θ; x, yobs , Ymis ) | x, yobs ; θ∗(k) } ∼
= E{S(θ; x, yobs , Ymis ) | x, yobs ; θ} + Imis (θ∗(k) − θ)
= Sobs (θ) + Imis (θ̂M LE − θ) + Imis (θ∗(k) − θ̂M LE ),
(13)
where Imis is the information matrix associated with f (ymis | x, yobs ; θ). Since θ̂M LE
is the solution to Sobs (θ) = 0, we have
−1
θ̂M LE ∼
Sobs (θ)
= θ + Iobs
and (13) further simplifies to
−1
E{S(θ; x, yobs , Ymis ) | x, yobs ; θ∗(k) } = Sobs (θ) + Imis Iobs
Sobs (θ) + Imis (θ∗(k) − θ̂M LE )
−1
= Icom Iobs
Sobs (θ) + Imis (θ∗(k) − θ̂M LE ).
12
(14)
Thus, combining all terms together, we have
∗(k)
−1
−1
θ̂n∗(k) = θ̂M LE + Icom
Imis (θ∗(k) − θ̂M LE ) + Icom
S
− E(S | x, yobs ; θ∗(k) )
(15)
∗(k)
where S ∗(k) = S(θ; x, yobs , ymis ). Therefore,
θ̂M I = θ +
−1
Iobs
Sobs (θ)
+
−1
Icom
Imis
M
1 X ∗(k)
(θ
− θ̂M LE )
M k=1
M
i
1 Xh
∗(k)
S(θ; x, yobs , ymis ) − E{S(θ; x, yobs , Ymis ) | x, yobs ; θ∗(k) }
+
M k=1
)
(
M
X
−1
(θ∗(k) − θ̂M LE )
= θ̂M LE + Icom
Imis M −1
−1
Icom
k=1
−1
+Icom
M −1
M
X
S ∗(k) (θ) − E(S | x, yobs ; θ∗(k) ) .
k=1
Note that the second term reflects the variability due to generating θ∗ and the third
∗(k)
therm reflects the variability due to generating ymis from f (ymis | x, yobs ; θ∗(k) ). The
three terms are independent and the last two terms are negligible for large M .
−1
−1
To prove (6), note first that E(Vn ) = V (θ̂n ) = Icom
and so we have E(WM ) ∼
.
= Icom
Now, inserting (15) into
M
BM =
1 X ∗(k)
(θ̂
− θ̂M I )⊗2 ,
M − 1 k=1 n
we have
−1
−1
−1
−1
−1
E(BM ) = Icom
Imis Iobs
Imis Icom
+ Icom
Imis Icom
−1
−1
= Iobs
− Icom
,
where the last equality follows from
−1
(A + BCB 0 )
−1
= A−1 − A−1 BCB 0 A−1 + A−1 BCB 0 (A + BCB 0 )
BCB 0 A−1
−1
with A = Icom , B = I, and C = −Imis . Therefore, E(V̂M I ) ∼
= V (θ̂M LE ) ∼
= Iobs
=
V (θ̂M I ).
13
Download