Likelihood-based inference with missing data under missing-at-random Jae-kwang Kim May 24, 2014

advertisement
Likelihood-based inference with missing data
under missing-at-random
Jae-kwang Kim
Joint work with Shu Yang
Department of Statistics, Iowa State University
May 24, 2014
Outline
1. Introduction
2. Parametric fractional imputation (PFI)
3. PFI approximation of the observed likelihood
4. Main theory
5. Simulation study
6. Discussion
Introduction
• Let y = (y1 , · · · , yn ) have a joint density f (y ; θ).
• Instead of observing y , we observe yobs , where
y = (yobs , ymis ).
• We are interested in making inference about θ in the presence
of missing data.
• Maximum likelihood estimator θ̂ maximizes the observed
log-likelihood
Z
lobs (θ) = log
f (y ; θ)dymis .
Introduction
• Inference with missing data is usually based on Wald type
inference:
−1
θ̂ ∼ N(θ, Îobs
),
where l̂obs = −∂ 2 lobs (θ)/∂θ2 evaluated at θ = θ̂.
• We are interested in inference based on Wilks’ theorem:
n
o
−2 lobs (θ) − lobs (θ̂) ∼ χ2p ,
where lobs (θ) is the observed log likelihood and p is the
dimension of θ.
Basic setup
• The observed likelihood involves integration:
Z
lobs (θ) = log
f (y ; θ)dymis = log fobs (yobs ; θ).
• Monte Carlo approximation is valid in the neighborhood of
θ = θ̂:
fobs (yobs ; θ) = E
∼
=
∗(j)
where yi
f (y ; θ)
| yobs
f (ymis | yobs ; θ)
M
∗(j)
1 X f (yobs , ymis ; θ)
∗(j)
M
j=1 f (ymis | yobs ; θ)
∼ f (ymis | yobs ; θ̂),
Basic setup
Wilk Inference is made using the likelihood ratio test. If lobs (θ) is
known, Wilk C.I. can be constructed as
n
o
θ ∈ Θ : −2{lobs (θ) − lobs (θ̂)} ≤ χ2p (1 − α) .
Computing the observed log likelihood is challenging because the
integration over the random variable yi,mis is often intractable.
Monte Carlo methods for approximating the observed data
likelihood using samples from a distribution near the MLE will fail
to compute lobs (θ) correctly when θ is far from the MLE.
Parametric fractional imputation (PFI)
Main idea of EM algorithm of PFI (Kim, 2011)
• PFI saves the computation associated with MCEM.
• It uses the importance sampling idea to compute the mean
score function where the imputed values are generated only
once by importance sampling at the beginning of the EM
iteration.
• PFI doesn’t change the imputed values but changes the
fractional weights for each EM iteration.
• It largely reduces the computation burden and ensures the
convergence of the EM sequence.
Parametric fractional imputation (PFI)
∗(j)
• (I-step) For missing yi,mis , yi,mis ∼ h(yi,mis ), for j = 1, . . . , m.
• (W-step) Given the current parameter value θ (t) , compute
S̄
∗(t)
(θ) ≡
n X
m
X
wij∗ (θ(t) )S(θ; yij∗ ) = 0,
i=1 j=1
where S(θ; y ) is the score function of θ,
∗(j)
wij∗ (θ) = P
m
f (yij∗ ; θ)/h(yi,mis )
∗(k)
∗
k=1 {f (yik ; θ)/h(yi,mis )}
,
∗(j)
and yij∗ = (yi,obs , yi,mis ).
• (M-step) Update parameter θ̂ (t+1) by solving (1).
• Repeat (W-step) and (M-step) until convergence.
(1)
Remarks on PFI method
• (I-step) + (W-step) = E-step
• Key part is to compute the fractional weights
∗(j)
wij∗ (θ) = P
m
f (yij∗ ; θ)/h(yi,mis )
∗(k)
∗
k=1 {f (yik ; θ)/h(yi,mis )}
.
Here, h is often called proposal distribution and f is called
target distribution.
Parametric fractional imputation (PFI)
Wald Inference can be constructed based on
∗−1
θ̂∗ ∼ N θ, Îobs
,
∗ is the approximate information matrix.
where Iobs
Issue with Wald inference
Wald-type confidence intervals often have poor coverage when the
sampling distribution of the MLE is skewed. In such cases,
Wilk-type inference is preferred.
PFI approximation of likelihood
Using the PFI data, we can express
fobs,i (yi,obs ; θ) ∼
=
=
∗(j)
∗
j=1 {f (yij ; θ)/h(yi,mis )}
Pm
∗(j)
j=1 {1/h(yi,mis )}
Pm
1
.
∗
∗
j=1 {wij (θ)/f (yij ; θ)}
Pm
The observed log-likelihood function lobs (θ) can be approximated
by


n
m
X
X
∗
∗
lobs
(θ) = −
log 
wij (θ)/f (yij∗ ; θ)  .
i=1
j=1
Given the imputed values yij∗ , only need f (yij∗ ; θ) and h(yij∗ ).
Main Theory 1
Theorem
The imputed observed likelihood ratio statistic for testing
H0 : θ = θ0 is,
n
o
∗
∗
W1 = −2 lobs
(θ0 ) − lobs
(θ̂) .
Under the regularity conditions specified, under the null hypothesis
H0 ,
W1 → χ2 (p),
as m → ∞ and n → ∞.
Main Theory 2
Theorem
Let θ = (θ1 , θ2 ), where θ1 and θ2 are q × 1 and (p − q) × 1
vectors, respectively. Under the same regularity conditions of
Theorem 1, under H0 : θ1 = θ1,0 ,
n
o
∗
∗
W2 = −2 lobs
(θ̂(0) ) − lobs
(θ̂) → χ2 (q),
∗ (θ).
as m → ∞ and n → ∞, where θ̂(0) = arg maxH0 lobs
Remarks on Theorem 2
• There are two models nvolved in Theorem 2. One is the full
model f and the other is the reduced model f0 , the model
under null hypothesis.
• Recall that
∗
lobs
(θ) = −
n
X


m
X
∗
wij (θ)/f (yij∗ ; θ)  .
log 
i=1
j=1
∗(j)
wij∗ (θ)
=P
m
f (yij∗ ; θ)/h(yi,mis )
∗(k)
∗
k=1 {f (yik ; θ)/h(yi,mis )}
.
• Thus, h remains the same but f needs to be changed to f0
under the reduced model.
An example of likelihood ratio test
Consider the following bivariate normal distribution
y1i
µ1
σ12
ρσ1 σ2
∼N
,
y2i
µ2
ρσ1 σ2
σ22
for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 .
Table: Data structure
H
K
y1
y2
X
L
X
M
X
X
Imputation
–
∗(j)
{(y2i , wij∗ )}m
j=1
∗(j)
{(y1i , wij∗ )}m
j=1
∗(j) ∗(j)
{(y1i , y2i , wij∗ )}m
j=1
Proposal h(·)
hK (y2 |y1 )
hL (y1 |y2 )
hM (y1 , y2 )
An example of likelihood ratio test
Consider the following bivariate normal distribution
y1i
µ1
σ12
ρσ1 σ2
∼N
,
y2i
µ2
ρσ1 σ2
σ22
for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 .
Table: Data structure
H
K
y1
y2
X
L
X
M
X
X
Imputation
–
∗(j)
{(y2i , wij∗ )}m
j=1
∗(j)
{(y1i , wij∗ )}m
j=1
∗(j) ∗(j)
{(y1i , y2i , wij∗ )}m
j=1
Proposal h(·)
hK (y2 |y1 )
hL (y1 |y2 )
hM (y1 , y2 )
An example of likelihood ratio test
Consider the following bivariate normal distribution
y1i
µ1
σ12
ρσ1 σ2
∼N
,
y2i
µ2
ρσ1 σ2
σ22
for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 .
Table: Data structure
H
K
y1
y2
X
L
X
M
X
X
Imputation
–
∗(j)
{(y2i , wij∗ )}m
j=1
∗(j)
{(y1i , wij∗ )}m
j=1
∗(j) ∗(j)
{(y1i , y2i , wij∗ )}m
j=1
Proposal h(·)
hK (y2 |y1 )
hL (y1 |y2 )
hM (y1 , y2 )
An example of likelihood ratio test
Consider the following bivariate normal distribution
y1i
µ1
σ12
ρσ1 σ2
∼N
,
y2i
µ2
ρσ1 σ2
σ22
for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 .
Table: Data structure
H
K
y1
y2
X
L
X
M
X
X
Imputation
–
∗(j)
{(y2i , wij∗ )}m
j=1
∗(j)
{(y1i , wij∗ )}m
j=1
∗(j) ∗(j)
{(y1i , y2i , wij∗ )}m
j=1
Proposal h(·)
hK (y2 |y1 )
hL (y1 |y2 )
hM (y1 , y2 )
An example of likelihood ratio test
Consider the following bivariate normal distribution
y1i
µ1
σ12
ρσ1 σ2
∼N
,
y2i
µ2
ρσ1 σ2
σ22
for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 .
Table: Data structure
H
K
y1
y2
X
L
X
M
X
X
Imputation
–
∗(j)
{(y2i , wij∗ )}m
j=1
∗(j)
{(y1i , wij∗ )}m
j=1
∗(j) ∗(j)
{(y1i , y2i , wij∗ )}m
j=1
Proposal h(·)
hK (y2 |y1 )
hL (y1 |y2 )
hM (y1 , y2 )
An example of likelihood ratio test
Consider the following bivariate normal distribution
y1i
µ1
σ12
ρσ1 σ2
∼N
,
y2i
µ2
ρσ1 σ2
σ22
for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 .
Table: Data structure
H
K
y1
y2
X
L
X
M
X
X
Imputation
–
∗(j)
{(y2i , wij∗ )}m
j=1
∗(j)
{(y1i , wij∗ )}m
j=1
∗(j) ∗(j)
{(y1i , y2i , wij∗ )}m
j=1
Proposal h(·)
hK (y2 |y1 )
hL (y1 |y2 )
hM (y1 , y2 )
An example of likelihood ratio test
Consider the following bivariate normal distribution
y1i
µ1
σ12
ρσ1 σ2
∼N
,
y2i
µ2
ρσ1 σ2
σ22
for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 .
Table: Data structure
H
K
y1
y2
X
L
X
M
X
X
Imputation
–
∗(j)
{(y2i , wij∗ )}m
j=1
∗(j)
{(y1i , wij∗ )}m
j=1
∗(j) ∗(j)
{(y1i , y2i , wij∗ )}m
j=1
Proposal h(·)
hK (y2 |y1 )
hL (y1 |y2 )
hM (y1 , y2 )
An example of likelihood ratio test
Consider the following bivariate normal distribution
y1i
µ1
σ12
ρσ1 σ2
∼N
,
y2i
µ2
ρσ1 σ2
σ22
for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 .
Table: Data structure
H
K
y1
y2
X
L
X
M
X
X
Imputation
–
∗(j)
{(y2i , wij∗ )}m
j=1
∗(j)
{(y1i , wij∗ )}m
j=1
∗(j) ∗(j)
{(y1i , y2i , wij∗ )}m
j=1
Proposal h(·)
hK (y2 |y1 )
hL (y1 |y2 )
hM (y1 , y2 )
An example of likelihood ratio test
Consider the following bivariate normal distribution
y1i
µ1
σ12
ρσ1 σ2
∼N
,
y2i
µ2
ρσ1 σ2
σ22
for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 .
Table: Data structure
H
K
y1
y2
X
L
X
M
X
X
Imputation
–
∗(j)
{(y2i , wij∗ )}m
j=1
∗(j)
{(y1i , wij∗ )}m
j=1
∗(j) ∗(j)
{(y1i , y2i , wij∗ )}m
j=1
Proposal h(·)
hK (y2 |y1 )
hL (y1 |y2 )
hM (y1 , y2 )
An example of likelihood ratio test
Consider the following bivariate normal distribution
y1i
µ1
σ12
ρσ1 σ2
∼N
,
y2i
µ2
ρσ1 σ2
σ22
for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 .
Table: Data structure
H
K
y1
y2
X
L
X
M
X
X
Imputation
–
∗(j)
{(y2i , wij∗ )}m
j=1
∗(j)
{(y1i , wij∗ )}m
j=1
∗(j) ∗(j)
{(y1i , y2i , wij∗ )}m
j=1
Proposal h(·)
hK (y2 |y1 )
hL (y1 |y2 )
hM (y1 , y2 )
An example of likelihood ratio test
Consider the following bivariate normal distribution
y1i
µ1
σ12
ρσ1 σ2
∼N
,
y2i
µ2
ρσ1 σ2
σ22
for i = 1, . . . , n. We are interested in testing H0 : µ1 = µ2 .
Table: Data structure
H
K
y1
y2
X
L
X
M
X
X
Imputation
–
∗(j)
{(y2i , wij∗ )}m
j=1
∗(j)
{(y1i , wij∗ )}m
j=1
∗(j) ∗(j)
{(y1i , y2i , wij∗ )}m
j=1
Proposal h(·)
hK (y2 |y1 )
hL (y1 |y2 )
hM (y1 , y2 )
An example of likelihood ratio test
• Under the full model, for i ∈ K , the fractional weights are
given by
∗(j)
wij∗ (θ) = P
m
f (yi
∗(j)
; θ)/hK (y1i |y2i )
∗(k)
k=1 {f (yi
∗(k)
; θ)/hK (y1i
.
|y2i )}
Similar for L and M.
• The MLE under the full model is computed by solving
n
X
∗(j)
wij∗ (θ)S(θ; yi
) = 0.
i=1
We may use the EM algorithm to obtain the solution.
• The maximum of the observed likelihood under the full model


n
m
X

X
∗(j)
∗
lobs
(θ̂) = −
log
wij∗ (θ̂)/f (yi ; θ̂) .


i=1
j=1
An example of likelihood ratio test
• Let f0 be the density for the reduced model under
H0 : µ1 = µ2 = µ.
• For i ∈ K , the fractional weights are given by
∗(j)
∗
w0,ij
(θ) = P
m
f0 (yi
∗(j)
; θ)/hK (y1i |y2i )
∗(k)
∗(k)
; θ)/hK (y1i |y2i )}
k=1 {f0 (yi
.
Similar for L and M.
• Note that we are using the same imputed values for this
computation.
• The MLE of θ under the reduced model, denoted by θ̂0 , is
obtained by solving
n
X
∗(j)
∗
w0,ij
(θ)S0 (θ; yi
) = 0,
i=1
where S0 (θ; yi ) is the score function derived from f0 .
An example of likelihood ratio test
• The maximum of the observed likelihood under the null model
is given by
∗
l0,obs
(θ̂0 ) = −
n
X
i=1
log

m
X

j=1


∗(j)
∗
w0,ij
(θ̂0 )/f0 (yi ; θ̂0 ) .

• The test statistic for testing H0 : µ1 = µ2 is computed from
the PFI data as
n
o
∗
∗
W2 = −2 l0,obs
(θ̂(0) ) − lobs
(θ̂) .
If W2 > χ21,1−α , then we reject the null model.
Simulation One
Profile likelihood confidence interval
• yi = 2 + xi + ei , xi ∼ N(1, 1), ei ∼ N(0, 1), where xi is fully
observed, and yi is subjected to missing.
• δi iid ∼ Bernoulli(0.6). Variable yi is observed if δi = 1 and yi
is missing if δi = 0.
• Monte Carlo samples were independently generated
B = 2, 000 times.
• Constructing 95% confidence interval for β1 and σ 2 .
• Two methods:
• Wald method using asymptotic normality
• Wilk method using the result of Theorem 2.
Simulation One
Table: Monte Carlo length and coverage of the Wald and Wilk
confidence intervals for β1 .
sample size
n = 20
n = 50
n = 100
Wald C.I.
length coverage
0.895
0.895
0.719
0.928
0.504
0.932
Wilk C.I.
length coverage
0.900
0.910
0.740
0.934
0.511
0.936
Simulation One
Table: Monte Carlo length and coverage of the Wald and Wilk
confidence intervals for σ 2 .
sample size
n = 20
n = 50
n = 100
Wald C.I.
length coverage
1.345
0.743
0.952
0.866
0.499
0.940
Wilk C.I.
length coverage
1.644
0.883
1.041
0.928
0.503
0.943
When n=20, about 8% of Monte Carlo samples have negative
values for σ 2 with the Wald confidence intervals.
Simulation One
Sampling distribution of β̂
0.8
0.0
0.4
Density
1.2
n=20
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Density
1.5
Density
1.0
0.5
0.0
1.5
2.0
2.5
0.0 0.5 1.0 1.5 2.0 2.5
n=100
2.0
n=50
1.6
1.8
2.0
2.2
2.4
2.6
Simulation One
Sampling distribution of σˆ2
Density
0.0 0.2 0.4 0.6 0.8 1.0 1.2
n=20
0.0
0.5
1.0
1.5
2.0
2.5
n=100
1.5
0.0
0.5
1.0
Density
1.0
0.5
0.0
Density
1.5
2.0
n=50
0.5
1.0
1.5
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Simulation Two
Likelihood ratio test
• Samples of size n = 100 and n = 200 are generated from
• yi = β0 + β1 x1i + β
ei , where
2 x2i +
xi = (x1i , x2i ) ∼ N
0
2
,
1 0.1
0.1 2
, ei ∼ N(0, 1).
• δi ∼ Bernoulli(0.6)
• (β0 , β1 ) = (−2, 1), β2 changes from 0, 0.1, 0.2, and 0.3.
• We are interested in testing the null hypothesis H0 : β2 = 0
using
• the likelihood ratio test (LRT) of Fractional Imputation (FI)
• the LRT of Multiple Imputation (MI)1
1
Meng, X.L. and Rubin, D.B. (1992). Performing Likelihood Ratio Tests with Multiply-Imputed Data Sets,
Biometrika, 79, 103-111.
Simulation Two
Parameter Value
β2 = 0
β2 = 0.1
β2 = 0.2
β2 = 0.3
α = 0.05
LRT.MI LRT.FI
0.038
0.060
0.151
0.202
0.474
0.601
0.812
0.888
α = 0.1
LRT.MI LRT.FI
0.088
0.113
0.261
0.314
0.634
0.713
0.894
0.935
Table: Monte Carlo power of the likelihood ratio test (LRT) of Multiple
Imputation (MI) and Fractional Imputation (FI) for continuous data with
sample size n = 100.
Simulation Two
Parameter Value
β2 = 0
β2 = 0.1
β2 = 0.2
β2 = 0.3
α = 0.05
LRT.MI LRT.FI
0.032
0.049
0.267
0.335
0.795
0.854
0.988
0.995
α = 0.1
LRT.MI LRT.FI
0.083
0.098
0.406
0.449
0.888
0.922
0.996
0.997
Table: Monte Carlo power of the likelihood ratio test (LRT) of Multiple
Imputation (MI) and Fractional Imputation (FI) for continuous data with
sample size n = 200.
Concluding remarks
Parametric fractional imputation provides a completed data with
fractional weights, which enables to compute the observed log
∗(j)
likelihood function: h(yi,mis ) and f (yij∗ ; θ).
LRT from PFI is more powerful than the Wald test based on the
central limit theorem and also than the LRT from MI proposed by
Meng and Rubin (1992).
Extension will be a topic of future research: Model selection
criteria can be developed, such as AIC or BIC as considered by
Ibrahim et al. (2008) and Garcia et al. (2010).
The end
Download