3. Fractional Imputation (Part 2) 1 Review

advertisement
3. Fractional Imputation (Part 2)
1
Review
• y observed, z latent.
• (y, z): joint density f (y, z; θ) = f1 (y | z; θ1 )f2 (z; θ2 )
• MLE of θ is computed by solving
E{S(θ; y, Z) | y; θ} = 0
(1)
• Using PFI method, the mean score equation (1) can be written as
S̄I (θ) ≡
n X
m
X
∗(j)
∗
wij
(θ)S(θ; yi , zi
) = 0,
(2)
i=1 j=1
where
∗(j)
∗(j)
∗(j)
∗(j)
∗(j)
f (yi , zi ; θ)/h(zi )
f1 (yi | zi ; θ1 )f2 (zi ; θ2 )/h(zi )
∗
wij
(θ) = P
=
P
∗(k)
∗(k)
∗(k)
∗(k)
∗(k)
; θ)/h(zi )
; θ1 )f2 (zi ; θ2 )/h(zi )
k f (yi , zi
k f1 (yi | zi
∗(1)
and zi
∗(m)
, · · · , zi
are generated from h(z).
• How to choose h(·) ?
– Method 1: For an initial guess θ0 , use
∗(j)
h(zi
∗(j)
) = f (yi , zi
; θ0 )
– Method 2: Use a prior π(θ) to get
Z
∗(j)
∗(j)
h(zi ) = f (yi , zi ; θ)π(θ)dθ.
We use the following steps:
1
1. Generate θ∗(1) , · · · , θ∗(m) from π(θ).
2. For each θ∗(j) , generate one z ∗(j) from f2 (z; θ∗(j) ).
3. Use
m
∗(j)
h(zi )
1 X
∗(j)
=
f (yi , zi ; θ∗(k) )
m k=1
In method 2, you may use m quantiles of π(·) in Step 1.
– Method 3: In some case, specially when f2 (z; θ2 ) is a normal distribution,
we may use m quantiles of the standard normal distribution and choose
h(z ∗(j) ) = 1/m.
Example (of Method 3): GLMM
• Model
yij ∼ f1 (yij | xij , ai ; θ1 )
ai ∼ N (0, σa2 )
We are interested in computing the Monte Carlo approximation associated with
the conditional distribution of ai given (xi , yi ). For example, the mean score
function for θ1 is
R
E{S(θ1 ; xi , yi , ai ) | xi , yi ; θ̂} =
S(θ1 ; xi , yi , ai )f1 (yi | xi , ai ; θ̂1 )φ(ai /σ̂a )d(ai /σ̂a )
R
f1 (yi | xi , ai ; θ̂1 )φ(ai /σ̂a )d(ai /σ̂a )
where φ(·) is the pdf of the standard normal distribution. Thus, we can approximate the above term by
E{S(θ1 ; xi , yi , ai ) | xi , yi ; θ̂} ∼
=
∗(j)
where ai
Pm
j=1
∗(j)
∗(j)
S(θ1 ; xi , yi , ai )f1 (yi | xi , ai
Pm
∗(j)
; θ̂1 )
j=1 f1 (yi | xi , ai
; θ̂1 )
= σ̂a ζj and ζ1 , · · · , ζm are the m quantiles from the standard normal
distribution. That is
P r(z < ζk ) = k/(m + 1),
2
k = 1, · · · , m.
2
Variance estimation
• When θ̂ is the solution to S(θ) = 0, the asymptotic variance of θ̂ is given by
V (θ̂) = {I(θ)}−1
(3)
where I(θ) = E{−∂S(θ)/∂θ0 }.
• To verify this result, we can use a Taylor linearization to S(θ̂) = 0:
0 = S(θ̂)
= S(θ) +
∂
S(θ) (θ̂ − θ)
∂θ0
∼
= S(θ) − {I(θ)} (θ̂ − θ)
Thus,
−1
θ̂ − θ ∼
= {I(θ)} S(θ)
and
−1
−1
−1
V (θ̂) ∼
= {I(θ)} V {S(θ)} {I(θ)} = {I(θ)} ,
where the last equality follows from
E{S(θ)} = 0
(4)
E{−∂S(θ)/∂θ0 } = E{S(θ)S(θ)0 }.
(5)
and
Equality (5) is called Bartlett Identity.
• Under missing data, define Sy (θ) = E{S(y,z) (θ) | y; θ}, where S(y,z) (θ) = S(θ; y, z).
Also, define I(y,z) (θ) = E{−∂S(y,z) (θ)/∂θ}. The MLE of θ, which is the solution
to (1), can be written as the solution to Sy (θ) = 0. To emphasize that the MLE
is the solution to Sy (θ) = 0, denote the MLE by θ̂y . Not surprisingly, we can
establish that, similarly to (3),
.
V (θ̂y ) = {Iy (θ)}−1 = {V (Sy )}−1
(6)
where Iy (θ) = E{−∂Sy (θ)/∂θ0 }. [ HW: Prove (6). You first need to prove (4)
and (5) for Sy (θ). ]
3
• If we define
Sz|y (θ) = S(y,z) (θ) − Sy (θ)
and
∂
Iz|y (θ) = E − 0 Sz|y (θ) ,
∂θ
we can establish the following results:
1. E{Sz|y (θ) | y} = 0.
2. E{Sz|y (θ), Sy (θ)} = 0.
3. V {S(y,z) (θ)} = V {Sy (θ)} + V {Sz|y (θ)} or
I(y,z) (θ) = Iy (θ) + Iz|y (θ).
(7)
Equality (7) is sometimes called missing information principle (Orchard and
Woodbury, 1972).
• We are interested in estimating Iy (θ) = I(y,z) (θ) − Iz|y (θ). To investigate the
second term, Iz|y (θ), note that
Iz|y (θ) = V {Sz|y (θ)}
= V E{Sz|y (θ) | y} + E V {Sz|y (θ) | y}
= 0 + E V {S(y,z) (θ) − Sy (θ) | y}
= 0 + E V {S(y,z) (θ) | y}
Thus, Iz|y (θ) is estimated by V {S(y,z) (θ) | y}, which is often obtained by
m
⊗2
1 X
Bm (S) =
S(θ; y, z ∗(j) ) − S̄I (θ)
m − 1 j=1
where z ∗(1) , · · · , z ∗(m) ∼ f (z | y; θ), B ⊗2 = BB 0 and S̄I (θ) = m−1
Pm
j=1
S(θ; y, z ∗(j) ).
• Thus, an (asymptotically) unbiased estimator of Iy (θ) is then given by
Îy (θ) = E{I(y,z) (θ) | y} − Bm ,
which is first discovered by Louis (1982).
4
• Louis formula can also be derived from the mean score equation (2) for PFI
method. Note that we may write S̄I (θ) as
S̄I (θ | θ) =
n X
m
X
∗(j)
∗
(θ)S(θ; yi , zi )
wij
=
n
X
i=1 j=1
S̄i (θ)
i=1
to reflect the dependency on θ in two places. Thus, we can use chain rule to
obtain
n X
m X
∂ ∗
∂
∗(j)
wij (θ) S(θ; yi , zi )
S̄I (θ | θ) =
∂θ
∂θ
i=1 j=1
n X
m
X
∂
∗(j)
∗
+
wij (θ)
S(θ; yi , zi )
∂θ
i=1 j=1
=
n X
m
X
∗
wij
(θ)
i=1 j=1
n X
m
n
o⊗2 X
∂
∗(j)
∗(j)
∗
S(θ; yi , zi ) .
S(θ; yi , zi ) − S̄i (θ)
+
wij (θ)
∂θ
i=1 j=1
Thus, the first term estimates V {S(y,z) (θ) | y} and the second term estimates
E{−I(y,z) (θ) | y}. Therefore, the asymptotic variance of the MLE of θ is estimated by the inverse of
Îy =
n X
m
X
∗(j)
∗
wij
(θ̂)I(θ̂; yi , zi )
−
n X
m
X
∗
wij
o⊗2
n
∗(j)
S(θ̂; yi , zi ) − S̄i (θ̂)
i=1 j=1
i=1 j=1
where
I(θ; y, z) = −
∂2
log f (y, z; θ).
∂θ∂θ0
• Yang and Kim (2014) further developed a likelihood-ratio test theory associated
with the PFI likelihood:
lP F I (θ) = −
n
X
log
i=1
( m
X
∗(j)
j=1
)
∗
wij
(θ)
f (yi , zi
,
; θ)
which turns out to be a PFI approximation to the observed likelihood
Z
n
X
ly (θ) =
log f (yi , zi ; θ)dzi .
i=1
That is, they proved that
n
o
−2 lP F I (θ) − lP F I (θ̂) ∼ χ2 (p)
under some regularity conditions.
5
3
Prediction
• So far, we have discussed estimation of parameter θ.
• In some applications, prediction of z, or a function of z, given the observation
is also very important.
• For example, in the case of predicting zi , we may use
ẑi = E{zi | yi ; θ̂}
(8)
which can be approximated by PFI:
ẑi∗
m
X
=
∗(j)
∗
wij
zi
j=1
∗(j)
∗
∝ f (yi , zi
where wij
∗(j)
; θ̂)/h(zi
) and
P
j
∗
= 1.
wij
• How to estimate the mean square prediction error ?
M SP E(ẑi ) = E{(ẑi − zi )2 }
• Note that, if we define z̃i = E{zi | yi ; θ} then
M SP E(z̃i ) = E{(z̃i − zi )2 } = E{V (zi | yi ; θ)}
which can be estimated easily using PFI by
M\
SP E(z̃i ) =
m
X
∗(j)
∗
wij
(zi
− ẑi∗ )2 .
j=1
However, M SP E(ẑi ) is generally different from M SP E(z̃i ). In fact, it can be
show that
M SP E(ẑi ) = M SP E(z̃i ) + E{(ẑi − z̃i )2 }.
• Two approaches to estimating M SP E(ẑi ):
1. Linearization method: estimate the second term of (9) by
Ê{(ẑi − z̃i )2 } = gi (θ̂)V̂ (θ̂)gi (θ̂)0
where
gi (θ) =
6
∂
E{zi | yi ; θ}.
∂θ0
(9)
2. Parametric bootstrap:
(b)
(b)
(a) Given θ̂, generate yi and zi from f (y, z; θ̂). Let (yi , zi ) be the b-th
bootstrap sample.
(b)
(b) Using the bootstrap sample of yi (i = 1, · · · , n), compute θ̂(b) using
the PFI method.
(b)
(c) Compute ẑi
(b)
= E{zi | yi ; θ̂(b) } by applying the same formula for
obtaining the original predictor ẑi in (8).
(d) The bootstrap estimator of M SP E(ẑi ) is computed by
B
o2
1 X n (b)
(b)
ẑ − zi
,
B b=1 i
(b)
where ẑi
is computed in Step (c).
• References:
– Prasad, N.G.N. and Rao, J.N.K. (1990). The estimation of the mean
squared error of small-area estimators, JASA, 85, p163-171.
– Chatterjee, Lahiri, and Li (2008) ”Parametric bootstrap approximation to
the distribution of EBLUP, and related prediction intervals in linear mixed
models”, Annal of Statistics, 36, p1221-1245.
4
Application to SAE
• Hierarchical structural model
1. Level one model: yij ∼ f1 (yij | xij , ai ; θ1 )
2. Level two model: ai ∼ f2 (ai ; θ2 )
• Instead of observing (xij , yij ), we observe (xij , ŷij ), where
ŷij | yij ∼ g(ŷij | yij ).
You may think that i is a state-level index and j is a county-level index.
7
• Thus, we have two missing data: ai and yi .
• EM algorithm can be used to estimate θ1 and θ2 . (See Week 2 material.)
• Prediction of Yi =
P
j
yij :
Ŷi = E
(
X
)
yij | xi , ŷi ; θ̂
j
which can be approximated by
Ŷi =
m
X
!
X
∗
wik
j
k=1
where
∗(k)
∗
wik
∝
with
P
k
f1 (yi
∗(k)
| xi
∗(k)
h1 (yi
∗(k)
, ai
∗(k)
yij
∗(k)
; θ̂1 )g(ŷi | yi
∗(k)
| xi
∗(k)
, ai
)
×
, ŷi )
∗
= 1.
wik(t)
• Prediction error estimation: Use parametric bootstrap.
8
∗(k)
; θ̂2 )
∗(k)
| x̂i )
f2 (ai
h2 (ai
Download