4. Fractional Imputation (Part 3) 1 Semiparametric fractional imputation for miss-

advertisement
4. Fractional Imputation (Part 3)
1
Semiparametric fractional imputation for missing covariates problem
• Suppose that we are interested in estimating θ in f (y | x; θ).
• Now, y is always observed and x is subject to missingness.
• Assume MAR in the sense that P r(δ = 1 | x, y) does not depend on x.
• In this case, the mean score equation for θ is
n
X
[δi S(θ; xi , yi ) + (1 − δi )E{S(θ; X, yi ) | yi }]
i=1
where
R
E{S(θ; X, yi ) | yi } =
S(θ; x, yi )f (yi | x; θ)g(x)dx
R
f (yi | x; θ)g(x)dx
where g(·) is the density for the marginal distribution of x.
• If g(x) = g(x; α) for some α, the parametric FI for this case can be used to
compute
E{S(θ; X, yi ) | yi } ∼
=
m
X
∗(j)
∗
wij
(θ, α)S(θ; xi
, yi )
j=1
where
∗(j)
∗
wij
(θ, α)
∝
f (yi | xi
∗(j)
; θ)g(xi
h(x∗ij )
; α)
and h(·) is the proposal density for x.
• Parameter α is a nuisance parameter in the sense that we are not interested in
estimating α but its estimation is needed to estimate the parameter of interest.
1
• Semiparametric approach: use a nonparametric model for g(·) but still uses a
parametric model for f (·).
• Without loss of generality, assume that the first r units are responding in both
x and y and the remaining n − r units are missing in x.
• A nonparametric model for g(x) is that it belongs to the class of G = {g(x) =
Pr
Pr
i=1 αi I(x = xi );
i=1 αi = 1, αi ≥ 0}. Note that the dimension of parameter
α is r, which can go to infinity asymptotically.
• EM algorithm using semiparametric fractional imputation
Step 0. For each unit with δi = 0, r imputed values of x are assigned with
∗(j)
xi
(0)
= xj . Let αk = 1/r.
Step 1. At the t-th EM iteration, compute the fractional weight
∗(j)
∗
wij(t)
f (yi | xi
= Pr
(t)
; θ(t) )αj
∗(k)
k=1
f (yi | xi
(t)
; θ(t) )αk
where θ(0) is the MLE of θ using only full respondents.
∗(j)
∗
and (xi
Step 2. Using wij(t)
, yi ), update the parameters by solving the im-
puted score equation:
r
X
S(θ; xi , yi ) +
i=1
n X
r
X
∗(j)
∗
wij(t)
S(θ; xi
, yi ) = 0
(1)
i=r+1 j=1
and
(t+1)
α̂j
1
=
n
(
1+
n
X
)
∗
wij(t)
.
(2)
i=r+1
Step 2 uses (1) and (2) to update the parameters, which are consistent equations
obtained by differentiating the log likelihood of θ and nuisance parameters α =
(α1 , . . . , αr )T ,
• Yang and Kim (2014) developed
√
n-consistency and the asymptotic normality
of the SFI estimator of θ using von Misses calculus and V-statistics theory.
2
2
Nonparametric fractional imputation
• Bivariate data: (xi , yi )
• xi are completely observed but yi is subject to missingness.
• Joint distribution of (x, y) completely unspecified.
• Assume MAR in the sense that P (δ = 1 | x, y) does not depend on y.
• Without loss of generality, assume that δi = 1 for i = 1, · · · , r and δi = 0 for
i = r + 1, · · · , n.
• We are only interested in estimating θ = E(Y ).
• Let Kh (xi , xj ) = K((xi − xj )/h) be the Kernel function with bandwidth h such
that K(x) ≥ 0 and
Z
K(x)dx = 1,
Z
xK(x)dx = 0,
2
σK
Z
≡
x2 K(x)dx > 0.
Examples include the following:
– Boxcar kernel: K(x) = 21 I(x)
– Gaussian kernel: K(x) =
√1
2π
exp(− 12 x2 )
– Epanechnikov kernel: K(x) = 34 (1 − x2 )I(x)
– Tricube Kernel: K(x) =
70
(1
81
− |x|3 )3 I(x)
where
I(x) =
1
0
if |x| ≤ 1
if |x| > 1.
• Nonparametric regression estimator of m(x) = E(Y | x):
m̂(x) =
r
X
li (x)yi
i=1
where
i
K x−x
h
li (x) = P
x−xj .
K
j
h
Estimator in (3) is often called Nadaraya-Watson kernel estimator.
3
(3)
• Under some regularity conditions and under the optimal choice of h (with h∗ =
O(n−1/5 )), it can be shown that
E {m̂(x) − m(x)}2 = O(n−4/5 ).
Thus, its convergence rate is slower than that of parametric one.
• However, the imputed estimator of θ using (3) can achieve the
√
n-consistency.
That is,
1
=
n
( r
X
)
m̂(xi )
(4)
√ n θ̂N P − θ −→ N (0, σ 2 )
(5)
θ̂N P
yi +
n
X
i=1
i=r+1
achieves
where σ 2 = E{v(x)/π(x)} + V {m(x)}, m(x) = E(y | x), v(x) = V (y | x) and
π(x) = E(δ | x). A sketched proof for (5) is given in Appendix A.
• We can express θ̂N P as a nonparametric fractional imputation (NFI) estimator
of the form
θ̂N F I
1
=
n
( r
X
yi +
i=1
n
r
X
X
)
∗(j)
∗
wij
yi
j=r+1 i=1
∗(j)
∗
where wij
= li (xj ), which is defined after (3), and yi
= yi . One may consider
a sampling of size m from the set of respondents using the fractional weights to
reduce the imputation size. Further research is needed.
• Reference
– Cheng, P.E. (1994). Nonparametric Estimation of Mean Functionals with
Data Missing at Random. Journal of the American Statistical Association
89, 81-87.
– Wang, D. and S.X. Chen (2009). Empirical Likelihood for Estimating
Equation with Missing Values. The Annals of Statistics, 37, 490–517.
– Kim, J.K. and Yu, C.Y. (2011). A semi-parametric estimation of mean
functionals with non-ignorable missing data, Journal of the American Statistical Association, 106, 157–165.
4
3
Nearest neighbor imputation
• Consider the same setup for nonparametric fractional imputation.
• Note that a nonparametric estimator m̂(x) in (3) corresponding to boxcar kernel
is the local sample average that takes the value of yi in the neighbors within
range h in terms of x’s. In the extreme case, we can make the choice of h varying
with x so that it can include only the nearest neighbor.
• That is, for the missing yj , we use the observed covariate xj to identify the
nearest neighbor of yj .
• Let j(1) be the index of the nearest neighbor of j such that
d xj(1) , xj ≤ d (xk , xj ) ,
k = 1, · · · , r
where d (xi , xj ) is the distance function between xi and xj .
• Similarly, the second nearest neighbor of yj , indexed by j(2), satisfies
d xj(1) , xj ≤ d xj(2) , xj ≤ d (xk , xj ) ,
k ∈ {1, · · · , r} − {j(1)}.
∗
∗
• Let yj(1)
and yj(2)
be the first nearest neighbor and the second nearest neighbor
of yj , respectively. Then, under some regularity conditions, we can show that
∗
max E (yj | xj ) − E yj(1)
| x∗j(1) = Op n−1+α
(6)
∗
max E (yj | xj ) − E yj(2)
| x∗j(2) = Op n−1+α
(7)
j
and
j
for any α > 0. A sketched proof for (6) and (7) is given in Appendix B.
• In fact, we can obtain the same result for m-nearest neighbors and then use
( r
)
n X
m
X
1 X
∗ ∗(j)
yi +
wij
yi
(8)
θ̂N N F I =
n i=1
i=r+1 j=1
∗(j)
where yi
∗
= yi(j) is the j-th nearest neighbor of i and wij
is the fractional
∗(j)
. For d(xi , xj ) = (xi − xj )2 , we may use a Gaussian
P
∗
∗
kernel to get wij
∝ exp{−d(xi , xi(j) )} with m
j=1 wij = 1.
weight assigned to yi
5
• Furthermore, we may consider a nearest neighbor based on predictive mean
matching:
1. Assume E(y | x) = m(x; β) for some β. Fit a working regression model to
get ŷi = m(xi ; β̂) for all i = 1, 2 · · · , n.
2. Identify the nearest neighbor using ŷi . That is, find j(1) that satisfies
d ŷj(1) , ŷj ≤ d (ŷk , ŷj ) ,
k = 1, · · · , r.
Similarly, we can identify m nearest neighbors in terms of ŷi .
3. Apply the m nearest neighbors in Step (2) to obtain the nearest neighbor
imputation estimator of the form (8).
• Note that we can express d (ŷk , ŷj ) = dβ̂ (xk , xj ) for some distance function
dβ (a, b) that also depends on β = β̂. In this case, proving the consistency of the
resulting estimator is more complicated. Further research is needed.
• Removes the curse of dimensionality in identifying the nearest neighbor. Very
popular in practice but its theory is under-developed.
• Also, we use real value of yi in the imputation. Such imputation is called hot
deck imputation. (Will be covered in Week 5.)
• References
– Abadie, A. and Imbens, G. (2006). Large sample properties of matching
estimators for average treatment effects, Econometrica, 74, 235-267.
– Chen, J. and Shao, J. (2000). Nearest neighbor imputation for survey
data. Journal of Official Statistics. 16, 113–132.
– Kim, J.K., Fuller, W.A., and Bell, W.R. (2011). “Variance Estimation for
Nearest Neighbor Imputation for U.S. Census Long Form Data,” Annals
of Applied Statistics 5, 824–842.
6
4
Application to measurement error models
• Bivariate data (x, y)
• We are interested in estimating θ in f (y | x; θ) with θ ∈ Ω.
• Instead of observing (x, y), we observe (w, y) where w = x + u and u is the
measurement error.
• Suppose that we have a separate sample, called calibration sample or validation
sample V , such that we observe x and w in V .
• In V , we obtain a nonparametric estimator of g(x | w) using a Kernel regression method. Specification of g(x | w) is called Berkson error model in the
measurement error literature.
• We also assume that
f (y | x, w) = f (y | x).
(9)
If condition (9) is satisfies, then the measurement error, u = w − x, is called
non-differential and the variable w is said to be a surrogate for x.
• In the original sample, the imputed values of x are obtained from
f (x | w, y) ∝ g(x | w)f (y | x, w)
which reduces to, under assumption (9),
f (x | w, y) ∝ g(x | w)f (y | x).
(10)
The first component is computed from the validation sampling using a nonparametric regression. The second part is computed using a parametric model
assumption, f (y | x) = f (y | x; θ) for some θ. Thus, the overall estimation
problem becomes a semiparametric estimation problem.
• Specifically, the following semiparametric fractional imputation can be used:
7
1. From the validation sample, use a nonparametric regression technique to
P
obtain Ê(x | w) = i∈V li (w)xi , where
li (w) = P
Kh (wi , w)
.
j∈V Kh (wj , w)
2. For each i in the original sample, we use m = nV imputed values of xi by
∗(j)
taking all the element of xj in V . That is, we use xi
∗(j)
3. Compute the fractional weight associated with xi
= xj .
by
∗(j)
f (yi | xi ; θ̂(t) )Kh (wi , wj )
∗
wij(t)
= Pm
.
∗(k)
f
(y
|
x
;
θ̂
)K
(w
,
w
)
i
h
i
k
(t)
i
k=1
4. Update the parameter value by maximizing
XX
∗(j)
∗
l(θ) =
wij(t)
log f (yi | xi )
i
j
with respect to θ ∈ Ω to get θ̂(t+1) .
5. Goto Step 3 until convergence.
• The resulting SMLE (semiparametric maximum likelihood estimator) of θ is the
solution to
E{S(θ; X, y) | y, w} = 0
which is actually the solution to
E{S(θ; X, y) | y, w; θ, g} = 0
where g is an infinite dimensional nuisance parameter.
• The
√
n-consistency of SMLE of θ can be established. (Further research is
needed.)
• Reference:
– Cattaneo, M.D. (2010). Efficient Semiparametric Estimation of Multi-valued
Treatment Effects under Ignorability. Journal of Econometrics 155, 138154.
– Schafer, D. W. (2001). Semiparametric maximum likelihood for measurement error model regression, Biometrics, 57, 53-61.
8
Appendix A: Proof for (5)
Regularity conditions:
(C1). Moments’ conditions: E(|m(x)|α ) < ∞, E(|y|α ) < ∞ for some α > 2.
(C2). Bounded conditions: 1 > π(x) > d > 0 almost surely.
(C3). Smoothness conditions: f (x) and π(x) have bounded partial derivatives with
respect to x up to an order q with q ≥ 2, 2q > dx almost surely, where dx is the
dimension of x.
(C4). Kernel function conditions:
1. It is bounded and has compact support.
R
2. K(s1 , . . . , sdx )ds1 . . . dsdx = 1,
R
3. sli K(s1 , . . . , sdx )ds1 . . . dsdx = 0 for any i = 1, . . . , dx and 1 ≤ l < q.
R
4. sqi K(s1 , . . . , sdx )ds1 . . . dsdx 6= 0.
√
(C5). Bandwidth conditions: nh2dx → ∞,
nhq → 0, as n → ∞.
Proof
For simplicity, we only consider the case where dx = 1. By using standard argument in the kernel smoothing method, it can be shown that
)
(
n
1 X
x − xj
E
δj K(
)yj = π(x)f (x)m(x) + O(h2 )
nh j=1
h
and
(
E
)
n
1 X
x − xj
δj K(
) = π(x)f (x) + O(h2 ).
nh j=1
h
According to (11), (12) and by using Taylor linearization, we have
P
(nh)−1 nj=1 δj K((x − xj )/h)yj
P
m̂(x) =
(nh)−1 nj=1 δj K((x − xj )/h)
(
)
n
1
1 X
x − xj
δj K(
= m(x) +
)yj − π(x)f (x)m(x)
π(x)f (x) nh j=1
h
(
)
n
m(x)
1 X
x − xj
−
δj K(
) − π(x)f (x) + O(h2 )
π(x)f (x) nh j=1
h
9
(11)
(12)
(13)
Therefore, we have
n
θ̂N P
1X
=
{δi yi + (1 − δi )m̂(xi )}
n i=1
n
1X
=
{δi yi + (1 − δi )m(xi )}
n i=1
1 X (1 − δi )δj h−1 K((xj − xi )/h) {yj − m(xi )}
+ O(h2 )
+
n2 i6=j
π(xi )f (xi )
n
1X
≈
{δi yi + (1 − δi )m(xi )}
n i=1
X 1 (1 − δi )δj h−1 K((xj − xi )/h) {yj − m(xi )}
2
+
n(n − 1) i<j 2
π(xi )f (xi )
+
(1 − δj )δi h−1 K((xi − xj )/h) {yi − m(xj )} + O(h2 ).
π(xj )f (xj )
(14)
Define
(1 − δi )δj h−1 K((xj − xi )/h) {yj − m(xi )}
,
ζij =
π(xi )f (xi )
(1 − δj )δi h−1 K((xi − xj )/h) {yi − m(xj )}
π(xj )f (xj )
P
and h(zi , zj ) = (ζij + ζji )/2 with zi = (xi , yi , δi ), then 2 i6=j h(zi , zj )/ {n(n − 1)} is
ζji =
the U-statistic. According to U-statistic theory (e.g. Serfling, 1980, Ch. 5), we have
n
X
2
2X
h(zi , zj ) =
E {h(zi , zj )|zi } + op (n−1/2 ).
n(n − 1) i<j
n i=1
(15)
Let s = (xj − xi )/h, by nh2 → ∞, nh4 → 0 and according to Taylor linearization, we
have
(1 − δi )δj h−1 K((xj − xi )/h) {yj − m(xi )}
E
|zi
π(xi )f (xi )
1 − δi
1
xj − xi
E π(xj ) K(
) {m(xj ) − m(xi )} |zi
π(xi )f (xi )
h
h
Z
1 − δi
π(xi + hs)K(s) {m(xi + hs) − m(xi )} f (xi + hs)ds
π(xi )f (xi )
O(h2 )
(16)
E(ζij |zi ) =
=
=
=
10
and
(1 − δj )δi h−1 K((xi − xj )/h) {yi − m(xj )}
E
|zi
π(xj )f (xj )
(1 − πj )h−1 K((xi − xj )/h) {yi − m(xj )}
δi E
|zi
π(xj )f (xj )
Z
1 − π(xi + hs)
K(s) {yi − m(xi )} f (xi )ds + O(h2 )
δi
f (xi + hs)π(xi + hs)
1 − π(xi )
δi
{yi − m(xi )} + O(h2 ).
(17)
π(xi )
E(ζji |zi ) =
=
=
=
According to (14)-(17), we have
n
θ̂N P
1X
=
{δi yi + (1 − δi )m(xi )}
n i=1
X
2
+
h(zi , zj ) + O(n−1/2 )
n(n − 1) i<j
n
=
1X
{δi yi + (1 − δi )m(xi )}
n i=1
n
1 X 1 − π(xi )
+
δi
{yi − m(xi )} + o(n−1/2 )
n i=1
π(xi )
n δi
δi
1X
yi −
− 1 m(xi ) + op (n−1/2 ).
=
n i=1 π(xi )
π(xi )
Hence, we have (5) with σ 2 = E {σ 2 (x)/π(x)} + var {m(x)} , where σ 2 (x) = var(y|x).
Appendix B: Proof for (6) and (7)
We assume the following conditions.
(C1) Let m (x) = E (y | x) be a function of x that satisfies
|m (x1 ) − m (x2 )| < C1 |d(x1 , x2 )| .
for some constant C1 for all x1 and x2 .
(C2) Let m2 (x) = E (y 2 | x) be a function of x that satisfies
|m2 (x1 ) − m2 (x2 )| < C2 |d(x1 , x2 )| .
for some constant C2 for all x1 and x2 .
11
(C3) For each i ∈ U , the cumulative distribution of d(x) = d(xi , x), denoted by
F (d), satisfies
lim F (d)/d > 0,
d→0
where it is understood that F (0) > 0 satisfies the condition.
Proof. Define dij = d (xi , xj ) for i 6= j and let dj(1) be the smallest value of dij
among i ∈ A ∩ {j}c . By definition, dj(1) = d xj , xj(1) . Similarly, we can define
dj(2) = d xj , xj(2) . Note that, for an = n1−α ,
n
X
P r max an dj(1) > M
=
P r dj(1) > M/an
j
j=1
= n [1 − F (M/an )]n
= n exp [an nα log {1 − F (M/an )}]
≤ n exp [−an nα F (M/an )] ,
since log(1 − x) ≤ −x for x ∈ [0, 1). Thus, writing t = M/an , we have
P r max an dj(1) > M
≤ n exp [−M nα F (t) /t]
j
which goes to zero as n → ∞, since, by (C3), F (t) /t is bounded below by some
constant greater than zero. Thus, we have maxj an dj(1) = Op (1) and, by (C1), we
have (6).
To prove (7), we use
n
X
P r max an dj(2) > M
=
P r dj(2) > M/an
j
j=1
= n {1 − F (M/an )}n + n {1 − F (M/an )}n−1 F (M/an )
= n {1 − F (M/an )}n−1 {1 + (n − 1)F (M/an )}
= n exp [(an nα − 1) log {1 − F (M/an )}] {1 + (n − 1)F (M/an )}
≤ n exp(1) exp [−an nα F (M/an )] {1 + nF (M/an )}
= Kn2 exp [−M nα F (t) /t] ,
for some K where t = M/an . Thus, for any > 0, we have P r maxj an dj(2) > M ≤ for sufficiently large n. Therefore, we have maxj an dj(2) = Op (1) and, by (C1), (A2)
is proved.
12
Download