4. Fractional Imputation (Part 3)
1
Semiparametric fractional imputation for missing covariates problem
• Suppose that we are interested in estimating θ in f (y | x; θ).
• Now, y is always observed and x is subject to missingness.
• Assume MAR in the sense that P r(δ = 1 | x, y) does not depend on x.
• In this case, the mean score equation for θ is
n
X
[δi S(θ; xi , yi ) + (1 − δi )E{S(θ; X, yi ) | yi }]
i=1
where
R
E{S(θ; X, yi ) | yi } =
S(θ; x, yi )f (yi | x; θ)g(x)dx
R
f (yi | x; θ)g(x)dx
where g(·) is the density for the marginal distribution of x.
• If g(x) = g(x; α) for some α, the parametric FI for this case can be used to
compute
E{S(θ; X, yi ) | yi } ∼
=
m
X
∗(j)
∗
wij
(θ, α)S(θ; xi
, yi )
j=1
where
∗(j)
∗
wij
(θ, α)
∝
f (yi | xi
∗(j)
; θ)g(xi
h(x∗ij )
; α)
and h(·) is the proposal density for x.
• Parameter α is a nuisance parameter in the sense that we are not interested in
estimating α but its estimation is needed to estimate the parameter of interest.
1
• Semiparametric approach: use a nonparametric model for g(·) but still uses a
parametric model for f (·).
• Without loss of generality, assume that the first r units are responding in both
x and y and the remaining n − r units are missing in x.
• A nonparametric model for g(x) is that it belongs to the class of G = {g(x) =
Pr
Pr
i=1 αi I(x = xi );
i=1 αi = 1, αi ≥ 0}. Note that the dimension of parameter
α is r, which can go to infinity asymptotically.
• EM algorithm using semiparametric fractional imputation
Step 0. For each unit with δi = 0, r imputed values of x are assigned with
∗(j)
xi
(0)
= xj . Let αk = 1/r.
Step 1. At the t-th EM iteration, compute the fractional weight
∗(j)
∗
wij(t)
f (yi | xi
= Pr
(t)
; θ(t) )αj
∗(k)
k=1
f (yi | xi
(t)
; θ(t) )αk
where θ(0) is the MLE of θ using only full respondents.
∗(j)
∗
and (xi
Step 2. Using wij(t)
, yi ), update the parameters by solving the im-
puted score equation:
r
X
S(θ; xi , yi ) +
i=1
n X
r
X
∗(j)
∗
wij(t)
S(θ; xi
, yi ) = 0
(1)
i=r+1 j=1
and
(t+1)
α̂j
1
=
n
(
1+
n
X
)
∗
wij(t)
.
(2)
i=r+1
Step 2 uses (1) and (2) to update the parameters, which are consistent equations
obtained by differentiating the log likelihood of θ and nuisance parameters α =
(α1 , . . . , αr )T ,
• Yang and Kim (2014) developed
√
n-consistency and the asymptotic normality
of the SFI estimator of θ using von Misses calculus and V-statistics theory.
2
2
Nonparametric fractional imputation
• Bivariate data: (xi , yi )
• xi are completely observed but yi is subject to missingness.
• Joint distribution of (x, y) completely unspecified.
• Assume MAR in the sense that P (δ = 1 | x, y) does not depend on y.
• Without loss of generality, assume that δi = 1 for i = 1, · · · , r and δi = 0 for
i = r + 1, · · · , n.
• We are only interested in estimating θ = E(Y ).
• Let Kh (xi , xj ) = K((xi − xj )/h) be the Kernel function with bandwidth h such
that K(x) ≥ 0 and
Z
K(x)dx = 1,
Z
xK(x)dx = 0,
2
σK
Z
≡
x2 K(x)dx > 0.
Examples include the following:
– Boxcar kernel: K(x) = 21 I(x)
– Gaussian kernel: K(x) =
√1
2π
exp(− 12 x2 )
– Epanechnikov kernel: K(x) = 34 (1 − x2 )I(x)
– Tricube Kernel: K(x) =
70
(1
81
− |x|3 )3 I(x)
where
I(x) =
1
0
if |x| ≤ 1
if |x| > 1.
• Nonparametric regression estimator of m(x) = E(Y | x):
m̂(x) =
r
X
li (x)yi
i=1
where
i
K x−x
h
li (x) = P
x−xj .
K
j
h
Estimator in (3) is often called Nadaraya-Watson kernel estimator.
3
(3)
• Under some regularity conditions and under the optimal choice of h (with h∗ =
O(n−1/5 )), it can be shown that
E {m̂(x) − m(x)}2 = O(n−4/5 ).
Thus, its convergence rate is slower than that of parametric one.
• However, the imputed estimator of θ using (3) can achieve the
√
n-consistency.
That is,
1
=
n
( r
X
)
m̂(xi )
(4)
√ n θ̂N P − θ −→ N (0, σ 2 )
(5)
θ̂N P
yi +
n
X
i=1
i=r+1
achieves
where σ 2 = E{v(x)/π(x)} + V {m(x)}, m(x) = E(y | x), v(x) = V (y | x) and
π(x) = E(δ | x). A sketched proof for (5) is given in Appendix A.
• We can express θ̂N P as a nonparametric fractional imputation (NFI) estimator
of the form
θ̂N F I
1
=
n
( r
X
yi +
i=1
n
r
X
X
)
∗(j)
∗
wij
yi
j=r+1 i=1
∗(j)
∗
where wij
= li (xj ), which is defined after (3), and yi
= yi . One may consider
a sampling of size m from the set of respondents using the fractional weights to
reduce the imputation size. Further research is needed.
• Reference
– Cheng, P.E. (1994). Nonparametric Estimation of Mean Functionals with
Data Missing at Random. Journal of the American Statistical Association
89, 81-87.
– Wang, D. and S.X. Chen (2009). Empirical Likelihood for Estimating
Equation with Missing Values. The Annals of Statistics, 37, 490–517.
– Kim, J.K. and Yu, C.Y. (2011). A semi-parametric estimation of mean
functionals with non-ignorable missing data, Journal of the American Statistical Association, 106, 157–165.
4
3
Nearest neighbor imputation
• Consider the same setup for nonparametric fractional imputation.
• Note that a nonparametric estimator m̂(x) in (3) corresponding to boxcar kernel
is the local sample average that takes the value of yi in the neighbors within
range h in terms of x’s. In the extreme case, we can make the choice of h varying
with x so that it can include only the nearest neighbor.
• That is, for the missing yj , we use the observed covariate xj to identify the
nearest neighbor of yj .
• Let j(1) be the index of the nearest neighbor of j such that
d xj(1) , xj ≤ d (xk , xj ) ,
k = 1, · · · , r
where d (xi , xj ) is the distance function between xi and xj .
• Similarly, the second nearest neighbor of yj , indexed by j(2), satisfies
d xj(1) , xj ≤ d xj(2) , xj ≤ d (xk , xj ) ,
k ∈ {1, · · · , r} − {j(1)}.
∗
∗
• Let yj(1)
and yj(2)
be the first nearest neighbor and the second nearest neighbor
of yj , respectively. Then, under some regularity conditions, we can show that
∗
max E (yj | xj ) − E yj(1)
| x∗j(1) = Op n−1+α
(6)
∗
max E (yj | xj ) − E yj(2)
| x∗j(2) = Op n−1+α
(7)
j
and
j
for any α > 0. A sketched proof for (6) and (7) is given in Appendix B.
• In fact, we can obtain the same result for m-nearest neighbors and then use
( r
)
n X
m
X
1 X
∗ ∗(j)
yi +
wij
yi
(8)
θ̂N N F I =
n i=1
i=r+1 j=1
∗(j)
where yi
∗
= yi(j) is the j-th nearest neighbor of i and wij
is the fractional
∗(j)
. For d(xi , xj ) = (xi − xj )2 , we may use a Gaussian
P
∗
∗
kernel to get wij
∝ exp{−d(xi , xi(j) )} with m
j=1 wij = 1.
weight assigned to yi
5
• Furthermore, we may consider a nearest neighbor based on predictive mean
matching:
1. Assume E(y | x) = m(x; β) for some β. Fit a working regression model to
get ŷi = m(xi ; β̂) for all i = 1, 2 · · · , n.
2. Identify the nearest neighbor using ŷi . That is, find j(1) that satisfies
d ŷj(1) , ŷj ≤ d (ŷk , ŷj ) ,
k = 1, · · · , r.
Similarly, we can identify m nearest neighbors in terms of ŷi .
3. Apply the m nearest neighbors in Step (2) to obtain the nearest neighbor
imputation estimator of the form (8).
• Note that we can express d (ŷk , ŷj ) = dβ̂ (xk , xj ) for some distance function
dβ (a, b) that also depends on β = β̂. In this case, proving the consistency of the
resulting estimator is more complicated. Further research is needed.
• Removes the curse of dimensionality in identifying the nearest neighbor. Very
popular in practice but its theory is under-developed.
• Also, we use real value of yi in the imputation. Such imputation is called hot
deck imputation. (Will be covered in Week 5.)
• References
– Abadie, A. and Imbens, G. (2006). Large sample properties of matching
estimators for average treatment effects, Econometrica, 74, 235-267.
– Chen, J. and Shao, J. (2000). Nearest neighbor imputation for survey
data. Journal of Official Statistics. 16, 113–132.
– Kim, J.K., Fuller, W.A., and Bell, W.R. (2011). “Variance Estimation for
Nearest Neighbor Imputation for U.S. Census Long Form Data,” Annals
of Applied Statistics 5, 824–842.
6
4
Application to measurement error models
• Bivariate data (x, y)
• We are interested in estimating θ in f (y | x; θ) with θ ∈ Ω.
• Instead of observing (x, y), we observe (w, y) where w = x + u and u is the
measurement error.
• Suppose that we have a separate sample, called calibration sample or validation
sample V , such that we observe x and w in V .
• In V , we obtain a nonparametric estimator of g(x | w) using a Kernel regression method. Specification of g(x | w) is called Berkson error model in the
measurement error literature.
• We also assume that
f (y | x, w) = f (y | x).
(9)
If condition (9) is satisfies, then the measurement error, u = w − x, is called
non-differential and the variable w is said to be a surrogate for x.
• In the original sample, the imputed values of x are obtained from
f (x | w, y) ∝ g(x | w)f (y | x, w)
which reduces to, under assumption (9),
f (x | w, y) ∝ g(x | w)f (y | x).
(10)
The first component is computed from the validation sampling using a nonparametric regression. The second part is computed using a parametric model
assumption, f (y | x) = f (y | x; θ) for some θ. Thus, the overall estimation
problem becomes a semiparametric estimation problem.
• Specifically, the following semiparametric fractional imputation can be used:
7
1. From the validation sample, use a nonparametric regression technique to
P
obtain Ê(x | w) = i∈V li (w)xi , where
li (w) = P
Kh (wi , w)
.
j∈V Kh (wj , w)
2. For each i in the original sample, we use m = nV imputed values of xi by
∗(j)
taking all the element of xj in V . That is, we use xi
∗(j)
3. Compute the fractional weight associated with xi
= xj .
by
∗(j)
f (yi | xi ; θ̂(t) )Kh (wi , wj )
∗
wij(t)
= Pm
.
∗(k)
f
(y
|
x
;
θ̂
)K
(w
,
w
)
i
h
i
k
(t)
i
k=1
4. Update the parameter value by maximizing
XX
∗(j)
∗
l(θ) =
wij(t)
log f (yi | xi )
i
j
with respect to θ ∈ Ω to get θ̂(t+1) .
5. Goto Step 3 until convergence.
• The resulting SMLE (semiparametric maximum likelihood estimator) of θ is the
solution to
E{S(θ; X, y) | y, w} = 0
which is actually the solution to
E{S(θ; X, y) | y, w; θ, g} = 0
where g is an infinite dimensional nuisance parameter.
• The
√
n-consistency of SMLE of θ can be established. (Further research is
needed.)
• Reference:
– Cattaneo, M.D. (2010). Efficient Semiparametric Estimation of Multi-valued
Treatment Effects under Ignorability. Journal of Econometrics 155, 138154.
– Schafer, D. W. (2001). Semiparametric maximum likelihood for measurement error model regression, Biometrics, 57, 53-61.
8
Appendix A: Proof for (5)
Regularity conditions:
(C1). Moments’ conditions: E(|m(x)|α ) < ∞, E(|y|α ) < ∞ for some α > 2.
(C2). Bounded conditions: 1 > π(x) > d > 0 almost surely.
(C3). Smoothness conditions: f (x) and π(x) have bounded partial derivatives with
respect to x up to an order q with q ≥ 2, 2q > dx almost surely, where dx is the
dimension of x.
(C4). Kernel function conditions:
1. It is bounded and has compact support.
R
2. K(s1 , . . . , sdx )ds1 . . . dsdx = 1,
R
3. sli K(s1 , . . . , sdx )ds1 . . . dsdx = 0 for any i = 1, . . . , dx and 1 ≤ l < q.
R
4. sqi K(s1 , . . . , sdx )ds1 . . . dsdx 6= 0.
√
(C5). Bandwidth conditions: nh2dx → ∞,
nhq → 0, as n → ∞.
Proof
For simplicity, we only consider the case where dx = 1. By using standard argument in the kernel smoothing method, it can be shown that
)
(
n
1 X
x − xj
E
δj K(
)yj = π(x)f (x)m(x) + O(h2 )
nh j=1
h
and
(
E
)
n
1 X
x − xj
δj K(
) = π(x)f (x) + O(h2 ).
nh j=1
h
According to (11), (12) and by using Taylor linearization, we have
P
(nh)−1 nj=1 δj K((x − xj )/h)yj
P
m̂(x) =
(nh)−1 nj=1 δj K((x − xj )/h)
(
)
n
1
1 X
x − xj
δj K(
= m(x) +
)yj − π(x)f (x)m(x)
π(x)f (x) nh j=1
h
(
)
n
m(x)
1 X
x − xj
−
δj K(
) − π(x)f (x) + O(h2 )
π(x)f (x) nh j=1
h
9
(11)
(12)
(13)
Therefore, we have
n
θ̂N P
1X
=
{δi yi + (1 − δi )m̂(xi )}
n i=1
n
1X
=
{δi yi + (1 − δi )m(xi )}
n i=1
1 X (1 − δi )δj h−1 K((xj − xi )/h) {yj − m(xi )}
+ O(h2 )
+
n2 i6=j
π(xi )f (xi )
n
1X
≈
{δi yi + (1 − δi )m(xi )}
n i=1
X 1 (1 − δi )δj h−1 K((xj − xi )/h) {yj − m(xi )}
2
+
n(n − 1) i<j 2
π(xi )f (xi )
+
(1 − δj )δi h−1 K((xi − xj )/h) {yi − m(xj )} + O(h2 ).
π(xj )f (xj )
(14)
Define
(1 − δi )δj h−1 K((xj − xi )/h) {yj − m(xi )}
,
ζij =
π(xi )f (xi )
(1 − δj )δi h−1 K((xi − xj )/h) {yi − m(xj )}
π(xj )f (xj )
P
and h(zi , zj ) = (ζij + ζji )/2 with zi = (xi , yi , δi ), then 2 i6=j h(zi , zj )/ {n(n − 1)} is
ζji =
the U-statistic. According to U-statistic theory (e.g. Serfling, 1980, Ch. 5), we have
n
X
2
2X
h(zi , zj ) =
E {h(zi , zj )|zi } + op (n−1/2 ).
n(n − 1) i<j
n i=1
(15)
Let s = (xj − xi )/h, by nh2 → ∞, nh4 → 0 and according to Taylor linearization, we
have
(1 − δi )δj h−1 K((xj − xi )/h) {yj − m(xi )}
E
|zi
π(xi )f (xi )
1 − δi
1
xj − xi
E π(xj ) K(
) {m(xj ) − m(xi )} |zi
π(xi )f (xi )
h
h
Z
1 − δi
π(xi + hs)K(s) {m(xi + hs) − m(xi )} f (xi + hs)ds
π(xi )f (xi )
O(h2 )
(16)
E(ζij |zi ) =
=
=
=
10
and
(1 − δj )δi h−1 K((xi − xj )/h) {yi − m(xj )}
E
|zi
π(xj )f (xj )
(1 − πj )h−1 K((xi − xj )/h) {yi − m(xj )}
δi E
|zi
π(xj )f (xj )
Z
1 − π(xi + hs)
K(s) {yi − m(xi )} f (xi )ds + O(h2 )
δi
f (xi + hs)π(xi + hs)
1 − π(xi )
δi
{yi − m(xi )} + O(h2 ).
(17)
π(xi )
E(ζji |zi ) =
=
=
=
According to (14)-(17), we have
n
θ̂N P
1X
=
{δi yi + (1 − δi )m(xi )}
n i=1
X
2
+
h(zi , zj ) + O(n−1/2 )
n(n − 1) i<j
n
=
1X
{δi yi + (1 − δi )m(xi )}
n i=1
n
1 X 1 − π(xi )
+
δi
{yi − m(xi )} + o(n−1/2 )
n i=1
π(xi )
n δi
δi
1X
yi −
− 1 m(xi ) + op (n−1/2 ).
=
n i=1 π(xi )
π(xi )
Hence, we have (5) with σ 2 = E {σ 2 (x)/π(x)} + var {m(x)} , where σ 2 (x) = var(y|x).
Appendix B: Proof for (6) and (7)
We assume the following conditions.
(C1) Let m (x) = E (y | x) be a function of x that satisfies
|m (x1 ) − m (x2 )| < C1 |d(x1 , x2 )| .
for some constant C1 for all x1 and x2 .
(C2) Let m2 (x) = E (y 2 | x) be a function of x that satisfies
|m2 (x1 ) − m2 (x2 )| < C2 |d(x1 , x2 )| .
for some constant C2 for all x1 and x2 .
11
(C3) For each i ∈ U , the cumulative distribution of d(x) = d(xi , x), denoted by
F (d), satisfies
lim F (d)/d > 0,
d→0
where it is understood that F (0) > 0 satisfies the condition.
Proof. Define dij = d (xi , xj ) for i 6= j and let dj(1) be the smallest value of dij
among i ∈ A ∩ {j}c . By definition, dj(1) = d xj , xj(1) . Similarly, we can define
dj(2) = d xj , xj(2) . Note that, for an = n1−α ,
n
X
P r max an dj(1) > M
=
P r dj(1) > M/an
j
j=1
= n [1 − F (M/an )]n
= n exp [an nα log {1 − F (M/an )}]
≤ n exp [−an nα F (M/an )] ,
since log(1 − x) ≤ −x for x ∈ [0, 1). Thus, writing t = M/an , we have
P r max an dj(1) > M
≤ n exp [−M nα F (t) /t]
j
which goes to zero as n → ∞, since, by (C3), F (t) /t is bounded below by some
constant greater than zero. Thus, we have maxj an dj(1) = Op (1) and, by (C1), we
have (6).
To prove (7), we use
n
X
P r max an dj(2) > M
=
P r dj(2) > M/an
j
j=1
= n {1 − F (M/an )}n + n {1 − F (M/an )}n−1 F (M/an )
= n {1 − F (M/an )}n−1 {1 + (n − 1)F (M/an )}
= n exp [(an nα − 1) log {1 − F (M/an )}] {1 + (n − 1)F (M/an )}
≤ n exp(1) exp [−an nα F (M/an )] {1 + nF (M/an )}
= Kn2 exp [−M nα F (t) /t] ,
for some K where t = M/an . Thus, for any > 0, we have P r maxj an dj(2) > M ≤ for sufficiently large n. Therefore, we have maxj an dj(2) = Op (1) and, by (C1), (A2)
is proved.
12
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )