4. Fractional Imputation (Part 3) 1 Semiparametric fractional imputation for missing covariates problem • Suppose that we are interested in estimating θ in f (y | x; θ). • Now, y is always observed and x is subject to missingness. • Assume MAR in the sense that P r(δ = 1 | x, y) does not depend on x. • In this case, the mean score equation for θ is n X [δi S(θ; xi , yi ) + (1 − δi )E{S(θ; X, yi ) | yi }] i=1 where R E{S(θ; X, yi ) | yi } = S(θ; x, yi )f (yi | x; θ)g(x)dx R f (yi | x; θ)g(x)dx where g(·) is the density for the marginal distribution of x. • If g(x) = g(x; α) for some α, the parametric FI for this case can be used to compute E{S(θ; X, yi ) | yi } ∼ = m X ∗(j) ∗ wij (θ, α)S(θ; xi , yi ) j=1 where ∗(j) ∗ wij (θ, α) ∝ f (yi | xi ∗(j) ; θ)g(xi h(x∗ij ) ; α) and h(·) is the proposal density for x. • Parameter α is a nuisance parameter in the sense that we are not interested in estimating α but its estimation is needed to estimate the parameter of interest. 1 • Semiparametric approach: use a nonparametric model for g(·) but still uses a parametric model for f (·). • Without loss of generality, assume that the first r units are responding in both x and y and the remaining n − r units are missing in x. • A nonparametric model for g(x) is that it belongs to the class of G = {g(x) = Pr Pr i=1 αi I(x = xi ); i=1 αi = 1, αi ≥ 0}. Note that the dimension of parameter α is r, which can go to infinity asymptotically. • EM algorithm using semiparametric fractional imputation Step 0. For each unit with δi = 0, r imputed values of x are assigned with ∗(j) xi (0) = xj . Let αk = 1/r. Step 1. At the t-th EM iteration, compute the fractional weight ∗(j) ∗ wij(t) f (yi | xi = Pr (t) ; θ(t) )αj ∗(k) k=1 f (yi | xi (t) ; θ(t) )αk where θ(0) is the MLE of θ using only full respondents. ∗(j) ∗ and (xi Step 2. Using wij(t) , yi ), update the parameters by solving the im- puted score equation: r X S(θ; xi , yi ) + i=1 n X r X ∗(j) ∗ wij(t) S(θ; xi , yi ) = 0 (1) i=r+1 j=1 and (t+1) α̂j 1 = n ( 1+ n X ) ∗ wij(t) . (2) i=r+1 Step 2 uses (1) and (2) to update the parameters, which are consistent equations obtained by differentiating the log likelihood of θ and nuisance parameters α = (α1 , . . . , αr )T , • Yang and Kim (2014) developed √ n-consistency and the asymptotic normality of the SFI estimator of θ using von Misses calculus and V-statistics theory. 2 2 Nonparametric fractional imputation • Bivariate data: (xi , yi ) • xi are completely observed but yi is subject to missingness. • Joint distribution of (x, y) completely unspecified. • Assume MAR in the sense that P (δ = 1 | x, y) does not depend on y. • Without loss of generality, assume that δi = 1 for i = 1, · · · , r and δi = 0 for i = r + 1, · · · , n. • We are only interested in estimating θ = E(Y ). • Let Kh (xi , xj ) = K((xi − xj )/h) be the Kernel function with bandwidth h such that K(x) ≥ 0 and Z K(x)dx = 1, Z xK(x)dx = 0, 2 σK Z ≡ x2 K(x)dx > 0. Examples include the following: – Boxcar kernel: K(x) = 21 I(x) – Gaussian kernel: K(x) = √1 2π exp(− 12 x2 ) – Epanechnikov kernel: K(x) = 34 (1 − x2 )I(x) – Tricube Kernel: K(x) = 70 (1 81 − |x|3 )3 I(x) where I(x) = 1 0 if |x| ≤ 1 if |x| > 1. • Nonparametric regression estimator of m(x) = E(Y | x): m̂(x) = r X li (x)yi i=1 where i K x−x h li (x) = P x−xj . K j h Estimator in (3) is often called Nadaraya-Watson kernel estimator. 3 (3) • Under some regularity conditions and under the optimal choice of h (with h∗ = O(n−1/5 )), it can be shown that E {m̂(x) − m(x)}2 = O(n−4/5 ). Thus, its convergence rate is slower than that of parametric one. • However, the imputed estimator of θ using (3) can achieve the √ n-consistency. That is, 1 = n ( r X ) m̂(xi ) (4) √ n θ̂N P − θ −→ N (0, σ 2 ) (5) θ̂N P yi + n X i=1 i=r+1 achieves where σ 2 = E{v(x)/π(x)} + V {m(x)}, m(x) = E(y | x), v(x) = V (y | x) and π(x) = E(δ | x). A sketched proof for (5) is given in Appendix A. • We can express θ̂N P as a nonparametric fractional imputation (NFI) estimator of the form θ̂N F I 1 = n ( r X yi + i=1 n r X X ) ∗(j) ∗ wij yi j=r+1 i=1 ∗(j) ∗ where wij = li (xj ), which is defined after (3), and yi = yi . One may consider a sampling of size m from the set of respondents using the fractional weights to reduce the imputation size. Further research is needed. • Reference – Cheng, P.E. (1994). Nonparametric Estimation of Mean Functionals with Data Missing at Random. Journal of the American Statistical Association 89, 81-87. – Wang, D. and S.X. Chen (2009). Empirical Likelihood for Estimating Equation with Missing Values. The Annals of Statistics, 37, 490–517. – Kim, J.K. and Yu, C.Y. (2011). A semi-parametric estimation of mean functionals with non-ignorable missing data, Journal of the American Statistical Association, 106, 157–165. 4 3 Nearest neighbor imputation • Consider the same setup for nonparametric fractional imputation. • Note that a nonparametric estimator m̂(x) in (3) corresponding to boxcar kernel is the local sample average that takes the value of yi in the neighbors within range h in terms of x’s. In the extreme case, we can make the choice of h varying with x so that it can include only the nearest neighbor. • That is, for the missing yj , we use the observed covariate xj to identify the nearest neighbor of yj . • Let j(1) be the index of the nearest neighbor of j such that d xj(1) , xj ≤ d (xk , xj ) , k = 1, · · · , r where d (xi , xj ) is the distance function between xi and xj . • Similarly, the second nearest neighbor of yj , indexed by j(2), satisfies d xj(1) , xj ≤ d xj(2) , xj ≤ d (xk , xj ) , k ∈ {1, · · · , r} − {j(1)}. ∗ ∗ • Let yj(1) and yj(2) be the first nearest neighbor and the second nearest neighbor of yj , respectively. Then, under some regularity conditions, we can show that ∗ max E (yj | xj ) − E yj(1) | x∗j(1) = Op n−1+α (6) ∗ max E (yj | xj ) − E yj(2) | x∗j(2) = Op n−1+α (7) j and j for any α > 0. A sketched proof for (6) and (7) is given in Appendix B. • In fact, we can obtain the same result for m-nearest neighbors and then use ( r ) n X m X 1 X ∗ ∗(j) yi + wij yi (8) θ̂N N F I = n i=1 i=r+1 j=1 ∗(j) where yi ∗ = yi(j) is the j-th nearest neighbor of i and wij is the fractional ∗(j) . For d(xi , xj ) = (xi − xj )2 , we may use a Gaussian P ∗ ∗ kernel to get wij ∝ exp{−d(xi , xi(j) )} with m j=1 wij = 1. weight assigned to yi 5 • Furthermore, we may consider a nearest neighbor based on predictive mean matching: 1. Assume E(y | x) = m(x; β) for some β. Fit a working regression model to get ŷi = m(xi ; β̂) for all i = 1, 2 · · · , n. 2. Identify the nearest neighbor using ŷi . That is, find j(1) that satisfies d ŷj(1) , ŷj ≤ d (ŷk , ŷj ) , k = 1, · · · , r. Similarly, we can identify m nearest neighbors in terms of ŷi . 3. Apply the m nearest neighbors in Step (2) to obtain the nearest neighbor imputation estimator of the form (8). • Note that we can express d (ŷk , ŷj ) = dβ̂ (xk , xj ) for some distance function dβ (a, b) that also depends on β = β̂. In this case, proving the consistency of the resulting estimator is more complicated. Further research is needed. • Removes the curse of dimensionality in identifying the nearest neighbor. Very popular in practice but its theory is under-developed. • Also, we use real value of yi in the imputation. Such imputation is called hot deck imputation. (Will be covered in Week 5.) • References – Abadie, A. and Imbens, G. (2006). Large sample properties of matching estimators for average treatment effects, Econometrica, 74, 235-267. – Chen, J. and Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics. 16, 113–132. – Kim, J.K., Fuller, W.A., and Bell, W.R. (2011). “Variance Estimation for Nearest Neighbor Imputation for U.S. Census Long Form Data,” Annals of Applied Statistics 5, 824–842. 6 4 Application to measurement error models • Bivariate data (x, y) • We are interested in estimating θ in f (y | x; θ) with θ ∈ Ω. • Instead of observing (x, y), we observe (w, y) where w = x + u and u is the measurement error. • Suppose that we have a separate sample, called calibration sample or validation sample V , such that we observe x and w in V . • In V , we obtain a nonparametric estimator of g(x | w) using a Kernel regression method. Specification of g(x | w) is called Berkson error model in the measurement error literature. • We also assume that f (y | x, w) = f (y | x). (9) If condition (9) is satisfies, then the measurement error, u = w − x, is called non-differential and the variable w is said to be a surrogate for x. • In the original sample, the imputed values of x are obtained from f (x | w, y) ∝ g(x | w)f (y | x, w) which reduces to, under assumption (9), f (x | w, y) ∝ g(x | w)f (y | x). (10) The first component is computed from the validation sampling using a nonparametric regression. The second part is computed using a parametric model assumption, f (y | x) = f (y | x; θ) for some θ. Thus, the overall estimation problem becomes a semiparametric estimation problem. • Specifically, the following semiparametric fractional imputation can be used: 7 1. From the validation sample, use a nonparametric regression technique to P obtain Ê(x | w) = i∈V li (w)xi , where li (w) = P Kh (wi , w) . j∈V Kh (wj , w) 2. For each i in the original sample, we use m = nV imputed values of xi by ∗(j) taking all the element of xj in V . That is, we use xi ∗(j) 3. Compute the fractional weight associated with xi = xj . by ∗(j) f (yi | xi ; θ̂(t) )Kh (wi , wj ) ∗ wij(t) = Pm . ∗(k) f (y | x ; θ̂ )K (w , w ) i h i k (t) i k=1 4. Update the parameter value by maximizing XX ∗(j) ∗ l(θ) = wij(t) log f (yi | xi ) i j with respect to θ ∈ Ω to get θ̂(t+1) . 5. Goto Step 3 until convergence. • The resulting SMLE (semiparametric maximum likelihood estimator) of θ is the solution to E{S(θ; X, y) | y, w} = 0 which is actually the solution to E{S(θ; X, y) | y, w; θ, g} = 0 where g is an infinite dimensional nuisance parameter. • The √ n-consistency of SMLE of θ can be established. (Further research is needed.) • Reference: – Cattaneo, M.D. (2010). Efficient Semiparametric Estimation of Multi-valued Treatment Effects under Ignorability. Journal of Econometrics 155, 138154. – Schafer, D. W. (2001). Semiparametric maximum likelihood for measurement error model regression, Biometrics, 57, 53-61. 8 Appendix A: Proof for (5) Regularity conditions: (C1). Moments’ conditions: E(|m(x)|α ) < ∞, E(|y|α ) < ∞ for some α > 2. (C2). Bounded conditions: 1 > π(x) > d > 0 almost surely. (C3). Smoothness conditions: f (x) and π(x) have bounded partial derivatives with respect to x up to an order q with q ≥ 2, 2q > dx almost surely, where dx is the dimension of x. (C4). Kernel function conditions: 1. It is bounded and has compact support. R 2. K(s1 , . . . , sdx )ds1 . . . dsdx = 1, R 3. sli K(s1 , . . . , sdx )ds1 . . . dsdx = 0 for any i = 1, . . . , dx and 1 ≤ l < q. R 4. sqi K(s1 , . . . , sdx )ds1 . . . dsdx 6= 0. √ (C5). Bandwidth conditions: nh2dx → ∞, nhq → 0, as n → ∞. Proof For simplicity, we only consider the case where dx = 1. By using standard argument in the kernel smoothing method, it can be shown that ) ( n 1 X x − xj E δj K( )yj = π(x)f (x)m(x) + O(h2 ) nh j=1 h and ( E ) n 1 X x − xj δj K( ) = π(x)f (x) + O(h2 ). nh j=1 h According to (11), (12) and by using Taylor linearization, we have P (nh)−1 nj=1 δj K((x − xj )/h)yj P m̂(x) = (nh)−1 nj=1 δj K((x − xj )/h) ( ) n 1 1 X x − xj δj K( = m(x) + )yj − π(x)f (x)m(x) π(x)f (x) nh j=1 h ( ) n m(x) 1 X x − xj − δj K( ) − π(x)f (x) + O(h2 ) π(x)f (x) nh j=1 h 9 (11) (12) (13) Therefore, we have n θ̂N P 1X = {δi yi + (1 − δi )m̂(xi )} n i=1 n 1X = {δi yi + (1 − δi )m(xi )} n i=1 1 X (1 − δi )δj h−1 K((xj − xi )/h) {yj − m(xi )} + O(h2 ) + n2 i6=j π(xi )f (xi ) n 1X ≈ {δi yi + (1 − δi )m(xi )} n i=1 X 1 (1 − δi )δj h−1 K((xj − xi )/h) {yj − m(xi )} 2 + n(n − 1) i<j 2 π(xi )f (xi ) + (1 − δj )δi h−1 K((xi − xj )/h) {yi − m(xj )} + O(h2 ). π(xj )f (xj ) (14) Define (1 − δi )δj h−1 K((xj − xi )/h) {yj − m(xi )} , ζij = π(xi )f (xi ) (1 − δj )δi h−1 K((xi − xj )/h) {yi − m(xj )} π(xj )f (xj ) P and h(zi , zj ) = (ζij + ζji )/2 with zi = (xi , yi , δi ), then 2 i6=j h(zi , zj )/ {n(n − 1)} is ζji = the U-statistic. According to U-statistic theory (e.g. Serfling, 1980, Ch. 5), we have n X 2 2X h(zi , zj ) = E {h(zi , zj )|zi } + op (n−1/2 ). n(n − 1) i<j n i=1 (15) Let s = (xj − xi )/h, by nh2 → ∞, nh4 → 0 and according to Taylor linearization, we have (1 − δi )δj h−1 K((xj − xi )/h) {yj − m(xi )} E |zi π(xi )f (xi ) 1 − δi 1 xj − xi E π(xj ) K( ) {m(xj ) − m(xi )} |zi π(xi )f (xi ) h h Z 1 − δi π(xi + hs)K(s) {m(xi + hs) − m(xi )} f (xi + hs)ds π(xi )f (xi ) O(h2 ) (16) E(ζij |zi ) = = = = 10 and (1 − δj )δi h−1 K((xi − xj )/h) {yi − m(xj )} E |zi π(xj )f (xj ) (1 − πj )h−1 K((xi − xj )/h) {yi − m(xj )} δi E |zi π(xj )f (xj ) Z 1 − π(xi + hs) K(s) {yi − m(xi )} f (xi )ds + O(h2 ) δi f (xi + hs)π(xi + hs) 1 − π(xi ) δi {yi − m(xi )} + O(h2 ). (17) π(xi ) E(ζji |zi ) = = = = According to (14)-(17), we have n θ̂N P 1X = {δi yi + (1 − δi )m(xi )} n i=1 X 2 + h(zi , zj ) + O(n−1/2 ) n(n − 1) i<j n = 1X {δi yi + (1 − δi )m(xi )} n i=1 n 1 X 1 − π(xi ) + δi {yi − m(xi )} + o(n−1/2 ) n i=1 π(xi ) n δi δi 1X yi − − 1 m(xi ) + op (n−1/2 ). = n i=1 π(xi ) π(xi ) Hence, we have (5) with σ 2 = E {σ 2 (x)/π(x)} + var {m(x)} , where σ 2 (x) = var(y|x). Appendix B: Proof for (6) and (7) We assume the following conditions. (C1) Let m (x) = E (y | x) be a function of x that satisfies |m (x1 ) − m (x2 )| < C1 |d(x1 , x2 )| . for some constant C1 for all x1 and x2 . (C2) Let m2 (x) = E (y 2 | x) be a function of x that satisfies |m2 (x1 ) − m2 (x2 )| < C2 |d(x1 , x2 )| . for some constant C2 for all x1 and x2 . 11 (C3) For each i ∈ U , the cumulative distribution of d(x) = d(xi , x), denoted by F (d), satisfies lim F (d)/d > 0, d→0 where it is understood that F (0) > 0 satisfies the condition. Proof. Define dij = d (xi , xj ) for i 6= j and let dj(1) be the smallest value of dij among i ∈ A ∩ {j}c . By definition, dj(1) = d xj , xj(1) . Similarly, we can define dj(2) = d xj , xj(2) . Note that, for an = n1−α , n X P r max an dj(1) > M = P r dj(1) > M/an j j=1 = n [1 − F (M/an )]n = n exp [an nα log {1 − F (M/an )}] ≤ n exp [−an nα F (M/an )] , since log(1 − x) ≤ −x for x ∈ [0, 1). Thus, writing t = M/an , we have P r max an dj(1) > M ≤ n exp [−M nα F (t) /t] j which goes to zero as n → ∞, since, by (C3), F (t) /t is bounded below by some constant greater than zero. Thus, we have maxj an dj(1) = Op (1) and, by (C1), we have (6). To prove (7), we use n X P r max an dj(2) > M = P r dj(2) > M/an j j=1 = n {1 − F (M/an )}n + n {1 − F (M/an )}n−1 F (M/an ) = n {1 − F (M/an )}n−1 {1 + (n − 1)F (M/an )} = n exp [(an nα − 1) log {1 − F (M/an )}] {1 + (n − 1)F (M/an )} ≤ n exp(1) exp [−an nα F (M/an )] {1 + nF (M/an )} = Kn2 exp [−M nα F (t) /t] , for some K where t = M/an . Thus, for any > 0, we have P r maxj an dj(2) > M ≤ for sufficiently large n. Therefore, we have maxj an dj(2) = Op (1) and, by (C1), (A2) is proved. 12