Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Jae-Kwang Kim 1 Iowa State University June 26, 2013 1 Joint work with Shu Yang Introduction 1 Introduction 2 Review 3 Fractional Hot deck imputation 4 Simulation Study 5 Conclusion Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 2 / 44 Introduction Basic Setup U = {1, 2, · · · , N}: index set of finite population (xi , yi ): study variables in unit i in the population. η: parameter of interest defined by the solution to N X U(η; xi , yi ) = 0. i=1 Examples: 1 2 3 4 5 Population mean: U(η; x, y ) = y − η Population proportion of Y less than q: U(η; x, y ) = I (y < q) − η Population p-th quantitle : U(η; x, y ) = I (y < η) − p Population regression coefficient: U(η; x, y ) = (y − xη)x0 Domain mean: U(η; x, y ) = (y − η)D(x) Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 3 / 44 Introduction Basic Setup (Cont’d) A: index set of the sample (A ⊂ U) obtained from a probability sampling design, with πi being the first-order inclusion probability of unit i. From the sample, we collect measurement for (xi , yi ). Under complete response, a consistent estimator of η can be obtained by solving X wi U(η; xi , yi ) = 0, (1) i∈A for η, where wi = πi−1 . Under some regularity conditions, the solution to (1) is consistent and asymptotically normally distributed (Binder and Patak, 1994). Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 4 / 44 Introduction Basic Setup (Cont’d) Assume that xi are always observed and yi are subject to non-response. Define δi = 1 0 if yi is observed otherwise. A consistent estimator of η is then obtained by taking the conditional expectation and solving Ū(η) = 0 for η, where X wi [δi U(η; xi , yi ) + (1 − δi ) E {U(η; xi , Y ) | xi , δi = 0}] . Ū(η) = i∈A (2) Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 5 / 44 Introduction How to compute the conditional expectation in (2)? 1 Often, start with assuming missing-at-random (MAR). That is, f (y | x, δ) = f (y | x) 2 Build a (parametric) model on f (y | x). That is, f (y | x) = f (y | x; θ) 3 for some θ. Obtain a consistent estimator θ̂ of θ from the set of respondents. That is, solve X wi δi S(θ; xi , yi ) = 0 i∈A 4 for θ, where S(θ; x, y ) is the score function of θ. Compute the conditional expectation by a Monte Carlo approximation using the samples from f (y | x; θ̂): M 1 X ∗(j) ∗(j) E {U(η; xi , Y ) | xi } ∼ U(η; xi , yi ), where yi ∼ f (y | xi ; θ̂). = M j=1 Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 6 / 44 Introduction Imputation Imputation: Monte Carlo approximation of the conditional expectation (given the observed data). E {U (η; xi , Y ) | xi } ∼ = M 1 X ∗(j) U η; xi , yi M j=1 1 2 Bayesian approach: generate yi∗ from f (yi | xi , θ∗ ) where θ∗ is generated from p(θ | x, y ). Frequentist approach: generate yi∗ from f (yi | xi ; θ̂), where θ̂ is a consistent estimator. Once the conditional expectation is computed (approximately), we can obtain η̂ by solving the imputed estimating equation. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 7 / 44 Introduction Imputation Remark Imputation can be applied even when η is unknown. Thus, it is a useful tool for general-purpose estimation. Works even when M = 1 (single imputation). To reduce the variance and to enable variance estimation, M > 1 is often used. Bayesian approach: Multiple imputation of Rubin (1987) Frequentist approach: Parametric fractional imputation of Kim (2011). Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 8 / 44 Review 1 Introduction 2 Review 3 Fractional Hot deck imputation 4 Simulation Study 5 Conclusion Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 9 / 44 Review Multiple imputation Generate M imputed values (with equal weights) Features 1 2 Imputed values are generated from the posterior predictive distribution, which is the average of f (yi | xi ; θ) evaluated at the posterior distribution π (θ | x, yobs ). Variance estimation formula is simple (Rubin’s formula). 1 )BM M 2 PM PM where WM = M −1 m=1 V̂I (m) , BM = (M − 1)−1 m=1 η̂(m) − η̄M , P M η̄M = M −1 m=1 η̂(m) is the average of M imputed estimators of η, and V̂I (m) is the imputed version of the variance estimator of η̂ under complete response. V̂MI (η̄M ) = WM + (1 + Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 10 / 44 Review Multiple imputation Remark Sampling design is incorporated by including wi into covariates in order to make the sampling design non-informative. Thus, the imputed values are generated from the sample model, not from population model. yi∗ ∼ f (y | xi , Ii = 1) where Ii is the indicator function for the sample inclusion. MAR is assumed in the sample level: f (y | x, I = 1, δ = 0) = f (y | x, I = 1, δ = 1), which is different from MAR in the population level: f (y | x, δ = 0) = f (y | x, δ = 1). Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 11 / 44 Review Multiple imputation Remark (Cont’d) If the sampling design is non-informative, then the sample model and the population model are equivalent and the sample MAR and the population MAR are equivalent. Variance estimation (using Rubin’s formula) does not work when the sampling design is informative. Even when the sampling design is non-informative, consistency of variance estimator is questionable (Kim et al., 2006) . Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 12 / 44 Review Multiple imputation Variance estimation Rubin’s formula is based on the following decomposition: V (η̂MI ) = V (η̂n ) + V (η̂MI − η̂n ). Basically, WM term estimates V (η̂n ) and (1 + M −1 )BM term estimates V (η̂MI − η̂n ). In general, we have V (η̂MI ) = V (η̂n ) + V (η̂MI − η̂n ) + 2Cov (η̂MI − η̂n , η̂n ) and the covariance terms can be non-negligible. The condition of zero covariance is called congeniality by Meng (1994). Congeniality holds when η̂MI is a smooth function of the MLE of θ in f (y | x; θ). Otherwise, Rubin’s variance estimator can be biased, which will be discussed in the simulation section. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 13 / 44 Review Parametric Fractional Imputation Parametric fractional imputation of Kim (2011) 1 2 ∗(1) where ∗(j) wij∗ ∝ f (yi 3 ∗(M) More than one (say M) imputed values of yi : yi , · · · , yi generated from some (initial) density h (yi | xi ). Create weighted data set o n ∗(j) ; j = 1, 2, · · · , M; i ∈ A wi wij∗ , xi , yi ∗(j) | xi ; θ̂)/h(yi | xi ), θ̂ is the (pseudo) maximum likelihood estimator of θ. The weight wij∗ are the normalized importance weights and are called fractional weights. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 14 / 44 Review Parametric Fractional Imputation (Cont’d) Product: fractionally imputed data set of size nM n o ∗(j) (wi wij∗ , xi , yi ); j = 1, 2, · · · , M; i ∈ A Property: for sufficiently large M, R (y |xi ;θ̂) M n o X g (xi , y ) f h(y |xi ) h(y | xi )dy ∗(j) ∼ ∗ wij g (xi , yi ) = = E g (x , Y ) | x ; θ̂ i i R f (y |xi ;θ̂) j=1 h(y |xi ) h(y | xi )dy for any g such that the expectation exists. Can handle informative sampling design by incorporating the sampling weights into the score equation. That is, solve X wi δi S(θ; xi , yi ) = 0 (3) i∈A where S(θ; x, y ) = ∂ log f (y | x; θ)/∂θ is the score function of θ. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 15 / 44 Review Parametric Fractional Imputation (Cont’d) Remark Imputed values are generated from the population model, not from the sample model. yi∗ ∼ f (y | xi ) 6= f (y | xi , Ii = 1). Thus, we assume population MAR, not sample MAR. For variance estimation, either linearization method or replication method can be used. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 16 / 44 Fractional Hot deck imputation 1 Introduction 2 Review 3 Fractional Hot deck imputation 4 Simulation Study 5 Conclusion Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 17 / 44 Fractional Hot deck imputation Fractional Hot deck imputation Motivation Hot deck imputation Imputed values are real observations Very popular in household surveys Want to implement hot deck version of fractional imputation. Kim (2004) and Fuller and Kim (2005) already considered fractional hot deck imputation: x is categorical in f (y | x). Kim, Fuller and Bell (2011) extended the method of Kim (2004) to nearest neighbor imputation. We now want to extend it to the case when x has continuous components. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 18 / 44 Fractional Hot deck imputation Fractional Hot deck imputation Proposed method: Three steps 1 Fully efficient fractional imputation (FEFI) by choosing all the respondents as donors. That is, we use M = nR imputed values for each missing unit, where nR is the number of respondents in the sample. 2 Use a systematic PPS sampling to select m (<< nR ) donors from the FEFI. 3 Use a calibration weighting technique to compute the final fractional weights (which lead to the same estimates of FEFI for some items). Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 19 / 44 Fractional Hot deck imputation Fractional Hot deck imputation Step 1: FEFI step Want to find the fractional weights wij∗ when the j-th imputed value ∗(j) yi is taken from the j-th value in the set of the respondents. Without loss of generality, we assume that the first nR elements ∗(j) respond and write yi = yj . Recall that ∗(j) wij∗ ∝ f (yi ∗(j) when yi | xi ) are generated from h(y | xi ). ∗(j) We have only to find h(yi Kim (ISU) ∗(j) | xi ; θ̂)/h(yi ∗(j) | xi ) when we use yi Fractional Hot Deck Imputation = yj . June 26, 2013 20 / 44 Fractional Hot deck imputation Fractional Hot deck imputation Step 1: FEFI step (Cont’d) We can treat {yi ; δi = 1} as a realization from f (y | δ = 1), the marginal distribution of y among respondents. Now, we can write Z f (yj |δj = 1) = f (yj | x, δj = 1) f (x | δj = 1)dx Z = ∼ = f (yj | x) f (x | δj = 1)dx N 1 X δk f (yj | xk ) , NR k=1 where NR = Kim (ISU) PN i=1 δi is the population size of (potential) respondents. Fractional Hot Deck Imputation June 26, 2013 21 / 44 Fractional Hot deck imputation Fractional Hot deck imputation Step 1: FEFI step (Cont’d) Using the survey weights, we can approximate P k∈AR wk f (yj |xk ) P f (yj |δj = 1) ∼ = k∈AR wk ∗(j) and the fractional weight for yi wij∗ ∝ P = yj becomes f (yj | xi ; θ̂) k∈AR (4) wk f (yj | xk ; θ̂) P with j∈AR wij∗ = 1, where AR = {i ∈ A; δi = 1} and θ̂ is computed from the weighted score equation in (3). Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 22 / 44 Fractional Hot deck imputation Fractional Hot deck imputation Step 2: Sampling Step FEFI uses all the elements in AR as donors for each missing i. Want to reduce the number of donors to, say, m = 10. For each i, we can treat the FEFI donor set as the weighted population and apply a sampling method to select a smaller set of donors. Fractional weights (4) for FEFI can be used as the selection probabilities for the PPS sampling. That is, our goal is to obtain a (systematic) PPS sample Di of size m from the FEFI donor set of size M = nR , using wij∗ as the selection probability assigned to the j-th element in AR . (Note that wij∗ P ∗ ∗ satisfies M j=1 wij = 1 and wij > 0.) Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 23 / 44 Fractional Hot deck imputation Fractional Hot deck imputation Step 3: Weighting Step After we select Di from the complete set of respondents, the selected donors in Di are assigned with the initial fractional weights ∗ = 1/m. wij0 The fractional weights are further adjusted to satisfy X X X X ∗ wi {(1 − δi ) wij,c q(xi , yj )} = wi {(1 − δi ) wij∗ q(xi , yj )}, i∈A j∈Di i∈A j∈AR (5) P ∗ = 1 for all i with δ = 0, where w ∗ for some q(xi , yj ), and j∈Di wij,c i ij is the fractional weights for FEFI method, as defined in (4). Regarding the choice of the control function q(x, y ) in (5), we can use q(x, y ) = (y , y 2 )0 , which will lead to fully efficient estimates for the mean and the variance of y . Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 24 / 44 Fractional Hot deck imputation Fractional Hot deck imputation Remark For variance estimation, replication method can be used. The imputed values are not changed, only the fractional weights are changed for each replication. (Details skipped) The proposed fractional hot deck imputation is less sensitive against model mis-specification in f (y | x; θ). (Details skipped.) The proposed method can be extended to a non-ignorable missing case under a parametric model assumption on the response mechanism. (Details skipped). Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 25 / 44 Simulation Study 1 Introduction 2 Review 3 Fractional Hot deck imputation 4 Simulation Study 5 Conclusion Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 26 / 44 Simulation Study Simulation Study - Study One Factors considered Correct vs incorrect imputation model: to see the effect of model misspecification of f (y | x). Imputation methods: MI, PFI, FHDI Parameters of interest: mean, proportion Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 27 / 44 Simulation Study Simulation Study - Study One Simulation Setup Two sets of models 1 2 Model A: yi = 0.5xi + ei , where xi ∼ exp(1) and ei ∼ N(0, 1). Model B: same as model A except for ei ∼ {χ2 (2) − 2)}/2 Response mechanism: yi is observed only when δi = 1 where δi ∼ Bernoulli(π), πi = {1 + exp(−0.2 − xi )}−1 Thus, we have MAR with 65% overall response in both models. B = 5, 000 Monte Carlo samples of size n = 200. We used yi ∼ N(β0 + β1 xi , σ 2 ) as the imputation model under both cases. (Thus, the imputation model is mis-specified under Model B.) Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 28 / 44 Simulation Study Simulation Study - Study One Simulation Setup (Cont’d) Two parameters considered: 1 2 η1 = E (Y ): the population mean of y η2 = Pr (Y < 1): the proportion of Y less than one. Four estimators computed: 1 2 3 4 Full sample estimator (FULL) that is computed using the full sample. Multiple imputation (MI) estimator with imputation size m = 10 Parametric fractional imputation (PFI) with imputation size m = 10 Fractional hot deck imputation (FHDI) with imputation size m = 10 Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 29 / 44 Simulation Study Simulation Study - Study One Simulation Results under Model A Table : Point estimation Parameter η 1 = µy η2 = pr(Y < 1) Kim (ISU) Method Full MI PFI FHDI Full MI PFI FHDI Mean .50 .50 .50 .50 .68 .68 .68 .68 Fractional Hot Deck Imputation Var .00625 .00955 .00907 .00926 .00107 .00130 .00129 .00158 Std Var 100 153 145 148 100 126 121 153 June 26, 2013 30 / 44 Simulation Study Simulation Study - Study One Simulation Results under Model A Table : Variance estimation Parameter V (η̂1 ) V (η̂2 ) Kim (ISU) Method MI PFI FHDI MI PFI FHDI R.B. (%) 0.66 2.18 0.44 19.35 0.99 5.19 Fractional Hot Deck Imputation t-statistics 0.32 1.11 0.22 9.39 0.50 2.56 June 26, 2013 31 / 44 Simulation Study Simulation Study - Study One Discussion for Model A results Point estimation unbiased for both parameters under correct model. For η1 = E (Y ), imputation increases variance roughly 45-53%: 1 2 1 1 V (η̂1,imp ) = σ + − σe2 n y nR n 1.25 1 1 ∼ + − 1 = 200 200 0.65 . = 0.00625 + 0.0027 = 0.00895 and 0.00895/0.00625 = 1.43. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 32 / 44 Simulation Study Simulation Study - Study One Discussion for Model A results (Cont’d) For η2 = Pr (Y < 1), imputation increases variance roughly 25% for MI and PFI. Note that n 1X η̂2,imp ∼ [δi I (yi < 1) + (1 − δi )E {I (Y < 1) | xi }] = n i=1 where we used the imputation model in computing the conditional expectation. Thus, it “borrows strength” by making use of normality assumption at the time of imputation. In some sense, the above imputation estimator can be viewed as a composite estimator, where “composite” estimator is a weighted average of “direct”’ estimator and “synthetic” estimator. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 33 / 44 Simulation Study Simulation Study - Study One Discussion for Model A results (Cont’d) In fact, under full response, there are two estimators of η2 = Pr (Y < 1): η̂2,MME = n −1 n X I (yi < 1) i=1 Z η̂2,MLE 1 = φ −∞ y − µ̂ σ̂ dy . The MLE is more efficient than the MME but it is less robust. The congeniality condition holds when MLE is used, but not when MME is used. Rubin’s variance estimator for MI requires the congeniality condition. FI does not require congeniality. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 34 / 44 Simulation Study Simulation Study - Study One Simulation Results under Model B Table : Point estimation Parameter η 1 = µy η2 = pr(Y < 1) Kim (ISU) Method Full MI PFI FHDI Full MI PFI FHDI Mean .502 .499 .501 .500 .748 .729 .730 .751 Fractional Hot Deck Imputation Var .00619 .00952 .00917 .00911 .00093 .00149 .00144 .00147 Std Var 100 155 148 148 100 159 155 157 June 26, 2013 35 / 44 Simulation Study Simulation Study - Study One Simulation Results under Model B Table : Variance estimation Parameter V (η̂1 ) V (η̂2 ) Kim (ISU) Method MI PFI FHDI (m = 10) MI (m = 10) PFI (m = 10) FHDI (m = 10) R.B. (%) 1.43 1.15 1.00 -3.08 3.26 4.50 Fractional Hot Deck Imputation t-statistics 0.71 0.57 0.51 -1.52 1.62 2.22 June 26, 2013 36 / 44 Simulation Study Simulation Study - Study One Discussion for Model B results Point estimation unbiased for η1 = E (Y ) even when the imputation model is incorrect. Note that, for m → ∞, the imputed estimator of η1 can be written n η̂1,imp = = 1X {δi yi + (1 − δi )ŷi } n 1 n i=1 n X ŷi i=1 which is called the projection estimator. Kim and Rao (2012) showed design-consistency of the projection estimator. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 37 / 44 Simulation Study Simulation Study - Study One Discussion for Model B results (Cont’d) However, all imputed estimator are biased for η2 = Pr (Y < 1). The biases are much higher for MI and PFI than FHDI, with the corresponding z-statistics are -34.8,-33.5, and 5.5 for MI, PFI, and FHDI, respectively. Note that the true error distribution is ei ∼ {χ2 (2) − 2)/2 while the imputation model errors are generated from ei∗ ∼ N(0, σ̂e2 ). (See the picture next page). In FHDI, the donors are still generated from the true distribution, only the fractional weights are computed from the wrong model. Thus, the effect of model mis-specification is less severe than the other imputation methods that create synthetic values from the wrong model. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 38 / 44 0.8 Simulation Study 0.4 0.0 0.2 Density 0.6 True model Imputation model −1 0 1 2 3 4 5 x Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 39 / 44 Simulation Study Simulation Study - Study Two Bivariate data (xi , yi ) of size n = 100 with Yi = β0 + β1 xi + β2 xi2 − 1 + ei (6) where (β0 , β1 , β2 ) = (0, 0.9, 0.06), xi ∼ N (0, 1), ei ∼ N (0, 0.16), and xi and ei are independent. The variable xi is always observed but the probability that yi responds is 0.5. The imputation model is Yi = β0 + β1 xi + ei . That is, imputer’s model uses extra information of β2 = 0. From the imputed data, we fit model (6) and computed power of a test H0 : β2 = 0 with 0.05 significant level. In addition, we also considered the Complete-Case (CC) method that simply uses the complete cases only for the regression analysis. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 40 / 44 Simulation Study Simulation Study - Study Two Table 5 Simulation results for the Monte Carlo experiment based on 10,000 Monte Carlo samples. Method MI FI CC E (θ̂) 0.028 0.046 0.060 V (θ̂) 0.00056 0.00146 0.00234 R.B. (V̂ ) 1.81 0.02 -0.01 Power 0.044 0.314 0.285 Table 5 shows that MI provides efficient point estimator than CC method but variance estimation is very conservative (more than 100% overestimation). Because of the serious positive bias of MI variance estimator, the statistical power of the test based on MI is actually lower than the CC method. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 41 / 44 Conclusion 1 Introduction 2 Review 3 Fractional Hot deck imputation 4 Simulation Study 5 Conclusion Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 42 / 44 Conclusion Concluding Remarks Advantage 1 2 3 4 Hot deck imputation: uses real observations for imputed values. Robust against model mis-specification. Applicable even when the sampling design is informative. Does not require congeniality condition for valid variance estimation. Disadvantage : May have a higher imputation variance than the imputation methods using synthetic values. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 43 / 44 Conclusion Future work Extension to single imputation (m = 1). Imputation variance component needs to be estimated. Instead of the calibration weighting step (in Step 3), we may consider using balanced imputation (Chauvet et al., 2011) FHDI for multivariate missing To be presented at the ISI meeting in Hong Kong To be implemented in SAS (in Proc Surveyimpute). Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 44 / 44 References REFERENCES Binder, D. and Z. Patak (1994), ‘Use of estimating functions for estimation from complex surveys’, Journal of the American Statistical Association 89, 1035–1043. Chauvet, G., J.-C. Deville and D. Haziza (2011), ‘On balanced random imputation in surveys’, Biometrika 98, 459–471. Fuller, W. A. and J. K. Kim (2005), ‘Hot deck imputation for the response model’, Survey Methodology 31, 139–149. Kim, J. K. (2004), ‘Finite sample properties of multiple imputation estimators’, The Annals of Statistics 32, 766–783. Kim, J. K. (2011), ‘Parametric fractional imputation for missing data analysis’, Biometrika 98, 119–132. Kim, J. K. and J. N. K. Rao (2012), ‘Combining data from two independent surveys: a model-assisted approach’, Biometrika 99, 85–100. Kim, J. K., M. J. Brick, W. A. Fuller and G. Kalton (2006), ‘On the bias of the multiple imputation variance estimator in survey sampling’, Journal of the Royal Statistical Society: Series B 68, 509–521. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 44 / 44 Conclusion Kim, J.K., W.A. Fuller and W.R. Bell (2011), ‘Variance estimation for nearest neighbor imputation for u.s. census long form data’, Annals of Applied Statistics 5, 824–842. Meng, X. L. (1994), ‘Multiple-imputation inferences with uncongenial sources of input (with discussion)’, Statistical Science 9, 538–573. Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, Wiley, New York. Kim (ISU) Fractional Hot Deck Imputation June 26, 2013 44 / 44