Two-phase Fractional Hot deck Imputation Jongho Im, Jae-kwang Kim, Wayne A. Fuller Iowa State University 2015 Joint Statistical Meetings Seattle, August 2015 Outline 1 Missing Data problem 2 Estimation 3 Imputed data set 4 Variance estimation 5 Simulation Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 2 / 19 Data A: index set of the sample (xi , yi , zi , δi ), i ∈ A xi discrete {1, 2, · · · , G } yi variable of intereset zi discrete version of yi {1, 2, · · · , H} δi : response indicator function of yi 1 if yi is observed δi = 0 if yi is missing wi = sampling weight Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 3 / 19 Discrete Variables xi = g , zi = h defines cell gh Assume missing at random πh|g ≡ P{z = h | x = g } = P{z = h | x = g , δ = 1} Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 4 / 19 Parameter Estimation π̂h|g = ( X )−1 wi δi I (xi = g ) i∈A X wi δi I (xi = g , zi = h) i∈A Note: 1 2 zi is observed only for δi = 1. PH h=1 π̂h|g = 1. Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 5 / 19 Discrete Variable Fractional Imputation Fully efficient fractional imputation (FEFI) zi some missing xi all observed H imputed “observations” for each i with δi = 0 weight = wj whj∗ , whj∗ = π̂h|g for xj = g nobs observed nmis H missing Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 6 / 19 Continuous y -variable Efficient estimator µ̂y ,FE = XX g π̂gh = estimated fraction for cell gh ȳgh = sample cell mean P wi δi I (xi = g , zi = h)yi Pi∈A i∈A wi δi I (xi = g , zi = h) = Fuller (ISU) π̂gh ȳgh h Fractional Hot Deck Imputation 2015 JSM 7 / 19 Continuous y -variable (Cont’d) Expression using Fractional Imputation µ̂y ,FEFI = G X X ( wj I (xj = g ) δj yj + (1 − δj ) g =1 j∈A where ) X wij∗ yi i wi δi I (xi = g , zi = h) k∈A wk δk I (xk = g , zk = h) wij∗ = π̂h|g × P Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 8 / 19 Missing Data Illustration Observation 1 2 3 4 5 Fuller (ISU) Weight 0.2 0.2 0.2 0.2 0.2 xi 1 2 2 2 2 Fractional Hot Deck Imputation yi 1 1 1 2 Miss 2015 JSM 9 / 19 Missing Data Imputed Observation 1 2 3 4 5 5 Fuller (ISU) Weight 0.2 0.2 0.2 0.2 0.2 0.2 Fractional Wgt 1.00 1.00 1.00 1.00 0.67 0.33 Fractional Hot Deck Imputation Final Wgt 0.200 0.200 0.200 0.200 0.013 0.007 xi 1 2 2 2 2 2 yi 1 1 1 2 1 2 2015 JSM 10 / 19 Continuous y -variable (Cont’d) Sample y -values: Two-phase sampling approach 1 2 Compute wij∗ for fully efficient fractional imputation. (Define zi and compute π̂h|g for each g and h.) Select M imputed values randomly using the selection probability proportional to wij∗ . Select M cells with replacement with the probability proportional to π̂h|g and then select an element within each selected cell with probability proportional to wi . If a cell is selected twice, we select two donors from the cell. The fractional weights equal to wij∗ = 1/M Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 11 / 19 Continuous y -variable (Cont’d) Random FI: Two-phase sampling approach Define dij = 1 if unit i is selected as a donor for unit j. Let xj = g . Note that we have dij = H X (1) (2) dh|j di|hg h=1 where (1) EI (dh|j ) = π̂h|g , and for xj = g wi δi I (xi = g , zi = h) k∈A wk δk I (xk = g , zk = h) (2) EI (di|hg ) = P Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 12 / 19 Replication variance estimation Replicates for FEFI (k) µ̂y ,FEFI = XX g (k) (k) π̂gh ȳgh h (k) = replicated cell fraction for gh using wi (k) = replicated cell mean P (k) i∈A wi δi I (xi = g , zi = h)yi P (k) i∈A wi δi I (xi = g , zi = h) π̂gh ȳgh = Fuller (ISU) (k) Fractional Hot Deck Imputation 2015 JSM 13 / 19 Replication variance estimation Replicates for random FI The fractional weights wij∗ = 1/M are replicated to reflect the variation due to sampling of donors. Imputed values remain the same (k) µ̂y ,FI = G X X ( (k) wj I (xj = g ) δj yj + (1 − δj ) g =1 j∈A Fuller (ISU) ) X ∗(k) wij yi i Fractional Hot Deck Imputation 2015 JSM 14 / 19 Extension to Multivariate Case Define categories for each variable Estimate cell proportions πght Select donors using the fractional weights for FEFI method Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 15 / 19 Simulation Trivariate data, Y1 ∼ U(0, 2), Y2 = 1 + Y1 + e2 , e2 ∼ N(0, 1/4) Y3 = 2 + Y1 + 0.5Y2 + e3 , e3 ∼ N(0, 1) The response are determined by the Bernoulli with p = (0.5, 0.7, 0.9) for (Y1 , Y2 , Y3 ), respectively. Categorical transformation (basically with 3 categories) was used to each of Y1 , Y2 , and Y3 . B = 2, 000 simulation samples with size of n = 300. Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 16 / 19 Monte Carlo Results (2,000 samples) Parameter Mean Y1 Estimator FEFI FI Std Variance 1.59 1.63 Rel Bias V̂ (µ̂) -2.9 -2.8 Mean Y3 FEFI FI 1.08 1.09 0.3 0.2 Proportion P(Y1 < 1, Y2 < 2) FEFI FI 1.48 1.51 5.1 5.7 Std Variance: Relative to full sample Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 17 / 19 Summary Imputation procedure approximates conditional distribution Nearly unbiased under cell mean model Replication variance estimator performed well Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 18 / 19 The end Fuller (ISU) Fractional Hot Deck Imputation 2015 JSM 19 / 19