Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 2 / 35 Introduction Motivation Combine information from several surveys Example: Two surveys 1 2 Survey A: Observe X and Y1 Survey B: Observe X and Y2 Want to create a data file with X , Y1 , Y2 . If “Survey B” sample is a subset of “Survey A” sample, then we may use record linkage technique to obtain Y1 value for survey B sample. What if the two samples are independent ? Kim (ISU) Matching 3 / 35 Introduction Table : A Simple Data structure for Matching Sample A Sample B Kim (ISU) X o o Matching Y1 o Y2 o 4 / 35 Introduction Table : Data after statistical matching Sample A Sample B X o o Y1 o o Y2 o o Also called data fusion, or data combination. Kim (ISU) Matching 5 / 35 Introduction Example 1 Split questionnaire design Split the original sample into two groups In group 1, ask (x, y1 ) In group 2, ask (x, y2 ) Often used to reduce the response burden (and improve the quality of the survey responses). Kim (ISU) Matching 6 / 35 Introduction Example 2 Combining two surveys Survey A: Health-related survey Survey B: Socio-Economic surveys x: demographic variable, y1 : health status variable, y2 : socio-economic variable Interested in fitting a regression of y1 (e.g. Obesity) on x and y2 using two surveys. Two samples should be obtained from the same finite population. Kim (ISU) Matching 7 / 35 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 8 / 35 Introduction Idea We want to create Y1 for each element in sample B by finding a “statistical twin” from the sample A. Often based on the assumption that Y1 and Y2 are conditionally independent, conditional on X . That is, Y1 ⊥ Y2 | X Under CI (Conditional Independence) assumption, we have f (y1 | x, y2 ) = f (y1 | x) and the “statistical twin” is solely determined by “how close” they are in terms of x’s. Kim (ISU) Matching 9 / 35 Introduction Remark Under the assumption that (X , Y1 , Y2 ) are multivariate normal, the CI assumption means that σ12 = σ1x σ2x /σxx and ρ12 = ρ1x ρ2x . That is, σ12 is determined from other parameters, rather than estimated from the realized samples. Kim (ISU) Matching 10 / 35 Existing Methods Methods under CI assumption Synthetic data imputation: 1 2 Estimate f (y1 | x) from sample A, denoted by fˆa (y1 | x). For each element in sample B, use the xi value to create imputed value(s) from fˆa (y1 | x). Matching: Two-step method Instead of using the synthetic values directly for imputation, synthetic values are used to identify the statistical twins in sample A. The identified twin in sample A is used as the imputed value. Kim (ISU) Matching 11 / 35 Existing Methods Some popular methods under CI assumption Parametric approach : Often based on the parametric model or regression model ŷ1i = β̂0 + β̂1 xi Nonparametric approach Random hot deck Rank hot deck Distance hot deck Reference D’Orazio, Di Zio, and Scanu (2006). Statistical Matching: Theory and Practice, Wiley. Kim (ISU) Matching 12 / 35 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 13 / 35 New Approach Motivation The regression of Y1 on X and Y2 will provide insignificant regression coefficient on Y2 . That is, the p-value for β̂2 will be large in ŷ1 = β̂0 + β̂1 x + β̂2 y2 CI assumption is often unrealistic ! For example, 1 2 3 Often X is demographic variable Y1 is social-behavior (or public health) Y2 is economic variable (e.g. HH income) In this case, we may have Corr (Y1 , Y2 | X ) 6= 0 Kim (ISU) Matching 14 / 35 New Approach Alternative interpretation We can view the problem as an omitted variable regression problem. (1) (1) (1) (2) (2) (2) y1 = β0 + β1 x + β2 z + e1 y2 = β0 + β1 x + β2 z + e2 where z, e1 , e2 are never observed. e1 and e2 are independent. z is an unobservable confounding factor that explains Cov (y1 , y2 | x) 6= 0. Thus, if we fit a regression of (y1 , y2 ) on x, then the error terms are still correlated. Kim (ISU) Matching 15 / 35 New Approach Instrumental variable Under CI assumption, imputed values are generated from f (y1 | x), which completely ignores the observed information of y2 . Let’s try to generate imputed values from f (y1 | x, y2 ). However, we cannot estimate the parameters in f (y1 | x, y2 ). Use instrumental variable assumption for identification of the models. Kim (ISU) Matching 16 / 35 New Approach Idea Decompose X = (X1 , X2 ) such that (i) f (y1 | x1 , x2 , y2 ) = f (y1 | x1 , y2 ) (ii) f (y1 | x1 , x2 = a) 6= f (y1 | x1 , x2 = b) for some a 6= b. X2 is often called instrumental variable (IV) for Y2 Kim (ISU) Matching 17 / 35 New Approach Propose method Under IV assumption, f (y1 | x, y2 ) ∝ f (y1 | x) f (y2 | x1 , y1 ) The second term can be ignored under CI assumption. The second term incorporates the observed information of y2 in Sample B. EM algorithm can be used to perform the parameter estimation and prediction simultaneously. E-step can be computationally heavy (Markov Chain Monte Carlo). Metropolis-Hastings algorithm 1 2 Generate y1∗ from fˆa (y1 | x). Accept y1∗ if f (y2 | x1 , y1∗ ; θ̂) is large at the current parameter value θ̂. Kim (ISU) Matching 18 / 35 New Approach Propose method Parametric fractional imputation (PFI) of Kim (2011) is an alternative computational tool that does not involve MCMC computation but still implements EM algorithm with intractable E-step. PFI uses importance sampling: When the target distribution is f (y1 | x, y2 ) ∝ f (y1 | x) f (y2 | x1 , y1 ) , first generate m values of y1∗ ∼ f (y1 | x) and then use a normalized version of f (y2 | x1 , y1∗ ) as a weight assigned to y1∗ . Solve the weighted score equation to update the parameters in the M-step. Kim (ISU) Matching 19 / 35 New Approach Propose method: Parametric fractional imputation 1 2 3 For each i ∈ B, generate m imputed values of y1 , denoted by ∗(1) ∗(m) y1i , · · · , y1i , from fˆa (y1 | xi ). Let θ̂t be the current parameter value of θ in f (y2 | x1 , y1 ). For the ∗(j) j-th imputed value y1i , assign fractional weight ∗(j) wij∗ ∝ f y2i | x1i , y1i ; θ̂t P ∗ where m j=1 wij = 1. Solve the fractionally imputed score equation for θ X i∈B 4 wib m X ∗(j) wij∗ S(θ; x1i , y1i , y2i ) = 0 j=1 to update θ̂t+1 , where S(θ; x1 , y1 , y2 ) = ∂ log f (y2 | x1 , y1 ; θ)/∂θ. Go to step 2 and continue until convergence. Kim (ISU) Matching 20 / 35 Remark Fractional imputation can be understood as a tool for computing a Monte Carlo approximation of the conditional expectation given the observation. Fractionally imputed data file can be used to obtain many different parameters. That is, if a parameter η is defined as a solution to E {U(η; x, y1 , y2 )} = 0, then a consistent estimator of η can be obtained by the solution to X i∈B wib m X ∗(j) wij∗ U(η; xi , y1i , y2i ) = 0. j=1 Note that the above estimating equation is a Monte Carlo approximation to the following estimating equation: X wib E {U(η; xi , Y1i , y2i ) | xi , y2i } = 0. i∈B For variance estimation, linearization method can be used (Skipped here). Kim (ISU) Matching 21 / 35 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 22 / 35 Application to Measurement error models Interested in estimating θ in f (y | x; θ). Instead of observing x, we observe z which can be highly correlated with x. Thus, z is an instrumental variable for x: f (y | x, z) = f (y | x) and f (y | z = a) 6= f (y | z = b) for a 6= b. In addition to original sample, we have a separate calibration sample that observes (xi , zi ). Kim (ISU) Matching 23 / 35 Example: Measurement error model Table : External Calibration Study Sample A Sample B Z o o X o Y o Table : Internal Calibration Study Sample Validation Subsample Non-validation subsample Kim (ISU) Matching Z o o X o Y o o 24 / 35 Remark Internal calibration study: Two-phase sampling structure Phase One: observe (z, y ) Phase Two: validation subsample, observe x in addition to (z, y ) Imputation approach for two-phase sampling Estimate f (x | z, y ) from the second phase sample. For the elements in the phase one sample, generate x ∼ fˆ(x | z, y ). For external calibration study, we use the proposed statistical matching technique under the assumption that f (y | x, z) = f (y | x). Kim (ISU) Matching 25 / 35 Proposed method: Idea In sample B, x is a latent variable (a variable that is always missing). The goal is to generate x in Sample B from f (xi | zi , yi ) ∝ f (xi | zi ) f (yi | xi , zi ) = f (xi | zi ) f (yi | xi ) Obtain a consistent estimator fˆa (x | z) from sample A. May use a Monte Carlo EM algorithm ∗(1) E-step: Generate xi ∗(m) , · · · , xi from f (xi | zi , yi ; θ̂(t) ) ∝ fˆa (xi | zi )f (yi | xi ; θ̂(t) ) M-step: Solve the imputed score equation for θ. Kim (ISU) Matching 26 / 35 Fractional imputation for EM algorithm The above E-step may be computationally challenging (often relies on a MCMC method) Parametric fractional imputation can be used for easy computation. E-step 1 2 ∗(1) ∗(m) from fˆa (xi | zi ) in i ∈ B. Generate xi , · · · , xi ∗(j) Compute the fractional weights associated with xi by ∗(j) wij∗ ∝ f (yi | xi and P j ; θ̂(t) ) wij∗ = 1. M-step: Solve the weighted score equation for θ. Kim (ISU) Matching 27 / 35 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 28 / 35 Simulation Setup Measurement error model setup yi ∼ Bernoulli(pi ) logit(pi ) = γ0 + γx xi zi = β0 + β1 xi + ui ui ∼ N(0, σ 2 xi2α ) and xi ∼ N(µx , σx2 ). We observe (xi , zi ), i = 1, . . . , nA in sample A. In sample B, instead of observing (xi , yi ), we observe (zi , yi ). For the simulation, nA = nB = 800, γ0 = 1, γx = 1, β0 = 0, β1 = 1, σ 2 = 0.25, α = 0.4, µx = 0, and σx2 = 1. Kim (ISU) Matching 29 / 35 Methods 1 2 3 4 Parametric fractional imputation (PFI) Hot deck fractional imputation (HDFI) Naive: Naive estimator obtained from the logistic regression of yi on zi for i ∈ B. Bayes: Proposed by Guo and Little (2011). GIBBS sampling is implemented with JAGS. We used 1000 iterations of a single chain for inference, after discarding the first 500 for burn-in. We specify diffuse proper prior distributions for the Bayes estimators. Letting θ1 = (log(σx2 ), log(σ 2 ), µx , β0 , β1 , γ0 , γx ), 5 we assume a priori that θ1 ∼ N(0, 10−6 I7 ), where I7 is a 7 × 7 identity matrix. The prior distribution for the power α is uniform on the interval [−5, 5]. Weighted regression calibration (WRC): regression calibration method incorporating the unequal variance in the measurement error model (also considered in Guo and Little, 2011). Kim (ISU) Matching 30 / 35 Simulation result Table : Monte Carlo (MC) means, variances, and mean squared errors (MSE) of point estimators of γx Method PFI HDFI Naive Bayes WRC Kim (ISU) MC Bias 0.0239 0.0246 -0.2241 0.0406 0.1120 MC Variance 0.0386 0.0387 0.0239 0.0415 0.0499 Matching MC MSE 0.0392 0.0393 0.0742 0.0432 0.0625 31 / 35 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 32 / 35 Concluding Remark Statistical matching is a tool for survey data integration. The current practice of statistical matching is based on conditional independence assumption, which may not be a realistic assumption in practice. A new approach based on instrumental variable is proposed. The proposed method provides statistically valid regression coefficient for the matched data even when CI assumption does not hold. Variance estimation is possible (not covered here). Directly applicable to measurement error model problems and split questionnaire design problems. Kim (ISU) Matching 33 / 35 Future research Semi-parametric inference by making fˆa (y1 | x) nonparametric. f (y1 | x, y2 ) ∝ f (y1 | x) f (y2 | x1 , y1 ) Application to causal inference: Estimation of average treatment effect from observational studies when we cannot observe the counterfactual outcomes. Combination of two data: one from probability sampling and the other from a non-probability sample. Kim (ISU) Matching 34 / 35 The end Kim (ISU) Matching 35 / 35