Combining data from two independent surveys: model-assisted approach Jae Kwang Kim 1 Iowa State University January 20, 2012 1 Joint work with J.N.K. Rao, Carleton University Reference Kim, J.K. and Rao, J.N.K. (2012). “Combining data from two independent surveys: a model-assisted approach,” Biometrika, In Press. (Available online via Advance Access 10.1093/biomet/asr063.) Outline 1 Introduction 2 Projection estimation 3 Replication variance estimation 4 Efficient estimation: Full information 5 Simulation study 6 Concluding remarks & Discussion 3 1. Introduction Two-phase sampling (Classical) Two-phase sampling A1 : first-phase sample of size n1 A2 : second-phase sample of size n2 (A2 ⊂ A1 ) x observed in phase 1 and both y and x observed in phase 2. Assume that 1 is an element of xi . Neyman (1934), Hansen & Hurwitz (1946), Rao (1973), Kott & Stukel (1997), Binder et al. (2000), Kim et al. (2006), Hidiroglou et al. (2009). 4 1. Introduction Two-phase sampling GREG estimator of Y = PN i=1 yi : ŶG = X̂01 β̂ 2 −1 X̂1 = X w1i xi , β̂ 2 = i∈A1 X w2i xi x0i i∈A2 X w2i xi yi i∈A2 Two ways of implementing the GREG estimator Calibration: create data file for A2 !−1 ŶG = X w2G ,i yi , w2G ,i = i∈A2 X̂01 X w2i xi x0i i∈A2 Projection estimation: create data file for A1 . X ŶG = w1i ỹi , ỹi = x0i β̂ 2 i∈A1 5 w2i xi 1. Introduction Domain projection estimators Calibration estimator of domain total Yd = X ŶCal,d = w2G ,i δi (d)yi PN i=1 δi (d)yi : i∈A2 δi (d) = 1 if i belongs to domain d, δi (d) = 0 otherwise. Note: ŶCal,d is based only on the domain sample belonging to A2 and it could lead to large variance if domain A2 sample is very small. 6 1. Introduction Domain projection estimators Domain projection estimator (Fuller, 2003) X Ŷp,d = wi1 δi (d)ỹi i∈A1 Note: Ŷp,d is based on much larger domain sample belonging to A1 than ŶCal,d based on domain sample belonging to A2 . Hence, Ŷp,d could be significantly more efficient if its relative bias is small. Under the model yi = x0i β + ei with E (ei ) = 0, Ŷp,d is model unbiased for Yd . But, “it is possible to construct populations for which Ŷp,d is very design biased” (Fuller, 2003). 7 1. Introduction Combining two independent surveys Large sample A1 collecting only x, and weights {wi1 , i ∈ A1 }. Much smaller sample A2 collecting x and y drawn independently and weights {wi2 , i ∈ A2 }. Example 1 (Hidiroglou, 2001): Canadian Survey of Employment, Payrolls and Hours A1 : Large sample drawn from a Canadian Customs and Revenue Agency administrative data file and auxiliary variables x observed. A2 : Small sample from Statistics Canada Business Register and study variables y , number of hours worked by employees and summarized earnings, observed. 8 1. Introduction Combining two independent surveys Example 2 (Reiter, 2008) A2 : Both self-reported health measurements, x, and clinical measurements from physical examinations, y , observed A1 : Only x reported Synthetic values ỹi , i ∈ A1 are created by first fitting a working model E (y ) = m(x, β) relating y to x to data {(yi , xi ), i ∈ A2 } and then predicting yi associated with xi , i ∈ A1 . Only synthetic values ỹi = m(xi , β̂), i ∈ A1 and associated weights wi1 , i ∈ A1 are released to the public. Our focus is on producing estimators of totals and domain totals from the synthetic data file {(ỹi , wi1 ), i ∈ A1 }. 9 2. Projection estimation Estimation of Y Projection estimator of Y : Ŷp = X wi1 ỹi i∈A1 Ŷp is asymptotically design-unbiased if β̂ satisfies n o X =0 wi2 yi − m xi , β̂ i∈A2 Note: Under condition (*), X X Ŷp = wi1 ỹi + wi2 {yi − ỹi } i∈A1 i∈A2 = “prediction” + “ bias correction” 10 (∗) 2. Projection estimation Estimation of Y Theorem 1: Under some regularity conditions, if β̂ satisfies condition (*), we can write X X Ŷp ∼ wi1 m0 (xi ) + wi2 {yi − m0 (xi )} = P̂1 + Q̂2 = i∈A1 i∈A2 where m0 (xi ) = m(xi , β 0 ) and β 0 = p lim β̂ with respect to survey 2. Thus, E (Ŷp ) ∼ = N X i=1 N N X X m0 (xi ) + {yi − m0 (xi )} = yi . i=1 and V (Ŷp ) ∼ = V (P̂1 ) + V (Q̂2 ). i=1 2. Projection estimation Model-assisted approach: Asymptotic unbiasedness of Ŷp does not depend on the validity of the working model but efficiency is affected. Note: In the variance decomposition V (Ŷp ) ∼ = V (P̂1 ) + V (Q̂2 ) = V1 + V2 . V1 is based on n1 sample elements and V2 is based on n2 sample elements. If n2 << n1 , then V1 << V2 . If the working model is good, then the squared error terms ei2 = {yi − m0 (xi )}2 are small and V2 will also be small. 12 2. Projection Estimation When is condition (*) satisfied? If 1 is an element of xi , this condition is satisfied for linear regression m(xi , β) = x0i β and logistic regression logit{m(xi , β)} = x0i β when β̂ is obtained from the estimating equation X wi2 xi (yi − mi ) = 0 i∈A2 for linear and logistic regression working models. For the ratio model, β̂ is the solution of X wi2 (yi − mi ) = 0. i∈A2 13 2. Projection Estimation Linearization variance estimation Let ei = yi − ỹi , then the variance estimator of Ŷp is vL (Ŷp ) = v1 (ỹi ) + v2 (êi ) v1 (z̃i ) = v (Ẑ1 ) = variance estimator for survey 1 v2 (z̃i ) = v (Ẑ2 ) = variance estimator for survey 2 P P Ẑ1 = i∈A1 wi1 zi , Ẑ2 = i∈A2 wi2 zi . Note vL (Ŷp ) requires access to data from both surveys. 14 2. Projection Estimation Estimation of domain total Yd Projection domain estimator X Ŷd,p = wi1 δi (d)ỹi i∈A1 Ŷd,p is asymptotically unbiased if Case (i) : X wi2 δi (d)(yi − ỹi ) = 0 i∈A2 OR Case (ii) : Cov {δi (d), yi − m(xi , β 0 )} = 0. 15 2. Projection Estimation Estimation of domain total Yd Case (i): For linear or logistic regression models (i) is satisfied if δi (d) is an elements of xi . For planned domains specified in advance, augmented working models can be used. Survey 1 data file should provide planned domain indicators. Case (ii): If working model is good, then the relative bias of Ŷd,p would be small. Ŷd,p is asymptotically model unbiased if model is correct. Ŷd,p can be significantly design biased for some populations. 16 3. Replication variance estimation Replication variance estimation for Ŷp Replication variance estimator for survey 1: v1,rep (Ẑ ) = L1 X 2 (k) ck Ẑ1 − Ẑ1 k=1 (k) Ẑ1 (k) i∈A1 wi1 zi (k) P = and {wi1 , i ∈ A1 }, k = 1, · · · , L1 : replication weights for survey 1 Replication variance estimator for Ŷp : v1,rep (Ŷp ) = L1 X 2 (k) ck Ŷp − Ŷp k=1 (k) P (k) (k) where Ŷp = i∈A1 wi1 ỹi values for replicate k. 17 (k) and {ỹi , i ∈ A1 } are synthetic 3. Replication variance estimation Replication variance estimation for Ŷp (k) How to create replicated synthetic data {ỹi 1 Create (k) {wi2 , k , i ∈ A1 } ? = 1, · · · , L1 ; i ∈ A2 } such that L1 X 2 (k) ck Ŷ2 − Ŷ2 = v2 (Ŷ2 ) k=1 2 Compute β̂ (k) (k) (k) and ỹi = m(xi , β̂ ) by solving X (k) wi2 {yi − m(xi , β)}xi = 0 i∈A2 for β̂ (k) (linear or logistic linear regression) v1,rep (Ŷp ) is asymptotically unbiased. Data file for sample A1 should contain additional columns of (k) (k) {ỹi , i ∈ A1 } and associated {wi1 , i ∈ A1 }, k = 1, 2, · · · , L1 . 18 3. Replication variance estimation Replication variance estimation for Ŷd,p (k) Let Ŷd,p = P i∈A1 (k) (k) wi1 δi (d)ỹi v1,rep (Ŷd,p ) = L1 X , then 2 (k) ck Ŷd,p − Ŷd,p k=1 Asymptotically unbiased under either case (i) or case (ii). 19 4. Optimal estimator: Full information Estimation of total Y Three estimators for two parameters Survey 1: X̂1 for X Survey 2: (X̂2 , Ŷ2 ) for (X , Y ) Combine information using generalized least squares 0 X̂1 − X X̂1 − X minimize Q(X , Y ) = X̂2 − X V −1 X̂2 − X Ŷ2 − Y Ŷ2 − Y with respect to (X , Y ) where V is the variance-covariance matrix of (X̂1 , X̂2 , Ŷ2 )0 . 20 4. Optimal estimator: Full information Estimation of total Y Best linear unbiased estimator based on X̂2 , Ŷ2 and X̂1 : Ỹopt = Ŷ2 + By ·x2 X̃opt − X̂2 X̃opt = Vxx2 X̂1 + Vxx1 X̂2 Vxx1 + Vxx2 where By ·x2 = Vyx2 /Vxx2 , Vxx1 = V (X̂1 ), Vxx2 = V (X̂2 ), Vyx2 = Cov (Ŷ2 , X̂2 ). Replace variances in Ỹopt by estimated variances to get Ŷopt and X̂opt . 21 4. Optimal estimator: Full information Estimation of total Y Ŷopt can be expressed as Ŷopt = X ∗ wi2 yi i∈A2 ∗ , i ∈ A } are calibration weights: {wi2 2 P i∈A2 ∗ x = X̂ wi2 opt . i Ŷopt can be computed from data file for A2 providing weights ∗,i ∈ A } {wi2 2 Example: Simple random samples A1 and A2 N xi − x̄2 ∗ wi2 = + X̂opt − X̂2 P 2 n2 i∈A2 (xi − x̄2 ) x̄2 : mean of x for A2 22 4. Optimal estimator: Full information Domain estimation Calibration estimator: Ŷd∗ = X ∗ wi2 δi (d)yi i∈A2 computed from data file for A2 only. Projection estimator: X Ŷp,d = wi1 δi (d)ỹi i∈A1 computed from data file for A1 . Both Ŷd∗ and Ŷd,p satisfy internal consistency property: X X Ŷd∗ = Ŷopt , Ŷd,p = Ŷp d d 23 4. Optimal estimator: Full information Domain estimation Ŷd∗ is asymptotically design unbiased but can lead to a large variance if domain contains few sample A2 units. Optimal estimator Ŷopt,d based on domain specific variances does not satisfy internal consistency, may not be stable for small domain sample size and it cannot be implemented from A2 data file. 24 5. Simulation Study Simulation Setup Two artificial populations A and B of size N = 10, 000: {(yi , xi , zi ); i = 1, · · · , N} Population A: Regression model xi ∼ χ2 (2), yi = 1 + 0.7xi + ei ei ∼ N(0, 2), zi ∼ Unif(0, 1) zi independent of (xi , yi ) Population B: Ratio model same (xi , zi ) but yi = 0.7xi + ui ui ∼ N(0, xi ) cov(y , x) = 0.71 for both populations Domain d: δi (d) = 1 if zi < 0.3; δi (d) = 0 otherwise. 25 5. Simulation Study Simulation Setup Two independent simple random samples: n1 = 500, n2 = 100 Working models: linear regression, ratio, augmented linear regression, augmented ratio Relative bias: RB(Ŷ ) = {E (Ŷ ) − Y }/Y Relative efficiency: RE (Ŷ ) = mse(Ŷopt )/mse(Ŷ ) 26 5. Simulation Study Simulation Results Table 1: Simulation Results (Point estimation) Parameter Estimator Total Regression projection Ratio projection Aug. Reg. projection Aug. Rat. projection Optimal Regression projection Ratio projection Aug. Reg. projection Aug. Rat. projection Optimal Calibration Domain 27 Population A RB RE 0.00 0.98 0.00 0.58 0.00 0.97 0.01 0.55 0.00 1.00 0.00 1.96 0.01 1.22 0.00 1.05 0.00 0.64 -0.01 1.00 0.00 0.45 Population B RB RE 0.00 0.97 0.00 0.99 0.00 0.97 0.00 0.98 0.00 1.00 0.01 2.01 0.01 2.05 0.00 0.98 0.00 0.96 -0.02 1.00 0.00 0.53 5. Simulation Study Conclusions from Table 1 Estimation of total Y RB of all estimator negligible: less than 2% Regression projection estimator almost as efficient as Ŷopt even when the true model is ratio model. Ratio projection estimator is considerably less efficient if the true model has substantial intercept term: model diagnostics to identify good working model 3 Augmented projection estimators similar to corresponding projection estimators in terms of RB and RE. 1 2 28 5. Simulation Study Conclusions from Table 1 Domain estimation RB of all estimators less than 5%: simulation setup ensures δi (d) unrelated to ri = yi − m(xi ; β). 2 Regression projection estimator considerably more efficient than the calibration estimator or optimal estimator: projection estimator based on larger sample size 3 Ratio projection estimator considerably less efficient if the model has substantial intercept term. 1 29 5. Simulation Study Jackknife variance estimation L1 = n2 = 100 pseudo replicates by random group jackknife Table 2: Simulation Results (relative biases of var. est.) Point Estimator Regression Projection Ratio Projection Aug. Reg. Projection Aug. Rat. Projection Parameter Total Domain Total Domain Total Domain Total Domain Pop. A -0.013 -0.030 0.032 -0.001 0.033 0.022 0.059 0.064 Pop. B 0.024 0.006 0.000 -0.017 0.040 0.050 0.030 0.061 |RB| of jackknife variance estimators small: less than 5% 30 6. Discussion Some alternative approaches The proposed method does not lead to the optimal estimator: Ŷopt = Ŷ2 + B̂y ·x2 X̃opt − X̂2 Vxx2 X̂1 + Vxx1 X̂2 Vxx1 + Vxx2 To implement the optimal estimator using synthetic data, we may express X X wi3 ỹi + wi2 (yi − ỹi ) Ŷopt = X̃opt = i∈A∗1 i∈A2 where ỹi = x0i B̂y ·x2 , A∗1 = A1 ∪ A2 and wi3 is the sampling weight for A∗1 satisfying X wi3 xi = X̃opt i∈A∗1 31 6. Discussion Some alternative approaches If P i∈A2 wi2 = P wi3 , then we can further express XX = wi3 w̃ij∗ (ŷi + êj ) i∈A∗1 Ŷopt i∈A∗1 j∈A2 where w̃ij∗ = wj2 /( i∈A2 wi2 ) and êj = yj − ŷj . It now take the form of fractional imputation considered in Fuller & Kim (2005). To reduce the size of the data set, we may consider random selection of M residuals to get êj∗ and P ŶFI = M XX wi3 wij∗ ŷi + êj∗ , i∈A∗1 j=1 where wij∗ satisfies P ∗ 1, ê ∗ = ∗ w j=1 ij j∈A2 w̃ij (1, êj ) . j PM 32 6. Discussion Some alternative approaches Nested two-phase sampling: A2 ⊂ A1 Non-nested two-phase sampling : A1 , A2 independent We can convert non-nested two-phase sampling into a nested two-phase sampling A2 ⊂ A∗1 where A∗1 = A1 ∪ A2 Synthetic data can be released for A∗1 33 6. Discussion Parametric multiple imputation Assume that f (yi | xi , θ) is known for fixed θ and that A1 and A2 are simple random samples Obtain the posterior distribution of θ: p(θ | y2 , x2 ) assuming a diffuse prior on θ, where (y2 , x2 )= data from A2 Draw M values θ(1) , · · · , θ(M) from the posterior distribution. (l) Draw yi from f (yi | xi , θ(l) ) for i ∈ A1 and l = 1, · · · , M. (l) Synthetic data sets: {yi , i ∈ A1 }, l = 1, · · · , M. Standard multiple imputation variance estimators do not work here. Reiter (2008) proposed a two-stage imputation procedure requiring T synthetic data sets (l) {yit : i ∈ A1 , t = 1, · · · , T } for each θ(l) to be generated. In all, TM synthetic data sets are generated. 34 6. Discussion Conclusion The proposed method is based on determination imputation to generate synthetic values. Synthetic data along with the replicates are created for survey 1 and only survey 1 data is released. Significant efficiency gain is achieved for domain estimation. Stochastic imputation approach is under study. 35 REFERENCES Binder, D.A. anad Babyak, C., Brodeur, M., Hidiroglou, M., & Jocelyn, W. (2000). Variance estimation for two-phase stratified sampling. Can. J. Statist. 28, 751–764. Fuller, W. A. (2003). Estimation for multiple phase samples. In Analysis of Survey Data, R. L. Chambers & C. J. Skinner, eds. Wiley: Chichester, England. Fuller, W. A. & Kim, J.-K. (2005). Hot deck imputation for the response model. Survey Methodology 31, 139–149. Hansen, M. & Hurwitz, W. (1946). The problem of non-response in sample surveys. J. Am. Statist. Assoc. 41, 517–529. Hidiroglou, M. (2001). Double sampling. Survey Methodol. 27, 143–54. Hidiroglou, M. A., Rao, J. N. K. & Haziza, D. (2009). Variance estimation in two-phase sampling. Australian and New Zealand Journal of Statistics 51, 127–141. Kim, J. K., Navarro, A. & Fuller, W. A. (2006). Replicate variance estimation after multi-phase stratified sampling. J. Am. Statist. Assoc. 101, 312–320. Kott, P. & Stukel, D. (1997). Can the jackknife be used with a two-phase sample? Survey Methodology 23, 81–89. Neyman, J. (1934). On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society 97, 558–606. Rao, J. N. K. (1973). On double sampling for stratification and analytical surveys. Biometrika 60, 125–33. Reiter, J. (2008). Multiple imputation when records used for imputation are not used or disseminated for analysis. Biometrika 95, 933–46. 37