Small Area Estimation Combining Information from Several Sources Seunghwan Park and Jae-kwang Kim Jan 27, 2012 Ouline • Introduction • Basic Theory • Application to Korea LFS • Discussion Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 2 / 28 Introduction • Small Area estimation : want to provide reliable estimates for area with insufficient sample sizes. • Sample is not planned to give accurate direct estimators for the domains: domains with few or no sample observations. • Idea : Model can be used to borrow strength from other sources of information. Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 3 / 28 Introduction • Motivation : want to combine several sources of information to get improved small area estimates. • How to improve the direct estimators using auxiliary variables, • from other independent survey data • from census data or administrative data. • In our study, • • • • Area-level model approach, Several sources of auxiliary information, A measurement error model. Using a Generalized Least Squares(GLS) method. Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 4 / 28 Introduction • General Setup • • • • Interested variable : Yi Survey A: Directly compute Ŷi , subject to sampling error. Survey B: Compute X̂i1 , subject to sampling error . Census: Measures X̂i2 . • EA (Ŷi ) 6= EB (X̂i1 ) due to the structural differences between the surveys. • Structural differences (or systematic difference) • due to different mode of survey • due to time difference • due to frame difference • Goal: Improve estimation of Yi by incorporating various types of auxiliary information. Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 5 / 28 Basic Theory • Two error models (for area i) • Sampling error model Ŷi,a = Yi + ai X̂i,b = Xi + bi where (ai , bi ) represents the sampling error such that ai bi ∼ 0 0 , V (ai ) Cov (ai , bi ) Cov (ai , bi ) V (bi ) • Structural error model Xi = β0 + β1 Yi + ei , Seunghwan Park and Jae-kwang Kim () Survey Sampling ei ∼ (0, σei2 ) Jan 27, 2012 6 / 28 Basic Theory • Structural error model describes the relationship between the two survey measurement up to sampling error. • Y : target measurement item (variable of primary interest) • X : inaccurate measurement of Y with possible systematic bias. • If both X and Y measure the same item (with different survey modes), structural error model is essentially a measurement error model. (β0 = 0, β1 = 1 means no measurement bias.) • Why consider Xi = β0 + β1 Yi + ei instead of Yi = β0 + β1 Xi + ei ? : We want to treat Yi fixed rather than treating Xi fixed. Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 7 / 28 Basic Theory • If the parameters in the structural error model are known, Ŷi,b ≡ β1−1 (X̂i,b − β0 ) is also an unbiased estimator of Yi , computed from called survey B. Estimator Ŷi,b , using consistent (β̂0 , β̂1 ) is often called synthetic estimator. • Two main issues: • Prediction of Yi : GLS ( or GMM) approach. • Parameter estimation : Use the theory of measurement error model. Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 8 / 28 Basic Theory Prediction • Recall GLS method: e ∼ (0, V ) y = Z θ + e, 0 θ̂GLS = (Z V −1 Z )−1 Z 0 V −1 y • GLS approach to combine two error models: e ∼ (0, V ) y = Z θ + e, ⇔ Ŷi,a β1−1 (X̂i,b − β0 ) = 1 1 where u1i = ai and u2i = β1−1 (bi + ei ). Thus, u1i 0 V (ai ) ∼ , u2i 0 β1−1 Cov (ai , bi ) Seunghwan Park and Jae-kwang Kim () Survey Sampling Yi + u1i u2i β1−1 Cov (ai , bi ) β1−2 (V (bi ) + σei2 ) Jan 27, 2012 9 / 28 Basic Theory Prediction • GLS estimator : Best linear unbiased estimator of Yi based on the linear combination of Ŷi,a and Ŷi,b = β1−1 (X̂i,b − β0 ). • Under the current setup, Ŷi∗ = wi Ŷi,a + (1 − wi )Ŷi,b where wi = σei2 + V (bi ) − β1 Cov (ai , bi ) σei2 + β12 V (ai ) + V (bi ) − 2β1 Cov (ai , bi ) • The GLS estimator is sometimes called composite estimator. In paractice we need to use β̂0 , β̂1 , and σ̂ei2 . Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 10 / 28 Basic Theory Parameter estimation • Parameter estimation for the structural model • Case 1: Matching measurement X and measurement Y is possible (e.g.: two phase sampling, Survey A sample is a subset of survey B sample.) • Case 2: Matching is not possible. • In case 1, we can easily obtain a consistent estimator of the model parameters from the set where units have both X and Y observed. (Unit level modeling) • In case 2, we may use area level model to link X̂i (from survey B) and Ŷi (from survey A) in the area level. Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 11 / 28 Basic Theory Parameter estimation • The area-level model takes the form of measurement error model (Fuller, 1987) X̂i = β0 + β1 Yi + ei + bi Ŷi = Yi + ai • Parameter estimation can be performed using the measurement error model estimation methods. Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 12 / 28 Korea LFS Application • Labor Force Survey : very important economic survey interested in estimating unemployment rates. • Several sources of information for unemployment of Korea Korean Labor Force Survey(KLF) data - 7K sample households (monthly) 2 Local Area Labor Force Survey(LALF) data - 200K sample households (quarterly) 3 Census data (10% of the population) 1 • KLF sample is nested within LALF sample. Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 13 / 28 Korea LFS Application • Unemployment rate for small area is the parameter of interest. • Several sources of information for unemployment for analysis district area i. • Ŷi : estimates from KLF data • X̂1i : estimates from LALF data • X̂2i : estimates from census data • KLF : sampling error ↑, measurement error ↓. • LALF : sampling error ↓, measurement error ↑. • Census data : sampling error ↓, measurement error↑(no updated information). Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 14 / 28 Korea LFS Application • First, we can construct structural error models of area i in terms of population mean X̄1i = β0 + β1 Ȳi + ē1i , (1) 2 where (X̄1i , Ȳi , ē1i ) = Ni−1 ΣUi (x1j , yj , e1j ), ē1i ∼ (0, σe1 /Ni ). • Consider nested error model : e1i = i + ui , 2 i ∼ (0, σe1 ) ui ∼ (0, σu2 ) 2 then ē1i ∼ (0, σe1 + σu2 /Ni ) 2 • Since Ni is often quite large, we can assume ē1i ∼ (0, σe1 ) Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 15 / 28 Korea LFS Application • Sampling error model Ŷi X̂1i ȳi x̄1i − β0 = Yi X1i + Ni a i Ni bi (2) • Combining (1) and (2) = 1 β1 Ȳi + ai bi + ē1i (3) where (x̄1i , ȳi ) = Ni−1 (X̂1i , Ŷi ) • Vi Variance-covariance matrix of (ai , bi + ē1i )0 is Vi = Seunghwan Park and Jae-kwang Kim () V (ai ) Cov (ai , bi ) Cov (ai , bi ) V (bi ) + σe2 Survey Sampling Jan 27, 2012 16 / 28 Korea LFS Application • GLS estimator ŶiGLS = {(β1 , 1)Vi−1 (β1 , 1)0 }−1 (β1 , 1)Vi−1 (x̄1i − β0 , ȳi ) where Vi is the variance-covariance matrix of (ai , bi + ē1i )0 . • GLS estiamtor can be expressed as the composite estimator form Ŷicomp = αi ȳi + (1 − αi )ỹi where ỹi = β1−1 (x̄1i − β0 ) which is called synthetic estimator and αi = V (ỹi ) − Cov (ȳi , ỹi ) V (x̄i ) + V (ỹi ) − 2Cov (ȳi , ỹi ) • Ignoring the effect of estimating β V (Ŷicomp − Ȳi ) = αi V (ȳi ) + (1 − αi )Cov (ȳi , ỹi ) Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 17 / 28 Korea LFS Application • We can Consider also Census data. Then (3) changes to ȳi 1 ai x̄1i − β0 = β1 Ȳi + bi + ē1i x̄2i − γ0 γ1 ē2i • Whole process is similar to the case combining two survey. Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 18 / 28 Korea LFS Application Parameter estimation • A consistent estimator of (β0 , β1 ) : Minimize Q(β0 , β1 ) = H X (x̄1i − β0 − ȳi β1 )2 V (x̄1i − β0 − ȳi β1 ) i=1 where V (x̄1i − β0 − ȳi β1 ) = σe2 + β12 V (ai ) − 2β1 Cov (ai , bi ) + V (bi ). • Let wi (β1 ) = σe2 + β12 V (ai ) − 2β1 Cov (ai , bi ) + V (bi ) −1 . Then β̂0 β̂1 where (x̄w , ȳw ) = { = x̄w − β̂1 ȳw PH i=1 wi (β̂1 ){(ȳi − ȳw )(x̄1i − x̄w ) − Cov (ai , bi )} = PH 2 i=1 wi (β̂1 ){(ȳi − ȳw ) − V (ai )} PH i=1 wi (β̂1 )}−1 PH i=1 (4) (5) wi (β̂1 )(x̄i , ȳi ) • This solution can be obtained by iterative algorithm. Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 19 / 28 Korea LFS Application Parameter estimation • Consider the method of moment estimator 2 E {(x̄1i − β0 − ȳi β1 )2 − β12 V (ai ) + 2β1 Cov (ai , bi ) − V (bi )} = σe1i • Under the nested error model 2 E {(x̄1i − β0 − ȳi β1 )2 − β12 V (ai ) + 2β1 Cov (ai , bi ) − V (bi )} = σe1 • Using the Fuller(2009) 2 σ̂e1 = H X n o ˆ (ai , bi ) − V̂ (bi ) κi (x̄1i − β̂0 − ȳi β̂1 )2 − β̂12 V̂ (ai ) + 2β̂1 Cov (6) i=1 n o−1 P 2 ˆ (ai , bi ) + V̂ (bi ) where κi ∝ σe1 + β̂12 V̂ (ai ) − 2β̂1 Cov and H i=1 κi = 1. 2 • We can also consider ēi ∼ (0, Ȳi σe1 ). Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 20 / 28 Korea LFS Application Parameter estimation • Iterative algorithm for parameter estimation. 2 = 0 using Compute the initial estimator of (β0 , β1 ) by setting σ̂e1 (4),(5). 2 2 Use the current value of (β̂0 , β̂1 ), compute σ̂e1 using (6). 2 3 Use the current value of σ̂e1 compute the updated estimator of (β0 , β1 ) using (4),(5). 4 Repeat step 2, step 3 until convergence. 1 Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 21 / 28 Korea LFS Application MSE estimation 2 • The actual prediction for Ȳi is computed by Ȳˆei = Ȳˆi (θ̂) where θ = (β0 , β1 , σe1 ). ˆ ) MSE (Ȳ ei = n o ˆ ) + E (Ȳ ˆ − Ȳ ˆ )2 MSE (Ȳ i ei i = M1i + M2i • Consider a jackknife approach, M̂2i = H H − 1 X ˆ (−k) ˆ 2 (Ȳi − Ȳi ) H k=1 (JK ) M̂1i = α̂i (JK ) where α̂i = α̂i − H−1 H Seunghwan Park and Jae-kwang Kim () P (JK ) V̂ (ai ) + (1 − α̂i (−k) k=1 (α̂i d (ai , bi ) )Cov − α̂i ) Survey Sampling Jan 27, 2012 22 / 28 Korea LFS Application Data analysis Result • Consider four estimates • KLF : Only KLF • LALF : Only LALF • GLS 1 : Combine KLF and LALF • GLS 2 : Combine KLF, LALF, and census data • MSE MSE KLF LALF GLS 1 GLS 2 1st Q 0.0000630 0.0001123 0.0000444 0.0000405 Seunghwan Park and Jae-kwang Kim () Median 0.0001210 0.0001330 0.0000738 0.0000543 Survey Sampling Mean 0.0002476 0.0001482 0.0000893 0.0000575 3rd Q 0.0002395 0.0001695 0.0001210 0.0000721 Jan 27, 2012 23 / 28 Discussion Modeling • Model specification was very difficult!. • We build models separately for urban and rural areas, which ares assigned based on the proportion of households engaged in agricultural business. • In KLF Survey, 25% of the whole areas have 0 unemployment rate due to the quite small sample size of individual area. • The areas which have 0 unemployment rate are excluded when parameters are estimated. • We have considered the structural model which has a 0 intercept. X̄1i = β1 Ȳi + ei • Mixture model or Zero-inflated regression model can be considered. Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 24 / 28 Discussion Estimation d (ai , bi ) even though it • In real data set, there is no estimate of covariance term, Cov is not 0. • After calculating the covariance term, there exist a problem covariance matrix for some area is not positive definite. • Thus a smoothing covariance matrix procedure is essentially needed. • Consider reverse two-phase sampling design • From the finite population, we select the first-phase sample A1 of size n1 . • We select the second-phase sample A2 from U − A1 of size n2 . • The final sample is A = A1 ∪ A2 and size is n = n1 + n2 . Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 25 / 28 Discussion Estimation • In fact, LALF survey samples are augmented by an additional sampling procedure from KLF survey samples. • Use reversed two-phase sampling design properties, V (ai ) 1 1 1 2 − )Sy2 ∼ = Sy na N na 1 1 2 1 Sy = ( − )Sy2 ∼ = nb N nb 1 1 ∼ 1 Sy2 = ( − )Sy2 = nb N nb =( V (bi ) Cov (ai , bi ) • Sampling error variance V̂ (ai ) d (ai , bi ) Cov Seunghwan Park and Jae-kwang Kim () d (ai , bi ) Cov V̂ (bi ) ! ∼ = Survey Sampling V̂ (ai ) 1 nai /nbi nai /nbi nai /nbi Jan 27, 2012 26 / 28 Discussion Future work • Current MSE estimation formula does not consider smoothing variance matrix procedure. • To improve the approximation to asymptotic normality, we can consider a transformation of X̂i , Ŷi . • New MSE estimation formula for transformation case is under investigation. Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 27 / 28 Discussion Thank You ! Seunghwan Park and Jae-kwang Kim () Survey Sampling Jan 27, 2012 28 / 28