2014 Korean local election prediction using small area estimation techniques and mixed-mode survey sampling. Jae Kwang Kim 1 Department of Statistics Iowa State University July 14, 2014 1 Joint Work with Jongho Im and Joohan Kim 1 / 34 Motivation The 6th local election in Korea: June 4th, 2014 A broadcasting company, called JTBC, wants to broadcast election prediction results for this election right after the election is over. In addition, several election poll results were needed before the election to broadcast in the main news. In the last local election in Seoul, prediction errors for election polls were around 8%-12%. Very low response rate for telephone surveys (around 10%). The response mechanism is not missing at random (even after controlling for age and gender, there is a substantial difference between the respondents and the nonrespondents). 2 / 34 Motivation One alternative to telephone survey prediction is exit poll: Very accurate (less than 3 % error) Very expensive (2.6 million USD for an one-time survey) We propose another alternative approach: More accurate than the telephone surveys but maybe less accurate than the exit polls. Cost effective (about the same cost for telephone survey) Only 4 areas were considered: Seoul, Busan, Inchon, Gyeonggi. The total cost was 0.14 million USD for 5 time surveys. 1st survey: 5/9-5/12 2nd survey: 5/14-5/17 3rd survey: 5/22-5/23 4th survey: 5/27-5/28 Prediction survey: 6/2-6/3 3 / 34 Map of South Korea 4 / 34 SMART survey Semantic network, Missing data Analysis Research Team (SMART) Consortium of three companies Treum Institute Hyundai Research ID—INCU (Open Survey) Public opinion survey targets for the whole population but prediction survey targets for the voting participants. (usually, about 60% vote participation rate.) 5 / 34 Features 1 Mixed-mode survey: Smart-phone app survey for young generation, classical phone survey for old generation. 2 Sampling: For the smart-phone app surveys, we used a stratified propensity score sampling from an online panel sample which was self-selected. The propensity score sampling is based on the propensity scores incorporating the official demographic information. 3 Prediction: Small area estimation technique using empirical Bayes. 6 / 34 Mixed-mode survey Mixed-mode survey Partition the population into two groups (below age 50, over age 50) to reflect the coverage of smart phone usage. For the middle age group (41-50), both modes are used. Advantage Overcome drawback of telephone surveys: It is well known that young generations (age 19-39) are hard to reach from the classical telephone survey. We can use extra profile information (ex: household size, household monthly income) Cost effective (smart-phone app survey is cheaper than the telephone surveys because it does not involve the cost for interviewers.) 7 / 34 Smart-Phone App Survey Very convenient. Cost efficient. No bias associated with interviewer effect. Known to have less order effect. Quick response 8 / 34 Smart-Phone App Survey Open Survey 9 / 34 Sampling Objective: Improve the representativeness of the sample 1 2 3 Raw data collected from volunteer sample (self-selected) Use propensity score method to reduce the selection bias in the volunteer sample Collect information for nonresponse adjustment in the sample recruitment stage. 10 / 34 New sample design: Three-phase sampling Phase 1: Voluntary sample with sample recruitment collect information at the time of registration compute propensity scores for the representativeness of the sample Phase 2: Using the propensity scores and other demographic information, select a sample from the Phase 1 sample. self-weighting design Phase 3: Nonresponse adjustment Monetary Incentives to all respondents Nonresponse propensity score adjustment 11 / 34 Propensity score sampling (for phase 2 sample) 1 Collect information at the time of sample registration. To compute the propensity weights. (known population totals) To use for nonresponse adjustment. 2 Basic stratification based on demographic variables 3 Select a sample (for actual surveys) using stratified systematic PPS sampling where the measure of size is computed from the propensity weights. 12 / 34 Propensity weights Probability sampling The first-order inclusion probabilities πi are known πi > 0 for all elements in the population Sampling for Internet survey is usually non-probability sampling. The first-order inclusion probabilities πi are unknown πi = 0 for some elements in the population Propensity weights: π̂i , estimated first-order inclusion probabilities 13 / 34 Weighting techniques Post-stratification: 1 2 Partition the sample into several categories with known population sizes in the population level. The final weight is proportional to wi ∝ Ng /ng if i belongs to group g, where Ng is the population size and ng is the sample size in group g. Raking-ratio estimation (or raking method): popular for handling multivariate categories with known marginal totals 14 / 34 Auxiliary information Demographic variables (region/age/gender) Household size Years of education Occupation Household income House type 15 / 34 Auxiliary information Two types of auxiliary information Household level information: Household size, Household income, House type Individual level information: Age, Gender, years of education, occupation Two main problems For household level information, known totals are summarized in the household level. However, the sample itself provides individual level summary. That is, the “distribution for individual level 6= distribution for household level”. Also, there is some level of item nonresponse among auxiliary variables. 16 / 34 Auxiliary information Example 1: Distribution of the household size Size Household Level (%) Individual Level (%) 1 20 8 2 22 18 21 22 3 4 27 33 5 8 13 6+ 3 6 Available Derived We need to convert “household level” information to “individual level” information. 17 / 34 Auxiliary information Example 2: Household income Income range Household Level (%) Individual Level (%) 0-200 22 200-300 21 300-400 21 400-600 23 600+ 12 Unless we know the joint distribution of the household income and the household size, we cannot convert “household level” information to “individual level” information. 18 / 34 Propensity Weighting Household (HH) weight vs Individual weight Household weight: The amount that the sample HH represents the HH population. Individual weight: The amount that the sample individual represents the population of the individuals. We do not need household weights for estimation, but the HH weights may be needed to account for the HH population level information. From the HH weights (w̃h ) that incorporate the HH population level information, we can construct individual weights by wi = w̃h × Mh /mh where h is the index of household that unit i belongs to, w̃h is the household weight, Mh is the HH size for house h, and mh is the number of individuals of the sample in household h. 19 / 34 Propensity Weighting 1 Compute the HH base weights (di ): di = 2 Ng ng where groups are formed by age/gender/region. Raking for household level information: (H) wi = ch × (di /ch ) × P Ng i∈Ag di /ch where ch = Mh /mh . 1 2 3 3 Compute the base HH weights from the base individual weights Apply raking procedure for HH weights Compute the individual weights from the HH weights Raking for individual level information 20 / 34 Propensity Weighting Item nonresponse for household income Modified Income range Pop’n (%) Sample (%) Pop’n (%) Don’t Know 19 19 0-200 22 19 22*0.81 200-300 21 20 21*0.81 21 17 21*0.81 300-400 400-600 23 18 23*0.81 600+ 12 7 12*0.81 If a sample unit is missing in HH income, then skip this raking step and keep the weight in the current step. If a sample unit responds in HH income, then apply the raking step with modified population control total (last column) that accounts for the current missing rate. 21 / 34 Propensity Weighting for Nonresponse adjustment About 60% Response rate Auxiliary information can be used to compute the propensity scores Auxiliary variables used for sampling: Population totals should be known. Other auxiliary variables can be obtained at the time of phase 1 sampling. (sample registration) For election prediction, nonresponse adjustment is not used (Nonresponse = Not interested ) 22 / 34 Final Prediction Notation h: district Yh : (true) vote outcome for party A Xh : vote outcome for party A in the latest local election. Ŷh : direct estimate of Yh Prediction model Sampling Error Model Ŷh = Yh + uh , uh ∼ N (0, 0.25/nh ) Structural Error Model Yh = γXh + eh , eh ∼ N (0, Xh σe2 ) 23 / 34 Final Prediction (Cont’d) Combining two models by Bayes theorem: Yh | (Xh , Ŷh ) ∼ N (αh Ŷh + (1 − αh )γXh , (1 − αh )Xh σe2 ) where αh = Xh σe2 Xh σe2 + 0.25/nh Prediction incorporating the survey result and previous election result Final estimate P ∗ h wh Ŷh Ŷ = P h wh ∗ 24 / 34 Final Prediction (Cont’d) Model parameter (γ, σe2 ) estimation: S̄(γ, σe2 ) = 0 h i S̄(γ, σe2 ) = E S(γ, σe2 ; Xh , Yh ) | Xh , Ŷh , γ, σe2 (Ŷh∗ − γXh )/σe2 X = wh − 1 + (Ŷh∗ −γXh )2 +(1−αh )Xh σe2 h 2σ 2 2Xh σ 4 e e where Ŷh∗ = α̂h Ŷh + (1 − α̂)γ̂Xh . Used EM algorithm to estimate the parameters. 25 / 34 Final Prediction (Cont’d) Table 1. Final prediction (%) Area Party Prediction True Outcome Seoul A 43.2 43.0 B 55.6 56.1 Busan A 46.3 50.7 B 53.7 49.3 Inchon A 46.4 50.0 B 52.2 48.2 Gyeonggi A 52.5 50.4 B 47.5 49.6 26 / 34 Prediction results analysis Table 2. Average Error Size (%) Area Ŷh Ŷh∗ Seoul 5.5 4.1 Busan 7.3 6.0 Inchon 8.7 5.6 Gyeonggi 7.2 2.6 Average Error Size H 1 X |Ŷh − Yh | H h=1 H 1 X ∗ |Ŷh − Yh | H h=1 27 / 34 Prediction Results Analysis: Seoul Boxplot of error sizes in Seoul 28 / 34 Prediction Results Analysis: Seoul (Cont’d) Map of error sizes in Seoul (Left: Ŷh , Right: Ŷh∗ ) 29 / 34 Prediction Results Analysis: Gyeonggi Boxplot of error sizes in Gyeonggi 30 / 34 Prediction Results Analysis: Gyeonggi (Cont’d) Map of error sizes in Gyeonggi (Left: Ŷh , Right: Ŷh∗ ) 31 / 34 Conclusion Used smart-phone app surveys to younger generation: to obtain better response rate and reduces the nonresponse error Used propensity score sampling to improve the representativeness of the self-selected panel sample. Seoul, Gyeonggi : Large panel sample size Inchon, Busan: Small panel sample size Uses small area estimation technique to incorporate the auxiliary information in the previous election. Difficult to predict for the behavior of vote participation. 32 / 34 Future Research Small area estimates may be quite efficient for predicting small areas but the sum of the small area estimates does not necessarily lead to efficient estimation. May involve random effect models (or hierarchical model) to borrow strength for large domains). Extension to non-normal models : Parametric fractional imputation of Kim (2011) can be used to compute the conditional expectation involved in the BLUP. MSE estimation. 33 / 34 The end 34 / 34