2014 Korean local election prediction using small area

advertisement
2014 Korean local election prediction using small area
estimation techniques and mixed-mode survey sampling.
Jae Kwang Kim
1
Department of Statistics
Iowa State University
July 14, 2014
1
Joint Work with Jongho Im and Joohan Kim
1 / 34
Motivation
The 6th local election in Korea: June 4th, 2014
A broadcasting company, called JTBC, wants to broadcast election
prediction results for this election right after the election is over.
In addition, several election poll results were needed before the
election to broadcast in the main news.
In the last local election in Seoul, prediction errors for election polls
were around 8%-12%.
Very low response rate for telephone surveys (around 10%).
The response mechanism is not missing at random (even after
controlling for age and gender, there is a substantial difference between
the respondents and the nonrespondents).
2 / 34
Motivation
One alternative to telephone survey prediction is exit poll:
Very accurate (less than 3 % error)
Very expensive (2.6 million USD for an one-time survey)
We propose another alternative approach:
More accurate than the telephone surveys but maybe less accurate
than the exit polls.
Cost effective (about the same cost for telephone survey)
Only 4 areas were considered: Seoul, Busan, Inchon, Gyeonggi.
The total cost was 0.14 million USD for 5 time surveys.
1st survey: 5/9-5/12
2nd survey: 5/14-5/17
3rd survey: 5/22-5/23
4th survey: 5/27-5/28
Prediction survey: 6/2-6/3
3 / 34
Map of South Korea
4 / 34
SMART survey
Semantic network, Missing data Analysis Research Team (SMART)
Consortium of three companies
Treum Institute
Hyundai Research
ID—INCU (Open Survey)
Public opinion survey targets for the whole population but prediction
survey targets for the voting participants. (usually, about 60% vote
participation rate.)
5 / 34
Features
1
Mixed-mode survey: Smart-phone app survey for young generation,
classical phone survey for old generation.
2
Sampling: For the smart-phone app surveys, we used a stratified
propensity score sampling from an online panel sample which was
self-selected. The propensity score sampling is based on the
propensity scores incorporating the official demographic information.
3
Prediction: Small area estimation technique using empirical Bayes.
6 / 34
Mixed-mode survey
Mixed-mode survey
Partition the population into two groups (below age 50, over age 50)
to reflect the coverage of smart phone usage.
For the middle age group (41-50), both modes are used.
Advantage
Overcome drawback of telephone surveys: It is well known that young
generations (age 19-39) are hard to reach from the classical telephone
survey.
We can use extra profile information (ex: household size, household
monthly income)
Cost effective (smart-phone app survey is cheaper than the telephone
surveys because it does not involve the cost for interviewers.)
7 / 34
Smart-Phone App Survey
Very convenient. Cost efficient.
No bias associated with interviewer effect.
Known to have less order effect.
Quick response
8 / 34
Smart-Phone App Survey
Open Survey
9 / 34
Sampling
Objective: Improve the representativeness of the sample
1
2
3
Raw data collected from volunteer sample (self-selected)
Use propensity score method to reduce the selection bias in the
volunteer sample
Collect information for nonresponse adjustment in the sample
recruitment stage.
10 / 34
New sample design: Three-phase sampling
Phase 1: Voluntary sample with sample recruitment
collect information at the time of registration
compute propensity scores for the representativeness of the sample
Phase 2: Using the propensity scores and other demographic
information, select a sample from the Phase 1 sample.
self-weighting design
Phase 3: Nonresponse adjustment
Monetary Incentives to all respondents
Nonresponse propensity score adjustment
11 / 34
Propensity score sampling (for phase 2 sample)
1
Collect information at the time of sample registration.
To compute the propensity weights. (known population totals)
To use for nonresponse adjustment.
2
Basic stratification based on demographic variables
3
Select a sample (for actual surveys) using stratified systematic PPS
sampling where the measure of size is computed from the propensity
weights.
12 / 34
Propensity weights
Probability sampling
The first-order inclusion probabilities πi are known
πi > 0 for all elements in the population
Sampling for Internet survey is usually non-probability sampling.
The first-order inclusion probabilities πi are unknown
πi = 0 for some elements in the population
Propensity weights: π̂i , estimated first-order inclusion probabilities
13 / 34
Weighting techniques
Post-stratification:
1
2
Partition the sample into several categories with known population
sizes in the population level.
The final weight is proportional to
wi ∝ Ng /ng
if i belongs to group g, where Ng is the population size and ng is the
sample size in group g.
Raking-ratio estimation (or raking method): popular for handling
multivariate categories with known marginal totals
14 / 34
Auxiliary information
Demographic variables (region/age/gender)
Household size
Years of education
Occupation
Household income
House type
15 / 34
Auxiliary information
Two types of auxiliary information
Household level information: Household size, Household income, House
type
Individual level information: Age, Gender, years of education,
occupation
Two main problems
For household level information, known totals are summarized in the
household level. However, the sample itself provides individual level
summary. That is, the “distribution for individual level 6= distribution
for household level”.
Also, there is some level of item nonresponse among auxiliary variables.
16 / 34
Auxiliary information
Example 1: Distribution of the household size
Size Household Level (%) Individual Level (%)
1
20
8
2
22
18
21
22
3
4
27
33
5
8
13
6+
3
6
Available
Derived
We need to convert “household level” information to “individual
level” information.
17 / 34
Auxiliary information
Example 2: Household income
Income range Household Level (%) Individual Level (%)
0-200
22
200-300
21
300-400
21
400-600
23
600+
12
Unless we know the joint distribution of the household income and
the household size, we cannot convert “household level” information
to “individual level” information.
18 / 34
Propensity Weighting
Household (HH) weight vs Individual weight
Household weight: The amount that the sample HH represents the HH
population.
Individual weight: The amount that the sample individual represents
the population of the individuals.
We do not need household weights for estimation, but the HH weights
may be needed to account for the HH population level information.
From the HH weights (w̃h ) that incorporate the HH population level
information, we can construct individual weights by
wi = w̃h × Mh /mh
where h is the index of household that unit i belongs to, w̃h is the
household weight, Mh is the HH size for house h, and mh is the
number of individuals of the sample in household h.
19 / 34
Propensity Weighting
1
Compute the HH base weights (di ):
di =
2
Ng
ng
where groups are formed by age/gender/region.
Raking for household level information:
(H)
wi = ch × (di /ch ) × P
Ng
i∈Ag
di /ch
where ch = Mh /mh .
1
2
3
3
Compute the base HH weights from the base individual weights
Apply raking procedure for HH weights
Compute the individual weights from the HH weights
Raking for individual level information
20 / 34
Propensity Weighting
Item nonresponse for household income
Modified
Income range Pop’n (%) Sample (%) Pop’n (%)
Don’t Know
19
19
0-200
22
19
22*0.81
200-300
21
20
21*0.81
21
17
21*0.81
300-400
400-600
23
18
23*0.81
600+
12
7
12*0.81
If a sample unit is missing in HH income, then skip this raking step
and keep the weight in the current step.
If a sample unit responds in HH income, then apply the raking step
with modified population control total (last column) that accounts for
the current missing rate.
21 / 34
Propensity Weighting for Nonresponse adjustment
About 60% Response rate
Auxiliary information can be used to compute the propensity scores
Auxiliary variables used for sampling: Population totals should be
known.
Other auxiliary variables can be obtained at the time of phase 1
sampling. (sample registration)
For election prediction, nonresponse adjustment is not used
(Nonresponse = Not interested )
22 / 34
Final Prediction
Notation
h: district
Yh : (true) vote outcome for party A
Xh : vote outcome for party A in the latest local election.
Ŷh : direct estimate of Yh
Prediction model
Sampling Error Model
Ŷh = Yh + uh , uh ∼ N (0, 0.25/nh )
Structural Error Model
Yh = γXh + eh , eh ∼ N (0, Xh σe2 )
23 / 34
Final Prediction (Cont’d)
Combining two models by Bayes theorem:
Yh | (Xh , Ŷh ) ∼ N (αh Ŷh + (1 − αh )γXh , (1 − αh )Xh σe2 )
where
αh =
Xh σe2
Xh σe2 + 0.25/nh
Prediction incorporating the survey result and previous election result
Final estimate
P
∗
h wh Ŷh
Ŷ = P
h wh
∗
24 / 34
Final Prediction (Cont’d)
Model parameter (γ, σe2 ) estimation: S̄(γ, σe2 ) = 0
h
i
S̄(γ, σe2 ) = E S(γ, σe2 ; Xh , Yh ) | Xh , Ŷh , γ, σe2

(Ŷh∗ − γXh )/σe2


X
=
wh

 − 1 + (Ŷh∗ −γXh )2 +(1−αh )Xh σe2
h
2σ 2
2Xh σ 4
e





e
where Ŷh∗ = α̂h Ŷh + (1 − α̂)γ̂Xh .
Used EM algorithm to estimate the parameters.
25 / 34
Final Prediction (Cont’d)
Table 1. Final prediction (%)
Area
Party Prediction True Outcome
Seoul
A
43.2
43.0
B
55.6
56.1
Busan
A
46.3
50.7
B
53.7
49.3
Inchon
A
46.4
50.0
B
52.2
48.2
Gyeonggi
A
52.5
50.4
B
47.5
49.6
26 / 34
Prediction results analysis
Table 2. Average Error Size (%)
Area
Ŷh Ŷh∗
Seoul
5.5 4.1
Busan
7.3 6.0
Inchon
8.7 5.6
Gyeonggi 7.2 2.6
Average Error Size
H
1 X
|Ŷh − Yh |
H
h=1
H
1 X ∗
|Ŷh − Yh |
H
h=1
27 / 34
Prediction Results Analysis: Seoul
Boxplot of error sizes in Seoul
28 / 34
Prediction Results Analysis: Seoul (Cont’d)
Map of error sizes in Seoul (Left: Ŷh , Right: Ŷh∗ )
29 / 34
Prediction Results Analysis: Gyeonggi
Boxplot of error sizes in Gyeonggi
30 / 34
Prediction Results Analysis: Gyeonggi (Cont’d)
Map of error sizes in Gyeonggi (Left: Ŷh , Right: Ŷh∗ )
31 / 34
Conclusion
Used smart-phone app surveys to younger generation: to obtain better
response rate and reduces the nonresponse error
Used propensity score sampling to improve the representativeness of
the self-selected panel sample.
Seoul, Gyeonggi : Large panel sample size
Inchon, Busan: Small panel sample size
Uses small area estimation technique to incorporate the auxiliary
information in the previous election.
Difficult to predict for the behavior of vote participation.
32 / 34
Future Research
Small area estimates may be quite efficient for predicting small areas
but the sum of the small area estimates does not necessarily lead to
efficient estimation. May involve random effect models (or
hierarchical model) to borrow strength for large domains).
Extension to non-normal models : Parametric fractional imputation of
Kim (2011) can be used to compute the conditional expectation
involved in the BLUP.
MSE estimation.
33 / 34
The end
34 / 34
Download