Small Area Estimation Combining Information from Several Sources Jan 27, 2012

advertisement
Small Area Estimation
Combining Information from Several Sources
Seunghwan Park and Jae-kwang Kim
Jan 27, 2012
Ouline
• Introduction
• Basic Theory
• Application to Korea LFS
• Discussion
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
2 / 28
Introduction
• Small Area estimation : want to provide reliable estimates for area with insufficient
sample sizes.
• Sample is not planned to give accurate direct estimators for the domains: domains
with few or no sample observations.
• Idea : Model can be used to borrow strength from other sources of information.
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
3 / 28
Introduction
• Motivation : want to combine several sources of information to get improved small
area estimates.
• How to improve the direct estimators using auxiliary variables,
• from other independent survey data
• from census data or administrative data.
• In our study,
•
•
•
•
Area-level model approach,
Several sources of auxiliary information,
A measurement error model.
Using a Generalized Least Squares(GLS) method.
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
4 / 28
Introduction
• General Setup
•
•
•
•
Interested variable : Yi
Survey A: Directly compute Ŷi , subject to sampling error.
Survey B: Compute X̂i1 , subject to sampling error .
Census: Measures X̂i2 .
• EA (Ŷi ) 6= EB (X̂i1 ) due to the structural differences between the surveys.
• Structural differences (or systematic difference)
• due to different mode of survey
• due to time difference
• due to frame difference
• Goal: Improve estimation of Yi by incorporating various types of auxiliary
information.
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
5 / 28
Basic Theory
• Two error models (for area i)
• Sampling error model
Ŷi,a
= Yi + ai
X̂i,b
= Xi + bi
where (ai , bi ) represents the sampling error such that
ai
bi
∼
0
0
,
V (ai )
Cov (ai , bi )
Cov (ai , bi )
V (bi )
• Structural error model
Xi = β0 + β1 Yi + ei ,
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
ei ∼ (0, σei2 )
Jan 27, 2012
6 / 28
Basic Theory
• Structural error model describes the relationship between the two survey
measurement up to sampling error.
• Y : target measurement item (variable of primary interest)
• X : inaccurate measurement of Y with possible systematic bias.
• If both X and Y measure the same item (with different survey modes), structural
error model is essentially a measurement error model. (β0 = 0, β1 = 1 means no
measurement bias.)
• Why consider Xi = β0 + β1 Yi + ei instead of Yi = β0 + β1 Xi + ei ? : We want to
treat Yi fixed rather than treating Xi fixed.
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
7 / 28
Basic Theory
• If the parameters in the structural error model are known, Ŷi,b ≡ β1−1 (X̂i,b − β0 ) is
also an unbiased estimator of Yi , computed from called survey B. Estimator Ŷi,b ,
using consistent (β̂0 , β̂1 ) is often called synthetic estimator.
• Two main issues:
• Prediction of Yi : GLS ( or GMM) approach.
• Parameter estimation : Use the theory of measurement error model.
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
8 / 28
Basic Theory
Prediction
• Recall GLS method:
e ∼ (0, V )
y = Z θ + e,
0
θ̂GLS = (Z V
−1
Z )−1 Z 0 V −1 y
• GLS approach to combine two error models:
e ∼ (0, V )
y = Z θ + e,
⇔
Ŷi,a
β1−1 (X̂i,b − β0 )
=
1
1
where u1i = ai and u2i = β1−1 (bi + ei ). Thus,
u1i
0
V (ai )
∼
,
u2i
0
β1−1 Cov (ai , bi )
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Yi +
u1i
u2i
β1−1 Cov (ai , bi )
β1−2 (V (bi ) + σei2 )
Jan 27, 2012
9 / 28
Basic Theory
Prediction
• GLS estimator : Best linear unbiased estimator of Yi based on the linear
combination of Ŷi,a and Ŷi,b = β1−1 (X̂i,b − β0 ).
• Under the current setup,
Ŷi∗ = wi Ŷi,a + (1 − wi )Ŷi,b
where
wi =
σei2 + V (bi ) − β1 Cov (ai , bi )
σei2 + β12 V (ai ) + V (bi ) − 2β1 Cov (ai , bi )
• The GLS estimator is sometimes called composite estimator. In paractice we need
to use β̂0 , β̂1 , and σ̂ei2 .
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
10 / 28
Basic Theory
Parameter estimation
• Parameter estimation for the structural model
• Case 1: Matching measurement X and measurement Y is possible
(e.g.: two phase sampling, Survey A sample is a subset of survey B
sample.)
• Case 2: Matching is not possible.
• In case 1, we can easily obtain a consistent estimator of the model parameters
from the set where units have both X and Y observed. (Unit level modeling)
• In case 2, we may use area level model to link X̂i (from survey B) and Ŷi (from
survey A) in the area level.
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
11 / 28
Basic Theory
Parameter estimation
• The area-level model takes the form of measurement error model (Fuller, 1987)
X̂i
= β0 + β1 Yi + ei + bi
Ŷi
= Yi + ai
• Parameter estimation can be performed using the measurement error model
estimation methods.
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
12 / 28
Korea LFS Application
• Labor Force Survey : very important economic survey interested in estimating
unemployment rates.
• Several sources of information for unemployment of Korea
Korean Labor Force Survey(KLF) data - 7K sample households
(monthly)
2 Local Area Labor Force Survey(LALF) data - 200K sample households
(quarterly)
3 Census data (10% of the population)
1
• KLF sample is nested within LALF sample.
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
13 / 28
Korea LFS Application
• Unemployment rate for small area is the parameter of interest.
• Several sources of information for unemployment for analysis district area i.
• Ŷi : estimates from KLF data
• X̂1i : estimates from LALF data
• X̂2i : estimates from census data
• KLF : sampling error ↑, measurement error ↓.
• LALF : sampling error ↓, measurement error ↑.
• Census data : sampling error ↓, measurement error↑(no updated information).
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
14 / 28
Korea LFS Application
• First, we can construct structural error models of area i in terms of population
mean
X̄1i = β0 + β1 Ȳi + ē1i ,
(1)
2
where (X̄1i , Ȳi , ē1i ) = Ni−1 ΣUi (x1j , yj , e1j ), ē1i ∼ (0, σe1
/Ni ).
• Consider nested error model :
e1i = i + ui ,
2
i ∼ (0, σe1
) ui ∼ (0, σu2 )
2
then ē1i ∼ (0, σe1
+ σu2 /Ni )
2
• Since Ni is often quite large, we can assume ē1i ∼ (0, σe1
)
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
15 / 28
Korea LFS Application
• Sampling error model
Ŷi
X̂1i
ȳi
x̄1i − β0
=
Yi
X1i
+
Ni a i
Ni bi
(2)
• Combining (1) and (2)
=
1
β1
Ȳi +
ai
bi + ē1i
(3)
where (x̄1i , ȳi ) = Ni−1 (X̂1i , Ŷi )
• Vi Variance-covariance matrix of (ai , bi + ē1i )0 is
Vi =
Seunghwan Park and Jae-kwang Kim ()
V (ai )
Cov (ai , bi )
Cov (ai , bi )
V (bi ) + σe2
Survey Sampling
Jan 27, 2012
16 / 28
Korea LFS Application
• GLS estimator
ŶiGLS = {(β1 , 1)Vi−1 (β1 , 1)0 }−1 (β1 , 1)Vi−1 (x̄1i − β0 , ȳi )
where Vi is the variance-covariance matrix of (ai , bi + ē1i )0 .
• GLS estiamtor can be expressed as the composite estimator form
Ŷicomp = αi ȳi + (1 − αi )ỹi
where ỹi = β1−1 (x̄1i − β0 ) which is called synthetic estimator and
αi =
V (ỹi ) − Cov (ȳi , ỹi )
V (x̄i ) + V (ỹi ) − 2Cov (ȳi , ỹi )
• Ignoring the effect of estimating β
V (Ŷicomp − Ȳi ) = αi V (ȳi ) + (1 − αi )Cov (ȳi , ỹi )
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
17 / 28
Korea LFS Application
• We can Consider also Census data. Then (3) changes to

 



ȳi
1
ai
 x̄1i − β0  =  β1  Ȳi +  bi + ē1i 
x̄2i − γ0
γ1
ē2i
• Whole process is similar to the case combining two survey.
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
18 / 28
Korea LFS Application
Parameter estimation
• A consistent estimator of (β0 , β1 ) : Minimize
Q(β0 , β1 ) =
H
X
(x̄1i − β0 − ȳi β1 )2
V (x̄1i − β0 − ȳi β1 )
i=1
where V (x̄1i − β0 − ȳi β1 ) = σe2 + β12 V (ai ) − 2β1 Cov (ai , bi ) + V (bi ).
• Let wi (β1 ) = σe2 + β12 V (ai ) − 2β1 Cov (ai , bi ) + V (bi ) −1 . Then
β̂0
β̂1
where (x̄w , ȳw ) = {
= x̄w − β̂1 ȳw
PH
i=1 wi (β̂1 ){(ȳi − ȳw )(x̄1i − x̄w ) − Cov (ai , bi )}
=
PH
2
i=1 wi (β̂1 ){(ȳi − ȳw ) − V (ai )}
PH
i=1
wi (β̂1 )}−1
PH
i=1
(4)
(5)
wi (β̂1 )(x̄i , ȳi )
• This solution can be obtained by iterative algorithm.
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
19 / 28
Korea LFS Application
Parameter estimation
• Consider the method of moment estimator
2
E {(x̄1i − β0 − ȳi β1 )2 − β12 V (ai ) + 2β1 Cov (ai , bi ) − V (bi )} = σe1i
• Under the nested error model
2
E {(x̄1i − β0 − ȳi β1 )2 − β12 V (ai ) + 2β1 Cov (ai , bi ) − V (bi )} = σe1
• Using the Fuller(2009)
2
σ̂e1
=
H
X
n
o
ˆ (ai , bi ) − V̂ (bi )
κi (x̄1i − β̂0 − ȳi β̂1 )2 − β̂12 V̂ (ai ) + 2β̂1 Cov
(6)
i=1
n
o−1
P
2
ˆ (ai , bi ) + V̂ (bi )
where κi ∝ σe1
+ β̂12 V̂ (ai ) − 2β̂1 Cov
and H
i=1 κi = 1.
2
• We can also consider ēi ∼ (0, Ȳi σe1
).
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
20 / 28
Korea LFS Application
Parameter estimation
• Iterative algorithm for parameter estimation.
2
= 0 using
Compute the initial estimator of (β0 , β1 ) by setting σ̂e1
(4),(5).
2
2 Use the current value of (β̂0 , β̂1 ), compute σ̂e1
using (6).
2
3 Use the current value of σ̂e1 compute the updated estimator of (β0 , β1 )
using (4),(5).
4 Repeat step 2, step 3 until convergence.
1
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
21 / 28
Korea LFS Application
MSE estimation
2
• The actual prediction for Ȳi is computed by Ȳˆei = Ȳˆi (θ̂) where θ = (β0 , β1 , σe1
).
ˆ )
MSE (Ȳ
ei
=
n
o
ˆ ) + E (Ȳ
ˆ − Ȳ
ˆ )2
MSE (Ȳ
i
ei
i
=
M1i + M2i
• Consider a jackknife approach,
M̂2i =
H
H − 1 X ˆ (−k) ˆ 2
(Ȳi
− Ȳi )
H
k=1
(JK )
M̂1i = α̂i
(JK )
where α̂i
= α̂i −
H−1
H
Seunghwan Park and Jae-kwang Kim ()
P
(JK )
V̂ (ai ) + (1 − α̂i
(−k)
k=1 (α̂i
d (ai , bi )
)Cov
− α̂i )
Survey Sampling
Jan 27, 2012
22 / 28
Korea LFS Application
Data analysis Result
• Consider four estimates
• KLF : Only KLF
• LALF : Only LALF
• GLS 1 : Combine KLF and LALF
• GLS 2 : Combine KLF, LALF, and census data
• MSE
MSE
KLF
LALF
GLS 1
GLS 2
1st Q
0.0000630
0.0001123
0.0000444
0.0000405
Seunghwan Park and Jae-kwang Kim ()
Median
0.0001210
0.0001330
0.0000738
0.0000543
Survey Sampling
Mean
0.0002476
0.0001482
0.0000893
0.0000575
3rd Q
0.0002395
0.0001695
0.0001210
0.0000721
Jan 27, 2012
23 / 28
Discussion
Modeling
• Model specification was very difficult!.
• We build models separately for urban and rural areas, which ares assigned based
on the proportion of households engaged in agricultural business.
• In KLF Survey, 25% of the whole areas have 0 unemployment rate due to the quite
small sample size of individual area.
• The areas which have 0 unemployment rate are excluded when
parameters are estimated.
• We have considered the structural model which has a 0 intercept.
X̄1i = β1 Ȳi + ei
• Mixture model or Zero-inflated regression model can be considered.
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
24 / 28
Discussion
Estimation
d (ai , bi ) even though it
• In real data set, there is no estimate of covariance term, Cov
is not 0.
• After calculating the covariance term, there exist a problem covariance matrix for
some area is not positive definite.
• Thus a smoothing covariance matrix procedure is essentially needed.
• Consider reverse two-phase sampling design
• From the finite population, we select the first-phase sample A1 of size
n1 .
• We select the second-phase sample A2 from U − A1 of size n2 .
• The final sample is A = A1 ∪ A2 and size is n = n1 + n2 .
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
25 / 28
Discussion
Estimation
• In fact, LALF survey samples are augmented by an additional sampling procedure
from KLF survey samples.
• Use reversed two-phase sampling design properties,
V (ai )
1
1
1 2
− )Sy2 ∼
= Sy
na
N
na
1
1 2
1
Sy
= ( − )Sy2 ∼
=
nb
N
nb
1
1
∼ 1 Sy2
= ( − )Sy2 =
nb
N
nb
=(
V (bi )
Cov (ai , bi )
• Sampling error variance
V̂ (ai )
d (ai , bi )
Cov
Seunghwan Park and Jae-kwang Kim ()
d (ai , bi )
Cov
V̂ (bi )
!
∼
=
Survey Sampling
V̂ (ai )
1
nai /nbi
nai /nbi
nai /nbi
Jan 27, 2012
26 / 28
Discussion
Future work
• Current MSE estimation formula does not consider smoothing variance matrix
procedure.
• To improve the approximation to asymptotic normality, we can consider a
transformation of X̂i , Ŷi .
• New MSE estimation formula for transformation case is under investigation.
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
27 / 28
Discussion
Thank You !
Seunghwan Park and Jae-kwang Kim ()
Survey Sampling
Jan 27, 2012
28 / 28
Download