Combining data from two independent surveys: model-assisted approach Jae Kwang Kim

advertisement
Combining data from two independent
surveys: model-assisted approach
Jae Kwang Kim 1
Iowa State University
January 20, 2012
1
Joint work with J.N.K. Rao, Carleton University
Reference
Kim, J.K. and Rao, J.N.K. (2012). “Combining data from two
independent surveys: a model-assisted approach,” Biometrika, In
Press.
(Available online via Advance Access 10.1093/biomet/asr063.)
Outline
1
Introduction
2
Projection estimation
3
Replication variance estimation
4
Efficient estimation: Full information
5
Simulation study
6
Concluding remarks & Discussion
3
1. Introduction
Two-phase sampling
(Classical) Two-phase sampling
A1 : first-phase sample of size n1
A2 : second-phase sample of size n2 (A2 ⊂ A1 )
x observed in phase 1 and both y and x observed in phase 2.
Assume that 1 is an element of xi .
Neyman (1934), Hansen & Hurwitz (1946), Rao (1973), Kott
& Stukel (1997), Binder et al. (2000), Kim et al. (2006),
Hidiroglou et al. (2009).
4
1. Introduction
Two-phase sampling
GREG estimator of Y =
PN
i=1 yi :
ŶG = X̂01 β̂ 2
−1

X̂1 =
X
w1i xi , β̂ 2 = 
i∈A1
X
w2i xi x0i 
i∈A2
X
w2i xi yi
i∈A2
Two ways of implementing the GREG estimator
Calibration: create data file for A2
!−1
ŶG =
X
w2G ,i yi , w2G ,i =
i∈A2
X̂01
X
w2i xi x0i
i∈A2
Projection estimation: create data file for A1 .
X
ŶG =
w1i ỹi , ỹi = x0i β̂ 2
i∈A1
5
w2i xi
1. Introduction
Domain projection estimators
Calibration estimator of domain total Yd =
X
ŶCal,d =
w2G ,i δi (d)yi
PN
i=1 δi (d)yi :
i∈A2
δi (d) = 1 if i belongs to domain d, δi (d) = 0 otherwise.
Note: ŶCal,d is based only on the domain sample belonging to
A2 and it could lead to large variance if domain A2 sample is
very small.
6
1. Introduction
Domain projection estimators
Domain projection estimator (Fuller, 2003)
X
Ŷp,d =
wi1 δi (d)ỹi
i∈A1
Note: Ŷp,d is based on much larger domain sample belonging
to A1 than ŶCal,d based on domain sample belonging to A2 .
Hence, Ŷp,d could be significantly more efficient if its relative
bias is small.
Under the model yi = x0i β + ei with E (ei ) = 0, Ŷp,d is model
unbiased for Yd . But, “it is possible to construct populations
for which Ŷp,d is very design biased” (Fuller, 2003).
7
1. Introduction
Combining two independent surveys
Large sample A1 collecting only x, and weights {wi1 , i ∈ A1 }.
Much smaller sample A2 collecting x and y drawn
independently and weights {wi2 , i ∈ A2 }.
Example 1 (Hidiroglou, 2001): Canadian Survey of
Employment, Payrolls and Hours
A1 : Large sample drawn from a Canadian Customs and
Revenue Agency administrative data file and auxiliary variables
x observed.
A2 : Small sample from Statistics Canada Business Register
and study variables y , number of hours worked by employees
and summarized earnings, observed.
8
1. Introduction
Combining two independent surveys
Example 2 (Reiter, 2008)
A2 : Both self-reported health measurements, x, and clinical
measurements from physical examinations, y , observed
A1 : Only x reported
Synthetic values ỹi , i ∈ A1 are created by first fitting a
working model E (y ) = m(x, β) relating y to x to data
{(yi , xi ), i ∈ A2 } and then predicting yi associated with xi ,
i ∈ A1 . Only synthetic values ỹi = m(xi , β̂), i ∈ A1 and
associated weights wi1 , i ∈ A1 are released to the public.
Our focus is on producing estimators of totals and domain
totals from the synthetic data file {(ỹi , wi1 ), i ∈ A1 }.
9
2. Projection estimation
Estimation of Y
Projection estimator of Y :
Ŷp =
X
wi1 ỹi
i∈A1
Ŷp is asymptotically design-unbiased if β̂ satisfies
n
o
X
=0
wi2 yi − m xi , β̂
i∈A2
Note: Under condition (*),
X
X
Ŷp =
wi1 ỹi +
wi2 {yi − ỹi }
i∈A1
i∈A2
= “prediction” + “ bias correction”
10
(∗)
2. Projection estimation
Estimation of Y
Theorem 1: Under some regularity conditions, if β̂ satisfies
condition (*), we can write
X
X
Ŷp ∼
wi1 m0 (xi ) +
wi2 {yi − m0 (xi )} = P̂1 + Q̂2
=
i∈A1
i∈A2
where m0 (xi ) = m(xi , β 0 ) and β 0 = p lim β̂ with respect to
survey 2. Thus,
E (Ŷp ) ∼
=
N
X
i=1
N
N
X
X
m0 (xi ) +
{yi − m0 (xi )} =
yi .
i=1
and
V (Ŷp ) ∼
= V (P̂1 ) + V (Q̂2 ).
i=1
2. Projection estimation
Model-assisted approach: Asymptotic unbiasedness of Ŷp
does not depend on the validity of the working model but
efficiency is affected.
Note: In the variance decomposition
V (Ŷp ) ∼
= V (P̂1 ) + V (Q̂2 ) = V1 + V2 .
V1 is based on n1 sample elements and V2 is based on n2
sample elements.
If n2 << n1 , then V1 << V2 .
If the working model is good, then the squared error terms
ei2 = {yi − m0 (xi )}2 are small and V2 will also be small.
12
2. Projection Estimation
When is condition (*) satisfied?
If 1 is an element of xi , this condition is satisfied for linear
regression m(xi , β) = x0i β and logistic regression
logit{m(xi , β)} = x0i β when β̂ is obtained from the estimating
equation
X
wi2 xi (yi − mi ) = 0
i∈A2
for linear and logistic regression working models.
For the ratio model, β̂ is the solution of
X
wi2 (yi − mi ) = 0.
i∈A2
13
2. Projection Estimation
Linearization variance estimation
Let ei = yi − ỹi , then the variance estimator of Ŷp is
vL (Ŷp ) = v1 (ỹi ) + v2 (êi )
v1 (z̃i ) = v (Ẑ1 ) = variance estimator for survey 1
v2 (z̃i ) = v (Ẑ2 ) = variance estimator for survey 2
P
P
Ẑ1 = i∈A1 wi1 zi , Ẑ2 = i∈A2 wi2 zi .
Note vL (Ŷp ) requires access to data from both surveys.
14
2. Projection Estimation
Estimation of domain total Yd
Projection domain estimator
X
Ŷd,p =
wi1 δi (d)ỹi
i∈A1
Ŷd,p is asymptotically unbiased if
Case (i) :
X
wi2 δi (d)(yi − ỹi ) = 0
i∈A2
OR
Case (ii) : Cov {δi (d), yi − m(xi , β 0 )} = 0.
15
2. Projection Estimation
Estimation of domain total Yd
Case (i): For linear or logistic regression models (i) is satisfied
if δi (d) is an elements of xi . For planned domains specified in
advance, augmented working models can be used. Survey 1
data file should provide planned domain indicators.
Case (ii): If working model is good, then the relative bias of
Ŷd,p would be small. Ŷd,p is asymptotically model unbiased if
model is correct.
Ŷd,p can be significantly design biased for some populations.
16
3. Replication variance estimation
Replication variance estimation for Ŷp
Replication variance estimator for survey 1:
v1,rep (Ẑ ) =
L1
X
2
(k)
ck Ẑ1 − Ẑ1
k=1
(k)
Ẑ1
(k)
i∈A1 wi1 zi
(k)
P
=
and {wi1 , i ∈ A1 }, k = 1, · · · , L1 :
replication weights for survey 1
Replication variance estimator for Ŷp :
v1,rep (Ŷp ) =
L1
X
2
(k)
ck Ŷp − Ŷp
k=1
(k)
P
(k) (k)
where Ŷp = i∈A1 wi1 ỹi
values for replicate k.
17
(k)
and {ỹi
, i ∈ A1 } are synthetic
3. Replication variance estimation
Replication variance estimation for Ŷp
(k)
How to create replicated synthetic data {ỹi
1
Create
(k)
{wi2 , k
, i ∈ A1 } ?
= 1, · · · , L1 ; i ∈ A2 } such that
L1
X
2
(k)
ck Ŷ2 − Ŷ2 = v2 (Ŷ2 )
k=1
2
Compute β̂
(k)
(k)
(k)
and ỹi = m(xi , β̂ ) by solving
X (k)
wi2 {yi − m(xi , β)}xi = 0
i∈A2
for β̂
(k)
(linear or logistic linear regression)
v1,rep (Ŷp ) is asymptotically unbiased.
Data file for sample A1 should contain additional columns of
(k)
(k)
{ỹi , i ∈ A1 } and associated {wi1 , i ∈ A1 }, k = 1, 2, · · · , L1 .
18
3. Replication variance estimation
Replication variance estimation for Ŷd,p
(k)
Let Ŷd,p =
P
i∈A1
(k)
(k)
wi1 δi (d)ỹi
v1,rep (Ŷd,p ) =
L1
X
, then
2
(k)
ck Ŷd,p − Ŷd,p
k=1
Asymptotically unbiased under either case (i) or case (ii).
19
4. Optimal estimator: Full information
Estimation of total Y
Three estimators for two parameters
Survey 1: X̂1 for X
Survey 2: (X̂2 , Ŷ2 ) for (X , Y )
Combine information using generalized least squares
0


X̂1 − X
X̂1 − X
minimize Q(X , Y ) =  X̂2 − X  V −1  X̂2 − X 
Ŷ2 − Y
Ŷ2 − Y

with respect to (X , Y ) where V is the variance-covariance
matrix of (X̂1 , X̂2 , Ŷ2 )0 .
20
4. Optimal estimator: Full information
Estimation of total Y
Best linear unbiased estimator based on X̂2 , Ŷ2 and X̂1 :
Ỹopt = Ŷ2 + By ·x2 X̃opt − X̂2
X̃opt
=
Vxx2 X̂1 + Vxx1 X̂2
Vxx1 + Vxx2
where By ·x2 = Vyx2 /Vxx2 , Vxx1 = V (X̂1 ), Vxx2 = V (X̂2 ),
Vyx2 = Cov (Ŷ2 , X̂2 ).
Replace variances in Ỹopt by estimated variances to get Ŷopt
and X̂opt .
21
4. Optimal estimator: Full information
Estimation of total Y
Ŷopt can be expressed as
Ŷopt =
X
∗
wi2
yi
i∈A2
∗ , i ∈ A } are calibration weights:
{wi2
2
P
i∈A2
∗ x = X̂
wi2
opt .
i
Ŷopt can be computed from data file for A2 providing weights
∗,i ∈ A }
{wi2
2
Example: Simple random samples A1 and A2
N xi − x̄2
∗
wi2
=
+ X̂opt − X̂2 P
2
n2
i∈A2 (xi − x̄2 )
x̄2 : mean of x for A2
22
4. Optimal estimator: Full information
Domain estimation
Calibration estimator:
Ŷd∗ =
X
∗
wi2
δi (d)yi
i∈A2
computed from data file for A2 only.
Projection estimator:
X
Ŷp,d =
wi1 δi (d)ỹi
i∈A1
computed from data file for A1 .
Both Ŷd∗ and Ŷd,p satisfy internal consistency property:
X
X
Ŷd∗ = Ŷopt ,
Ŷd,p = Ŷp
d
d
23
4. Optimal estimator: Full information
Domain estimation
Ŷd∗ is asymptotically design unbiased but can lead to a large
variance if domain contains few sample A2 units.
Optimal estimator Ŷopt,d based on domain specific variances
does not satisfy internal consistency, may not be stable for
small domain sample size and it cannot be implemented from
A2 data file.
24
5. Simulation Study
Simulation Setup
Two artificial populations A and B of size N = 10, 000:
{(yi , xi , zi ); i = 1, · · · , N}
Population A: Regression model
xi ∼ χ2 (2), yi = 1 + 0.7xi + ei
ei ∼ N(0, 2), zi ∼ Unif(0, 1)
zi independent of (xi , yi )
Population B: Ratio model
same (xi , zi ) but yi = 0.7xi + ui
ui ∼ N(0, xi )
cov(y , x) = 0.71 for both populations
Domain d: δi (d) = 1 if zi < 0.3; δi (d) = 0 otherwise.
25
5. Simulation Study
Simulation Setup
Two independent simple random samples: n1 = 500, n2 = 100
Working models: linear regression, ratio, augmented linear
regression, augmented ratio
Relative bias: RB(Ŷ ) = {E (Ŷ ) − Y }/Y
Relative efficiency: RE (Ŷ ) = mse(Ŷopt )/mse(Ŷ )
26
5. Simulation Study
Simulation Results
Table 1: Simulation Results (Point estimation)
Parameter
Estimator
Total
Regression projection
Ratio projection
Aug. Reg. projection
Aug. Rat. projection
Optimal
Regression projection
Ratio projection
Aug. Reg. projection
Aug. Rat. projection
Optimal
Calibration
Domain
27
Population A
RB
RE
0.00
0.98
0.00
0.58
0.00
0.97
0.01
0.55
0.00
1.00
0.00
1.96
0.01
1.22
0.00
1.05
0.00
0.64
-0.01 1.00
0.00
0.45
Population B
RB
RE
0.00
0.97
0.00
0.99
0.00
0.97
0.00
0.98
0.00
1.00
0.01
2.01
0.01
2.05
0.00
0.98
0.00
0.96
-0.02 1.00
0.00
0.53
5. Simulation Study
Conclusions from Table 1
Estimation of total Y
RB of all estimator negligible: less than 2%
Regression projection estimator almost as efficient as Ŷopt even
when the true model is ratio model.
Ratio projection estimator is considerably less efficient if the
true model has substantial intercept term: model diagnostics
to identify good working model
3 Augmented projection estimators similar to corresponding
projection estimators in terms of RB and RE.
1
2
28
5. Simulation Study
Conclusions from Table 1
Domain estimation
RB of all estimators less than 5%: simulation setup ensures
δi (d) unrelated to ri = yi − m(xi ; β).
2 Regression projection estimator considerably more efficient
than the calibration estimator or optimal estimator: projection
estimator based on larger sample size
3 Ratio projection estimator considerably less efficient if the
model has substantial intercept term.
1
29
5. Simulation Study
Jackknife variance estimation
L1 = n2 = 100 pseudo replicates by random group jackknife
Table 2: Simulation Results (relative biases of var. est.)
Point Estimator
Regression Projection
Ratio Projection
Aug. Reg. Projection
Aug. Rat. Projection
Parameter
Total
Domain
Total
Domain
Total
Domain
Total
Domain
Pop. A
-0.013
-0.030
0.032
-0.001
0.033
0.022
0.059
0.064
Pop. B
0.024
0.006
0.000
-0.017
0.040
0.050
0.030
0.061
|RB| of jackknife variance estimators small: less than 5%
30
6. Discussion
Some alternative approaches
The proposed method does not lead to the optimal estimator:
Ŷopt = Ŷ2 + B̂y ·x2 X̃opt − X̂2
Vxx2 X̂1 + Vxx1 X̂2
Vxx1 + Vxx2
To implement the optimal estimator using synthetic data, we
may express
X
X
wi3 ỹi +
wi2 (yi − ỹi )
Ŷopt =
X̃opt
=
i∈A∗1
i∈A2
where ỹi = x0i B̂y ·x2 , A∗1 = A1 ∪ A2 and wi3 is the sampling
weight for A∗1 satisfying
X
wi3 xi = X̃opt
i∈A∗1
31
6. Discussion
Some alternative approaches
If
P
i∈A2
wi2 =
P
wi3 , then we can further express
XX
=
wi3 w̃ij∗ (ŷi + êj )
i∈A∗1
Ŷopt
i∈A∗1 j∈A2
where w̃ij∗ = wj2 /( i∈A2 wi2 ) and êj = yj − ŷj .
It now take the form of fractional imputation considered in
Fuller & Kim (2005).
To reduce the size of the data set, we may consider random
selection of M residuals to get êj∗ and
P
ŶFI =
M
XX
wi3 wij∗ ŷi + êj∗ ,
i∈A∗1 j=1
where wij∗ satisfies
P
∗ 1, ê ∗ =
∗
w
j=1 ij
j∈A2 w̃ij (1, êj ) .
j
PM
32
6. Discussion
Some alternative approaches
Nested two-phase sampling: A2 ⊂ A1
Non-nested two-phase sampling : A1 , A2 independent
We can convert non-nested two-phase sampling into a nested
two-phase sampling
A2 ⊂ A∗1
where A∗1 = A1 ∪ A2
Synthetic data can be released for A∗1
33
6. Discussion
Parametric multiple imputation
Assume that f (yi | xi , θ) is known for fixed θ and that A1 and
A2 are simple random samples
Obtain the posterior distribution of θ: p(θ | y2 , x2 ) assuming a
diffuse prior on θ, where (y2 , x2 )= data from A2
Draw M values θ(1) , · · · , θ(M) from the posterior distribution.
(l)
Draw yi from f (yi | xi , θ(l) ) for i ∈ A1 and l = 1, · · · , M.
(l)
Synthetic data sets: {yi , i ∈ A1 }, l = 1, · · · , M.
Standard multiple imputation variance estimators do not work
here. Reiter (2008) proposed a two-stage imputation
procedure requiring T synthetic data sets
(l)
{yit : i ∈ A1 , t = 1, · · · , T } for each θ(l) to be generated. In
all, TM synthetic data sets are generated.
34
6. Discussion
Conclusion
The proposed method is based on determination imputation
to generate synthetic values.
Synthetic data along with the replicates are created for survey
1 and only survey 1 data is released.
Significant efficiency gain is achieved for domain estimation.
Stochastic imputation approach is under study.
35
REFERENCES
Binder, D.A. anad Babyak, C., Brodeur, M.,
Hidiroglou, M., & Jocelyn, W. (2000). Variance
estimation for two-phase stratified sampling. Can. J. Statist. 28,
751–764.
Fuller, W. A. (2003). Estimation for multiple phase samples.
In Analysis of Survey Data, R. L. Chambers & C. J. Skinner,
eds. Wiley: Chichester, England.
Fuller, W. A. & Kim, J.-K. (2005). Hot deck imputation for
the response model. Survey Methodology 31, 139–149.
Hansen, M. & Hurwitz, W. (1946). The problem of
non-response in sample surveys. J. Am. Statist. Assoc. 41,
517–529.
Hidiroglou, M. (2001). Double sampling. Survey Methodol.
27, 143–54.
Hidiroglou, M. A., Rao, J. N. K. & Haziza, D. (2009).
Variance estimation in two-phase sampling. Australian and New
Zealand Journal of Statistics 51, 127–141.
Kim, J. K., Navarro, A. & Fuller, W. A. (2006). Replicate
variance estimation after multi-phase stratified sampling. J. Am.
Statist. Assoc. 101, 312–320.
Kott, P. & Stukel, D. (1997). Can the jackknife be used with
a two-phase sample? Survey Methodology 23, 81–89.
Neyman, J. (1934). On the two different aspects of the
representative method: the method of stratified sampling and
the method of purposive selection. Journal of the Royal
Statistical Society 97, 558–606.
Rao, J. N. K. (1973). On double sampling for stratification and
analytical surveys. Biometrika 60, 125–33.
Reiter, J. (2008). Multiple imputation when records used for
imputation are not used or disseminated for analysis. Biometrika
95, 933–46.
37
Download