Statistical Matching using Fractional Imputation Jae-Kwang Kim 1

advertisement
Statistical Matching using Fractional Imputation
Jae-Kwang Kim
1
Iowa State University
1
Joint work with Emily Berg and Taesung Park
1
Introduction
2
Classical Approaches
3
Proposed method
4
Application: Measurement error models
5
Simulation Study
6
Conclusion
Kim (ISU)
Matching
2 / 35
Introduction
Motivation
Combine information from several surveys
Example: Two surveys
1
2
Survey A: Observe X and Y1
Survey B: Observe X and Y2
Want to create a data file with X , Y1 , Y2 .
If “Survey B” sample is a subset of “Survey A” sample, then we may
use record linkage technique to obtain Y1 value for survey B sample.
What if the two samples are independent ?
Kim (ISU)
Matching
3 / 35
Introduction
Table : A Simple Data structure for Matching
Sample A
Sample B
Kim (ISU)
X
o
o
Matching
Y1
o
Y2
o
4 / 35
Introduction
Table : Data after statistical matching
Sample A
Sample B
X
o
o
Y1
o
o
Y2
o
o
Also called data fusion, or data combination.
Kim (ISU)
Matching
5 / 35
Introduction
Example 1 Split questionnaire design
Split the original sample into two groups
In group 1, ask (x, y1 )
In group 2, ask (x, y2 )
Often used to reduce the response burden (and improve the quality of
the survey responses).
Kim (ISU)
Matching
6 / 35
Introduction
Example 2 Combining two surveys
Survey A: Health-related survey
Survey B: Socio-Economic surveys
x: demographic variable, y1 : health status variable, y2 : socio-economic
variable
Interested in fitting a regression of y1 (e.g. Obesity) on x and y2
using two surveys.
Two samples should be obtained from the same finite population.
Kim (ISU)
Matching
7 / 35
1
Introduction
2
Classical Approaches
3
Proposed method
4
Application: Measurement error models
5
Simulation Study
6
Conclusion
Kim (ISU)
Matching
8 / 35
Introduction
Idea
We want to create Y1 for each element in sample B by finding a
“statistical twin” from the sample A.
Often based on the assumption that Y1 and Y2 are conditionally
independent, conditional on X . That is,
Y1 ⊥ Y2 | X
Under CI (Conditional Independence) assumption, we have
f (y1 | x, y2 ) = f (y1 | x)
and the “statistical twin” is solely determined by “how close” they are
in terms of x’s.
Kim (ISU)
Matching
9 / 35
Introduction
Remark
Under the assumption that (X , Y1 , Y2 ) are multivariate normal, the
CI assumption means that
σ12 = σ1x σ2x /σxx
and
ρ12 = ρ1x ρ2x .
That is, σ12 is determined from other parameters, rather than
estimated from the realized samples.
Kim (ISU)
Matching
10 / 35
Existing Methods
Methods under CI assumption
Synthetic data imputation:
1
2
Estimate f (y1 | x) from sample A, denoted by fˆa (y1 | x).
For each element in sample B, use the xi value to create imputed
value(s) from fˆa (y1 | x).
Matching: Two-step method
Instead of using the synthetic values directly for imputation, synthetic
values are used to identify the statistical twins in sample A. The
identified twin in sample A is used as the imputed value.
Kim (ISU)
Matching
11 / 35
Existing Methods
Some popular methods under CI assumption
Parametric approach : Often based on the parametric model or
regression model
ŷ1i = β̂0 + β̂1 xi
Nonparametric approach
Random hot deck
Rank hot deck
Distance hot deck
Reference
D’Orazio, Di Zio, and Scanu (2006). Statistical Matching: Theory
and Practice, Wiley.
Kim (ISU)
Matching
12 / 35
1
Introduction
2
Classical Approaches
3
Proposed method
4
Application: Measurement error models
5
Simulation Study
6
Conclusion
Kim (ISU)
Matching
13 / 35
New Approach
Motivation
The regression of Y1 on X and Y2 will provide insignificant regression
coefficient on Y2 . That is, the p-value for β̂2 will be large in
ŷ1 = β̂0 + β̂1 x + β̂2 y2
CI assumption is often unrealistic !
For example,
1
2
3
Often X is demographic variable
Y1 is social-behavior (or public health)
Y2 is economic variable (e.g. HH income)
In this case, we may have
Corr (Y1 , Y2 | X ) 6= 0
Kim (ISU)
Matching
14 / 35
New Approach
Alternative interpretation
We can view the problem as an omitted variable regression problem.
(1)
(1)
(1)
(2)
(2)
(2)
y1 = β0 + β1 x + β2 z + e1
y2 = β0 + β1 x + β2 z + e2
where z, e1 , e2 are never observed. e1 and e2 are independent.
z is an unobservable confounding factor that explains
Cov (y1 , y2 | x) 6= 0.
Thus, if we fit a regression of (y1 , y2 ) on x, then the error terms are
still correlated.
Kim (ISU)
Matching
15 / 35
New Approach
Instrumental variable
Under CI assumption, imputed values are generated from f (y1 | x),
which completely ignores the observed information of y2 .
Let’s try to generate imputed values from f (y1 | x, y2 ).
However, we cannot estimate the parameters in f (y1 | x, y2 ).
Use instrumental variable assumption for identification of the models.
Kim (ISU)
Matching
16 / 35
New Approach
Idea
Decompose X = (X1 , X2 ) such that
(i)
f (y1 | x1 , x2 , y2 ) = f (y1 | x1 , y2 )
(ii)
f (y1 | x1 , x2 = a) 6= f (y1 | x1 , x2 = b)
for some a 6= b.
X2 is often called instrumental variable (IV) for Y2
Kim (ISU)
Matching
17 / 35
New Approach
Propose method
Under IV assumption,
f (y1 | x, y2 ) ∝ f (y1 | x) f (y2 | x1 , y1 )
The second term can be ignored under CI assumption. The second
term incorporates the observed information of y2 in Sample B.
EM algorithm can be used to perform the parameter estimation and
prediction simultaneously.
E-step can be computationally heavy (Markov Chain Monte Carlo).
Metropolis-Hastings algorithm
1
2
Generate y1∗ from fˆa (y1 | x).
Accept y1∗ if f (y2 | x1 , y1∗ ; θ̂) is large at the current parameter value θ̂.
Kim (ISU)
Matching
18 / 35
New Approach
Propose method
Parametric fractional imputation (PFI) of Kim (2011) is an alternative
computational tool that does not involve MCMC computation but
still implements EM algorithm with intractable E-step.
PFI uses importance sampling: When the target distribution is
f (y1 | x, y2 ) ∝ f (y1 | x) f (y2 | x1 , y1 ) ,
first generate m values of y1∗ ∼ f (y1 | x) and then use a normalized
version of f (y2 | x1 , y1∗ ) as a weight assigned to y1∗ . Solve the
weighted score equation to update the parameters in the M-step.
Kim (ISU)
Matching
19 / 35
New Approach
Propose method: Parametric fractional imputation
1
2
3
For each i ∈ B, generate m imputed values of y1 , denoted by
∗(1)
∗(m)
y1i , · · · , y1i , from fˆa (y1 | xi ).
Let θ̂t be the current parameter value of θ in f (y2 | x1 , y1 ). For the
∗(j)
j-th imputed value y1i , assign fractional weight
∗(j)
wij∗ ∝ f y2i | x1i , y1i ; θ̂t
P
∗
where m
j=1 wij = 1.
Solve the fractionally imputed score equation for θ
X
i∈B
4
wib
m
X
∗(j)
wij∗ S(θ; x1i , y1i , y2i ) = 0
j=1
to update θ̂t+1 , where S(θ; x1 , y1 , y2 ) = ∂ log f (y2 | x1 , y1 ; θ)/∂θ.
Go to step 2 and continue until convergence.
Kim (ISU)
Matching
20 / 35
Remark
Fractional imputation can be understood as a tool for computing a
Monte Carlo approximation of the conditional expectation given the
observation.
Fractionally imputed data file can be used to obtain many different
parameters. That is, if a parameter η is defined as a solution to
E {U(η; x, y1 , y2 )} = 0, then a consistent estimator of η can be
obtained by the solution to
X
i∈B
wib
m
X
∗(j)
wij∗ U(η; xi , y1i , y2i ) = 0.
j=1
Note that the above estimating equation is a Monte Carlo
approximation to the following estimating equation:
X
wib E {U(η; xi , Y1i , y2i ) | xi , y2i } = 0.
i∈B
For variance estimation, linearization method can be used (Skipped
here).
Kim (ISU)
Matching
21 / 35
1
Introduction
2
Classical Approaches
3
Proposed method
4
Application: Measurement error models
5
Simulation Study
6
Conclusion
Kim (ISU)
Matching
22 / 35
Application to Measurement error models
Interested in estimating θ in f (y | x; θ).
Instead of observing x, we observe z which can be highly correlated
with x.
Thus, z is an instrumental variable for x:
f (y | x, z) = f (y | x)
and
f (y | z = a) 6= f (y | z = b)
for a 6= b.
In addition to original sample, we have a separate calibration sample
that observes (xi , zi ).
Kim (ISU)
Matching
23 / 35
Example: Measurement error model
Table : External Calibration Study
Sample A
Sample B
Z
o
o
X
o
Y
o
Table : Internal Calibration Study
Sample
Validation Subsample
Non-validation subsample
Kim (ISU)
Matching
Z
o
o
X
o
Y
o
o
24 / 35
Remark
Internal calibration study: Two-phase sampling structure
Phase One: observe (z, y )
Phase Two: validation subsample, observe x in addition to (z, y )
Imputation approach for two-phase sampling
Estimate f (x | z, y ) from the second phase sample.
For the elements in the phase one sample, generate x ∼ fˆ(x | z, y ).
For external calibration study, we use the proposed statistical
matching technique under the assumption that
f (y | x, z) = f (y | x).
Kim (ISU)
Matching
25 / 35
Proposed method: Idea
In sample B, x is a latent variable (a variable that is always missing).
The goal is to generate x in Sample B from
f (xi | zi , yi ) ∝ f (xi | zi ) f (yi | xi , zi )
= f (xi | zi ) f (yi | xi )
Obtain a consistent estimator fˆa (x | z) from sample A.
May use a Monte Carlo EM algorithm
∗(1)
E-step: Generate xi
∗(m)
, · · · , xi
from
f (xi | zi , yi ; θ̂(t) ) ∝ fˆa (xi | zi )f (yi | xi ; θ̂(t) )
M-step: Solve the imputed score equation for θ.
Kim (ISU)
Matching
26 / 35
Fractional imputation for EM algorithm
The above E-step may be computationally challenging (often relies on
a MCMC method)
Parametric fractional imputation can be used for easy computation.
E-step
1
2
∗(1)
∗(m)
from fˆa (xi | zi ) in i ∈ B.
Generate xi , · · · , xi
∗(j)
Compute the fractional weights associated with xi
by
∗(j)
wij∗ ∝ f (yi | xi
and
P
j
; θ̂(t) )
wij∗ = 1.
M-step: Solve the weighted score equation for θ.
Kim (ISU)
Matching
27 / 35
1
Introduction
2
Classical Approaches
3
Proposed method
4
Application: Measurement error models
5
Simulation Study
6
Conclusion
Kim (ISU)
Matching
28 / 35
Simulation Setup
Measurement error model setup
yi ∼ Bernoulli(pi )
logit(pi ) = γ0 + γx xi
zi = β0 + β1 xi + ui
ui ∼ N(0, σ 2 xi2α ) and xi ∼ N(µx , σx2 ).
We observe (xi , zi ), i = 1, . . . , nA in sample A. In sample B, instead
of observing (xi , yi ), we observe (zi , yi ).
For the simulation, nA = nB = 800, γ0 = 1, γx = 1, β0 = 0, β1 = 1,
σ 2 = 0.25, α = 0.4, µx = 0, and σx2 = 1.
Kim (ISU)
Matching
29 / 35
Methods
1
2
3
4
Parametric fractional imputation (PFI)
Hot deck fractional imputation (HDFI)
Naive: Naive estimator obtained from the logistic regression of yi on zi
for i ∈ B.
Bayes: Proposed by Guo and Little (2011). GIBBS sampling is
implemented with JAGS. We used 1000 iterations of a single chain for
inference, after discarding the first 500 for burn-in. We specify diffuse
proper prior distributions for the Bayes estimators. Letting
θ1 = (log(σx2 ), log(σ 2 ), µx , β0 , β1 , γ0 , γx ),
5
we assume a priori that θ1 ∼ N(0, 10−6 I7 ), where I7 is a 7 × 7 identity
matrix. The prior distribution for the power α is uniform on the interval
[−5, 5].
Weighted regression calibration (WRC): regression calibration method
incorporating the unequal variance in the measurement error model
(also considered in Guo and Little, 2011).
Kim (ISU)
Matching
30 / 35
Simulation result
Table : Monte Carlo (MC) means, variances, and mean squared errors (MSE) of
point estimators of γx
Method
PFI
HDFI
Naive
Bayes
WRC
Kim (ISU)
MC Bias
0.0239
0.0246
-0.2241
0.0406
0.1120
MC Variance
0.0386
0.0387
0.0239
0.0415
0.0499
Matching
MC MSE
0.0392
0.0393
0.0742
0.0432
0.0625
31 / 35
1
Introduction
2
Classical Approaches
3
Proposed method
4
Application: Measurement error models
5
Simulation Study
6
Conclusion
Kim (ISU)
Matching
32 / 35
Concluding Remark
Statistical matching is a tool for survey data integration.
The current practice of statistical matching is based on conditional
independence assumption, which may not be a realistic assumption in
practice.
A new approach based on instrumental variable is proposed.
The proposed method provides statistically valid regression coefficient
for the matched data even when CI assumption does not hold.
Variance estimation is possible (not covered here).
Directly applicable to measurement error model problems and split
questionnaire design problems.
Kim (ISU)
Matching
33 / 35
Future research
Semi-parametric inference by making fˆa (y1 | x) nonparametric.
f (y1 | x, y2 ) ∝ f (y1 | x) f (y2 | x1 , y1 )
Application to causal inference: Estimation of average treatment
effect from observational studies when we cannot observe the
counterfactual outcomes.
Combination of two data: one from probability sampling and the
other from a non-probability sample.
Kim (ISU)
Matching
34 / 35
The end
Kim (ISU)
Matching
35 / 35
Download