Weighting in survey analysis under informative sampling Jae-Kwang Kim August 22, 2013

advertisement
Weighting in survey analysis under informative
sampling
Jae-Kwang Kim
1
Iowa State University
August 22, 2013
1
Joint work with Chris Skinner at London School of Economics and Political Science
Reference
Kim, J.K. and Skinner, C.J. (2013). “Weighting in survey analysis under
informative sampling,” Biometrika 100, 385-398.
Kim (ISU)
Informative sampling
August 22, 2013
2/1
Parameters
Two types of parameters
Descriptive parameter: “How many people in the United States were
unemployed on August 15, 2013?”
Analytic parameter: “If personal income (in the United States)
increases 2%, how much will the consumption of beef increase ?”
Basic approach to estimating analytic parameters
1
2
Specify a model that describes the relationship among the variables
(often called superpopulation model).
Estimate the parameters in the model using the realized sample.
Kim (ISU)
Informative sampling
August 22, 2013
3/1
Basic Setup
U = {1, · · · , N}: index set of finite population (of size N)
FN = {(xi , yi ); i ∈ U}: realized values in the finite population
May assume that FN is a realization from a model with density
f (y | x; θ).
Ij = 1 if element j is sampled and Ij = 0 otherwise.
We are interested in estimating θ from the sample.
Kim (ISU)
Informative sampling
August 22, 2013
4/1
A Regression Model
The finite population is a realization from a model
yi = xi β + ei
(1)
where the ei are independent (0, σ 2 ) random variables independent of
xj for all i and j.
We are interested in estimating β from the sample.
First order inclusion probabilities πi are available.
Kim (ISU)
Informative sampling
August 22, 2013
5/1
Estimation of Regression Coefficients
OLS estimator
β̂ols =
N
X
!−1
Ii xi0 xi
i=1
N
X
Ii xi0 yi
i=1
Probability weighted estimator
β̂π =
N
X
!−1
Ii di xi0 xi
i=1
N
X
Ii di xi0 yi
i=1
where di = 1/πi .
Kim (ISU)
Informative sampling
August 22, 2013
6/1
Informative Sampling
Non-informative sampling design (with respect to the superpopulation
model) satisfies
P (yi ∈ B | xi , Ii = 1) = P (yi ∈ B | xi )
(2)
for any measurable set B. The left side is the sample model and the
right side is the population model.
Informative sampling design: Equality (2) does not hold.
Non-informative sampling for regression implies
E xi0 ei | Ii = 1 = 0.
(3)
If condition (3) is satisfied, β̂ols is unbiased.
Kim (ISU)
Informative sampling
August 22, 2013
7/1
Hypothesis Testing : DuMouchel and Duncan (1983),
Fuller (1984)
Thus, one may want to test (3), or test directly
n
o
n o
H0 : E β̂ols = E β̂π
1
2
(4)
From the sample, fit a regression of yi on (xi , zi ) where zi = πi−1 xi
Perform a test for γ = 0 under the expanded model
y = Xβ + Zγ + a
where a is the error term satisfying E (a | X, Z) = 0.
Kim (ISU)
Informative sampling
August 22, 2013
8/1
Remarks on Testing
When performing the hypothesis testing, design consistent variance
estimator is preferable.
Rejecting the null hypothesis means that we cannot directly use the
OLS estimator under the current model.
Include more x’s until the sampling design is non-informative under the
expanded model.
Use the probability weighted estimator or use other consistent
estimators.
Kim (ISU)
Informative sampling
August 22, 2013
9/1
Modifying weights
to retain benefits of weights, while mitigating disadvantages
Kim (ISU)
Informative sampling
August 22, 2013
10 / 1
Pros and Cons of weighting
Pros
to avoid bias from informative sampling, when inclusion probabilities
πj unequal
(note: other approaches can also do this, e.g. sample likelihood,
Pfeffermann, 2011)
to protect against model misspecification
to make efficient use of population-level information
Cons
variance inflation from unequal inclusion probabilities
Kim (ISU)
Informative sampling
August 22, 2013
11 / 1
Weighted Estimators Under Informative Sampling
Magee (1998): Any estimator of the solution to
N
X
Ii di yi − x0i β xi q(xi ) = 0
i=1
is consistent for β, where q(xi ) is a function of xi .
Kim (ISU)
Informative sampling
August 22, 2013
12 / 1
Estimators Under Informative Sampling
Proof of Magee’s result: Let
Uq (β) =
N
X
Ii di (yi − x0i β)xi qi .
i=1
Note that
E {Uq (β) | FN } =
N
X
(yi − x0i β)xi qi .
i=1
Under the regression model, the model expectation of the above term
is zero as long as qi is a function of xi only.
Kim (ISU)
Informative sampling
August 22, 2013
13 / 1
Optimization Problem
Idea: Find a function q(·) which minimizes V (β̂q ) with respect to the joint
distribution of the sampling mechanism and the model.
Class: β̂q solves
N
X
Ij dj qj uj (β) = 0
j=1
and satisfies Em {uj (β)} = 0.
Objective function: design-model variance
V (β̂q ) = J(β)−1 V {
N
X
Ij dj qj uj (β)}J(β)−1
j=1
where
J=E

N
X

Kim (ISU)
j=1

∂uj (β) 
.
Ij dj qj
∂β 
Informative sampling
August 22, 2013
14 / 1
Approximations / Assumptions
observations for different units are approximately independent
generalized linear model so that uj (β) = ej xj
V (β̂q ) ∼
=

N
X

Kim (ISU)
j=1
qj Em (ej2 )xj x0j
−1
N
 X

qj2 Em (dj ej2 )xj x0j
j=1
Informative sampling

N
X

qj Em (ej2 )xj x0j
−1


j=1
August 22, 2013
15 / 1
(Approximately) Optimal Solution
qj∗ ∝ Em (ej2 | xj )/Em (dj ej2 | xj )
Requires fitting of model to Em (dj ej2 | xj )
Equivalent to Fuller (2009, Sect 6.3.2) for linear regression model
Different from Pfeffermann and Sverchkov (1999, 2003)
qj ∝ 1/Em (dj | Ij = 1, xj )
Kim (ISU)
Informative sampling
August 22, 2013
16 / 1
Fuller’s estimator: Under Em (ej2 | xj ) = σ 2
(Estimated) GLS estimator: Minimize
Q(β) =
N
X
Ii di (yi − xi β)2 /vi2
(5)
i=1
n
o
where vi2 = E di (yi − xi β)2 | xi .
1
2
Obtain β̂π and compute êi = yi − xi β̂π .
Fit a (nonlinear) regression model ai2 = di êi2 on xi ,
ai2 = qa (xi ; γa ) + rai
to get v̂i2 = qa (xi ; γ̂a ) and insert v̂i2 in (5).
For variance estimation, the variability of γ̂a can be safely ignored.
Kim (ISU)
Informative sampling
August 22, 2013
17 / 1
Practical Solution
As a first approximation may suppose dj and ej2 are uncorrelated and
set qj = 1/Em (dj | xj ).
Thus, use wj = dj qj = dj /Em (dj | xj ).
If di is well explained by xi , then the variability of wj will be reduced.
Design weight standardized for its dependence on xj .
Kim (ISU)
Informative sampling
August 22, 2013
18 / 1
Extension to GLM
Generalized linear models
Em (yi | xi ) = µi
Vm (yi | xi ) = v (µi )τ 2
and g (µi ) = x0i β. Link function g (·) and v (·) are of known forms.
May write µi (β) = g −1 (x0i β) and vi (β) = v {g −1 (x0i β)}.
Under simple random sampling, optimal estimator β̂ can be obtained
by minimizing
Q(β) =
N
X
Ii {yi − µi (β)}2 /vi (β).
i=1
Kim (ISU)
Informative sampling
August 22, 2013
19 / 1
Extension to GLM
Under informative sampling design, optimal estimator β̂ can be
obtained by minimizing
∗
Q (β) =
N
X
Ii di {yi − µi (β)}2 /vi∗ (β)
i=1
where
vi∗ (β) = Em di {yi − µi (β)}2 | xi .
A version of EGLS method can be developed.
Kim (ISU)
Informative sampling
August 22, 2013
20 / 1
Widening Class of Weights Further
Idea: Instead of using wi = di q(xi ), consider a wider class of estimators of
the form wi such that
N
X
Ii wj (yj − xj β)xj = 0
j=1
leads to consistent estimator.
Kim (ISU)
Informative sampling
August 22, 2013
21 / 1
Widening Class of Weights Further (Cont’d)
Consistency unaffected if E (wj | xj , yj , Ij = 1) = E (dj | xj , yj , Ij = 1).
Variance minimized if V (wj | xj , yj , Ij = 1) = 0
Achieved by setting
wj
= E (dj | xj , yj , Ij = 1)q(xj )
= d˜j q(xj ).
Closely related to weight smoothing by Beaumont (2008).
Kim (ISU)
Informative sampling
August 22, 2013
22 / 1
Widening Class of Weights Further (Cont’d)
Optimal q ∗ in the class of weights of the form wj = d˜j q(xj ):
qj∗ ∝ Em (ej2 | xj )/Em (d˜j ej2 | xj )
Need a model for d˜j = E (dj | xj , yj , Ij = 1), smoothed weights.
EM-type algorithm developed in Kim and Skinner (2013).
Details skipped.
Kim (ISU)
Informative sampling
August 22, 2013
23 / 1
Conclusion
Fitting models to complex survey data
Always test for informative design
If the hypothesis of noninformative design is rejected:
Examine model
Use HT estimator or more complex design consistent estimator
Variance estimation for clusters and two-stage designs must recognize
clusters
Kim (ISU)
Informative sampling
August 22, 2013
24 / 1
The end
Kim (ISU)
Informative sampling
August 22, 2013
25 / 1
Download