Weighting in survey analysis under informative sampling Jae-Kwang Kim 1 Iowa State University August 22, 2013 1 Joint work with Chris Skinner at London School of Economics and Political Science Reference Kim, J.K. and Skinner, C.J. (2013). “Weighting in survey analysis under informative sampling,” Biometrika 100, 385-398. Kim (ISU) Informative sampling August 22, 2013 2/1 Parameters Two types of parameters Descriptive parameter: “How many people in the United States were unemployed on August 15, 2013?” Analytic parameter: “If personal income (in the United States) increases 2%, how much will the consumption of beef increase ?” Basic approach to estimating analytic parameters 1 2 Specify a model that describes the relationship among the variables (often called superpopulation model). Estimate the parameters in the model using the realized sample. Kim (ISU) Informative sampling August 22, 2013 3/1 Basic Setup U = {1, · · · , N}: index set of finite population (of size N) FN = {(xi , yi ); i ∈ U}: realized values in the finite population May assume that FN is a realization from a model with density f (y | x; θ). Ij = 1 if element j is sampled and Ij = 0 otherwise. We are interested in estimating θ from the sample. Kim (ISU) Informative sampling August 22, 2013 4/1 A Regression Model The finite population is a realization from a model yi = xi β + ei (1) where the ei are independent (0, σ 2 ) random variables independent of xj for all i and j. We are interested in estimating β from the sample. First order inclusion probabilities πi are available. Kim (ISU) Informative sampling August 22, 2013 5/1 Estimation of Regression Coefficients OLS estimator β̂ols = N X !−1 Ii xi0 xi i=1 N X Ii xi0 yi i=1 Probability weighted estimator β̂π = N X !−1 Ii di xi0 xi i=1 N X Ii di xi0 yi i=1 where di = 1/πi . Kim (ISU) Informative sampling August 22, 2013 6/1 Informative Sampling Non-informative sampling design (with respect to the superpopulation model) satisfies P (yi ∈ B | xi , Ii = 1) = P (yi ∈ B | xi ) (2) for any measurable set B. The left side is the sample model and the right side is the population model. Informative sampling design: Equality (2) does not hold. Non-informative sampling for regression implies E xi0 ei | Ii = 1 = 0. (3) If condition (3) is satisfied, β̂ols is unbiased. Kim (ISU) Informative sampling August 22, 2013 7/1 Hypothesis Testing : DuMouchel and Duncan (1983), Fuller (1984) Thus, one may want to test (3), or test directly n o n o H0 : E β̂ols = E β̂π 1 2 (4) From the sample, fit a regression of yi on (xi , zi ) where zi = πi−1 xi Perform a test for γ = 0 under the expanded model y = Xβ + Zγ + a where a is the error term satisfying E (a | X, Z) = 0. Kim (ISU) Informative sampling August 22, 2013 8/1 Remarks on Testing When performing the hypothesis testing, design consistent variance estimator is preferable. Rejecting the null hypothesis means that we cannot directly use the OLS estimator under the current model. Include more x’s until the sampling design is non-informative under the expanded model. Use the probability weighted estimator or use other consistent estimators. Kim (ISU) Informative sampling August 22, 2013 9/1 Modifying weights to retain benefits of weights, while mitigating disadvantages Kim (ISU) Informative sampling August 22, 2013 10 / 1 Pros and Cons of weighting Pros to avoid bias from informative sampling, when inclusion probabilities πj unequal (note: other approaches can also do this, e.g. sample likelihood, Pfeffermann, 2011) to protect against model misspecification to make efficient use of population-level information Cons variance inflation from unequal inclusion probabilities Kim (ISU) Informative sampling August 22, 2013 11 / 1 Weighted Estimators Under Informative Sampling Magee (1998): Any estimator of the solution to N X Ii di yi − x0i β xi q(xi ) = 0 i=1 is consistent for β, where q(xi ) is a function of xi . Kim (ISU) Informative sampling August 22, 2013 12 / 1 Estimators Under Informative Sampling Proof of Magee’s result: Let Uq (β) = N X Ii di (yi − x0i β)xi qi . i=1 Note that E {Uq (β) | FN } = N X (yi − x0i β)xi qi . i=1 Under the regression model, the model expectation of the above term is zero as long as qi is a function of xi only. Kim (ISU) Informative sampling August 22, 2013 13 / 1 Optimization Problem Idea: Find a function q(·) which minimizes V (β̂q ) with respect to the joint distribution of the sampling mechanism and the model. Class: β̂q solves N X Ij dj qj uj (β) = 0 j=1 and satisfies Em {uj (β)} = 0. Objective function: design-model variance V (β̂q ) = J(β)−1 V { N X Ij dj qj uj (β)}J(β)−1 j=1 where J=E N X Kim (ISU) j=1 ∂uj (β) . Ij dj qj ∂β Informative sampling August 22, 2013 14 / 1 Approximations / Assumptions observations for different units are approximately independent generalized linear model so that uj (β) = ej xj V (β̂q ) ∼ = N X Kim (ISU) j=1 qj Em (ej2 )xj x0j −1 N X qj2 Em (dj ej2 )xj x0j j=1 Informative sampling N X qj Em (ej2 )xj x0j −1 j=1 August 22, 2013 15 / 1 (Approximately) Optimal Solution qj∗ ∝ Em (ej2 | xj )/Em (dj ej2 | xj ) Requires fitting of model to Em (dj ej2 | xj ) Equivalent to Fuller (2009, Sect 6.3.2) for linear regression model Different from Pfeffermann and Sverchkov (1999, 2003) qj ∝ 1/Em (dj | Ij = 1, xj ) Kim (ISU) Informative sampling August 22, 2013 16 / 1 Fuller’s estimator: Under Em (ej2 | xj ) = σ 2 (Estimated) GLS estimator: Minimize Q(β) = N X Ii di (yi − xi β)2 /vi2 (5) i=1 n o where vi2 = E di (yi − xi β)2 | xi . 1 2 Obtain β̂π and compute êi = yi − xi β̂π . Fit a (nonlinear) regression model ai2 = di êi2 on xi , ai2 = qa (xi ; γa ) + rai to get v̂i2 = qa (xi ; γ̂a ) and insert v̂i2 in (5). For variance estimation, the variability of γ̂a can be safely ignored. Kim (ISU) Informative sampling August 22, 2013 17 / 1 Practical Solution As a first approximation may suppose dj and ej2 are uncorrelated and set qj = 1/Em (dj | xj ). Thus, use wj = dj qj = dj /Em (dj | xj ). If di is well explained by xi , then the variability of wj will be reduced. Design weight standardized for its dependence on xj . Kim (ISU) Informative sampling August 22, 2013 18 / 1 Extension to GLM Generalized linear models Em (yi | xi ) = µi Vm (yi | xi ) = v (µi )τ 2 and g (µi ) = x0i β. Link function g (·) and v (·) are of known forms. May write µi (β) = g −1 (x0i β) and vi (β) = v {g −1 (x0i β)}. Under simple random sampling, optimal estimator β̂ can be obtained by minimizing Q(β) = N X Ii {yi − µi (β)}2 /vi (β). i=1 Kim (ISU) Informative sampling August 22, 2013 19 / 1 Extension to GLM Under informative sampling design, optimal estimator β̂ can be obtained by minimizing ∗ Q (β) = N X Ii di {yi − µi (β)}2 /vi∗ (β) i=1 where vi∗ (β) = Em di {yi − µi (β)}2 | xi . A version of EGLS method can be developed. Kim (ISU) Informative sampling August 22, 2013 20 / 1 Widening Class of Weights Further Idea: Instead of using wi = di q(xi ), consider a wider class of estimators of the form wi such that N X Ii wj (yj − xj β)xj = 0 j=1 leads to consistent estimator. Kim (ISU) Informative sampling August 22, 2013 21 / 1 Widening Class of Weights Further (Cont’d) Consistency unaffected if E (wj | xj , yj , Ij = 1) = E (dj | xj , yj , Ij = 1). Variance minimized if V (wj | xj , yj , Ij = 1) = 0 Achieved by setting wj = E (dj | xj , yj , Ij = 1)q(xj ) = d˜j q(xj ). Closely related to weight smoothing by Beaumont (2008). Kim (ISU) Informative sampling August 22, 2013 22 / 1 Widening Class of Weights Further (Cont’d) Optimal q ∗ in the class of weights of the form wj = d˜j q(xj ): qj∗ ∝ Em (ej2 | xj )/Em (d˜j ej2 | xj ) Need a model for d˜j = E (dj | xj , yj , Ij = 1), smoothed weights. EM-type algorithm developed in Kim and Skinner (2013). Details skipped. Kim (ISU) Informative sampling August 22, 2013 23 / 1 Conclusion Fitting models to complex survey data Always test for informative design If the hypothesis of noninformative design is rejected: Examine model Use HT estimator or more complex design consistent estimator Variance estimation for clusters and two-stage designs must recognize clusters Kim (ISU) Informative sampling August 22, 2013 24 / 1 The end Kim (ISU) Informative sampling August 22, 2013 25 / 1