Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Sparse and Efficient Replication Variance Estimation for Complex Surveys Jae Kwang Kim1 Iowa State University April 4th, 2012 Westat 1 Joint work with Changbao Wu at University of Waterloo 1 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding 1 Introduction 2 Fully Efficient Replication Weights 3 Sparse and Efficient Replication Weights 4 Balanced Sampling 5 Simulation Studies 6 Concluding Remarks 2 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding 1 Introduction 2 Fully Efficient Replication Weights 3 Sparse and Efficient Replication Weights 4 Balanced Sampling 5 Simulation Studies 6 Concluding Remarks 3 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Variance Estimation Construct confidence intervals Conduct hypothesis tests Provide descriptive measures on accuracies of estimates Measure the efficiency of the sampling design 4 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Traditional Approaches Inclusion probabilities πi = P(i ∈ S) and πij = P(i, j ∈ S) Basic design weights wi = 1/πi The Horvitz-Thompson estimator of Y = X wi yi . Ŷ = PN i=1 yi (1) i∈S Variance estimation V̂ = XX Ωij yi yj , (2) i∈S j∈S where Ωij = (πij − πi πj )/(πij πi πj ). Nonlinear statistics: linearization (first order Taylor series approximation) 5 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Replication Methods Use jackknife, bootstrap, or BRR Enlarge the data set with additional columns of replication weights Work the same way for linear or nonlinear statistics Preserve confidentiality (by not releasing certain design information) 6 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Issues with Replication Weights Validity – – – – Asymptotic Unbiasedness Jackknife method is valid for restricted scenarios Bootstrap method needs to be modified to fit into special designs No replication method is valid for arbitrary sampling designs Efficiency – Small variance – Replication variance estimators are often approximations to the fully efficient V̂ – Efficiency often depends on sample sizes, number of replication weights, and design features (such as sampling fractions) Sparsity – Want to reduce the size of replication. – The number of jackknife replication weights depends on the sample size – Bootstrap methods typically require at least 1000 sets of weights (StatCan uses 500!) 7 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding An Algebraic Construction of Replication Weights Two research problems in this talk 1 How to construct a valid replication variance estimator ? 2 How to reduce the replication size without sacrificing the efficiency (for some key items) ? 8 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding 1 Introduction 2 Fully Efficient Replication Weights 3 Sparse and Efficient Replication Weights 4 Balanced Sampling 5 Simulation Studies 6 Concluding Remarks 9 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Basic Settings Consider an arbitrary sampling design with first and second order inclusion probabilities πi and πij and basic design weights wi . P Develop variance estimator v(Ŷ) for Ŷ = i∈S wi yi . The analytic variance estimator XX V̂ = Ωij yi yj , i∈S j∈S where Ωij = (πij − πi πj )/(πij πi πj ), is viewed as fully efficient. Write V̂ = y0 ∆y, where ∆ = (Ωij ) is a n × n matrix and y = (y1 , y2 , · · · , yn )0 10 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding An Algebraic Construction of Replication Weights (k) (k) w(k) = (w1 , · · · , wn )0 , k = 1, · · · , L P (k) Ŷ (k) = i∈S wi yi Replication variance estimator V̂R = L X 2 hk Ŷ (k) − Ŷ (3) k=1 for some known constants hk Want to construct w(k) so that V̂R = V̂ for any y-variable. 11 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Construction of Weights (Cont’d) Approach 1: Fay’s Idea Decomposition of ∆: ∆= p X λk δ k δ 0k (4) k=1 where p = rank(∆) ≤ n V̂ = V̂R for any y-variable if L = p and w(k) = w + (λk /hk )1/2 δ k , where w = (w1 , · · · , wn )0 is the set of basic design weights. Algebraically equivalent because hk (Ŷ (k) − Ŷ)2 = y0 {hk (w(k) − w)(w(k) − w)0 }y = y0 {λk δ k δ 0k }y 12 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Construction of Weights (Cont’d) Approach 2: Direct construction for some designs In many single-stage designs, we can write V̂ = y0 ∆y where ∆ = ∆0 − ∆0 Z Z 0 ∆0 Z −1 Z 0 ∆0 (5) where ∆0 = diag{λ1 , · · · , λn }, λi ≥ 0 for all i = 1, 2, · · · , n and Z 0 = (z1 , · · · , zn ) is a q × n matrix with full row rank. After some matrix algebra, we can get y0 ∆y = (y − Z β̂)0 ∆0 (y − Z β̂) n X = λk (yk − z0k β̂)2 k=1 where β̂ = (Z 0 ∆0 Z)−1 Z 0 ∆0 y. 13 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Construction of Weights (Cont’d) Deville(1999)’s approximation for variance estimation under unequal probability of fixed-size sampling: 2 X yi − Ỹ V̂ = c (1 − πi ) πi i∈S −1 P P where c = 1 − i∈S b2i , bi = (1 − πi )/ k∈S (1 − πk ) and P Ỹ = k∈S bi (yi /πi ). Deville’s approximation uses λi = cπi−2 (1 − πi ) and zi = πi . 14 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Construction of Weights (Cont’d) Bredit and Chauvet for balanced sampling P (2011) approximation P with constraint i∈S xi /πi = i∈U xi : V̂ = n X (1 − πi ) πi−2 (yi − ŷi )2 n−q i∈S where ŷi = x0i B̂ and nP o−1 P −2 −2 0 B̂ = (1 − π )π x x k k k k k∈S k∈S (1 − πk )πk xk yi . Bredit and Chauvet (2011) approximation uses λi = cπi−2 (1 − πi ) and zi = xi , where c = n/(n − q). 15 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Jackknife for the designs with ∆ in (5). 2 Pn 0 β̂ 1 Recall that we can write V̂ = λ y − z k k=1 k k P (k) n 2 Thus, we want to construct Ŷ (k) = i=1 wi yi such that 2 2 hk Ŷ (k) − Ŷ = λk yk − z0k β̂ which is achieved when n X i=1 3 (k) w i yi = n X i=1 wi yi − yk − z0k β̂ λ 1/2 k hk Jackknife-type replication weights n o wi + (λk /hk )1/2 −1 + z0 (Z 0 ∆0 Z)−1 zi λi k (k) o n wi = 1/2 0 0 wi + (λk /hk ) zk (Z ∆0 Z)−1 zi λi . if i = k otherwise. 16 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Delete-a-Group Jackknife construction 1 Partition S into L groups: S = S ∪ S ∪ · · · S . L 1 2 2 We can write L X 2 2 X X 0 0 λj yj − zj β̂ V̂ = λj yj − zj β̂ = k=1 j∈Sk j∈S 3 P (k) Thus, we want to construct Ŷ (k) = ni=1 wi yi such that 2 2 X hk Ŷ (k) − Ŷ = λj yj − z0j β̂ j∈Sk 4 which is not always possible. P (k) Instead, construct Ŷ (k) = ni=1 wi yi such that 2 X 2 1/2 (k) hk Ŷ (k) − Ŷ = λj (2δj − 1) yj − z0j β̂ j∈Sk (k) where δj are IID from Bernoulli(0.5). 17 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Delete-a-Group Jackknife construction (Cont’d) Note that 2 X 1/2 (k) X E∗ { λj (2δj − 1)(yj − z0j β̂)}2 = λj yj − z0j β̂ j∈Sk j∈Sk (k) where the expectation is with respect to the distribution for δi . Delete-a-group Jackknife replication weights o n P 0 (Z 0 ∆ Z)−1 z λ q z if i ∈ Sk w + −q + i i 0 ik j∈Sk jk j i (k) wi = o n wi + P otherwise. qjk z0 (Z 0 ∆0 Z)−1 zi λi j∈Sk j (k) where qjk = (λj /hk )1/2 (2δj − 1). Thus, it is essentially the same as applying BRR within each random group. 18 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding 1 Introduction 2 Fully Efficient Replication Weights 3 Sparse and Efficient Replication Weights 4 Balanced Sampling 5 Simulation Studies 6 Concluding Remarks 19 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Efficiency and Sparsity An initial L sets of replication weights w(k) Weights from bootstrap, jackknife or BRR Weights from the algebraic construction V̂R is (assumed to be) fully efficient L is very large Goal: Reduce L to a much smaller number L0 without major loss of efficiency Strategy: Random sampling (random grouping) plus weight calibration 20 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Random Sampling of Weights View V̂R = PL k=1 ck Ŷ (k) − Ŷ 2 as a population total w(kj ) , Select j = 1, · · · , L0 from w(k) , k = 1, · · · , L by simple random sampling without replacement 2 P 0 (1) Use V̂R = Lj=1 hkj (L/L0 ) Ŷ (kj ) − Ŷ (1) V̂R is still a valid variance estimator: (1) E∗ (V̂R ) = V̂R 21 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Random Sampling of Weights (Cont’d) Under the algebraic construction of replication weights: V̂R = L X λk (δ 0k y)2 k=1 V̂R is a population total and λk can be viewed as a size variable Select w(k) with inclusion probability ηk and use (2) V̂R = L0 X hkj j=1 (2) V̂R ηkj Ŷ (kj ) − Ŷ 2 (1) is still valid and is more efficient than V̂R 22 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Random Sampling of Weights (Cont’d) Random selection of L0 sets of weights ensures validity of replication variance estimator using the smaller number sets of weights (1) None of V̂R (2) or V̂R is as efficient as V̂R Calibration over the selected smaller number sets of weights can improve efficiency 23 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Weight Calibration Step 1 Collect all key variables z = (z1 , · · · , zm )0 from the survey Compute the fully efficient variance-covariance matrix L X (k) 0 (k) V̂1 Ẑ = hk1 Ẑ − Ẑ Ẑ − Ẑ k=1 using the initial large number L sets of replication weights, where (k) Ẑ = X (k) wi zi i∈S 24 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Weight Calibration (Cont’d) Step 2 Construct an initial set of L0 replication weights such that the resulting variance estimator is also asymptotically unbiased. (k) (k) (k) Let w0 = (w10 , · · · , wn0 )0 , k = 1, · · · , L0 Less efficient variance estimator for Ẑ: V̂0 (Ẑ) = L0 X (k) (k) hk0 (Ẑ0 − Ẑ)(Ẑ0 − Ẑ)0 k=1 (k) where Ẑ0 = P i∈S (k) wi0 zi 25 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Weight Calibration (Cont’d) Step 3 Find T̂ (k) = P ∗(k) i∈S L0 X wi0 zi such that hk0 (T̂ (k) − Ẑ)(T̂ (k) − Ẑ)0 = V̂1 Ẑ . (6) k=1 To find T̂ 1 2 (k) satisfying (6), we use the following steps: Pp0 Decompose V̂1 Ẑ into V̂1 Ẑ = k=1 αk qk q0k for some αk > 0 and m-dimensional vector qk , k = 1, · · · , p0 . (p0 < m) Construct (k) Ẑ + (αk /hk0 )1/2 qk , if k = 1, · · · , p0 T̂ = Ẑ, if k = p0 + 1, · · · , L0 26 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Weight Calibration (Cont’d) Step 4 Find ∗(k) w0 ∗(k) ∗(k) = (w10 , · · · , wn0 )0 , k = 1, · · · , L0 that minimizes X (k) ∗(k) (k) ∗(k) (k) 2 /wi0 D w0 , w0 = τi wi0 − wi0 i∈S subject to X ∗(k) wi0 zi = T̂ (k) . i∈S P ∗(k) The resulting replicate Ŷ ∗(k) = i∈S wi0 yi can take the form (k) (k) (k) 0 (k) Ŷ0 = Ŷ0 + T̂ (k) − Ẑ0 β̂ where β̂ (k) = nP (k) 0 i∈S wi0 zi zi /τi o−1 P (k) i∈S wi0 zi yi /τi . 27 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Weight Calibration (Cont’d) Step 4 (Cont’d) The choice of τi = w−1 i is proposed (to make Cov(ê, Ẑ) = 0). In this case, V̂0∗ = . = L0 X k=1 L0 X 2 ∗(k) ck0 Ŷ0 − Ŷ n (k) 0 o2 (k) ck0 ê0 − ê0 + T̂ − Ẑ β̂ k=1 0 . = V̂0 ê + β̂ V̂1 Ẑ β̂ . (7) The first component is not fully efficient but the second component is fully efficient. In practice, negative replication weights need to be avoided. ∗(k) (k) Need to change the objective function D(w0 , w0 ) in the calibration weighting for replication. 28 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Properties of the Calibrated Weights V̂0 (Ŷ): Less efficient variance estimator based on L0 replicates. V̂1 (Ŷ): Fully efficient variance estimator based on L replicates. (L >> L0 ) V̂0∗ (Ŷ): proposed calibration variance estimator based on L0 replicates. Results 1 2 V{V̂1 (Ŷ)} ≤ V{V̂0∗ (Ŷ)} ≤ V{V̂0 (Ŷ)}. Pm . If Ŷ = i=1 ai Ẑi , then V{V̂0∗ (Ŷ)} = V{V̂1 (Ŷ)}. 29 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding 1 Introduction 2 Fully Efficient Replication Weights 3 Sparse and Efficient Replication Weights 4 Balanced Sampling 5 Simulation Studies 6 Concluding Remarks 30 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Balanced sampling Sampling design satisfying X 1 X xi = xi . πi i∈S (8) i∈U We are P interested in creating replication variance method for ŶHT = i∈S πi−1 yi . P Write ŶHT = i∈S πi−1 yi as 0 Ŷreg = ŶHT + X − X̂HT B̂ for some B̂. Note that Ŷreg = ŶHT , regardless of B̂, under (8). 31 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Balanced sampling The choice of ( )−1 X X B̂ = πi−2 (1 − πi )xi x0i πi−2 (1 − πi )xi yi i∈S i∈S leads to . V Ŷreg = E ( X ) πi−2 (1 − πi ) (yi − ŷi )2 i∈S where ŷi = x0i B̂. Bredit and Chauvet (2011) approximation n X V̂ = (1 − πi ) πi−2 (yi − ŷi )2 . n−q i∈S 32 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Balanced sampling Jackknife variance estimator V̂JK = n X 2 ∗(k) hk ŶHT − ŶHT , (9) k=1 ∗(k) (k) (k) where ŶHT = ŶHT + (X̂ − X̂HT )0 B̂, P (k) (k) (X̂HT , ŶHT ) = i∈S (k) πi−1 (x0i , yi ), hk = (1 − πk )n/(n − q), and S (k) = S ∩ {k}c . Note that ∗(k) ŶHT − ŶHT (k) 0 (k) ŶHT − ŶHT + X̂ − X̂HT B̂ = −πk−1 yk − x0k B̂ . = 33 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding 1 Introduction 2 Fully Efficient Replication Weights 3 Sparse and Efficient Replication Weights 4 Balanced Sampling 5 Simulation Studies 6 Concluding Remarks 34 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Settings Family Expenditure Survey; N=2246; treated as a population x1 : # Children; x2 : # Youth; x3 : # people; x4 : annual income; y1 : total annual expenditure; y2 : expenditure on clothing; y3 : expenditure on furnishings and equipment Rao-Sampford PPS sampling method; known πi and πij P θ1 = Y1 = Ni=1 yi1 and P P θ2 = Ȳ2 /Ȳ3 = N −1 Ni=1 yi2 / N −1 Ni=1 yi3 zi = (xi1 , xi2 , xi3 , xi4 , yi2 , yi3 )0 Relative bias and relative efficiency, based on 2000 runs B RB = 1 X V̂(b) − V B V b=1 and RE = {MSE(V̂L )}1/2 , {MSE(V̂)}1/2 35 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Relative Bias (%) of Replication Variance Estimators for θ = Y1 L0 n 25 50 50 V̂L (1) (2) (3) (4) (C) V̂R V̂R V̂R V̂R V̂R −7.39 −7.74 −7.58 −6.48 −6.00 0.92 100 2.49 1.39 −0.63 1.80 2.95 2.13 150 −1.16 −1.09 −1.65 −1.96 2.96 0.72 100 1.73 −1.39 −2.02 0.58 2.91 −0.07 150 −1.16 1.62 0.04 −1.93 −0.74 −1.63 36 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Relative Bias (%) of Replication Variance Estimators for θ = Ȳ2 /Ȳ3 L0 n 25 50 50 V̂L (1) (2) (3) (4) (C) V̂R V̂R V̂R V̂R V̂R −4.01 −3.45 −1.74 −5.37 −5.56 −2.20 100 −1.06 −0.39 3.37 0.39 −0.96 0.00 150 −0.12 −3.08 0.83 −7.16 −5.59 0.48 100 −1.18 −0.83 0.60 −2.12 −1.02 −1.31 150 −0.12 0.19 1.79 1.53 0.31 1.03 37 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Relative Efficiency of Replication Variance Estimators for θ = Y1 L0 n 25 50 50 V̂L (1) (2) (3) (4) (C) V̂R V̂R V̂R V̂R V̂R 1.00 0.84 1.05 0.89 0.93 1.34 100 1.00 0.55 0.80 0.92 0.93 1.34 150 1.00 0.47 0.70 0.83 0.76 1.00 100 1.00 0.83 1.00 0.97 0.98 1.61 150 1.00 0.57 0.72 0.95 0.96 1.28 38 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Relative Efficiency of Replication Variance Estimators for θ = Ȳ2 /Ȳ3 L0 n 25 50 50 V̂L (1) (2) (3) (4) (C) V̂R V̂R V̂R V̂R V̂R 1.00 0.78 0.80 0.69 0.74 0.98 100 1.00 0.54 0.57 0.35 0.49 1.00 150 1.00 0.50 0.52 0.42 0.55 0.99 100 1.00 0.74 0.77 0.60 0.64 0.99 150 1.00 0.60 0.61 0.35 0.50 0.99 39 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding 1 Introduction 2 Fully Efficient Replication Weights 3 Sparse and Efficient Replication Weights 4 Balanced Sampling 5 Simulation Studies 6 Concluding Remarks 40 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Remarks Construction of sparse and efficient replication weights is a task at the data file preparation stage Issues with computational implementation need to be dealt with by those who produce public-use data files, not the data users The algebraic construction of fully efficient replication weights can serve as a starting point The proposed strategy, i.e. random selection plus weight calibration, on producing sparse and efficient replication weights can be applied to any initial weights from jackknife, bootstrap or BRR methods Identify and include all key variables at the calibration stage To produce a small number (say 30-50) sets of replication weights which provide valid and efficient variance estimation is an important research problem with both theoretical and practical significance 41 / 42 Introduction Fully Efficient Replication Weights Sparse and Efficient Replication Weights Balanced Sampling Simulation Studies Concluding Dedication This work is dedicated to the memory of Professor Randy Sitter 42 / 42