Calculating Estimates for Multiple-frame Surveys with More than Two Frames 1 Introduction

Calculating Estimates for Multiple-frame Surveys with More than Two Frames Amber Tomas and Qiannan Yin April 13, 2016 1 Introduction There are now a number of different estimators for calculating estimates from multiple-frame surveys. However, most available documentation and code only applies to the case of two frames. In this document we describe how to compute estimates for the more general Q-frame case, and provide pseudo-code. This pseudo-code forms the basis for an R package currently in development by the authors. 2 Notation and Assumptions Our notation largely follows that used in Lohr and Rao [2006]. Suppose there are Q frames, denoted A1 , . . . , AQ . Let F denote the index set of frames, so F = 1, 2, . . . , Q. Assume that the union of the frames covers the population of interest. The population domains are defined by the subsets of F. For k ⊆ F, the domain defined by Dk is     \ \ Dk =  Aq  ∩  Acq  , q∈Dk q ∈D / k where c denotes complement. We use d to denote the number of non-empty domains, and q ∈ Dk to denote the set of frames that cover domain Dk . Let N (q) be the number of population units on frame Aq , and let Nk be the number of population units Pd in domain Dk . Then k=1 Nk = N , the population size. We use Sq to denote the set of units sampled from frame Aq , Y (q) to denote the population total of units on frame q, and Yk to denote the population total of domain Dk . The algorithms presented in this paper assume that only a small amount of information is available. In particular, we assume: • the domain membership is known for all sampled units; • the sampling weights from each frame are available; • the value of the response variable is known for all sampled units from each frame. (q) (q) We use wi to denote the weight of the ith unit on frame Aq , and yi to denote the value of the response variable for this unit. Often there may be additional information available, such as replicate weights and auxilliary variables. However, the methods presented in this paper can be used without this additional information. This is useful for practitioners who want to calculate estimates from publicly available data which often comes only with the values of the respose variables and their sampling weights. Unless otherwise stated, we assume that the Horvitz-Thompson estimator is used to estimate population totals. Hence for k = 1, . . . , d, the estimate of Yk using the sample from frame Aq is X (q) (q) Ŷk = wi δi (k)yi i∈Sq 1 where if i ∈ Dk . otherwise 1 0 δi (k) = Similarly, the estimate of the population size in domain k, Nk , based on the sample from frame q is X (q) (q) N̂k = wi δi (k). i∈Sq The remaining sections give pseudo-code that can be used to calculate the following estimators: • the Hartley estimator • the Fuller-Burmeister estimator • the Pseudo Maximum Likelihood estimator • the Pseudo Empirical Likelihood estimator (not incorporating auxilliary information) • the Unique Linkage estimator • the Multiplicity estimator • the Single-frame estimator. We use these estimators as they have been most referenced in the literature or used in practice. Our pseudocode assumes that the necessary variances can be calculated, and the final section discusses how these variances can be estimated if no additional design information is available. 3 Hartley Estimator The optimal linear estimator proposed by H19 [1962] and generalized to Q-frames by Lohr and Rao [2006] seeks to find the linear combination of separate frame estimates in each domain which will minimize the variance of the total population estimator. It requires calculating the covariance matrix of the domain estimates from each frame, and we assume that the practitioner can do this using the best variance approximation methods available to them (for example, using replicate weights). Methods for calculating the estimated (q) covariance matrix if no information is available other than the wi and yi are given in Section 10. Below is pseudo-code that can be used to calculate the Hartley estimator. 1. Calculate the d-vectors (q) (q) Ŷq = (Ŷ1 , . . . , Ŷd )T , and (q) (q) N̂q = (N̂1 , . . . , N̂d )T , q = 1, . . . , Q. Note that for q 6∈ Dk , i.e. if frame q does not cover domain Dk , then (q) Ŷk (q) = N̂k = 0. 2. Let Cq denote the estimated covariance matrix of Ŷq . (a) Construct Cq , q = 1, . . . , Q using one of the methods in Section 10 or an alternative. (b) Define the d × d diagonal matrix Γq to have kth diagonal entry 1 if q ∈ k and 0 otherwise, and let 1d be a d-vector of 1s. Calculate Eq = Γq Cq Γq , q = 1, . . . , Q. 2 (c) Calculate the (d(Q + 1) × dQ block matrix Ai,j  P −1 PQ Q  I − q=1 Γq Ei  q=1 Γq  Γi P −1 = Q Γ Γ Ej  i q q=1   Γj if i = j if i 6= j, i ≤ Q if i = Q + 1 , for i = 1, . . . , Q + 1, j = 1, . . . , Q. (d) Calculate the dQ × 1 vector T T θ = (θ1T , . . . , θQ ) which is the solution to Aθ = [0TdQ 1Td ]T (1) The set of linear equations (1) has d(Q + 1) equations in dQ unknowns, but is of rank dQ. We solve this system of linear equations using the generalized inverse of A. (e) The final estimate is then given by ŶH (θ) = Q X θqT Γq ŶqT . q=1 Equivalently, so-called “modified” weights can be calculated. These are the weights that can be used for calculating estimates if the q samples are concatenated together and treated as a single sample. Then the modified weights can be calculated as follows: (q) w̃i (q) = wi θq [k], q = 1, . . . , Q, i ∈ Sq , where θq [k] is the k-th element of θq . Then the final estimate can be calculated as ŶH (θ) = Q X X (q) (q) w̃i yi . q=1 i∈Sq 4 Fuller-Burmeister estimator The Fuller-Burmeister estimator [FB1, 1972] is similar to the Hartley estimator but incorporates information about the estimates N̂ (q) as well as Ŷ (q) to find an optimal linear combination of domain estimates. It was also generalized to Q-frames by Lohr and Rao [2006], and here we present pseudo-code that can be used to calculate it. 1. Estimate the 2d × 2d covariance matrix Dq = cov Ŷq N̂q = Cq T Dq12 Dq12 Dq22 2. Let Gq be the 2d × 2d matrix diag(Γq , Γq ), where Γq is as defined in Section 3. Calculate Eq = Gq Dq Gq , q = 1, . . . , Q 3. Construct the 2d(Q + 1) × 2dQ block diagonal matrix Bi,j  P −1 PQ Q  G G I − G Ei  i q q q=1 q=1  P −1 = Q Gi Ej  q=1 Gq   Gj if i = j if i 6= j, i ≤ Q if i = Q + 1 3 , for i = 1, . . . , Q + 1, j = 1 . . . , Q. T T 4. Compute the 2dQ-vector β = (β1T , . . . , βQ ) , as the solution to Bβ = [0T2dQ 1Td 0Td ]T (2) The set of linear equations (2) has 2d(Q + 1) equations in 2dQ unknowns, but is of rank 2dQ. We solve this system of linear equations using the generalized inverse of B. 5. Then ŶF B (β) = Q X βqT Gq [ŶqT N̂qT ]T . q=1 5 Pseudo Maximum Likelihood Estimator Pseudo maximum likelihood estimation for dual frame surveys using simple random sampling was first proposed by Skinner [1991]. This was extended to Q-frame complex surveys by Lohr and Rao [2006]. 1. Let f (q) = n(q) N (q) d(q) , q = 1, . . . , Q, where d(q) is the design effect for estimating Y from frame q. For simple random sampling the design effect is equal to 1, whereas for complex samples typically d(q) is greater than 1. Lohr and Rao [2006] discuss methods for how to choose a design effect if if it not approximately the same for all variables. If the frame sizes are not known, we use f (q) = 2. Calculate n(q) N̂ (q) d(q) , q = 1, . . . , Q. P (q) (q) Ŷk q∈Dk f ˆ Ȳk = P , k = 1, . . . , d. (q) N̂ (q) q∈Dk f k 3. Define M to be the d × Q matrix with k, qth entry 1 if q ∈ Dk M̂k,q = , for q = 1, . . . , Q, k = 1, . . . , d. 0 otherwise. 4. Let Ĥ be the d × Q matrix with k, qth entry (q) N̂k if q ∈ Dk Ĥk,q = , for q = 1, . . . , Q, k = 1, . . . , d. 0 otherwise. 5. The aim of this step is to find the d-vector of domain population sizes N̂ , which is the solution to (I − M M + )(diagN̂ )−1 Ĥf = 0Q+d , (3) M T N̂ − h where M + is the Moore-Penrose generalized inverse of M , and h = (N (1) , N (2) , ..., N (Q) )T . M is nonsingular only if we know the true values of N , in which case this step should be skipped and the true values used in place of the estimated frame sizes. If N is not known, the following method can be used to find an approximate solution to (3) [Lohr and Rao, 2006]: This relationship can be used to iteratively calculate N̂ using the following method: i) Let Ñ0 be a vector of initial estimates for N . Two reasonable options are (q) (a) If frame q has the largest sample size in domain Dk , then let Ñk0 = N̂k ; 4 (q) (b) Let Ñk0 equal the average of N̂k , q = 1, . . . , Q. ii) For l = 1, . . . calculate Ñl+1 , which is the solution to the linear set of equations (I − M M + )(diagÑl )−1 (diagM f ) Ñl+1 MT (I − M M + )(diagÑl )−1 Ĥf = . h iii) If the difference between Nl+1 and Nl is small enough, let N̂ = Nl+1 and stop iterating. In our studies we used option (b) for Ñk0 , and a stopping distance of 0.001. 6. For unit i on frame Aq in domain Dk , calculate the adjusted weight (q) (q) w̃i =P wi N̂k f (q) (p) p∈Dk 7. Calculate the final estimate ŶP M L = f (p) N̂k Q X X , q = 1, . . . , Q. (q) (q) w̃i yi . q=1 i∈Sq 6 Pseudo-Empirical Likelihood Estimator Pseudo-empirical likelihood (PEL) estimation for multiple frame surveys was introduced by Rao and Wu [2010]. They suggested two possible approaches: 1. PEL methods for dual-frame surveys based on post-stratified samples; 2. A multiplicity-based PEL approach which requires knowledge only of the multiplicity of each sampled observation and not the domain membership information, and which is easily extended to more than two frames. In the absence of auxiliary variables, the multiplicity approach simplifies to that of Mecatti [2007], which we describe in Section 8. Since we do not assume the presence of any auxiliary variables for this paper, we only present the PEL post-stratified estimator. Rao and Wu [2010] presented the post-stratified estimator only for the dual frame case, and noted that it could be extended to the more general Q-frame case but that the general case was not presented due to notational difficulties. We agree that the notational complexities are real, and for the same reason here we present pseudo-code that can be used to calculate the PEL estimator for both the 2- and 3-frame case, and indicate how it can be extended further if required. (q) (q) 1. Construct the vectors of normalized weights1 d˜k : a vector of length nk that has ith element (q) (q) d˜ki = P wi (q) i∈Sq ∩Dk wi , ∀ i ∈ Sq ∩ Dk . There should be d × Q such vectors: one for every frame-domain combination. 2. The next step is to determine values for the constants ηkq , k = 1, . . . , d, q = 1, . . . , Q. The constant ηkq can be thought of as the “weight” given to the estimate from frame q in domain k. Rao and Wu [2010] propose two possibilities: 1 Rao and Wu [2010] suggest using the inverse selection probabilities in place of weights when available. 5 A. Calculate η̂k,q = P (q) Vb (Ȳˆk ) , k = 1, . . . , d, q = 1, . . . , Q, (l) Vb (Ȳˆ ) l∈Dk k (q) where Ȳˆk is the ratio estimator (q) Ȳˆk P = (q) i∈Sq ∩Dk wi yi (q) P i∈Sq ∩Dk = wi (q) d˜ki yi . X i∈Sq ∩Dk B. Calculate2 (q) η̂k = P (q) Vb (N̂k ) , k = 1, . . . , d, q = 1, . . . , Q. (l) Vb (N̂ ) l∈Dk k 3. Calculate the “weights” (q) (q) Wk = N̂k (q) N̂ (q) × η̂k , k = 1, . . . , d, q = 1, . . . , Q. 4. We break this step up and present the pseudo-code separately for Q = 2 and then Q = 3. At the end of this section an explanation is given for how to extend the method to more than three frames. A. Suppose there are only two frames, A and B. Let domain Da include units only on frame A, domain Dab include units on both frame A and frame B, and domain Db include units only on frame B. Construct the following matrices and vectors: n x(A) a (A) xab (B) xab (B) xb = (A) (B) (B) (na(A) , nab , nab , nb )T , (4 × 1 vector) = 0n(A) ×1 a yi (A) = , i ∈ SA ∩ Dab , (nab × 1 vector) (A) Wab −yi (B) = , i ∈ SB ∩ Dab , (nab × 1 vector) (B) Wab = 0n(B) ×1 x = d = (A) (B) (B) (x(A) a , xab , xab , xb ), ˜(A) ˜(B) ˜(B) (d˜(A) a , dab , dab , db ), X = 0, scalar W = (Wa(A) , Wab , Wab , Wb b (A) (B) (n × 1 vector) (n × 1 vector) (B) ), (4 × 1 vector) B. Suppose there are only three frames, A, B, and C. Let Dab denote the domain where only frames A and B overlap, and Dabc denote the domain where all three frames overlap. Other domains are denoted similarly. Construct the following matrices and vectors: 2 See equation (2.8) in Rao and Wu [2010] for an adjustment they suggest in the 2-frame case. 6 (nA , nA , nB , nB , nA , nC , nC , nB , nC , nA , nB , nC )T , (12 × 1 vector)  a yab ab b ac ac c bc bc abc abc abc i   W(abc)a , i ∈ SA ∩ Dabc −yi = , i ∈ SB ∩ Dabc W   (abc)b 0 for all other i and where W(abc)a or W(abc)b is 0  yi   W(abc)a , i ∈ SA ∩ Dabc −yi = , i ∈ SC ∩ Dabc W   (abc)c 0 for all other i and where W(abc)a or W(abc)c is 0  y i   W(ab)a , i ∈ SA ∩ Dab −yi = , i ∈ SB ∩ Dab W   (ab)b 0 for all other i and where W(ab)a or W(ab)b is 0  y i   W(ac)a , i ∈ SA ∩ Dac −yi = , i ∈ SC ∩ Dac W   (ac)c 0 for all other i and where W(ac)a or W(ac)c is 0  y i   W(bc)b , i ∈ SB ∩ Dbc −yi = , i ∈ SC ∩ Dbc W   (bc)c 0 for all other i and where W(bc)b or W(bc)c is 0 n = x1h(abc) x2h(abc) xh(ab) xh(ac) xh(bc) x = (x1h(abc) , x2h(abc) , xh(ab) , xh(ac) , xh(bc) ), (5 × n matrix) d = Ã ˜B ˜B Ã ˜C ˜C ˜B ˜C Ã ˜B ˜C T (dÃ a , dab , dab , db , dac , dac , dc , dbc , dbc , dabc , dabc , dabc ) , (1 × n vector) X = 05×1 , (scalar) W = (Wa(A) , Wab , Wab , Wb (A) (B) (B) (B) (C) (A) (B) (C) (A) (C) , Wac , Wac , Wc(C) , Wbc , Wbc , Wabc , Wabc , Wabc ), (12 × 1 vector) Note: for an internally consistent solution based on ensuring equality of frame estimates of domain population sizes rather than domain population totals, set yi = 1 in the above for all i. 5. Use code in Appendix A3. of Wu [2005] to solve for p̂, a vector of length n. 6. Calculate the estimate ȲˆPEL = X X k q∈Dk (q) X Wk (q) p̂ki yi . i∈Sq ∩Dk Explanation of Calculations for PEL Estimator in the case of more than two frames Rao and Wu [2010] mention that the multiple-frame PEL estimator can be considered as a case of the PEL estimator under stratified sampling, where each domain/frame combination can be considered as a separate stratum. The calculation of the PEL estimate for the case of stratified sampling is given in Wu [2005]. In this paper the problem is formulated as follows: Maximize H X X l(p1 , . . . , pH ) = n∗ Wh d∗hi log(phi ), h=1 i∈sh subject to the constraints X phi > 0, phi = 1, h = 1, . . . , H i∈sh and X h Wh X phi xhi = X, i∈sh 7 (4) where here xhi is a vector of auxilliary variables, X the vector of known population totals for the auxilliary variables, and sh is the set of sampled units from stratum h. In Rao and Wu (2010) the constrained maximization problem is presented as: Maximize X X l(pa , pab , pba , pb ) = n∗ Wh d∗hi log(phi ), h ∈ {a, ab, ba, b}, i∈Sh h subject to the constraints X phi = 1, h = a, ab, ba, b, i∈Sh and X pabi yi = i∈Sab X pbaj yj . (5) j∈Sba Note that they do not specify that the phi must be greater than zero, although we assume that this constraint is intended. The constraint (5) can be reexpressed as X X pabi yi − pbaj yj = 0, or equivalently i∈Sab Wab X i∈Sab j∈Sba X X X −yi yi + Wba + Wa pbai pai .0 + Wb pbi .0 = 0, pabi Wab Wba i∈Sba i∈Sa which can be seen to be equivalent to (4) when  yi  Wab , −yi xhi = ,  Wba 0 i∈Sb i ∈ Sab , i ∈ Sba for all other i (6) and X = 0. The case for more than two frames can be expressed similarly, with k − 1 “auxilliary” variables for each overlap domain with k frames. For example, with three frames the constraint for the domain abc can be expressed as X X X pA pB pC (abc)i yi = (abc)i yi = (abc)i yi , A i∈Sabc B i∈Sabc C i∈Sabc or equivalently as X A i∈Sabc pA (abc)i yi = X X pB (abc)i yi , B i∈Sabc pA (abc)i yi = A i∈Sabc X pC (abc)i yi . (7) C i∈Sabc This can be formulated similarly to (6) but using one auxilliary variable for each equality in (7). Once the problem from Rao and Wu (2010) has been formulated in the same way as Wu (2005), the code presented in Appendix A3. of that paper can be used to solve for p̂. 7 Unique Linkage Estimator The Unique Linkage (UL) Estimator has been used in multiple-frame surveys such as SESTAT [NCSES, b,a]. The idea is that each sampled unit is linked to just one frame, and information on that nuit from the other frames is ignored for estimation purposes. 1. Determine an order for the Q frames. Denote the rank of frame q in this order by R(q). One method for determining the order is as follows: (a) Calculate Vb (Ŷ (q) ), for q = 1, . . . , Q. (b) Order the surveys by increasing value of Vb (Ŷ (q) ). 8 2. Let M [i] denote the set of frames which contain unit i. Calculate weight modification factors 1 if R(q) = mink∈M [i] R(k) (q) mi = , q = 1 . . . , Q, i ∈ Sq 0 otherwise 3. Calculate the modified weights (q) w̃i (q) (q) = mi wi , q = 1, . . . , Q, i ∈ Sq . 4. The Unique Linkage estimator is given by ŶU L = Q X X (q) w̃i yi . q=1 i∈Sq 8 The Multiplicity Estimator Compared to previous estimators we’ve considered, the multiplicity estimator requires one additional assumption: that the number of frames to which every sampled unit belongs is known. The number of frames on which a unit is listed is referred to as the multiplicity of that unit. The multiplicity estimator for multipleframe surveys was first proposed by Mecatti [2007]. 1. Let mi be the multiplicity of unit i, i.e. the number of frames that include unit i. 2. Calculate the modified weights of the sampled units (q) w̃i (q) = wi m−1 i , q = 1, . . . , Q; i ∈ Sq 3. Then the Mecatti estimator is given by ŶM = Q X X (q) (q) w̃i yi . q=1 i∈Sq 9 The Single-frame Estimator This estimation method requires the following additional assumption: that the records of every sampled unit can be identified on all the frames that include it. This is stronger than the assumption required for the Mecatti estimator. The single-frame estimator was first proposed for dual-frame surveys by Kalton and Anderson [1986] and then extended to the Q-frame case by Lohr and Rao [2006]. (q) 1. Let w[i] be the weight on frame Aq of population unit i. Calculate the modified weights of the sampled units  −1 Q X 1  (q) w̃[i] =  , q = 1, . . . , Q; i ∈ Sq . (j) j=1 w[i] 2. Then the single-frame estimator is ŶSF = Q X X q=1 i∈Sq 9 (q) (q) w̃i yi . 10 Variance Estimation As stated in the introduction, we assume only that the frame and domain membership is known, along with the frame sampling weights and value of the response variable. Most of the estimators presented in previous sections require estimates of the variance of estimates of the population mean or total. Typically this would be done using replicate weights or Jacknife methods, both of which assume knowledge of the sampling design and may require knowledge of design variables not on the available data set. Therefore, in this section we present several options for how the variances can be estimated in the absence of additional information. 10.1 Variance Estimation for the Pseudo-Empirical Likelihood Estimator Step 2. of the Pseudo-Empirical Likelihood Estimator presented in section 6 uses variance estimates of the estimated domain means from each frame. The variance of the Hajek estimators of the domain mean can be written as follows: " (q) # Ŷk (q) ˆ V (ȲkH ) = V (q) N̂k P  (q) yi (q) wi i∈Sk  = V P (q) (q) wi i∈Sk "P # (q) i∈S (q) Iki wi yi = V , (8) P (q) i∈S (q) Iki wi where Iki = 1 0 if unit i is in domain Dk . otherwise The required variances are therefore all variances of some ratio estimator, for some frame q = 1, . . . , Q and domain Dk , k = 1, . . . , d. If SRS is used to select the sample from frame q, then the weights cancel out and the ratio estimator from (8) can be written as P i∈S Iki yi (q) B̂k = P q . i∈Sq Iki (q) The variance of B̂k can be estimated by (see for example Lohr [2010, p. 126]) X 1 1 n(q) (q) (q) (yi Iki − B̂k Iki )2 V̂ (B̂k ) = 1 − (q) (q) (q) (q) N n − 1 i∈S n I¯k q X n(q) 1 (q) (q) = 1 − (q) (yi − B̂k )2 /nk . (q) N n − 1 i∈S ∩D q If SRS is not used, we use the following weighted variance formula: X n(q) 1 (q) (q) (q) V̂ (B̂k ) = 1 − (q) w (yi − B̂k )2 P (q) N n − 1 i∈S ∩D i q k (9) k 1 i∈Sq ∩Dk (q) , wi which simplifies to (9) when the weights are equal for all units sampled from frame q in domain k. If the effective sample size for frame q is known, this should be used in place of n(q) . 10.2 Variance and Covariance Estimation for the Hartley and Fuller-Burmeister Estimators The generalized Hartley estimator and Fuller-Burmeister estimator are functions of Cq , the covariance matrix (q) of Ŷd , for all domains d covered by frame q, for q = 1, . . . , Q. Firstly we derive formulas for the variances in this matrix, and then the covariances. 10 10.2.1 Variances Let i ∈ Uq ∩ Dk1 otherwise, yi 0 ui = where Uq ∩ Dk1 is the set of all population units from frame q in domain k1. Then X (q) Yk1 = ui . i∈Uq (q) Assuming that Yk1 is estimated by the Horvitz-Thompson estimator (q) −1(q) X Ŷk1 = πi ui , (10) i∈S (q) the variance of this estimator is given by (q) V (q) (Ŷk1 ) = X 1 − πi i∈Uq πi (q) ! u2i (q) + (q) (q) XX πij − πi πj i∈Uq j6=i πi πj (q) (q) ! ui uj . (11) −1(q) (q) We assume that only the sampling weights are known, and that wi ≈ πi for all units i on frame q, q = 1, . . . , Q. We consider the following three approximations to the variance (11): 1. 2 n(q) su (q) (q) 2 b V (Ŷk1 ) = (N ) 1 − (q) , N n(q) where s2u = X 1 (ui − ū)2 . − 1 i∈S n(q) q This is the unbiased estimator of (10) in the case that SRSWOR was used to select the sample from frame q. 2. (q) Vb (Ŷk1 ) = X (q) (q) wi (wi − 1)yi2 . i∈Sq This is obtained by assuming that πij ≈ πi πj ∀ i, j, i.e. that units are sampled independently, in which case we can write (11) as ! X 1 − π (q) (q) i b V (Ŷk1 ) ≈ yi2 . (q) π i∈Uq i An unbiased estimate of this is given by (q) (q) Vb (Ŷk1 ) = X 1 − πi −1(q) πi (q) πi i∈Sq = X (q) ! (q) wi (wi yi2 − 1)yi2 . i∈Sq We would expect this estimator to do less well when the sampling fraction is large, in which case the assumption that the second term in the variance is close to zero is unlikely to hold. 11 3. (q) Vb (Ŷk1 ) = n(q) X (q) −1(q) 2 (1 − πi )(ui πi − A(q) s ) , n(q) − 1 i∈S (12) q where (q) 1 − πi X A(q) s = i∈Sq P j∈Sq (1 − −1(q) ui πi . (q) πj ) Henderson [2006] compared nine estimators of the variance of the Horvitz-Thompson estimator (11). The estimator (12), first derived by Hajek (1964), seemed to perform well in many cases. However, it does take significantly longer to compute than estimates 1. and 2. 10.2.2 Covariances (q1) (q2) We assume that samples from different frames are independent, in which case Cov(Ŷk1 , Ŷk2 ) = 0 for all q1 6= q2. Therefore, we only need to consider the covariance between domain estimates for estimates calculated from the same frame. Let yi i ∈ Uq ∩ Dk2 vi = 0 otherwise, An unbiased estimator of the covariance between two Horvitz-Thompson estimators X X (q) (q) Ŷk1 = wi ui , and Ŷk2 = wi v i , i∈Sq i∈Sq is [Särndal et al., 1992] d Ŷ (q) , Ŷ (q) ) Cov( k1 k2 = (q) (q) (q) X X πij − πi πj ui vj (q) (q) πij (q) π π i∈Sq j∈Sq = i (q) X (1 − πi ) i∈Sq ui vi (q) πi (q) πi + (13) j (q) (q) (q) X X πij − πi πj ui i∈Sq j6=i (q) πij (q) πi vj (q) . πj Since it is assumed that the sampling weights of only the sampled units are known, we need to approximate (13). We again consider three possibilities, based on different assumptions: 1. (q) (q) (q) X X X (N (q) )2 N (q) N (q) − 1 d Ŷ (q) , Ŷ (q) ) = N (N − n ) Cov( − ui vj . u v + i i k1 k2 (n(q) )2 (n(q) )2 n(q) n(q) − 1 i∈S i∈S j6=i q q This is obtained from (13) under the assumption that SRSWOR was used, in which case we know (q) (q) πi = n(q) /N (q) and πij = n(q) /N (q) (n(q) − 1)/(N (q) − 1), i 6= j. 2. d Ŷ (q) , Ŷ (q) ) = Cov( k1 k2 X (q) (q) wi (wi − 1)ui vi . i∈Sq (q) (q) (q) This is obtained from (13) under the assumption that πij ≈ πi πj term is approximately equal to zero. ∀ i, j. In this case the second 3. d Ŷ (q) , Ŷ (q) ) = Cov( k1 k2 X i∈Sq (q) (1 − πi ) ui (q) πi vi (q) πi   −1  (q) (q) X X 1 − (1 − πi )(1 − πj ) ui vj   + 1 −   (q) (q) . P (q) πi πj i∈Sq j6=i l∈Sq (1 − πl ) 12 This is obtained based on substituting the following approximation [Hájek, 1964] for the joint selection probabilities:   (q) (q) (q) (q) (q)  1 − (1 − πi )(1 − πj )  πij = πi πj P (q) (q) l∈Uq πl (1 − πk ) and approximating the denominator of this term by X (q) (1 − πl ). l∈Sq Similar to the methods for estimating variances, method 2. is the most computationally efficient closely followed by method 1. The third method requires significantly more computation time. We’d expect method 1 to be most accurate if there is small weight variation between units on the same frame. If this is not the case then method 2. may give more accurate results, particularly if the sampling fraction is small, in which case the independence assumptions are more reasonable. References W.A. Fuller and L.F. Burmeister. Estimators for samples selected from two overlapping frames. In Proceedings of the Social Statistics Section, pages 245–249. American Statistical Association, 1972. Jaroslav Hájek. Asymptotic theory of rejective sampling with varying probabilities from a finite population. Annals of Mathematical Statistics, 35(4):1491–1523, 1964. H. O. Hartley. Multiple frame surveys. In Proceedings of the Social Statistics Section, pages 203–206. American Statistical Association, 1962. Tamie Henderson. Estimating the variance of the horvitz-thompson estimator. Master’s thesis, School of Finance and Applied Statistics, The Australian National University, October 2006. G. Kalton and D.W. Anderson. Sampling rare populations. Journal of the Royal Statistical Society, 149: 65–82, 1986. Sharon L. Lohr. Sampling: Design and Analysis. Duxbury Press, 2 edition, 2010. Sharon L. Lohr and J. N. K. Rao. Estimation in multiple-frame surveys. Journal of the American Statistical Association, 101(475):1019–1030, 2006. doi: 10.1198/016214506000000195. Fulvia Mecatti. A single frame multiplicity estimator for multiple frame surveys. Survey methodology, 33(2): 151–157, 2007. National Science Federation NCSES. Scientists and engineers statistical data system (sestat), 2010: Technical notes. ncsesdata.nsf.gov/us-workforce/2010/ses_2010_tech_notes.pdf, a. Accessed: 01-04-2016. National Science Federation NCSES. Current design of the sestat surveys. www.nsf.gov/statistics/ srs07202/content.cfm?pub_id=1715&id=3, b. Accessed: 01-04-2016. J. N. K. Rao and Changbao Wu. Pseudo–empirical likelihood inference for multiple frame surveys. Journal of the American Statistical Association, 105(492):1494–1503, 2010. C.E. Särndal, B. Swensson, and J.H. Wretman. Model Assisted Survey Sampling. Springer series in statistics. Springer-Verlag, 1992. C.J. Skinner. On the efficiency of raking ratio estimation for multiple frame surveys. Journal of the American Statistical Association, 86:779–784, 1991. Changbao Wu. Algorithms and R codes for the pseudo empirical likelihood method in survey sampling. Survey Methodology, 31(2):239, 2005. 13

Calculating Estimates for Multiple-frame Surveys with More than Two Frames 1 Introduction

Related documents

Products

Support

Calculating Estimates for Multiple-frame Surveys with More than Two Frames 1 Introduction

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib