Calculating Estimates for Multiple-frame Surveys with More than Two Frames 1 Introduction

advertisement
Calculating Estimates for Multiple-frame Surveys with More than
Two Frames
Amber Tomas and Qiannan Yin
April 13, 2016
1
Introduction
There are now a number of different estimators for calculating estimates from multiple-frame surveys. However, most available documentation and code only applies to the case of two frames. In this document
we describe how to compute estimates for the more general Q-frame case, and provide pseudo-code. This
pseudo-code forms the basis for an R package currently in development by the authors.
2
Notation and Assumptions
Our notation largely follows that used in Lohr and Rao [2006]. Suppose there are Q frames, denoted
A1 , . . . , AQ . Let F denote the index set of frames, so F = 1, 2, . . . , Q. Assume that the union of the
frames covers the population of interest. The population domains are defined by the subsets of F. For
k ⊆ F, the domain defined by Dk is

 

\
\
Dk = 
Aq  ∩ 
Acq  ,
q∈Dk
q ∈D
/ k
where c denotes complement. We use d to denote the number of non-empty domains, and q ∈ Dk to denote
the set of frames that cover domain Dk .
Let N (q) be the number of population units on frame Aq , and let Nk be the number of population units
Pd
in domain Dk . Then k=1 Nk = N , the population size. We use Sq to denote the set of units sampled from
frame Aq , Y (q) to denote the population total of units on frame q, and Yk to denote the population total of
domain Dk .
The algorithms presented in this paper assume that only a small amount of information is available. In
particular, we assume:
• the domain membership is known for all sampled units;
• the sampling weights from each frame are available;
• the value of the response variable is known for all sampled units from each frame.
(q)
(q)
We use wi to denote the weight of the ith unit on frame Aq , and yi to denote the value of the response
variable for this unit. Often there may be additional information available, such as replicate weights and
auxilliary variables. However, the methods presented in this paper can be used without this additional
information. This is useful for practitioners who want to calculate estimates from publicly available data
which often comes only with the values of the respose variables and their sampling weights.
Unless otherwise stated, we assume that the Horvitz-Thompson estimator is used to estimate population
totals. Hence for k = 1, . . . , d, the estimate of Yk using the sample from frame Aq is
X (q)
(q)
Ŷk =
wi δi (k)yi
i∈Sq
1
where
if i ∈ Dk
.
otherwise
1
0
δi (k) =
Similarly, the estimate of the population size in domain k, Nk , based on the sample from frame q is
X (q)
(q)
N̂k =
wi δi (k).
i∈Sq
The remaining sections give pseudo-code that can be used to calculate the following estimators:
• the Hartley estimator
• the Fuller-Burmeister estimator
• the Pseudo Maximum Likelihood estimator
• the Pseudo Empirical Likelihood estimator (not incorporating auxilliary information)
• the Unique Linkage estimator
• the Multiplicity estimator
• the Single-frame estimator.
We use these estimators as they have been most referenced in the literature or used in practice. Our pseudocode assumes that the necessary variances can be calculated, and the final section discusses how these
variances can be estimated if no additional design information is available.
3
Hartley Estimator
The optimal linear estimator proposed by H19 [1962] and generalized to Q-frames by Lohr and Rao [2006]
seeks to find the linear combination of separate frame estimates in each domain which will minimize the variance of the total population estimator. It requires calculating the covariance matrix of the domain estimates
from each frame, and we assume that the practitioner can do this using the best variance approximation
methods available to them (for example, using replicate weights). Methods for calculating the estimated
(q)
covariance matrix if no information is available other than the wi and yi are given in Section 10.
Below is pseudo-code that can be used to calculate the Hartley estimator.
1. Calculate the d-vectors
(q)
(q)
Ŷq = (Ŷ1 , . . . , Ŷd )T ,
and
(q)
(q)
N̂q = (N̂1 , . . . , N̂d )T , q = 1, . . . , Q.
Note that for q 6∈ Dk , i.e. if frame q does not cover domain Dk , then
(q)
Ŷk
(q)
= N̂k
= 0.
2. Let Cq denote the estimated covariance matrix of Ŷq .
(a) Construct Cq , q = 1, . . . , Q using one of the methods in Section 10 or an alternative.
(b) Define the d × d diagonal matrix Γq to have kth diagonal entry 1 if q ∈ k and 0 otherwise, and let
1d be a d-vector of 1s.
Calculate
Eq = Γq Cq Γq , q = 1, . . . , Q.
2
(c) Calculate the (d(Q + 1) × dQ block matrix
Ai,j

P
−1 PQ
Q

I − q=1 Γq Ei

q=1 Γq
 Γi
P
−1
=
Q
Γ
Γ
Ej

i
q
q=1


Γj
if i = j
if i 6= j, i ≤ Q
if i = Q + 1
, for i = 1, . . . , Q + 1, j = 1, . . . , Q.
(d) Calculate the dQ × 1 vector
T T
θ = (θ1T , . . . , θQ
)
which is the solution to
Aθ = [0TdQ 1Td ]T
(1)
The set of linear equations (1) has d(Q + 1) equations in dQ unknowns, but is of rank dQ. We
solve this system of linear equations using the generalized inverse of A.
(e) The final estimate is then given by
ŶH (θ) =
Q
X
θqT Γq ŶqT .
q=1
Equivalently, so-called “modified” weights can be calculated. These are the weights that can be
used for calculating estimates if the q samples are concatenated together and treated as a single
sample. Then the modified weights can be calculated as follows:
(q)
w̃i
(q)
= wi θq [k], q = 1, . . . , Q, i ∈ Sq ,
where θq [k] is the k-th element of θq . Then the final estimate can be calculated as
ŶH (θ) =
Q X
X
(q) (q)
w̃i yi .
q=1 i∈Sq
4
Fuller-Burmeister estimator
The Fuller-Burmeister estimator [FB1, 1972] is similar to the Hartley estimator but incorporates information
about the estimates N̂ (q) as well as Ŷ (q) to find an optimal linear combination of domain estimates. It was
also generalized to Q-frames by Lohr and Rao [2006], and here we present pseudo-code that can be used to
calculate it.
1. Estimate the 2d × 2d covariance matrix
Dq = cov
Ŷq
N̂q
=
Cq
T
Dq12
Dq12
Dq22
2. Let Gq be the 2d × 2d matrix diag(Γq , Γq ), where Γq is as defined in Section 3. Calculate
Eq = Gq Dq Gq , q = 1, . . . , Q
3. Construct the 2d(Q + 1) × 2dQ block diagonal matrix
Bi,j

P
−1 PQ
Q

G
G
I
−
G
Ei

i
q
q
q=1
q=1

P
−1
=
Q
Gi
Ej

q=1 Gq


Gj
if i = j
if i 6= j, i ≤ Q
if i = Q + 1
3
, for i = 1, . . . , Q + 1, j = 1 . . . , Q.
T T
4. Compute the 2dQ-vector β = (β1T , . . . , βQ
) , as the solution to
Bβ = [0T2dQ 1Td 0Td ]T
(2)
The set of linear equations (2) has 2d(Q + 1) equations in 2dQ unknowns, but is of rank 2dQ. We solve
this system of linear equations using the generalized inverse of B.
5. Then
ŶF B (β) =
Q
X
βqT Gq [ŶqT N̂qT ]T .
q=1
5
Pseudo Maximum Likelihood Estimator
Pseudo maximum likelihood estimation for dual frame surveys using simple random sampling was first proposed by Skinner [1991]. This was extended to Q-frame complex surveys by Lohr and Rao [2006].
1. Let
f (q) =
n(q)
N (q) d(q)
, q = 1, . . . , Q,
where d(q) is the design effect for estimating Y from frame q. For simple random sampling the design
effect is equal to 1, whereas for complex samples typically d(q) is greater than 1. Lohr and Rao [2006]
discuss methods for how to choose a design effect if if it not approximately the same for all variables.
If the frame sizes are not known, we use
f (q) =
2. Calculate
n(q)
N̂ (q) d(q)
, q = 1, . . . , Q.
P
(q) (q)
Ŷk
q∈Dk f
ˆ
Ȳk = P
, k = 1, . . . , d.
(q) N̂ (q)
q∈Dk f
k
3. Define M to be the d × Q matrix with k, qth entry
1 if q ∈ Dk
M̂k,q =
, for q = 1, . . . , Q, k = 1, . . . , d.
0 otherwise.
4. Let Ĥ be the d × Q matrix with k, qth entry
(q)
N̂k
if q ∈ Dk
Ĥk,q =
, for q = 1, . . . , Q, k = 1, . . . , d.
0
otherwise.
5. The aim of this step is to find the d-vector of domain population sizes N̂ , which is the solution to
(I − M M + )(diagN̂ )−1 Ĥf
= 0Q+d ,
(3)
M T N̂ − h
where M + is the Moore-Penrose generalized inverse of M , and h = (N (1) , N (2) , ..., N (Q) )T . M is
nonsingular only if we know the true values of N , in which case this step should be skipped and the
true values used in place of the estimated frame sizes.
If N is not known, the following method can be used to find an approximate solution to (3) [Lohr and
Rao, 2006]: This relationship can be used to iteratively calculate N̂ using the following method:
i) Let Ñ0 be a vector of initial estimates for N . Two reasonable options are
(q)
(a) If frame q has the largest sample size in domain Dk , then let Ñk0 = N̂k ;
4
(q)
(b) Let Ñk0 equal the average of N̂k , q = 1, . . . , Q.
ii) For l = 1, . . . calculate Ñl+1 , which is the solution to the linear set of equations
(I − M M + )(diagÑl )−1 (diagM f )
Ñl+1
MT
(I − M M + )(diagÑl )−1 Ĥf
=
.
h
iii) If the difference between Nl+1 and Nl is small enough, let N̂ = Nl+1 and stop iterating.
In our studies we used option (b) for Ñk0 , and a stopping distance of 0.001.
6. For unit i on frame Aq in domain Dk , calculate the adjusted weight
(q)
(q)
w̃i
=P
wi N̂k f (q)
(p)
p∈Dk
7. Calculate the final estimate
ŶP M L =
f (p) N̂k
Q X
X
, q = 1, . . . , Q.
(q) (q)
w̃i yi .
q=1 i∈Sq
6
Pseudo-Empirical Likelihood Estimator
Pseudo-empirical likelihood (PEL) estimation for multiple frame surveys was introduced by Rao and Wu
[2010]. They suggested two possible approaches:
1. PEL methods for dual-frame surveys based on post-stratified samples;
2. A multiplicity-based PEL approach which requires knowledge only of the multiplicity of each sampled
observation and not the domain membership information, and which is easily extended to more than
two frames.
In the absence of auxiliary variables, the multiplicity approach simplifies to that of Mecatti [2007], which we
describe in Section 8. Since we do not assume the presence of any auxiliary variables for this paper, we only
present the PEL post-stratified estimator. Rao and Wu [2010] presented the post-stratified estimator only
for the dual frame case, and noted that it could be extended to the more general Q-frame case but that the
general case was not presented due to notational difficulties. We agree that the notational complexities are
real, and for the same reason here we present pseudo-code that can be used to calculate the PEL estimator
for both the 2- and 3-frame case, and indicate how it can be extended further if required.
(q)
(q)
1. Construct the vectors of normalized weights1 d˜k : a vector of length nk that has ith element
(q)
(q)
d˜ki = P
wi
(q)
i∈Sq ∩Dk
wi
, ∀ i ∈ Sq ∩ Dk .
There should be d × Q such vectors: one for every frame-domain combination.
2. The next step is to determine values for the constants ηkq , k = 1, . . . , d, q = 1, . . . , Q. The constant
ηkq can be thought of as the “weight” given to the estimate from frame q in domain k. Rao and Wu
[2010] propose two possibilities:
1 Rao
and Wu [2010] suggest using the inverse selection probabilities in place of weights when available.
5
A. Calculate
η̂k,q = P
(q)
Vb (Ȳˆk )
, k = 1, . . . , d, q = 1, . . . , Q,
(l)
Vb (Ȳˆ )
l∈Dk
k
(q)
where Ȳˆk is the ratio estimator
(q)
Ȳˆk
P
=
(q)
i∈Sq ∩Dk
wi yi
(q)
P
i∈Sq ∩Dk
=
wi
(q)
d˜ki yi .
X
i∈Sq ∩Dk
B. Calculate2
(q)
η̂k = P
(q)
Vb (N̂k )
, k = 1, . . . , d, q = 1, . . . , Q.
(l)
Vb (N̂ )
l∈Dk
k
3. Calculate the “weights”
(q)
(q)
Wk
=
N̂k
(q)
N̂ (q)
× η̂k , k = 1, . . . , d, q = 1, . . . , Q.
4. We break this step up and present the pseudo-code separately for Q = 2 and then Q = 3. At the end
of this section an explanation is given for how to extend the method to more than three frames.
A. Suppose there are only two frames, A and B. Let domain Da include units only on frame A, domain
Dab include units on both frame A and frame B, and domain Db include units only on frame B.
Construct the following matrices and vectors:
n
x(A)
a
(A)
xab
(B)
xab
(B)
xb
=
(A)
(B)
(B)
(na(A) , nab , nab , nb )T , (4 × 1 vector)
= 0n(A) ×1
a
yi
(A)
=
, i ∈ SA ∩ Dab , (nab × 1 vector)
(A)
Wab
−yi
(B)
=
, i ∈ SB ∩ Dab , (nab × 1 vector)
(B)
Wab
=
0n(B) ×1
x
=
d
=
(A)
(B)
(B)
(x(A)
a , xab , xab , xb ),
˜(A) ˜(B) ˜(B)
(d˜(A)
a , dab , dab , db ),
X
=
0, scalar
W
=
(Wa(A) , Wab , Wab , Wb
b
(A)
(B)
(n × 1 vector)
(n × 1 vector)
(B)
), (4 × 1 vector)
B. Suppose there are only three frames, A, B, and C. Let Dab denote the domain where only frames
A and B overlap, and Dabc denote the domain where all three frames overlap. Other domains are
denoted similarly. Construct the following matrices and vectors:
2 See
equation (2.8) in Rao and Wu [2010] for an adjustment they suggest in the 2-frame case.
6
(nA
, nA , nB , nB , nA , nC , nC , nB , nC , nA , nB , nC )T , (12 × 1 vector)
 a yab ab b ac ac c bc bc abc abc abc
i

 W(abc)a , i ∈ SA ∩ Dabc
−yi
=
, i ∈ SB ∩ Dabc
W

 (abc)b
0
for all other i and where W(abc)a or W(abc)b is 0

yi

 W(abc)a , i ∈ SA ∩ Dabc
−yi
=
, i ∈ SC ∩ Dabc
W

 (abc)c
0
for all other i and where W(abc)a or W(abc)c is 0
 y
i

 W(ab)a , i ∈ SA ∩ Dab
−yi
=
, i ∈ SB ∩ Dab
W

 (ab)b
0
for all other i and where W(ab)a or W(ab)b is 0
 y
i

 W(ac)a , i ∈ SA ∩ Dac
−yi
=
, i ∈ SC ∩ Dac
W

 (ac)c
0
for all other i and where W(ac)a or W(ac)c is 0
 y
i

 W(bc)b , i ∈ SB ∩ Dbc
−yi
=
, i ∈ SC ∩ Dbc
W

 (bc)c
0
for all other i and where W(bc)b or W(bc)c is 0
n =
x1h(abc)
x2h(abc)
xh(ab)
xh(ac)
xh(bc)
x
=
(x1h(abc) , x2h(abc) , xh(ab) , xh(ac) , xh(bc) ), (5 × n matrix)
d
=
˜A ˜B ˜B ˜A ˜C ˜C ˜B ˜C ˜A ˜B ˜C T
(d˜A
a , dab , dab , db , dac , dac , dc , dbc , dbc , dabc , dabc , dabc ) , (1 × n vector)
X
=
05×1 , (scalar)
W
=
(Wa(A) , Wab , Wab , Wb
(A)
(B)
(B)
(B)
(C)
(A)
(B)
(C)
(A)
(C)
, Wac
, Wac
, Wc(C) , Wbc , Wbc , Wabc , Wabc , Wabc ),
(12 × 1 vector)
Note: for an internally consistent solution based on ensuring equality of frame estimates of domain
population sizes rather than domain population totals, set yi = 1 in the above for all i.
5. Use code in Appendix A3. of Wu [2005] to solve for p̂, a vector of length n.
6. Calculate the estimate
ȲˆPEL =
X X
k q∈Dk
(q)
X
Wk
(q)
p̂ki yi .
i∈Sq ∩Dk
Explanation of Calculations for PEL Estimator in the case of more than two
frames
Rao and Wu [2010] mention that the multiple-frame PEL estimator can be considered as a case of the PEL
estimator under stratified sampling, where each domain/frame combination can be considered as a separate
stratum. The calculation of the PEL estimate for the case of stratified sampling is given in Wu [2005]. In
this paper the problem is formulated as follows:
Maximize
H
X
X
l(p1 , . . . , pH ) = n∗
Wh
d∗hi log(phi ),
h=1
i∈sh
subject to the constraints
X
phi > 0,
phi = 1, h = 1, . . . , H
i∈sh
and
X
h
Wh
X
phi xhi = X,
i∈sh
7
(4)
where here xhi is a vector of auxilliary variables, X the vector of known population totals for the auxilliary
variables, and sh is the set of sampled units from stratum h.
In Rao and Wu (2010) the constrained maximization problem is presented as:
Maximize
X
X
l(pa , pab , pba , pb ) = n∗
Wh
d∗hi log(phi ), h ∈ {a, ab, ba, b},
i∈Sh
h
subject to the constraints
X
phi = 1, h = a, ab, ba, b,
i∈Sh
and
X
pabi yi =
i∈Sab
X
pbaj yj .
(5)
j∈Sba
Note that they do not specify that the phi must be greater than zero, although we assume that this constraint
is intended. The constraint (5) can be reexpressed as
X
X
pabi yi −
pbaj yj = 0, or equivalently
i∈Sab
Wab
X
i∈Sab
j∈Sba
X
X
X
−yi
yi
+ Wba
+ Wa
pbai
pai .0 + Wb
pbi .0 = 0,
pabi
Wab
Wba
i∈Sba
i∈Sa
which can be seen to be equivalent to (4) when
 yi
 Wab ,
−yi
xhi =
,
 Wba
0
i∈Sb
i ∈ Sab
,
i ∈ Sba
for all other i
(6)
and X = 0.
The case for more than two frames can be expressed similarly, with k − 1 “auxilliary” variables for each
overlap domain with k frames. For example, with three frames the constraint for the domain abc can be
expressed as
X
X
X
pA
pB
pC
(abc)i yi =
(abc)i yi =
(abc)i yi ,
A
i∈Sabc
B
i∈Sabc
C
i∈Sabc
or equivalently as
X
A
i∈Sabc
pA
(abc)i yi =
X
X
pB
(abc)i yi ,
B
i∈Sabc
pA
(abc)i yi =
A
i∈Sabc
X
pC
(abc)i yi .
(7)
C
i∈Sabc
This can be formulated similarly to (6) but using one auxilliary variable for each equality in (7).
Once the problem from Rao and Wu (2010) has been formulated in the same way as Wu (2005), the code
presented in Appendix A3. of that paper can be used to solve for p̂.
7
Unique Linkage Estimator
The Unique Linkage (UL) Estimator has been used in multiple-frame surveys such as SESTAT [NCSES, b,a].
The idea is that each sampled unit is linked to just one frame, and information on that nuit from the other
frames is ignored for estimation purposes.
1. Determine an order for the Q frames. Denote the rank of frame q in this order by R(q).
One method for determining the order is as follows:
(a) Calculate Vb (Ŷ (q) ), for q = 1, . . . , Q.
(b) Order the surveys by increasing value of Vb (Ŷ (q) ).
8
2. Let M [i] denote the set of frames which contain unit i. Calculate weight modification factors
1 if R(q) = mink∈M [i] R(k)
(q)
mi =
, q = 1 . . . , Q, i ∈ Sq
0 otherwise
3. Calculate the modified weights
(q)
w̃i
(q)
(q)
= mi wi , q = 1, . . . , Q, i ∈ Sq .
4. The Unique Linkage estimator is given by
ŶU L =
Q X
X
(q)
w̃i yi .
q=1 i∈Sq
8
The Multiplicity Estimator
Compared to previous estimators we’ve considered, the multiplicity estimator requires one additional assumption: that the number of frames to which every sampled unit belongs is known. The number of frames
on which a unit is listed is referred to as the multiplicity of that unit. The multiplicity estimator for multipleframe surveys was first proposed by Mecatti [2007].
1. Let mi be the multiplicity of unit i, i.e. the number of frames that include unit i.
2. Calculate the modified weights of the sampled units
(q)
w̃i
(q)
= wi m−1
i , q = 1, . . . , Q; i ∈ Sq
3. Then the Mecatti estimator is given by
ŶM =
Q X
X
(q) (q)
w̃i yi .
q=1 i∈Sq
9
The Single-frame Estimator
This estimation method requires the following additional assumption: that the records of every sampled
unit can be identified on all the frames that include it. This is stronger than the assumption required for
the Mecatti estimator. The single-frame estimator was first proposed for dual-frame surveys by Kalton and
Anderson [1986] and then extended to the Q-frame case by Lohr and Rao [2006].
(q)
1. Let w[i] be the weight on frame Aq of population unit i. Calculate the modified weights of the sampled
units

−1
Q
X
1 
(q)
w̃[i] = 
, q = 1, . . . , Q; i ∈ Sq .
(j)
j=1 w[i]
2. Then the single-frame estimator is
ŶSF =
Q X
X
q=1 i∈Sq
9
(q) (q)
w̃i yi .
10
Variance Estimation
As stated in the introduction, we assume only that the frame and domain membership is known, along with
the frame sampling weights and value of the response variable. Most of the estimators presented in previous
sections require estimates of the variance of estimates of the population mean or total. Typically this would
be done using replicate weights or Jacknife methods, both of which assume knowledge of the sampling design
and may require knowledge of design variables not on the available data set. Therefore, in this section we
present several options for how the variances can be estimated in the absence of additional information.
10.1
Variance Estimation for the Pseudo-Empirical Likelihood Estimator
Step 2. of the Pseudo-Empirical Likelihood Estimator presented in section 6 uses variance estimates of the
estimated domain means from each frame. The variance of the Hajek estimators of the domain mean can be
written as follows:
" (q) #
Ŷk
(q)
ˆ
V (ȲkH ) = V
(q)
N̂k
P

(q)
yi
(q) wi
i∈Sk

= V P
(q)
(q) wi
i∈Sk
"P
#
(q)
i∈S (q) Iki wi yi
= V
,
(8)
P
(q)
i∈S (q) Iki wi
where
Iki =
1
0
if unit i is in domain Dk
.
otherwise
The required variances are therefore all variances of some ratio estimator, for some frame q = 1, . . . , Q and
domain Dk , k = 1, . . . , d.
If SRS is used to select the sample from frame q, then the weights cancel out and the ratio estimator from
(8) can be written as
P
i∈S Iki yi
(q)
B̂k = P q
.
i∈Sq Iki
(q)
The variance of B̂k
can be estimated by (see for example Lohr [2010, p. 126])
X
1
1
n(q)
(q)
(q)
(yi Iki − B̂k Iki )2
V̂ (B̂k ) =
1 − (q)
(q)
(q)
(q)
N
n − 1 i∈S
n I¯k
q
X
n(q)
1
(q)
(q)
=
1 − (q)
(yi − B̂k )2 /nk .
(q)
N
n − 1 i∈S ∩D
q
If SRS is not used, we use the following weighted variance formula:
X
n(q)
1
(q)
(q)
(q)
V̂ (B̂k ) = 1 − (q)
w (yi − B̂k )2 P
(q)
N
n − 1 i∈S ∩D i
q
k
(9)
k
1
i∈Sq ∩Dk
(q)
,
wi
which simplifies to (9) when the weights are equal for all units sampled from frame q in domain k. If the
effective sample size for frame q is known, this should be used in place of n(q) .
10.2
Variance and Covariance Estimation for the Hartley and Fuller-Burmeister
Estimators
The generalized Hartley estimator and Fuller-Burmeister estimator are functions of Cq , the covariance matrix
(q)
of Ŷd , for all domains d covered by frame q, for q = 1, . . . , Q. Firstly we derive formulas for the variances
in this matrix, and then the covariances.
10
10.2.1
Variances
Let
i ∈ Uq ∩ Dk1
otherwise,
yi
0
ui =
where Uq ∩ Dk1 is the set of all population units from frame q in domain k1. Then
X
(q)
Yk1 =
ui .
i∈Uq
(q)
Assuming that Yk1 is estimated by the Horvitz-Thompson estimator
(q)
−1(q)
X
Ŷk1 =
πi
ui ,
(10)
i∈S (q)
the variance of this estimator is given by
(q)
V
(q)
(Ŷk1 )
=
X
1 − πi
i∈Uq
πi
(q)
!
u2i
(q)
+
(q) (q)
XX
πij − πi πj
i∈Uq j6=i
πi πj
(q) (q)
!
ui uj .
(11)
−1(q)
(q)
We assume that only the sampling weights are known, and that wi ≈ πi
for all units i on frame
q, q = 1, . . . , Q. We consider the following three approximations to the variance (11):
1.
2
n(q)
su
(q)
(q) 2
b
V (Ŷk1 ) = (N ) 1 − (q)
,
N
n(q)
where
s2u =
X
1
(ui − ū)2 .
− 1 i∈S
n(q)
q
This is the unbiased estimator of (10) in the case that SRSWOR was used to select the sample from
frame q.
2.
(q)
Vb (Ŷk1 ) =
X
(q)
(q)
wi (wi
− 1)yi2 .
i∈Sq
This is obtained by assuming that πij ≈ πi πj ∀ i, j, i.e. that units are sampled independently, in which
case we can write (11) as
!
X 1 − π (q)
(q)
i
b
V (Ŷk1 ) ≈
yi2 .
(q)
π
i∈Uq
i
An unbiased estimate of this is given by
(q)
(q)
Vb (Ŷk1 )
=
X
1 − πi
−1(q)
πi
(q)
πi
i∈Sq
=
X
(q)
!
(q)
wi (wi
yi2
− 1)yi2 .
i∈Sq
We would expect this estimator to do less well when the sampling fraction is large, in which case the
assumption that the second term in the variance is close to zero is unlikely to hold.
11
3.
(q)
Vb (Ŷk1 ) =
n(q) X
(q)
−1(q)
2
(1 − πi )(ui πi
− A(q)
s ) ,
n(q) − 1 i∈S
(12)
q
where
(q)
1 − πi
X
A(q)
s =
i∈Sq
P
j∈Sq (1
−
−1(q)
ui πi
.
(q)
πj )
Henderson [2006] compared nine estimators of the variance of the Horvitz-Thompson estimator (11).
The estimator (12), first derived by Hajek (1964), seemed to perform well in many cases. However, it
does take significantly longer to compute than estimates 1. and 2.
10.2.2
Covariances
(q1)
(q2)
We assume that samples from different frames are independent, in which case Cov(Ŷk1 , Ŷk2 ) = 0 for
all q1 6= q2. Therefore, we only need to consider the covariance between domain estimates for estimates
calculated from the same frame.
Let
yi i ∈ Uq ∩ Dk2
vi =
0 otherwise,
An unbiased estimator of the covariance between two Horvitz-Thompson estimators
X
X
(q)
(q)
Ŷk1 =
wi ui , and Ŷk2 =
wi v i ,
i∈Sq
i∈Sq
is [Särndal et al., 1992]
d Ŷ (q) , Ŷ (q) )
Cov(
k1
k2
=
(q)
(q) (q)
X X πij
− πi πj ui vj
(q) (q)
πij (q)
π π
i∈Sq j∈Sq
=
i
(q)
X
(1 − πi )
i∈Sq
ui
vi
(q)
πi
(q)
πi
+
(13)
j
(q)
(q) (q)
X X πij
− πi πj ui
i∈Sq j6=i
(q)
πij
(q)
πi
vj
(q)
.
πj
Since it is assumed that the sampling weights of only the sampled units are known, we need to approximate
(13). We again consider three possibilities, based on different assumptions:
1.
(q)
(q)
(q) X
X X (N (q) )2
N (q) N (q) − 1
d Ŷ (q) , Ŷ (q) ) = N (N − n )
Cov(
−
ui vj .
u
v
+
i i
k1
k2
(n(q) )2
(n(q) )2
n(q) n(q) − 1
i∈S
i∈S j6=i
q
q
This is obtained from (13) under the assumption that SRSWOR was used, in which case we know
(q)
(q)
πi = n(q) /N (q) and πij = n(q) /N (q) (n(q) − 1)/(N (q) − 1), i 6= j.
2.
d Ŷ (q) , Ŷ (q) ) =
Cov(
k1
k2
X
(q)
(q)
wi (wi
− 1)ui vi .
i∈Sq
(q)
(q) (q)
This is obtained from (13) under the assumption that πij ≈ πi πj
term is approximately equal to zero.
∀ i, j. In this case the second
3.
d Ŷ (q) , Ŷ (q) ) =
Cov(
k1
k2
X
i∈Sq
(q)
(1 − πi )
ui
(q)
πi
vi
(q)
πi


−1 
(q)
(q)
X X
1 − (1 − πi )(1 − πj )
ui vj
 
+
1 − 
 (q) (q) .
P
(q)
πi πj
i∈Sq j6=i
l∈Sq (1 − πl )
12
This is obtained based on substituting the following approximation [Hájek, 1964] for the joint selection
probabilities:


(q)
(q)
(q)
(q) (q)  1 − (1 − πi )(1 − πj ) 
πij = πi πj
P
(q)
(q)
l∈Uq πl (1 − πk )
and approximating the denominator of this term by
X
(q)
(1 − πl ).
l∈Sq
Similar to the methods for estimating variances, method 2. is the most computationally efficient closely
followed by method 1. The third method requires significantly more computation time. We’d expect method
1 to be most accurate if there is small weight variation between units on the same frame. If this is not the
case then method 2. may give more accurate results, particularly if the sampling fraction is small, in which
case the independence assumptions are more reasonable.
References
W.A. Fuller and L.F. Burmeister. Estimators for samples selected from two overlapping frames. In Proceedings
of the Social Statistics Section, pages 245–249. American Statistical Association, 1972.
Jaroslav Hájek. Asymptotic theory of rejective sampling with varying probabilities from a finite population.
Annals of Mathematical Statistics, 35(4):1491–1523, 1964.
H. O. Hartley. Multiple frame surveys. In Proceedings of the Social Statistics Section, pages 203–206.
American Statistical Association, 1962.
Tamie Henderson. Estimating the variance of the horvitz-thompson estimator. Master’s thesis, School of
Finance and Applied Statistics, The Australian National University, October 2006.
G. Kalton and D.W. Anderson. Sampling rare populations. Journal of the Royal Statistical Society, 149:
65–82, 1986.
Sharon L. Lohr. Sampling: Design and Analysis. Duxbury Press, 2 edition, 2010.
Sharon L. Lohr and J. N. K. Rao. Estimation in multiple-frame surveys. Journal of the American Statistical
Association, 101(475):1019–1030, 2006. doi: 10.1198/016214506000000195.
Fulvia Mecatti. A single frame multiplicity estimator for multiple frame surveys. Survey methodology, 33(2):
151–157, 2007.
National Science Federation NCSES. Scientists and engineers statistical data system (sestat), 2010: Technical
notes. ncsesdata.nsf.gov/us-workforce/2010/ses_2010_tech_notes.pdf, a. Accessed: 01-04-2016.
National Science Federation NCSES. Current design of the sestat surveys. www.nsf.gov/statistics/
srs07202/content.cfm?pub_id=1715&id=3, b. Accessed: 01-04-2016.
J. N. K. Rao and Changbao Wu. Pseudo–empirical likelihood inference for multiple frame surveys. Journal
of the American Statistical Association, 105(492):1494–1503, 2010.
C.E. Särndal, B. Swensson, and J.H. Wretman. Model Assisted Survey Sampling. Springer series in statistics.
Springer-Verlag, 1992.
C.J. Skinner. On the efficiency of raking ratio estimation for multiple frame surveys. Journal of the American
Statistical Association, 86:779–784, 1991.
Changbao Wu. Algorithms and R codes for the pseudo empirical likelihood method in survey sampling.
Survey Methodology, 31(2):239, 2005.
13
Download