Double Sampling (Chapter 14)

advertisement
Double Sampling (Chapter 14)
To this point, we have considered a number of sampling or estimation methods (ratio and
regression estimation, stratified sampling, e.g.) whereby auxiliary information was used to
estimate a population mean or total. In all of these cases, it was assumed that we knew
population information on the auxiliary variable.
Double sampling is a sampling method which makes use of auxiliary data where the auxiliary
information is obtained through sampling. More precisely, we first take a sample of units
strictly to obtain auxiliary information, and then take a second sample where the variable(s)
of interest are observed. It will often be the case that this second sample is a subsample
of the preliminary sample used to acquire auxiliary information. Two common situations
where double sampling is employed to use auxiliary information to improve the estimate of
some response variable are outlined below.
1. If the variable of interest is “expensive” to measure, but a related variable is much
“cheaper,” we might first sample many of the sampling units and measure the “cheaper”
(auxiliary) variable, and only measure the response variable on a subsample (or smaller
sample).
• A common example of this scenario is any case where a visual estimate of some
response variable can be made much more quickly (cheaply) than measuring the
response outright. For example, if we want to estimate the number of leaves in
some area, it could be quite time-consuming to count the number of leaves in a
number of 18x18 inch quadrats, whereas a visual estimate of the number of leaves
in such an area is relatively simple to make. If a definite relationship between the
visual and actual number of leaves in a quadrat can be established, we could make
efficient use of the visual estimates (even if they are highly biased) to improve an
estimate of the total number of leaves through double sampling.
2. A common problem in many surveys (as discussed in this class) is that of potential
bias from nonresponse. Double sampling can be used with stratification principles to
adjust for nonresponse in surveys by taking a second sample of the nonrespondents.
To illustrate the use of double sampling, consider the following example.
Example: Suppose it is desired to estimate the total biomass of vegetation and average
biomass (gm/m2 ) on a 1000 m2 area. A systematic sample of twenty 1 m2 plots is selected
and a visual estimate of the total grams of biomass is recorded. Five of the twenty plots are
then randomly selected and the total biomass is carefully determined on these 5 plots. The
visual and actual measurements of total biomass are given on the next page.
Visual Only
60 60
200
0
100 80 20 150
100
80
104
60
60
20
150
60
Visual and Actual
Visual: 20 80 150 40 80
Actual: 14 62 155 36 71
Ratio Estimation in Double Sampling: Let
n0 =
the total # of units observed,
n =
the # of units in the second sample, where both x and y are observed.
In the biomass example, n0 = 20 and n = 5.
Suppose we want to estimate τ =
N
X
yi (total biomass in the 1000 m2 area).
i=1
n
X
• First, compute the sample ratio: r =
i=1
n
X
yi
for the second (smaller) sample.
xi
i=1
• Recall that with ratio estimation, we assumed we knew the population total for X,
namely τX , and we then estimated the total for Y by: τbr = rτX . But here, we need to
estimate τX (since we only have a sample of the x-values). How?
• With this estimate for τX then, a ratio estimate of τ is given by:
τbr = r · τbX , where via the Delta Method, the variance of τbr is:
Ã
!
2
0
σr2
0 σ
2 n −n
Var(τbr ) = N (N − n ) 0 + N
, where:
n
n0
n
σr2 =
N
N
1 X
1 X
(yi − Rxi )2 , σ 2 =
(yi − µ)2 .
N − 1 i=1
N − 1 i=1
• With ratio estimation as before, we had n0 = N , so that³the 1st
term of the variance
´ 2
σr
N
−n
2
expression above was just zero. And then: Var(τbr ) = N
as before.
N
n
• The estimated variance is given by:
Ã
d τb ) =
Var(
r
!
n0 − n s2r
s2
, where:
N (N − n ) 0 + N 2
n
n0
n
n
n
1 X
1 X
2
2
2
sr =
(yi − rxi ) , s =
(yi − y)2 (usual sample variance).
n − 1 i=1
n − 1 i=1
0
• The value of s2r will be small (as before) if the relationship between the visual and
actual estimates is linear and goes through the origin.
105
• The ratio estimate of the mean response and corresponding standard error are given
by:
τbr
SE(τbr )
µb r = , SE(µb r ) =
.
N
N
Back to the Biomass Example: To estimate the total biomass in the 1000 m2 area or the
mean biomass, the following R code was used.
> x <- c(20,80,150,40,80,60,60,200,0,100,80,20,150,100,80,60,
60,20,150,60)
> y <- c(14,62,155,36,71)
> N <- 1000
# Population size
> np <- 20
# Initial sample size: n’
> n <- 5
# Subsample size
> x1 <- x[1:5]
# X-values for the subsample
> r <- sum(y)/sum(x1)
> r
[1] 0.9135135
Ã
r=
à n
X
! Ã n
X
yi /
i=1
!!
xi
i=1
# Estimate of actual / visual


0
n
NX
> tau.hat.x <- (N/np)*sum(x)
> tau.hat.x
[1] 78500
xi 
n0 i=1
# Estimate of tau.x for all 1000 plots
> tau.hat.r <- r*tau.hat.x
> tau.hat.r
[1] 71710.81
(τbr = rτbx )
# Estimate of total biomass
τbx =
Ã
s2r
n
1 X
=
(yi − rxi )2
n − 1 i=1
!
> sr2 <- (1/(n-1))*sum((y-r*x1)^2)
Ã
! !
> var.tau.hat.r <- (N*(N-np)*var(y))/ Ã
0
2
s2r
0 s
2 n −n
d
br ) = N (N − n )
Var(
τ
+
N
np + N^2*((np-n)/np)*sr2/n
n0
n0
n
> sqrt(var.tau.hat.r)
µ
¶
q
d τb )
SE(τbr ) = Var(
[1] 12613.57
# SE of tau.hat.r
r
> mu.hat.r <- tau.hat.r/N
> mu.hat.r
[1] 71.71081
> sqrt(var.tau.hat.r)/N
[1] 12.61357
(µb r = τbr /N )
# Estimate of biomass per plot (gm/m sq)
µ
# SE of mu.hat.r
106
SE(µb r ) =
q
d τb )/N
Var(
r
¶
To investigate how much improvement these estimators with double sampling give, consider
obtaining estimates of total biomass and biomass per plot based just on the y-values (i.e.:
an SRS of size n = 5).
# Unbiased Estimates
# =============
> N*mean(y)
[1] 67600
> mean(y)
[1] 67.6
> N*sqrt((1/n)*var(y))
[1] 24034.56
> sqrt((1/n)*var(y))
[1] 24.03456
#
#
#
#
#
#
#
#
(τb = N y)
Estimate of total
biomass (SRS)
(µb = y)
Estimate of mean


s
biomass (SRS)
2
s
ignoring the fpc
SE of estimated total SE(τb) = N
n
biomass (SRS)

s
SE of estimated mean 
2
s
SE(µ)
b =
ignoring the fpc
biomass (SRS)
n
Allocation in Double Sampling for Ratio Estimation
It was mentioned at the outset that one major reason for using double sampling is because
auxiliary information may be cheaper to measure than the main variable of interest. Hence,
the cost of sampling at the first stage (auxiliary info) as compared to the second stage (both
auxiliary and response info) is very important in deciding how to allocate sample units at
the two stages.
Suppose the total cost of sampling is fixed at C, and let C = c0 n0 + cn , where:
c0 = cost of observing x on one unit (visual estimate),
c = cost of observing y on one unit (actual estimate).
For a fixed cost (C, c0 , c), we want to find the optimal values of n, n0 , that is, those values
that minimize the variance of the mean or total estimator. These optimal values can be
shown to satisfy:
v Ã
!
u 0
uc
σr2
n
t
.
=
n0
c σ 2 − σr2
• s2r is approximately unbiased for σr2 and s2 is unbiased for σ 2 . So we could use s2r and s2
from a preliminary or prior study as “guesses” of these standard deviations to answer
the allocation question.
• In the biomass example, we found s2r = 117.2 and s2 = 2888.3. Hence, thereÃwas much
!
σr2
more variation in the y’s than between the y’s and x’s. This in turn makes
σ 2 − σr2
small. Is this typical? When?
107
s2r
σr2
=
0.0423
as
an
estimate
of
the
ratio
, we can compute allocations
s2 − s2r
σ 2 − σr2
for a variety of cost ratios:
• Using
c0 /c
n/n0
0.1
.065
0.2
.091
0.9
.195
1.0
.206
• So, for example, if the cost ratio is c0 /c = 0.1 (10 times more expensive to measure the
response), then the subsample (of the y’s) should be about 6.5% the size of the visual
sample (of the x’s) (i.e.; we would measure y on roughly every 15th unit.)
Regression Estimation for Double Sampling: Suppose there is an underlying relationship
(linear or otherwise) between y and x which does not go through the origin. Then regression estimation will be more appropriate than ratio estimation.
Consider the simple linear regression model: y = A + Bx on our sample of size n.
• The regression estimate of the total in previous problems was given as: τbL = a + bτx .
With double sampling, however, τx is unknown and must be estimated:
• The resulting estimated variance of τbL is:
2
d τb ) = N (N − n0 ) s + N 2
Var(
L
0
n
Ã
n0 − n
n0
!
n
X
1
(yi − a − bxi )2 .
n(n − 2) i=1
Double Sampling for Stratification: Consider now the use of auxiliary information for the
purpose of stratifying the original population into groups which are homogeneous within
and heterogeneous between for the variable of interest.
Recall that to take a stratified random sample, we need to know which strata the individual sampling units belong to before the sampling is done, as it is with this information
that the sampling is actually performed. In cases where the stratum identifications were not
known, but the relative population stratum sizes (Nh /N ) could be assumed known, poststratification techniques can be applied to obtain desired estimators of the population mean
or total.
If, however, these relative population stratum sizes are not known, then they need to be
estimated. One way to accomplish this is to take an initial sample whose sole purpose is
to estimate these relative stratum sizes, and then take a second subsample from this first
sample where we stratify the initial sample according to the estimated relative stratum sizes.
108
Before describing how double sampling is used here, recall the following notation for stratified
random sampling. Let:
N = population size,
Nh = the size of stratum h in the population,
Wh = Nh /N = the proportion of the population in stratum h.
Double sampling in this stratified framework is conducted as follows in two steps:
1. Select an SRS of size n0 . Here, we find n0h , the number of observations in this “easy-totake” sample from stratum h, h = 1, . . . , L.
• Let wh =
n0h
= the proportion of the sample in stratum h.
n0
2. Take a stratified random sample from the n0 units in the initial sample.
• Let nh = the number of units sampled from stratum h.
¶
L µ
X
Nh
• Normally, the stratified random sample estimates the population µ using y =
yh.
N
The difference here is that we are estimating Nh /N with n0h /n0 based on the initial sample. This gives the double sampling stratified estimator of the mean:
h=1
yd =
L
X
Ã
h=1
!
n0h
y h , where y h = the sample mean in the hth stratum.
n0
• The variance and estimated variance of y d are given by:
Ã
Var(y d ) =
N − n0
N
!
σ2
+
n0
L
X
Wh σh2
h=1
|
n0
Ã
{z
!
n0h
−1
nh
, estimated by:
}
extra source of variability
from the secondary sample
µ
d
Var(y
d) =
|
Ã
!
"
#
¶ L
L
N −1 X
n0h − 1 nh − 1
N − n0 X
s2h
wh (y h − y d )2
−
wh +
0
0
N
n −1
N −1
nh
N (n − 1) h=1
h=1
{z
}
|
{z
}
Notes:
• Suppose we take an SRS of 100 plots, where we plan to subsample 10 of these plots
to measure the variable of interest. Normally, we would sample the “subsample plots”
during the course of sampling the original 100 plots for efficiency reasons (i.e.: the
sampling would be done all at once). However, for a stratified random sample in this
setting, we don’t know the strata identifications until we sample the 100 plots.
109
• An alternative to this double sampling for stratification procedure might be to take
a systematic sample at the first stage and at the second stage. This might be better at guaranteeing coverage of the area of interest. At the second stage (to mimic
stratified random sampling), we could systematically select sites in each of the strata
proportionally.
• In general, we may want to sample more units than dictated by the proportional stratum sizes to guarantee that at least 2 samples from each stratum are taken. Why?
Example (A Way to Handle Nonresponse): Consider a population of N = 400 individuals
on which we want to conduct a mail survey. The parameter of interest in this study is the
proportion of people who respond with a “Yes” to a particular question. Call this population
proportion p.
• Suppose an initial SRS of size n0 = 120 people is taken where 30 people respond and
the other 90 do not. On the basis of this sample, we will view the respondents as
one stratum and the nonrespondents as a second stratum (call them strata 1 & 2
respectively). We view them as separate strata to allow for the (likely) possibility that
the nonrespondents might have responded differently than the respondents, had they
responded. So L = 2 here.
• In terms of the stratification notation, we have:
n01 = 30 respondents, and n02 = 90 nonrepondents.
n01
n02
=
0.25
and
w
=
= 0.75.
2
n0
n0
• Suppose 20 of the 30 respondents answered “Yes” to the question. Our initial estimate
of the proportion who responded “Yes” is thus: y 1 = 20/30 = 0.667.
We then guess that w1 =
• To try to obtain more information about the population of nonrespondents, suppose
we now take an SRS of 25 nonrespondents from the 90 in our original sample. As the
preliminary survey was a mail survey (which often has a low response rate), we might
try a phone survey or even a face-to-face interview at this second stage. Suppose in
this intensive follow-up, 20 of the 25 nonrespondents sampled respond and 4 answer
“Yes” to the original question.
• Viewing this two-stage sample as a double sample, we have the following:
Stratum
Respondents (1)
Nonrespondents (2)
Initial Sample Size
n01 = 30
n02 = 90
Secondary Sample Size
n1 = 30
n2 = 25
110
Stratum Mean
y 1 = 0.667
y 2 = 4/20 = 0.2
• The estimated stratum mean for the nonrespondents stratum, namely y 2 = 0.2 assumes
that the 20 people actually sampled are representative of the 25 we attempted to sample.
• The resulting estimate of the proportion answering “Yes” to the question is:
pbd = y d =
2
X
wh y h = (.25)(.667) + (.75)(.2) = 0.317,
h=1
a significant decrease from the estimate of 0.667 which ignored the nonrespondents.
111
Download