Cluster Sampling & Systematic Sampling

advertisement
Cluster Sampling & Systematic Sampling
Recall that cluster sampling is where we first divide the population into “clusters,” then
select a simple random sample (SRS) of these clusters, and sample every unit within the
selected clusters. This is a two-stage sampling plan, where we employ an SRS of clusters at
the first stage and a census within these selected clusters at the second stage.
Systematic sampling is where we first select a random starting point in the population,
and then sample every mth unit beginning at that starting point. This also is a two-stage
sampling plan, where we employ an SRS of size 1 from the list of potential starting points,
and then census the sampling units at multiples of m units from the initial unit.
• It turns out that cluster sampling and systematic random sampling, as defined, share
identical sampling structures as two-stage sampling plans. To see this, consider the
following general situation. Suppose the population is first partitioned into mutually
exclusive groups, known as primary units, each of which contains a number of sampling
units, known as secondary units. Selection occurs on the primary units, and then every
secondary unit within a selected primary unit is sampled.
• This should be clear for cluster sampling. With systematic sampling, the primary units
are groups of observations m units apart in the population list. For example, if we want
every 5th member of a population, then there are 5 primary units, and we take an SRS
of 1 of these units and then examine every secondary unit in the primary unit selected.
Main Idea: The important point for these two sampling plans is that whenever a primary
unit is selected, all secondary units within are sampled. In truth then, the primary units
are the sampling units in a cluster or systematic sample, even though measurements or
observations are actually made on the secondary units.
Thompson lists three special considerations for these two sampling plans which warrant
further discussion and separate consideration (pages 129 & 131).
1. In cluster sampling, the size of the cluster may serve as auxiliary information that may
be used either in selecting clusters with unequal probabilities (PPS Sampling) or in
forming ratio estimators.
2. The size and shape of clusters may affect efficiency.
3. In systematic sampling, it is not uncommon to have a sample size of one; that is, a
single primary unit.
After defining the relevant notation for these sampling plans, each of the special considerations above will be addressed by way of examples.
Notation: Consider taking a cluster sample from some population with response variable y.
We let:
81
N = the number of primary units (clusters) in the population,
n = the number of primary units (clusters) in the sample,
Mi = the # of secondary units in the ith primary unit,
M = the total # of secondary units in the population =
N
X
Mi
i=1
yij = the y-value of the j th secondary unit in the ith primary unit,
yi =
Mi
X
yij = the total of the y’s in the ith primary unit (cluster totals),
j=1
τ =
N
X
yi = the total of the y’s in all the units,
i=1
τ
= the mean of the y’s per secondary unit,
M
τ
µ1 =
= the mean of the y’s per primary unit.
N
Example: A sociologist wants to estimate the average per capita income in a certain small
city. As no list of resident adults is available, she decides that each of the city blocks will
be considered one cluster. The clusters are numbered on a city map from 1 to 415, and the
experimenter decides she has enough time and money to sample n = 25 clusters where every
household will be interviewed within the clusters (blocks) chosen. The data below give the
number of residents and the total income for each of the 25 blocks sampled. [Problem taken
from Scheaffer, Mendenhall, & Ott, Elementary Survey Sampling, page 248.]
µ =
Cluster
Number of
Residents,
Total Income
per Cluster,
1
2
3
4
5
6
7
8
9
10
11
12
13
8
12
4
5
6
6
7
5
8
3
2
6
5
$192,000
$242,000
$84,000
$130,000
$104,000
$80,000
$150,000
$130,000
$90,000
$100,000
$170,000
$86,000
$108,000
Cluster
Number of
Residents,
Total Income
per Cluster,
14
15
16
17
18
19
20
21
22
23
24
25
10
9
3
6
5
5
4
6
8
7
3
8
$98,000
$106,000
$100,000
$64,000
$44,000
$90,000
$74,000
$102,000
$60,000
$78,000
$94,000
$82,000
25
X
i=1
Mi = 151
25
X
yi = $2, 658, 000
i=1
Given that M = 2500 residents, use these data to estimate the average per capita income in
82
the city. Notation:
N = 415 blocks, n = 25 blocks, M = 2500 residents,
Mi = the number of residents in the ith block,
M = the total number of residents in all 415 blocks,
yij = the income of the j th resident in the ith block,
yi = the total income of all residents on the ith block,
τ = the total income of all residents in the city,
µ = the mean income per resident,
µ1 = the mean income per block.
Aside: If the goal of a cluster sampling problem is to estimate τ or µ1 , the population total
or mean, then we can treat this situation as an SRS of n clusters (or blocks in the example),
and estimate the mean and total as with an SRS:
n
1
1X
yi = [2658000] = $106,320, with standard error:
n i=1
25
sµ
sµ
¶ 2
¶
N − n su
415 − 25 1898226667
SE(y) =
=
= $8, 447,
N
n
415
25
25
1 X
where: s2u =
(yi − y)2 = the sample variance of the cluster totals (yi ’s).
n − 1 i=1
• Estimate µ1 with y =
• Estimate τ with N y = 415(106320) = $44,122,800, with standard error:
s
s
s2u
1898226667
SE(τb) = N (N − n) = 415(415 − 25)
= $3, 505, 584.
n
25
• Note that all of the above computations are done strictly on the observations on the
primary units (or clusters). Considering only the cluster totals, we cannot estimate µ,
the average income per resident.
• However, here, we have an auxiliary variable (the number of residents on a block), so
we should take advantage of this to improve the SRS-based estimates of τ , µ1 .
Ratio Estimator with Cluster Sampling: In cluster sampling, when the primary unit total
yi is correlated with the cluster size Mi (such that we expect yi = 0 when Mi = 0), a ratio
estimator for τ & µ1 may be employed where here the sample ratio is:
n
X
r=
i=1
n
X
i=1
n
X
yi
=
xi
i=1
n
X
yi
Mi
=
Sample total income
Sample total # of residents
i=1
so that the ratio estimator of the population total is: τbr = M r .
• As with earlier ratio estimators, this estimator is biased. The bias is typically negligible,
and so we look at the estimated variance as before.
83
• The approximate variance (Delta method) of τbr and corresponding estimated variance
are given by:
Var(τbr ) ≈ N (N − n) ·
d τb )
Var(
r
N
N
σr2
N (N − n) X
N (N − n) X
=
(yi − Rxi )2 =
(yi − Mi µ)2 ,
n
n(N − 1) i=1
n(N − 1) i=1
n
N (N − n) X
(yi − rMi )2 .
=
n(n − 1) i=1
Back to the Example: In the income example, the sample ratio is:
n
X
r=
i=1
n
X
yi
Mi
=
sample total income
2658000
=
= $17, 603 per resident.
sample # residents
151
i=1
• This is a biased (ratio) estimate of µ, the true average income per resident (secondary
unit), with standard error:
vÃ
!
u
u N − n s2r
t
SE(r) =
, where: s2r =
2
N µx
n
n
n
1 X
1 X
(yi − rxi )2 =
(yi − rMi )2 .
n − 1 i=1
n − 1 i=1
•
•
Plugging in the true mean number of residents per block µx gives: SE(r) = $1621.
To estimate the total income:
τbr = M r = (2500)(17603) = $44006.6K, with standard error:
SE(τbr ) = M · SE(r) = (2500)(1621) = $4053.5K.
200K
250K
Income vs. # Residents
0
50K
Income (yi)
100K
150K
• Note that this standard error ($4053.5K) is larger
than that found with an SRS earlier ($3505.6K).
Why do you think this happened?
0
84
2
4
6
8
# of Residents (Mi)
10
12
• If we know M (= 2500 here), we can estimate µ (average income per resident) by:
τb
N
44122800
=
y=
= $17649 (close to r = $17603),
M
M
2500


µ
¶
τb
1
1
smaller
than
.
b = SE
SE(µ)
=
SE(τb) =
(3505584) = $1402 
M
M
2500
SE(r) = $1621
µb =
• The upshot of all this is that for cluster sampling, we have two basic options:
1. Work only with the primary units (clusters).
2. Use ratio estimation to make use of the relationship between primary and secondary units (if such a relationship exists).
• In either case, the variation in the estimators only depends on n (the cluster size). So
we should use the method which gives the smaller estimated variance in a particular
problem.
Systematic Sampling: Recall that systematic sampling is the special case of cluster sampling
where each cluster is determined through some random starting point, and a single cluster
is chosen.
Example: Suppose we are sampling days where we can afford to sample every fifth day.
This gives the following five clusters:
• Every sampling unit is in exactly one cluster.
• In systematic sampling, experimenters generally select ONE cluster from this group.
This gives a sample size of 1, which prohibits any type of inference. Why?
• There are two basic outlooks one takes toward this “problem”:
1. Take more than 1 systematic sample (replication).
2. Assume the one systematic sample taken is representative of the population (i.e.:
assume it is an SRS).
Review: To this point, we have considered two ways to estimate the population total (or
mean) in a cluster sample. These two ways are reviewed below.
1. SRS of Clusters:
τb = N y, where y is the cluster mean,
N
1 X
σ2
(yi − µ1 )2 is the population variance
Var(τb) = N (N − n) u , where σu2 =
n
N − 1 i=1
of the cluster totals, and µ1 is the average cluster total.
85
2. Ratio Estimation:
τbr = M r where M = total # of secondary units in the population,
and r = the average response per secondary unit,
N
N (N − n) 2
1 X
Var(τbr ) =
σr , where σr2 =
(yi − Mi µ)2 is the population variance
n
N − 1 i=1
of the cluster total within a cluster, and µ is the average response per
secondary unit.
• Under these two cluster sampling estimation methods:
σu2 = the variability between clusters, and
σr2 = the variability between clusters, accounting for cluster size.
PPS Sampling: Suppose in cluster sampling that the primary units are drawn with replacement where the selection probabilities are proportional to the sizes of the primary units (i.e.:
larger clusters are more likely to be selected than smaller clusters).
Recall that in the situation of PPS sampling with replacement, an unbiased estimator of
the population total is given by the Hansen-Hurwitz estimator:
τbp =
n
n
1X
yi
MX
yi
Mi
=
, where: pi =
n i=1 pi
n i=1 Mi
M
with variance given by:
Ã
N
1X
yi
Var(τbp ) =
pi
−τ
n i=1
pi
µ
µ
!2
N
1X
Mi
=
n i=1 M
Ã
yi
−τ
Mi /M
!2
¶¶
N
N
2
yi
1X
Mi
MX
=
M
−µ
=
Mi (y i − µ)2
n i=1 M
Mi
n i=1
(where: y i = yi /Mi , the average income per resident in cluster i)
µ
¶
N
N
MX
yi
Mi µ 2 M X
1
=
Mi
−
=
(yi − Mi µ)2 .
n i=1
Mi
Mi
n i=1 Mi
• The squared difference (yi − Mi µ)2 in this final piece represents the variation in the
cluster totals from the average per resident times the number of residents. Hence, the
greater the differences between the average incomes for the clusters, the larger this
variance will be. Does this make sense for cluster sampling?
• An unbiased estimator of Var(τbp ) given above is:
d τb ) =
Var(
p
n
M2 X
(y − µb p )2 , where: µb p = τbp /M.
n(n − 1) i=1 i
86
• This Hansen-Hurwitz estimator will be good if there is a lot of variability in the cluster sizes. Otherwise, one of the two cluster-based estimators of the population total
(presented earlier) should be used.
• The major advantage of the Hansen-Hurwitz estimator is its unbiasedness and low
variability when the cluster sizes are very different.
• A Horvitz-Thompson estimator based on the selection probabilities pi = Mi /M can
also be developed, as indicated on page 134 of the book.
Basic Principle of Cluster and Systematic Sampling: Because a census is taken on all secondary units within a primary unit, the within-primary-unit variance plays no role in the
variances of population means or totals. This explains why the ideal case for cluster or
systematic sampling is that when there is large variability within primary units relative to
the variability between primary units, because this large within-primary-unit variability will
have no effect on the variance of estimators!
• Hence, we want the primary units to be “mini-populations”; that is, we want them to
be representative of the population as a whole. This will minimize differences between
primary units while maximizing differences within.
Example (Systematic Sampling): Consider sampling
via line transects from an irregularly-shaped area.
Interest is in estimating the proportion of the area
which has some attribute (say: bare ground).
• Suppose we take a systematic random sample
of 10 such transects, where the initial starting
point is randomly chosen, and the resulting 10 transects are evenly spaced across the
region.
• Because the area is irregular in shape, the transects will be of different lengths.
• Although we really have a cluster sample of size 1, we will assume that the 10 transects
are representative of the area and treat them as an SRS of size 10.
• So, the sampling unit here is a transect. What is the population?
How might we estimate the proportion of the area (using these 10 transects) with the desired
attribute (bare ground)? Three ways are considered:
1. Simple Average of Transect Proportions: We could measure the proportion of each
transect with the attribute, and average the 10 resulting proportions. Problem?
87
yi = the length of transect i with the attribute,
xi = the length of transect i.
2. Ratio Estimator: Suppose we let:
Then a ratio estimate of the proportion of the area with the attribute is:
10
X
yi
µ
¶
n
1 X
N −n 1 2
2
sr , where: sr =
(yi − rxi )2 .
r=
, with variance Var(r) =
2
n − 1 i=1
| N
{z
} µx
xi
=1 here
i=1
i=1
10
X
• As a side note, we might estimate µx with x (the average length of the 10 transects),
but we could actually determine µx exactly using integral calculus!
• This is a perfectly valid way of estimating the proportion of the area with the
attribute, although, as with all ratio estimators, there is some bias in the estimator.
3. Unbiased Way?: Consider only the lengths (yi ’s), and not the proportions.
• We can determine the true mean length of a transect µx via calculus (since we
know the total acreage and length across the base of the area). If we then take:
y
average length of the attribute among the transects
=
,
µx
true mean length of a transect
this estimator will be unbiased, as it is no longer a ratio estimator.
• If there are large differences in transect lengths, the ratio estimation method is
better because it accounts for cases of many short or long transects. The variance
of the ratio estimator will be smaller if there are large differences in the (yi /xi )’s
(the proportions with the attribute) in the transects.
R Code for Cluster Sampling Income Example
> N <- 415; n <- 25; fpc <- (N-n)/N; M <- 2500
> Mi <- c(8,12,4,5,6,6,7,5,8,3,2,6,5,10,9,3,6,5,5,4,6,8,7,3,8)
> y <- 1000*c(192,242,84,130,104,80,150,130,90,100,170,86,
108,98,106,100,64,44,90,74,102,60,78,94,82)
# SRS of Clusters
# ===============
> ybar <- mean(y)
> ybar
[1] 106320
# Average income per block
> su2 <- var(y)
> se.ybar <- sqrt(fpc*su2/n)
> se.ybar
[1] 8447.19
# SE of average income per block
88
> tauhat <- N*ybar
> tauhat
[1] 44122800
> se.tauhat <- N*se.ybar
> se.tauhat
[1] 3505584
# Estimated total income in city
# SE of estimated total income
# Ratio Estimator with Cluster Sampling
# =====================================
> r <- sum(y)/sum(Mi)
> r
[1] 17602.65
# Sample ratio = average income per resident
> mux <- M/N
> sr2 <- (1/(n-1))*sum((y-r*Mi)^2)
> se.r <- sqrt(fpc*sr2/(n*mux^2))
> se.r
# SE of average income per resident
[1] 1621.409
#
(assuming M = 2500 is KNOWN)
> xbar <- mean(Mi)
> se.r2 <- sqrt(fpc*sr2/(n*xbar^2))
> se.r2
# SE of average income per resident
[1] 1617.14
#
(assuming M is UNKNOWN)
> tauhat <- M*r
> tauhat
[1] 44006623
# Estimated total income in city
> se.tauhat <- M*se.r
> se.tauhat
[1] 4053522
# SE of estimated total income
89
Download