Lecture notes

advertisement
Module H6 Sessions 13&14
Cluster and Multi-Stage Sampling Procedures
1.
Cluster Sampling
For administrative convenience and reduction of costs such as transport, interviewer time,
supervision, a random selection is often made from clusters of units rather than from the
total population of units themselves. For example, in a household survey, rather than
selecting a simple random sample of households from the whole country, one could select
a sample of villages. This has the added advantage that we don't need a list (sampling
frame) of every household in the country from which to draw our sample, just a list of all
the villages.
If we then interview every household in the selected villages, we have a one–stage cluster
sample. If we just interview a sample of households in each selected village we have a
two–stage cluster sample. To take a random sample at this second stage, we need of
course a sampling frame of households in the chosen villages.
Advantages :
(i)
The major advantage of a cluster sampling scheme is to lower the costs and for
administrative convenience.
(ii)
Also of advantage when no satisfactory sampling frame is available for the whole
population, but a listing of the clusters into which the population is divided, is
available.
NOTE:
(i)
The selection of m clusters from a population made up of M clusters may be made
by simple random sampling (SRS), or stratified random sampling, or systematic
sampling, by treating the clusters themselves as units.
(ii)
Whether or not a particular group of units should be called a cluster depends on
the circumstances.
e.g. If units are households, villages may be clusters;
if units are family members, households may be clusters.
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 1
Module H6 Sessions 13&14
(iii)
Clusters need not necessarily be natural aggregates. For example, in area sampling,
imposing grids onto maps, produces artificial clusters.
(iv)
In any one sample design, several levels of clusters may be used (e.g. district, village,
household). This leads to a multi-stage sampling procedure. The units at the first
(topmost) stage of selection are known as primary sampling units (PSU's).
In general, clusters do not usually contain an equal number of units.
Cluster sampling will give less precise estimates than a simple random sample of the same
size, but this loss in precision is usually far outweighed by the reduced cost, which enables
a larger sample to be taken. Also, better administrative control will give better quality
of data. In a large survey some clustering will probably be inevitable anyway because of the
lack of a suitable sampling frame.
2.
Selection of Clusters in One–Stage Cluster Sampling
Could sample clusters with
(a) equal probability (SRS) and use weights in estimation (an appreciation of the meaning
and use of sampling weights will be given in Session 19); or
(b)
with probability proportional to size (PPS).
Example: Suppose we want to select 1 cluster from a population made up of 10 clusters.
Let Pi be the probability of selecting the ith cluster.
Let Pi(a) and Pi(b) correspond to sampling with SRS and with PPS respectively (see Table
below).
Need to think whether sampling is with or without replacement.
In (a), with or without replacement leads to the same probabilities Pi(a).
But column Pi(b), gives with replacement probabilities.
Sampling of clusters under scheme (b) is usually done with replacement.
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 2
Module H6 Sessions 13&14
District size(pop)
Pi(a)
1
2
3
4
5
6
7
1000
400
200
300
1200
1000
1600
1/10
1/10
1/10
1/10
1/10
1/10
1/10
1000/6700
400/6700
200/6700
300/6700
1200/6700
1000/6700
1600/6700
8
9
10
200
350
450
1/10
1/10
1/10
200/6700
350/6700
450/6700
Total
6700
Cluster No: (i)
Pi(b)
1
1
Other advantages of with replacement selection are:
(i)
in multistage sampling, variance estimates are easier
(ii)
selected clusters are independent
but on the other hand, we may get the same cluster twice!
If we have within-cluster sampling (2-stage) this may not be a problem, especially if withincluster sampling fraction is low.
In practice, if the sampling fraction f 
n
is small, we can sample without replacement,
N
but pretend we didn't.
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 3
Module H6 Sessions 13&14
3.
Method of Drawing a PPS Sample
Make a cumulative list of cluster sizes.
Example:
Cluster No
1
2
Size
1000
400
3
4
5
6
7
8
9
10
200
300
1200
1000
1600
200
350
450
Cumulative Size
1000
1400
1600
1900
3100
4100
5700
5900
6250
6700
Method (i).
Draw a systematic sample based on cumulative size, assuming clusters are arranged in
random order.
e.g. To select m = 3 clusters, look at
6700
 2233.
3
Draw a random number between 1 and 2233.
Say we get 1814.
1814  Select District 4
1814 + 2233 = 4047  Select District 6
4047 + 2233 = 6280  Select District 10
Unless one cluster is very large, this will be without replacement.
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 4
Module H6 Sessions 13&14
Method (ii).
Draw a random sample based on cumulative size.
e.g. To select m = 3 clusters, take 3 random numbers from 1 to 6700.
Say we get 5814, 832, 3534.
5814  Select District 8
832  Select District 1
3534  Select District 6
This is effectively sampling with replacement.
4.
Estimation in two-stage cluster sampling
Assume clusters are sampled with replacement, and xij represents the measurement made
on the jth unit in the ith cluster.
Then each cluster i provides an independent unbiased estimate xi of the population total
XT (of some measurement variable X). This is given by
xi  m 
j
xij
pij
……………………...(1)
where
m
is the number of clusters sampled;
xij
is the measurement made on the jth unit in the ith PSU (cluster);
pij
is the probability of selecting the jth unit in the ith cluster.
The summation takes place over all xij's in the ith cluster (NOTE: In multi-stage samples,
the summation extends over all draws and all stages of selection).
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 5
Module H6 Sessions 13&14
Then we can estimate the overall total XT by
1 m
x   xi
m i 1
………...…………(2)
and its variance by
var  x  
m
2
1

xi  x  .

mm  1 i 1
…………….……..(3)
Consider the previous example again with a slight change to cluster sizes.
Cluster No
Cluster Size
pij for SRS
pij for PPS
1
2
3
4
5
6
100
40
20
30
120
100
3/10
3/10
3/10
3/10
3/10
3/10
100/670
40/670
20/670
30/670
120/670
100/670
7
8
9
10
160
20
35
45
3/10
3/10
3/10
3/10
160/670
20/670
35/670
45/670
Total
670
Consider a 2–stage selection, taking 3 at stage 1, and then 8 at stage 2 (from each cluster).
Consider SRS with replacement at stage 1 and without replacement at stage 2.
Suppose selected clusters are: 1, 7, 8, at stage 1 sampling.
Then,
Prob. of selecting the jth unit from cluster 1 
3
8

 p1 j
10 100
Prob. of selecting the jth unit from cluster 7 
3
8

 p2 j
10 160
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 6
Module H6 Sessions 13&14
Prob. of selecting the jth unit from cluster 8 
3 8

 p3 j
10 20
Then from cluster 1 data, population total XT is estimated by
8
x1 j
j 1
p1 j
x1   3
, and similarly for x7 , x8  .
The population total can then be estimated from all data as:
x   x7   x8  8 x1 j 8 x7 j 8 x8 j
x  1



.
3
j 1 p1 j
j 1 p7 j
j 1 p8 j
with the variance of x* given by expression (3) on page 4B.
Notice that the first term above is:
10
x1j
j1
p1j

10
x1j
j1
3
8

10 100


 x1j
8
 100

10
3
mean for
cluster1
total for
cluster 1
Writing the 2nd and 3rd terms similarly, get
 total for   total for   total for   10
x  


  .
 cluster 1  cluster 7   cluster 8   3
i.e.
x* = (mean per cluster)  10.
Computations above correspond to the way in which we would intuitively calculate the
population total XT from data of just one cluster, and then average results across several
clusters to get a better estimate of XT, together with its variance.
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 7
Module H6 Sessions 13&14
5.
Multistage Sampling
We saw earlier that a two–stage sample results when a random sample of clusters is
selected, and these are then further subsampled (rather than completely enumerated) to
provide the units making up the final sample. The first stage units are called Primary
Sampling Units (PSUs), and the second stage units are called Secondary Sampling Units
(SSUs). If the second stage units are further sub–sampled, a three–stage sampling
procedure results.
In general, a population sub–sampled at several stages leads to a MULTISTAGE
SAMPLING procedure. For example:
Districts  Towns  Hospitals  Wards  Patients.
NOTE: Most large scale survey investigations are multistage. They are cost–effective and
administratively convenient. They are also useful in the absence of an adequate sampling
frame.
Sample Selection
Suppose a population of size = 240000, is divided into 800 PSUs, each PSU of size 300, or
240 PSUs of size 1000, etc.
Suppose a sampling fraction of 1 in 100 is to be taken.
Allocation
A
B
C
No of PSU
800
240
120
D
E
F
G
60
40
20
8
No SSU taken per PSU
3
10
20
40
60
120
300
In the above example, SRS at each stage will produce unbiased estimates of the population
mean and allow s.e.'s to be found.
In general however, size of PSUs vary!
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 8
Module H6 Sessions 13&14
What size of samples should then be taken at each stage?
Some possibilities are given below
1. Take a SRS of PSUs, and sample each selected PSU by SRS or stratified or systematic
sampling.
Suppose decide on SRS of SSUs. Then need to decide on n1, n2, ..., nm ; i.e. the sample
sizes from the selected m PSUs. For example,
a)
Take all ni of equal size.
Disadvantage
Sample might not be representative if some large PSUs are not selected at
the first stage.
b)
Take ni proportional to size of ith PSU.
Disadvantage
If a large PSU is selected, the final sample is heavily biased by units in this
PSU.
2.
Select PSUs with probability proportional to size (PPS), and take equal sized
samples from the selected PSUs.
This gives greater precision than case 1, with a constant sampling fraction over the two
stages of selection.
Advantages of 2 are:

Equal ni  a large PSU cannot exert too great an effect;

Administratively convenient;

If each PSU is to be studied separately, this is an efficient allocation.
Limitation: Size of the PSUs must be known.
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 9
Module H6 Sessions 13&14
An Example
A population has individuals belonging to one of three districts A, B, C.
District
A
B
C
Total
District Size (Ni)
20000
2000
8000
30000
Consider selecting one PSU, and then sub–sampling from it.
(a) Consider SRS of one PSU and SRS of ni = 100 SSUs.
If A is the selected PSU,
1 100
1
Prob. any particular individual is selected  

3 20000 600
If B is the selected PSU,
1 100
Prob. any particular individual is selected  
3 2000

1
60
If C is the selected PSU,
Prob. any particular individual is selected
1 100
1
 

3 8000 240
So this scheme gives different individuals a widely different chance of entering the final
sample.
(b) Select one district using SRS and ni proportional to Ni SSUs.
(e.g. Take a 5% sample at stage 2).
If A is selected,
Prob. any particular individual is selected
SADC Course in Statistics
1 1
1
 

3 20 60
Module H6 Sessions 13&14 – Page 10
Module H6 Sessions 13&14
If B is selected,
Prob. any particular individual is selected
1 1
1
 

3 20 60
If C is selected,
Prob. any particular individual is selected
1 1
1
 

3 20 60
(c) Select one district with PPS, and take equal ni = 100.
If A is selected,
Prob. any particular individual is selected 
20000 100
1


30000 20000 300
If B is selected,
Prob. any particular individual is selected 
2000 100
1


30000 2000 300
If C is selected,
Prob. any particular individual is selected 
8000 100
1


30000 8000 300
NOTE :In (2) and (3), probability any particular individual was selected is constant. Such a
design is called a self–weighting design.
6.
Self–Weighting Designs
A self–weighting design is one in which each unit at the final level (e.g. plot, person,
household) has the same probability of being selected. This probability is the product of
the probabilities of selection (over all draws) at each stage of the design.
For example, suppose we are planning a two–stage survey from a population of villages of
different sizes. We will take a sample of villages, and within each selected village take a
subsample of households. We can make the design self–weighting by either:
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 11
Module H6 Sessions 13&14
(a)
Choosing villages with probability proportional to size, and then choosing the same
number of households in each selected village.
or
(b)
Choosing a simple random sample of villages, and then selecting a number of
households in each selected village which is proportional to the size of that village.
The advantage of a self–weighting design is that the straightforward sample mean
1
x   x i is an unbiased estimator of the true population mean.
n
This is not so for a non–self–weighting design. For such a design, you need to take
account of the different probabilities of selection. The formulae given under expressions
(1), (2) and (3) on pages 5 and 6 again apply, BUT with pij defined as the overall probability
of selection of the jth unit from the ith cluster covering all draws and all stages of selection.
Note that the above method of estimation is appropriate only in instances where PSUs are
sampled with replacement.
6.
Design Effects (brief overview)
The design effect (deff) is the ratio of the correct variance of an estimator (say
ˆ ) in the
given design to the variance calculated as if it were a simple random sample of the same
size. Thus
variancecorrect (ˆ)
deff 
varianceSRS (ˆ)
[ Note: Some also consider deft =  (deff) ]
Deff is a comprehensive measure which attempts to summarise the effect of various
complexities in the design, especially those of clustering and stratification.
In general, deff will be different for different groups of variables measured in the same
survey, but may be quite similar across groups of variables of the same type (e.g. socioeconomic variables, health knowledge variables, etc).
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 12
Module H6 Sessions 13&14
It may be similar in value for the same variable using the same design across different
surveys in time and location.
Uses of deff:
(a) To give an indication of the precision lost (or gained) due to a complex survey design,
and to enable different designs to be compared.
(b) To enable easy estimation of sample size for future surveys, e.g. if we know deff for a
crucial variable from a previous survey (of the same design), we can use sample size n as
S2
n deff x 
V
Where V is the desired variance for the real survey and S2 is the variance of a simple
random sample of the same size (the finite population correction is ignored here).
(c) To enable simple (rough!) calculation of standard errors. For example if the design is
self-weighting, the SRS mean will be correct. For correct variance, the SRS variance
estimates should be multiplied by an estimate of deff from either ((i) an earlier survey of
the same design; or (ii) calculating the correct multi-stage sampling variance for one or two
variables in the group (e.g. demographic variables) and using deff for these as an estimate
of deff for other variables in the group.
Further information about deff can be found in
Pettersson H. and Silva,P.L.D.N. (2004) Analysis of design effects for surveys in developing
countries. Chapter VII, pp.123-143, of the UN Publication An Analysis of Operating
Characteristics of Household Surveys in Developing and Transition Countries: Survey Costs, Design
Effects and Non-Sampling Errors. Available at
http://unstats.un.org/unsd/hhsurveys/index.htm. (accessed 10th September 2007)
Other references related to design effects can also be found under Section B of the above
publication.
References for the bulk of the contents of this document appear on the next page.
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 13
Module H6 Sessions 13&14
General References related to multi-stage samples and practical applications
Barnett, V. (1974) Elements of Sampling Theory. Edward Arnold. ISBN 0 340 17387 4
Cochran, W.G. (1977) Sampling Techniques (3rd edition). Wiley & Sons.
Levy, P.S. and Lemeshow, S. (1999) Sampling and Populations: Methods and Applications (3rd
edition) Wiley, New York. ISBN 0-471-15575-6
Lohr, S.L. (1999) Sampling: Design and Analysis. International Thomson Publishing. ISBN 0534-35361-4
Rao, P.S.R.S. (2000) Sampling Methodologies: with applications. Chapman and Hall, London.
Scheaffer, R.L., Mendenhall, W., Ott, L. (1990) Elementary survey sampling, (4th Edition).
PWS-Kent Publishing Company, pp. 390.
Wilson I. M., Huttly, S.R.A. and Fenn, B. (2006) A Case Study of Sample Design for
Longitudinal Research : Young Lives. International Journal of Social Research Methodology:
Theory and Practice. 9, no.5, pp.351-365.
SADC Course in Statistics
Module H6 Sessions 13&14 – Page 14
Download