Cluster Sampling - The Scottish Government

advertisement
Methodology Glossary Tier 2
Cluster Sampling
Cluster sampling is a method used to select a sample. The main reason to
cluster sample is to increase the efficiency of survey administration by
reducing costs (particularly travel costs). A sample derived through simple
random sampling can result in sample units which are widely dispersed
geographically, meaning that interviewers must travel great distances to
conduct a survey. This means that expensive travel costs may be incurred
and it may take longer to complete all the interviews required.
When cluster sampling, the population of interest is first split into geographic
areas (‘clusters’) or some other natural cluster (such as industries, schools or
hospitals). Some of these clusters are then randomly selected to form a
sampling frame from which a sample is chosen.
One-stage Cluster Sampling
This involves splitting the population into clusters, then randomly selecting a
proportion of these clusters. All units within the selected clusters are chosen
to participate in the survey.
Example
Glasgow City Council wishes to find out information about the diets of primary
one pupils. It is clearly not feasible to survey every primary one pupil in
Glasgow City, as it would be expensive and time consuming. We may
therefore decide that each school in Glasgow represents a cluster and
randomly select 20 schools. In one-stage cluster sampling we would visit
each of these schools and interview all primary one pupils in each of the 20
school.
Two-stage Cluster Sampling
This involves splitting the population into clusters and selecting a proportion of
these clusters to form a sampling frame. In two-stage cluster sampling we
then randomly select a proportion of individuals within each chosen cluster to
participate in the survey.
Example
The Scottish Government wishes to find out about the sleep patterns of all
primary school pupils. The survey will require one to one interviews with
school pupils and their parents/guardians. Obviously it is not feasible to
survey pupils in all schools or even to study all the students in a sample of
schools. Instead, we would randomly select a number of schools and then
randomly select a proportion of pupils from the chosen schools to participate
in the study.
http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary
Methodology Glossary Tier 2
Selecting a Cluster Sample
Say we wanted to use cluster sampling to draw a sample of 1,000 people in
Scotland. We decide to use Scottish Parliamentary Constituencies as our
clusters. We intend to randomly choose 10 out of the 73 constituencies and
sample 100 people within each constituency. One way to select
constituencies would be to list them all alphabetically, select a random starting
point on the list and pick every 7th constituency.
This presents a problem – the population of each constituency is different so
people living in constituencies with smaller populations (e.g. Shetland) would
have a higher chance of being in our sample than those living in
constituencies with larger populations (e.g. Dundee West). This could lead to
bias in our estimates and we would have to weight our results to reflect the
different selection probabilities of people living in different constituencies.
This can complicate the analysis of results.
One way to overcome this problem is to set the probability of each cluster
being selected for the sample to be proportional to its overall size. This
means the larger a cluster the more chance it has of being selected.
The following shows how to determine what probability should be assigned to
each cluster. Note, this method only works when the same number of people
will be sampled in each cluster.
Say we have:
a population of size N ,
we want to select a sample of size n,
we split the population into j clusters,
we select a clusters and
we select b individuals within each cluster for our sample,
To ensure everyone in the population has an equal chance of being included
in the sample we need:
Probability an individual is in the sample =
n
N
We can say that:
n  ab
If we say that Nj is the total number of individuals in cluster j then:
n
N
=
a  b 
N
http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary
Methodology Glossary Tier 2
=
a  b  N 
N  N 
j
j
=
 a  N j    b

  

N

 Nj




 a  N j  
 gives the probability that should be assigned to cluster j.
The first part 
 N 


The second part  b
 is the resulting probability an individual within cluster
 Nj
j has of being included in the sample if cluster j is included in the sample.
Example
Say we have a population of 10,000 patients that are in 12 hospitals and we
want to sample 500 patients. We intend to sample 100 patients each in five of
the hospitals.
((a  Nj ) / N )
1
2
3
4
5
6
7
8
9
10
11
12
Total
874
789
695
884
704
924
712
877
801
999
950
791
10,000
(5
(5
(5
(5
(5
(5
(5
(5
(5
(5
(5
(5
 874 ) / 10,000
 789 ) / 10,000
 695 ) / 10,000
 884 ) / 10,000
 704 ) / 10,000
 924 ) / 10,000
 712 ) / 10,000
 877 ) / 10,000
 791 ) / 10,000
 999 ) / 10,000
 950 ) / 10,000
 791 ) / 10,000
= 43.70%
= 39.45%
= 34.75%
= 44.20%
= 35.20%
= 46.20%
= 35.60%
= 43.85%
= 40.05%
= 49.95%
= 47.50%
= 39.55%
Overall
probability of
patient being
included in the
sample
Probability a
patient is
included in the
sample if their
hospital is
chosen
Probability
hospital is
selected
Patient
Population
Hospital
The table below shows how many patients are in each hospital and the
probability we should assign to each hospital when selecting clusters. It also
shows the probability an individual within each hospital is included in the
sample if their hospital is selected and how the overall probability of selection
is equal for all patients
((b / Nj ))
( 100 / 874 )
( 100 / 789 )
( 100 / 695 )
( 100 / 884 )
( 100 / 704 )
( 100 / 924 )
( 100 / 712 )
( 100 / 877 )
( 100 / 801 )
( 100 / 999 )
( 100 / 950 )
( 100 / 791 )
= 11.44%
= 12.67%
= 14.39%
= 11.31%
= 14.20%
= 10.82%
= 14.04%
= 11.40%
= 12.48%
= 10.01%
= 10.53%
= 12.64%
43.70
39.45
34.75
44.20
35.20
46.20
35.60
43.85
40.05
49.95
47.50
39.55












11.44
12.67
14.39
11.31
14.20
10.82
14.04
11.40
12.48
10.01
10.53
12.64
= 5%
= 5%
= 5%
= 5%
= 5%
= 5%
= 5%
= 5%
= 5%
= 5%
= 5%
= 5%
To select 5 hospitals we must use systematic sampling. This can be done by:
1. List the hospitals in a spreadsheet
http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary
Methodology Glossary Tier 2
2. Order the hospitals by a random number
3. For each hospital insert as many records as there are patients in that
hospital
4. Calculate a sampling interval by dividing the total number of patients by
the number of hospitals we want to select (in this case 10,000 divided
by 5 = 2,000)
5. Draw a systematic sample by selecting a random start between 1 and
2,000 and counting down the list of records selecting the hospital from
which every 2,000 record belongs.
Quality Implications of Cluster Sampling
We have established that the main advantage of cluster sampling is time and
cost savings. However, there are disadvantages to cluster sampling which
mean the technique should be treated with caution. For some variables,
individuals within the same cluster are more likely to be similar to each other
than individuals in different clusters.
Examples
1. Income levels are likely to be very different across geographic areas.
Therefore clustering may not be appropriate for a survey to estimate
average income in Scotland as it would severely reduce the quality of
any results.
2. The proportion of the population that is female is likely to be similar
across geographic areas. Therefore clustering is unlikely to affect
survey estimates of the proportion of the population that is female.
Clustering, therefore, makes us less certain our results reflect the whole
population and produces a larger standard error than the equivalent simple
random sample would.
Despite these disadvantages, clustering usually enables the selection of
larger samples than simple/stratified random sampling. Consequently, it is
often possible to target a large enough sample that offsets the loss of
precision.
The selection process effectively comes down to a trade off between two
factors – precision and cost. If you require a very accurate and precise
sample, then simple/stratified random sampling is more appropriate whereas,
if cost and time are the more important factors in your considerations, then
cluster sampling is likely to be more suitable, so long as the sample is large
enough to offset the loss of precision.
Measuring the Impact of Clustering on Final Estimates
It is possible to evaluate the effect clustering has on each variable we collect
by estimating the design effect due to clustering. The larger the design effect
due to clustering the less precise our estimates are. This is defined as:
http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary
Methodology Glossary Tier 2
Design Effect Due to Clustering = DEFFCL = 1    b  1
Where:
ρ is the intra-cluster correlation (see below for definition)
b is the average cluster sample size
Intra-cluster Correlation
The intra-cluster correlation (ρ in the above equation) is a measure of how
strongly individuals in the same cluster resemble each other.
A worked example using these formulae is provided later in the paper.
If we define each observation as:
Y ji     j   ji
Where
Yij is the i’th observation in cluster j,
μ is the overall mean,
 j is a random effect shared by all observation in cluster j,
 ji is a random effect for the i’th observation in cluster j.
Then the intra-cluster correlation (  ) is:

Variance j
Variance
j
 Variance ji 
This can be thought of as how much of the total variance in our sample is
explained by clustering and it can be calculated using the following formula:

   
  b  1 
Where:
α is the mean square between clusters, defined as
 n Y
j
Y 
2
j
z  1
Where n j = sample size of cluster j
Y j = sample mean for cluster j
Y = overall mean for entire sample
z = number of clusters
β is the mean square within clusters, defined as
http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary
Methodology Glossary Tier 2
 Y
ij
j
i
 Yj 
2
Where Yij = observation i in cluster j
n  z 
Y j = sample mean for cluster j
n = total sample size
z = number of clusters
b is the average cluster sample size
The method used to calculate design effects is designed for continuous
variables. When used for proportions the analysis tends to give biased results
for the intra-cluster correlation coefficient. However, this bias tends to be
small for proportions between 0.2 and 0.8.
Effective Sample Size
The design effect due to clustering can be used to calculate the effective
sample size.
Effective sample size = actual sample size / DEFFCL
The effective sample size is the simple random sample equivalent of the
clustered sample – i.e. how large would a simple random sample have to be
to give the same statistical power as the clustered sample.
If we have prior knowledge of what the design effect due to clustering is likely
to be we can use this to calculate how many individuals need to be included in
our sample.
Example
We have conducted a survey to estimate what proportion of people in
Scotland have ever travelled abroad. The sample was clustered and included
2,500 individuals and gave us an estimate of 67%. The design effect due to
clustering was 1.8.
The effective sample size is:
2,500 / 1.8 = 1,388.89
So the clustered sample of 2,500 people has the same precision as a simple
random sample of 1,389 people. Or our clustered sample of 2,500 people
has the same statistical power as a simple random sample of 1,389 people.
Complex Standard Errors
The design effect due to clustering can also be used to calculate a complex
standard error. To do this we must calculate the design factor due to
clustering.
http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary
Methodology Glossary Tier 2
Design factor due to clustering = DEFTCL =
DEFFCL
The complex standard error is:
Complex standard error = standard error  DEFTCL
Details on calculating standard errors is available in Tier 2 - Confidence
Intervals.
The complex standard error can be used to produce confidence intervals
which take into account the design effect due to clustering.
95% Confidence Interval = estimate  1.96 * complex standard error
DEFTCL can be thought of as the factor by which confidence intervals increase
because the sample was clustered.
Example
A restaurant chain has twenty outlets in Scotland. They intend to buy new
aprons for all staff. If they buy all the aprons in the same size they will receive
a significant discount. They therefore want to know the average height of
their staff before placing the apron order.
They decide to use a cluster sample and randomly select three outlets to
include in the sample. They will measure the height of four employees in
each of the outlets.
The probability of each outlet being included in the sample will be proportional
to size.
Outlet
1
2
3
4
5
6
7
8
9
10
Number of
staff
41
32
23
26
30
37
35
25
33
38
Outlet
11
12
13
14
15
16
17
18
19
20
Number of
staff
25
30
24
41
28
37
30
41
28
26
The restaurant owner uses the method described in the ‘Selecting a Cluster
Sample’ to pick three outlets. They randomly select outlet 1, 8 and 10.
Outlet
1
8
Number of
staff
41
25
http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary
Methodology Glossary Tier 2
10
38
Outlet
Observation
number
Height
Outlet
Observation
number
Height
Outlet
Observation
number
Height
Within each restaurant they randomly choose 4 staff members and measure
their height. They get the following results:
1
1
1
1
1
2
3
4
172
165
157
166
8
8
8
8
5
6
7
8
184
179
164
177
10
10
10
10
9
10
11
12
161
168
176
171
The restaurant owner then calculates the mean and the 95% confidence
intervals around the mean.
Mean = 172+165+157+166+184+…+171
12
=
170
= 170  standard error x 1.96
= 170  2.3 x 1.96
= 170  4.5
= (165.5,174.5)
Confidence intervals
The restaurant owner now wants to assess what impact the clustering has on
the estimates. They therefore want to calculate the design effect due to
clustering.
Firstly they must calculate the intra-cluster correlation. To do this they must
calculate the mean square within clusters and the mean square between
clusters.
Mean square within clusters:
j
1
2
2
so:
n  z 
Height
Mean height
for outlet
i
Y ji
i
1
1
j
Observation
number
Outlet
The formula is
 Yj 
172
165
Y
Yj
165
165
ji
 Yj 
Square of
difference
between
observation
and mean
height for
outlet
ij
Difference
between
observation
and mean
height for
outlet
 Y
Y
 Yj 
2
ji
7
0
http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary
49
0
Methodology Glossary Tier 2
1
1
8
8
8
8
10
10
10
10
3
4
5
6
7
8
9
10
11
12
157
166
184
179
164
177
161
168
176
171
165
165
176
176
176
176
169
169
169
169
-8
1
8
3
-12
1
-8
-1
7
2
64
1
64
9
144
1
64
1
49
4
There are 12 observations and 3 groups so n – z = 12 – 3 = 9
The mean square within clusters is:
49  0  64  1  64  9  144  1  64  1  49  4   450   50


 9 
9
Mean squares between clusters:
 n Y
The formula is
j
Y 
2
j
z  1
Outlet
Mean
height for
each
outlet
Overall
mean
height
j
Yj
Y
1
8
10
165
176
169
so:
Difference between
mean height for
each outlet and
overall mean height
Y
j
Y 
170
170
170
Square of difference
between mean height
for each outlet and
overall mean height
Y
Y 
2
j
-5
6
-1
There are three clusters so z = 3.
The mean sum of squares between clusters is:
25  4  36  4  1  4  248  124
2
2
The intra-cluster correlation is therefore:

   
  b  1 
=
124  50
124  4  1  50
74
274
= 0.27
=
The design effect due to clustering (DEFFCL) is therefore:
http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary
Outlet
sample
size
nj
25
36
1
4
4
4
Methodology Glossary Tier 2
1    b  1 = 1  0.27  4  1
= 1 0.81
= 1.81
12
 6.6 . This means the clustered sample the
1.81
restaurant chain took has the same power as a simple random sample of
sample size 6.
The effective sample size is
The complex standard error is:
Complex standard error
= standard error x
DEFFCL
= 2.3 x 1.81
= 3.09
Our 95% confidence intervals become:
Confidence interval = 170  complex standard error x 1.96
= 170  3.09 x 1.96
= 170  6.06
= (163.9,176.1)
This means that the clustering has added 1.6 cm onto our confidence limits on
either size.
Notes:
The design effect due to clustering is different for every estimate in a cluster
sample. It is often unfeasible to calculate the design effect due to clustering
for all variables measured. Therefore, it is appropriate to calculate the design
effect due to clustering for the most important variables collected in the
sample and the variables likely to have the greatest geographic differences.
There are other design effects which impact on the size of the confidence
interval and effective sample size. For example, the design effect of
proportionate stratification will be less than one and will therefore reduce the
size of confidence intervals and increase the effective sample size.
The design effect due to clustering is always greater than 1. This means the
precision of our estimates can never be improved by clustering. However, if
the design effect due to clustering is very close to 1, the effect of clustering on
the precision of our estimates is very small.
Further Information
Tier 1 Sampling | Social Survey Design
Tier 2 Confidence Intervals | Simple Random Sampling | Stratified Sampling
http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary
Download