First half of the cluster/systematic sampling lectures for the ASTI and

advertisement
92
SECTION 6
CLUSTER SAMPLING
AND
SYSTEMATIC LIST SAMPLING
6.1 What is Cluster Sampling?
6.1.1
Definition
Cluster Sampling ---
Probability sampling in which
sampling units at some point
in the selection process are
collections, or clusters, of
population elements
6.1.2 Selecting a Simple One-Stage Cluster Sample
A. Specify appropriate clusters. In most instances
clusters would consist of population elements
which are physically close together and therefore
relatively similar to each other (i.e., with high
intra-cluster homogeneity). Clusters almost
always have varying numbers of population
members.
93
B. Select a simple random sample of clusters from a
complete list of clusters.
C. Collect survey data from all population elements
falling in the selected clusters.
Note:
We have considered here a special type
of cluster sampling in which the
probability sampling method used to
select clusters is simple random
sampling. However, any form of
probability sampling as well as
stratification can be used to choose the
clusters.
6.1.3 Some Examples
A. In a household survey for a small city, a
probability sample of blocks is selected. Each
block in this case represents a cluster of
households. Households may or may not be
subsampled within selected blocks.
94
B. In a national sample of inpatient hospital visits for
individuals with multiple sclerosis during some
calendar year, a probability sample of hospitals is
chosen. Each hospital in this instance represents a
cluster of visits by patients with multiple sclerosis
who were discharged during that year.
C. In a survey of first graders in the schools of a
state, a probability sample of schools is selected.
All first graders in a school would represent a
cluster in this design.
6.1.4 Some Implications of Using Cluster Sampling
A. If we collect survey data on all members of each
selected cluster (i.e., we have a "one-stage" sample
of members), the probability of choosing each
element in this cluster is the same as the
probability of choosing the cluster regardless of
the number of elements in the cluster. For
example, if we have 100 clusters of unequal size
and we randomly choose 10 of them, the
probability of selection for each element in the 10
selected clusters is 0.1 regardless of the sizes of
the selected clusters.
95
B. Cluster sampling generally yields estimates with
relatively larger variances (i.e., lower precision)
than samples of the same size which are chosen by
element (i.e., non cluster) sampling. The amount
of the increase in variance is directly related to the
average sample cluster size.
C. Because members of clusters are often close in
geographic proximity, the average cost per sample
element can be reduced substantially over element
sampling if cluster sampling is used. The amount
of the reduction in costs is directly related to the
average size of the clusters that are used.
D. Since elements in clusters are usually similar (i.e.,
clusters are internally homogeneous), the amount
of information gathered by the survey may not be
increased substantially as new measurements are
taken within clusters. This tells us that sample
cluster sizes should not be too large. As a general
rule, the number of clusters in the population
should be large which means that the average size
of clusters should be kept as small as possible.
96
E. NOTE RE-WORDING: The survey statistician
frequently has some choice in the size of clusters
that are used in a survey. In making this choice,
the cost advantages of large (sample) clusters
must be properly weighed against the statistical
advantages of smaller (sample) clusters.
F. Analysis of data from cluster samples is frequently
somewhat more complex than analysis of data
from element samples.
G. Cluster sampling eliminates the need for a
sampling frame consisting of a list of all elements
in the population. Since clusters are the units
being sampled, a listing of all clusters in the
population constitutes an appropriate frame.
H. V(n)>0 is possible if clusters vary in size.
(Skip to Section 6.5 on p. 111)
6.2 Estimating a Population Mean from a Simple
Cluster Sample
6.2.1 Setting
A. A population of elements is divided into clusters
so that each element is a member of only one
cluster.
97
B. We wish to estimate the mean per element for
some characteristic in the population (denoted by
the symbol, Y ).
C. The sampling method we decide to use in selecting
our sample of clusters is simple random sampling.
Other more complex sampling methods involving
stratification might have also been used.
D. Data are collected from all elements in the sample
of clusters.
6.2.2 Some Definitions
N = Number of elements in the population
A = Number of clusters in the population
N
B  = Average number of elements per cluster in
A
the population
a = Number of clusters chosen in the sample
n j = Number of population elements which fall in the
j-th cluster
98
a
n =  n j = Total number of elements in the clusters
j1
chosen in the sample
b
n
= Average number of elements per cluster in the
a
sample
y j = Total of all observations in the j-th cluster
6.2.3 Estimator of Y
a
r
yj
j1
a
nj
a

yj
j1
n
(6.1)
j1
Note: The symbol " r " is used since the estimator in
formula (6.1) is usually called a ratio mean.
99
6.2.4 Estimated Variance of r
bg
A. If the average cluster size in the population B is
known;
Aa  2
v( r )  
s
2 c
AaB


(6.2)
where s2c is a measure of variability among clusters
calculated by the formula
a
s2c 
 ( y j  rn j )
a
2
j1
a 1

a
a
 y  r  n  2r  y jn j
j1
2
j
2
j1
2
j
j1
a 1
(6.3)
bg
B. If the average cluster size in the population B is
not known:
 (A  a)a  2
v( r )  
sc
2

 An 
where s2c is defined in formula (6.3).
(6.4)
100
6.2.5 Standard Error of r
se( r )  v( r )
(6.5)
where the value of v( r ) would depend on whether
formula (6.2) or (6.4) was used.
6.2.6 Confidence Intervals for Y (n>50)
Lower Boundary: r  t se( r )
(6.6)
Upper Boundary: r  t se( r )
(6.7)
The value of t depends on the confidence level we
wish for the confidence interval (see Section 2.2.7).
Interpretation: We are 95 percent sure that Y is
covered by the interval whose
(t=1.96)
boundaries are computed by
formulas (6.6) and (6.7).
A. Formulas in Section 6.2 can be used to estimate a
population proportion (P) with some attribute.
101
This is done by letting y j represent the number of
elements possessing the attribute in the j-th cluster
in the sample.
B. r is not an unbiased estimator of Y (or P for
proportions ) whenever we sample clusters of
unequal size. The bias emerges since different
cluster samples are likely to yield different values
of n (i.e., n is a random variable). The amount of
bias is directly related to the variability in size
among clusters in the population. The survey
statistician must therefore find some ways to
reduce this variability.
C. Some common methods for reducing variability in
size among clusters:
(1) Stratify clusters by sizes (i.e., by n j ).
(2) Divide large clusters into smaller ones.
(3) Combine smaller clusters into larger ones.
(4) Select clusters with probabilities proportional
to size (i.e., to n j ) and subsample within
clusters. This is the so-called method of
multi-stage sampling which is discussed
briefly below. For additional reading see
Chapter 7 of Kish [2].
102
6.3 Estimating a Population Total from a Simple Cluster
Sample
6.3.1 Setting
A. A population of elements is divided into clusters.
B. We wish to estimate the aggregate total of some
characteristic in the population (denoted by the
o
symbol, Y ).
C. A simple random sample of clusters is selected.
D. Data are collected from all elements in the sample
of clusters.
6.3.2 Some Definitions
We use the same symbols for formulas to estimate
totals as we did in Section 6.2.2.
o
6.3.3 Estimator of Y
A. If N is known:
o
N a
r
 y j  Nr
j
n 1
L
O
M
NP
Q
(6.8)
103
B. If N is not known:
o
A a
r2 
 y j  Ay t
a j1
L
O
M
NP
Q
(6.9)
where y t is the average total among clusters in the
sample which is calculated with the formula
a
yt 
yj
j1
(6.10)
a
o
6.3.4 Estimated Variance of r
A. If N is known:
 A(A  a)  2
v(r1 )  N 2 V( r )  
sc

a


o
where s2c is defined in formula (6.3).
(6.11)
104
B. If N is not known:
 A(A  a)  2
v(r2 )  A 2 v(y t )  
st

a


o
(6.12)
where s2t measures the variability among cluster
totals calculated by the formula:
a
a
s 
2
t
 (y j  yt )2
j1
a 1
ay
j1

2
j
L y O
M
NP
Q
2
a
j1
a (a  1)
j
(6.13)
o
6.3.5 Standard Error of r
o
o
se(r)  v(r)
(6.14)
o
where the value of var( r ) would depend on whether
formula (6.11) or (6.12) was used.
105
o
6.3.6 Confidence Intervals for Y
Lower Boundary: r  t L
se( r ) O
M
N P
Q
Upper Boundary: r  t L
se( r ) O
M
N P
Q
o
o
o
(6.15)
o
(6.16)
The value of t depends on the confidence level we wish
for the confidence interval (see Section 2.2.7).
o
Interpretation: We are 95 percent sure that Y is
covered by the interval whose
(t=1.96)
boundaries are computed by formulas
(6.15) and (6.16).
106
6.4 Illustrative Example of Cluster Sampling
6.4.1 Setting
A. A household survey on health care utilization is
done in a small city. Specifically, the objective is
to estimate the following:
(1) Y : The average number of health care visits
per household in the city
o
(2) Y : The total number of health care visits in
the city.
B. The city consists of 415 blocks, which are used as
clusters for the survey. A list of the city's blocks is
prepared.
C. The decision is made to select a simple random
sample of 25 blocks and to measure the number of
health care visits by each household in the
sampled blocks.
D. Results of the survey are summarized in Table 6.1.
107
Table 6.1 Summary of Results from Health Care Utilization Survey
Cluster
j
Number of
Households
Number of
Visits by
Cluster
Members
nj
yj
1
2
3
4
5
6
7
8
9
10
11
12
13
8
12
4
5
6
6
7
5
8
3
2
6
5
96
121
42
65
52
40
75
65
45
50
85
43
54
Summary Statistics:
Cluster
j
j1
n   n j  151
j1
a
y
j1
j
 1,329
yj
10
9
3
6
5
5
4
6
8
7
3
8
A  415
y
a
nj
14
15
16
17
18
19
20
21
22
23
24
25
a
a  25
Number of
Households
Number of
Visits by
Cluster
Members
2
j
82,039
2
j
1,047
a
n
j1
a
y n
j1
j
j
 8, 403
49
53
50
32
22
45
37
51
30
39
47
41
108
6.4.2 Analysis (N and B are unknown)
A. Estimate of Y :
a
r
yj
j1
n
a
s2c 

1329
 8.801
151
a
a
 y  r  n  2r  y jn j
j1
2
j
2
j1
2
j
j1
a 1
82,039  (8.801) 2 (1047)  2(8.801)(8403)
=
 634.479
24
 (A  a)a  2  (415  25)(25) 
v( r )  
sc  
634.479
2
2 

An
(415)(151)




= 0.6538
se(r)  v( r )  0.6538  0.8086
109
95 percent confidence interval for Y :
Lower Boundary: 8.801 - (1.96)(0.8086) = 7.216
Upper Boundary: 8.801 + (1.96)(0.8086) = 10.386
Interpretation: We are 95 percent sure that Y lies
somewhere between 7.216 and 10.386 visits.
o
B. Estimate of Y :
o
r2
LA O
L415O
 MP
1329 
y  M P
Na Q N25 Q
a
j1
a
s 
2
t
a  y 2j
j1
j
22,061
L y O
M
NP
Q (25)(82,039)  (1329)
j1
a (a  1)
= 474.557
2
a
j
(25)(24)
2
110
 A(A  a)  2  (415)(415  25) 
v(r2 )  
st  
474.557 


a
25




o
= 3,072,282
o
o
se( r 2 )  var( r 2 )  3,072,282  1,753
o
95 percent confidence interval for Y :
Lower Boundary: 22,061 - (1.96)(1753) = 18,625
Upper Boundary: 22,061 + (1.96)(1753) = 25,497
o
Interpretation: We are 95 percent sure that Y lies
somewhere between 18,625 and 25,497
visits.
111
6.5 Multi-Stage Cluster Sampling
6.5.1 Definition
Multi-Stage Cluster
Sampling ---
A method of probability sampling
in which the sample of elements is
chosen in two or more stages.
Second stage sampling units are
chosen from the sampling units
selected in the first stage. Third
stage units are chosen from second
stage sampling units; and so forth.
6.5.2 Selecting a Two-Stage Sample
A. A list of first stage sampling units is prepared.
Sampling units in the first stage of a multi-stage
sample are called primary sampling units, or
PSUs.
B. A probability sample of first stage units is chosen.
C. In each of the selected first stage units a list of
population elements is prepared.
D. A separate probability sample of elements is
selected in each of the selected first stage units.
112
6.5.3 Some Examples
A. Household sample of the non-institutionalized
population in Virginia
Primary Sampling Unit:
Minor Civil Divisions
Secondary Sampling Units: Small Groups of Blocks
Tertiary Sampling Units:
Households
B. National Sample of Hospital Discharges
PSU: Small Groups of Counties
SSU: Hospitals
TSU: Patient Medical Records
6.5.4 Some Implications of Multi-Stage Cluster
Sampling
A. Most of the cost savings can be retained while
gaining back some of the statistical losses (i.e.,
larger variances) of cluster sampling.
113
B. Analysis of data collected in multi-stage designs
can become increasingly more complex as the
number of stages increases.
6.5.5 Analysis of Data from Multi-Stage Cluster
Samples
Analysis of data from multi-stage cluster samples is
somewhat beyond the scope of these lectures.
However, it is important to note that features of the
design used to choose the sample do affect the
approach and results of statistical inference. Further
discussion of this topic can be found in the
following references:
[1] Wolter, K Introduction to Variance Estimation,
Springer-Verlag, New York, 1985.
[2] Kish, L., Survey Sampling, Wiley and Sons, New
York, 1965, Chapters 6 and 7.
[3] Cochran, W.G., Sampling Techniques, Third
Edition, Wiley and Sons, 1977, Chapters 8. 10,
11.
114
6.6 PPS Cluster Sampling
6.6.1 General:
A. Probability Proportional to Size.
B. Used in conjunction with cluster samples selected
in two or more stages.
C. "Size" = Some measure which is indicative (often
crudely) of the number of analysis units in a PSU
(e.g., in the 1980 Somalia survey was a recent
administrative count of the number of households
in the PSU divided by 40, and rounded).
D. PPS selection is invoked in selecting clusters (i.e.,
usually used in all but the final stage of sampling).
E. Both with and without replacement methods
available; latter generally more complicated.
115
6.6.2 Usefulness of PPS Sampling
A. Creates equal-probability samples in which number
of respondents per cluster is nearly equal; this
simplifies data collection.
B. Reduces the amount of variation in the overall
sample size ( no ) which improves the statistical
quality of survey estimates.
Example: Unstratified Two-Stage Sample of 500 Dwellings in
Atlanta
DESIGN ALTERNATIVE 1 --- TWO-STAGE PPS:
PSU = Census block (select with PPS)
SSU = Dwelling (select with SRS using sampling
rate determined by PSU "size")
116
Two-Stage PPS Sample Selection Steps:
(1) List all A blocks in Atlanta (to create the first stage
sampling frame), noting B (the number of
dwellings in the  -th block), or some reasonable
measure of B .
(2) Select 100 blocks with probability proportional to
size (PPS); thus,
100B
  = Pr{Select  -th block in Atlanta}=
B
where
A
B  1 B = Number of dwellings in Atlanta
L
O
M
N P
Q
(3) Within each selected block select a simple random
sample (SRS) of exactly 5 dwellings; thus,
| = Pr{Select a dwelling in  -th block | select  -th block}
 5
B
117
(4) Thus, the probability of selecting any household in
Atlanta by this PPS design is
100B 5
   | 
 500
B
B
B
L
O
L
O
M
M
N P
Q
NP
Q
Implications of Two-Stage PPS Design:
+ Constant sample size (5) in each sample
block
+ Constant total sample size (500) among
all possible two-stage samples; i.e.,
V(n o )  0
+ Equal selection probabilities for
dwellings in Atlanta
DESIGN ALTERNATIVE 2 --- TWO-STAGE SRS:
PSU = Census block (select with SRS)
SSU = Dwelling (select with SRS using sampling
rate determined by PSU "size")
118
Two-Stage SRS Sample Selection Steps:
(1) List all A blocks in Atlanta (to create the first stage
sampling frame)
(2) Select 100 blocks with SRS; thus,
L
O
M
NP
Q
  = Pr{Select  -th block in Atlanta}= 100
A
(3) Within each selected block select a simple random
sample (SRS) of exactly 5 dwellings; thus,
| = Pr{Select a dwelling in  -th block | select  -th block}
 5
B
119
(4) Thus, the probability of selecting any household in
Atlanta by this 2-stage SRS design is
L
O
L
O
MQ
PM
N
NP
Q
   |  100 5  500
A B
AB
Implications of Two-Stage SRS Design:
+ Constant sample size (5) in each sample
block
+ Constant total sample size (500) among
all possible two-stage samples; i.e.,
V(n o )  0
- Unequal selection probabilities for
dwellings in Atlanta (implying the need
for sample weights for analysis and
precision loss due to variable weights)
120
6.7 Systematic (List) Sampling
6.7.1 Definition:
Systematic Sampling ---
A method of probability sampling
in which elements on an ordered
list of population members are
chosen by applying an interval of
constant length after a random
start.
6.7.2 Selecting a Systematic Sample
A. The sampling frame consists of a list numbered
sequentially from 1 to the number of sampling units
in the list.
B. A sampling interval (denoted by the symbol, k) is
chosen. If a sample of about n out of N elements is
desired, k is usually the ratio, N/n, rounded to the
nearest integer.
C. A random number between 1 and k is chosen. This
number is called the random start and will be
denoted by the symbol, g.
D. Elements selected in the sample are those number
g and every k-th element for the remainder of the
list; i.e., g, g+k, g+2k, etc.
121
6.7.3 Example of the Systematic Selection Process
We wish to select a systematic 1-in-5 sample from a
population of 1000 discharges in a hospital. The list
of discharges is represented in Figure 6.1. The
random start is 3.
Discharge
1
2
3
4
5
6
7
8
9
10
.
.
.
996
997
998
999
1000
Discharge
Sampled
3
8
.
.
.
.
.
.
998
Figure 6.1 Representation of the Process of Selecting a Systematic List
Sample of Patient Discharges
122
6.7.4 Advantages of Systematic Sampling
A. Systematic sampling can be quickly and easily
applied.
B. With proper ordering of the population list,
systematic sampling can be used as a method of
selecting a sample which is similar to a
proportionate stratified sample.
6.7.5 Estimating a Population Mean Y
If n represents the number of elements in the
systematic sample, the estimator of Y is:
n
 yi
y sys  i 1
n
(6.17)
123
6.7.6 Estimated Variance of ysys
A. If we can assume that the list is "thoroughly
shuffled" (i.e., essentially randomly ordered),
1  f  2
v(ysys )  
s

 n 
(6.18)
where
n
s2 
ny 
i 1
2
i
n
2
 yi
i 1
n( n  1)
(6.19)
B. If we cannot assume that the list is "thoroughly
shuffled,"
1  f  n / 2
v(ysys )   2   (y h1  y h 2 ) 2
 n  h 1
(6.20)
Formula (6.20) means that we multiply the factor
1  f by the sum of squares of successive paired
n2
differences.
L
O
M
NP
Q
124
6.7.7 Standard Error of ysys
se(ysys )  v(ysys )
(6.21)
6.7.8 Confidence Intervals for Y (n>50)
ch
 t[sec
y h
]
Lower Boundary: ysys  t[se ysys ]
(6.22)
Upper Boundary: ysys
(6.23)
sys
The value of t depends on the confidence level we
wish for the confidence interval (see Section 2.2.7).
Interpretation: We are 95 percent sure that Y is
covered by the interval whose
(t=1.96)
boundaries are calculated by formulas
(6.22) and (6.23).
125
6.7.9 Illustrative Example of Analysis from a
Systematic Sample
A. Data on the length of stay from a systematic sample
of 20 hospital discharges from a population of 400
discharges are shown in Table 6.2. The list of
sample elements is presented in the order in which
they were selected from the sampling frame.
Table 6.2 Results of a Systematic Sample of Hospital Discharges
Sample
Discharge
i
1
2
3
4
5
6
7
8
9
10
Length
of Stay
yi
6
2
1
3
3
4
12
2
7
4
Sample
Discharge
i
11
12
13
14
15
16
17
18
19
20
20
 y i = 103
n = 20
N = 400
i1
20
 y 2i = 907
f = n/N = 0.05
i 1
Length
of Stay
yi
9
6
3
1
1
19
3
6
9
2
126
B. Estimator of Y :
n
y sys 
 yi
 103  515
.
20
n
i 1
C. If we can assume that the sampling frame is
thoroughly shuffled:
n
s2 
n
n  y  [  y i ]2
20 907f a
103f
afa
=
= 19.818
20
19
afaf
na
n  1f
1 f O
1  005
. O
19.818 = 0.941
s L
 L
M
P
M
P
n
20
N QN Q
i 1
v  ysys
2
i
2
i 1
2
D. If we cannot assume that the sampling frame is
thoroughly shuffled:
1  f  n / 2
v  ysys    2   (y h1  y h 2 ) 2
 n  i 1
127
L
O
a
f
a
f
a
f
M
P
N Q
2
2
2
.
 1  005
6

2

1

3

...

9

2
202
 1247
.
6.7.10 Some Statistical Notes on Systematic Sampling
A. y sys is an unbiased estimator if N/n is an integer
and very nearly unbiased if it is not an integer.
B. If the sampling frame is not randomly ordered,
v  ysys  is not unbiased and tends to slightly
overestimate the true variance of v  ysys  .
C. Systematic sampling is a special form of cluster
sampling. Specifically, the population can be
viewed as consisting of k clusters each of which
is a possible systematic sample which can be
chosen. By choosing a random start and applying
a fixed interval in selecting the sample,
128
we are effectively randomly choosing one of the
k possible clusters.
D. If we can assume that the population list is
"thoroughly shuffled," the formulas of Sections
o
2.3 and 2.4 can be used to estimate a total (Y) or a
proportion (p).
E. Some precautionary notes concerning use of
systematic sampling:
(1) Periodicity in the population list
(a) Problem: Variability of the measurement
(i.e., yi ) in the list such that comparable
measurements appear at fixed intervals.
Particular problems arise when the periodic
intervals of similarity are integral multiples
of the sampling interval k.
(b) Remedy: Change the random start several
times as one moves through the list during
the selection process.
(2) Monotonicity in the population list
129
(a) Problem: Measurement values tend to
increase or decrease as one moves through
the list. The value of the random start
therefore strongly influences the value of
the estimate resulting from the sample.
(b) Remedy: Randomly order the list, if
possible.
Supplementary Reading
[1] Mendenhall, W., Ott, L., and Scheaffer, R.L.,
Elementary Survey Sampling, Duxbury Press,
Belmont, 1971, Chapters 7 and 8.
[2] Kish, L., Survey Sampling, Wiley and Sons, New
York, 1965, Chapters 4, 5, and 6.
[3] Cochran, W.G., Sampling Techniques, Third Edition,
Wiley and Sons, New York, 1977, Chapters 8, 9, and
9A.
Download