stratified sampling lecture

advertisement
48
SECTION 4
STRATIFIED RANDOM SAMPLING
4.1 What is Stratification?
4.1.1
Definition
Stratification ---
The process of dividing a population
of elements into distinct
subpopulations called strata. Strata are
formed so that each population
element is assigned to only one
stratum.
4.1.2 Some Examples
A. We wish to stratify a list of employees of a large
company by age. Since the age of all employees is
available on our sampling frame, we form the
following strata:
49
h
---1
2
3
4
5
Stratum Composition:
-------------------------------Less than 25 years old
25 to 34 years old
35 to 44 years old
45 to 54 years old
55 years of age or older
B. To draw a sample of United States counties we
wish to stratify by region. To this end we form the
following strata:
h
--1
2
3
4
Stratum Composition
All Counties Located in the
-------------------------------------Northeast census region
South census region
North Central census region
West census region
50
4.1.3 How is Stratification Used in Sample Surveys?
A. The population is divided into strata so that each
population element is a member of only one
stratum. We use the letter H to represent the
number of strata that are formed and N h to denote
the number of population elements which fall in
the h-th stratum. The total number of elements in
the population is therefore
N  N1  N 2 ... N H 
H
 Nh
h 1
(4.1)
B. A sample size is chosen for each stratum. We
denote the sample size in the h-th stratum by the
symbol, n h . The total sample size over all strata is
then
n  n1  n2 ... n H 
H
 nh
h 1
(4.2)
The corresponding sampling rate for the h-th
stratum would be
51
nh
Nh
with the overall sampling rate being
fh 
f
n
N
(4.3)
(4.4)
C. A probability sample is separately chosen in each
stratum so that the choice of elements in one
stratum does not depend upon choices made in the
other strata. Selection procedures among strata are
often the same although different selection
methods can be used, if needed.
D. The population value to be estimated (e.g., mean,
proportion, etc.) is estimated separately for each
stratum.
E. An estimate for the entire population is produced
by appropriately combining the individual stratum
estimates.
(Skip to Section 4.7 on p. 75 )
52
4.2 What is Stratified Simple Random Sampling:
4.2.1 Definition
Stratified Simple Random Sampling:
Stratified sampling in which simple random
sampling is used in each stratum.
NOTE: Stratified simple random sampling is a simple
form of stratified sampling. There are many
other types of stratified sampling, however.
4.2.2 How to Select a Stratified Random Sample
A. Divide the population into strata.
B. Determine sample sizes for each stratum.
C. Select a separate simple random sample in each
stratum.
53
4.3 Estimating a Population Mean From a Stratified
Random Sample
4.3.1 Setting
A. A stratified random sample has been selected.
B. Data from each element in the sample have been
collected.
C. We wish to estimate the population mean per
element for some characteristic. This mean can be
expressed as
o
Y N 1Y1  N 2 Y2 ..... N H YH

N
N
H
 W1Y1  W2 Y2 .....WH YH   Wh Yh
Y
h 1
(4.5)
where
Wh 
Nh
N
(4.6)
is the proportion of all population elements which
fall in the h-th stratum.
54
4.3.2 Estimator of Y
1
1 H
y st 
N1y1  N 2 y 2 ... N H y H   N h y h
N
N h1
H
= W1y1  W2 y 2 ...WH y H   Wh y h
h 1
(4.7)
where y h is the estimator of Yh for the h-th stratum
calculated as
nh
 y hi
y h  i 1
nh
(4.8)
and y hi is the value of the observation made on the
i-th sample element in the h-th stratum.
55
4.3.3 Some Statistical Notes About y st
A. y st is a random variable since different stratified
random samples of the same size from the same
population are likely to lead to different values of
y st .
B. y st is an unbiased estimator of Y .
C. With sufficiently large stratified random samples
(n>30), the sampling distribution will closely
resemble the normal distribution.
4.3.4 Estimated Variance of y st
v(yst )  W12 v(y1 )  W22 v(y2 )  ...  WH2 v(yH )
L
1 f O
 W M P
s
n
N Q
H
h 1
2
h
h
h
2
h
(4.9)
56
where s2h is the estimated element variance for the h-th
stratum, which can be calculated by using the same
formula as used to estimate element variance from a
simple random sample, only applied for a specific
stratum.
4.3.5 Some Statistical Notes About v(yst )
A. v(yst ) is a random variable (see Section 4.3.3).
B. s2h is an unbiased estimator of the true element
variance in the h-th stratum.
C. v(yst ) is an unbiased estimator of the true variance
of y st .
4.3.6 Standard Error of y st
se(yst )  v(yst ) 
2 1  f h  2
W

h 
 sh
n
h 1
 h 
H
(4.11)
57
4.3.7 Confidence Intervals for Y (n>30)
Lower Boundary: yst  {t}{se(yst )}
(4.12)
Upper Boundary: yst  {t}{se(yst )}
(4.13)
The value of t which we use depends on the confidence
level for our interval (see Section 2.2.7 for some
commonly used values of t).
Interpretation: We are 95 percent sure that Y is covered
by the interval whose boundaries are
(t=1.96)
defined by formulas (4.12) and (4.13).
4.4 Estimating a Population Total from a Stratified
Random Sample
4.4.1 Setting
A. A stratified random sample has been selected.
B. Data from each element in the sample have been
collected.
58
C.
We wish to estimate the aggregate total of some
characteristic for the population. This total can be
expressed as
o
H
Y  N 1Y1  N 2 Y2 ..... N H YH   N h Yh
h 1
(4.14)
o
4.4.2 Estimator of Y
o
H
y st  Ny st   N h y h
(4.15)
h 1
o
4.4.3 Some Statistical Notes About y st
o
A. y st is a random variable.
o
o
B. y st is an unbiased estimator of Y .
C. When we have large sample sizes, the sampling
o
distribution of y st will closely resemble the normal
distribution.
59
o
4.4.4 Estimated Variance of y st
H
1  f h  2
o 
2
v  yst   N v(yst )   N 2h 
s h
 
h 1
 nh 
(4.16)
o
4.4.5 Some Statistical Notes About v(yst )
o
A. v(yst ) is a random variable
o
B. v(yst ) is an unbiased estimator of the true variance
o
of y st .
60
o
4.4.6 Standard Error of y st
o
o
se(yst )  v(yst ) 
2 1  f h  2
N

h 
 sh
h 1
 nh 
H
(4.17)
o
4.4.7 Confidence Intervals for Y( n  30)
o
o
Lower Boundary: yst  {t} {se(yst )}
o
(4.18)
o
Upper Boundary: yst  {t} {se(yst )}
(4.19)
The value of t which we use depends on the confidence
level for our interval (see Section 2.2.7).
o
Interpretation: We are 95 percent sure that Y is
covered by the interval whose boundaries
(t=1.96)
are computed by formulas (4.18) and
(4.19).
61
4.5 Estimating a Population Proportion from a
Stratified Simple Random Sample
4.5.1 Setting
A. A stratified random sample has been selected.
B. Data from each element in the sample have been
collected.
C. We wish to estimate the proportion (fraction) of
elements in the population which possess a certain
attribute. This proportion can be denoted by the
symbol, P.
D. Example: Proportion of individuals with limitation
of activities in 1978 due to chronic illness.
4.5.2 Estimator of P
p st 
1
1 H
N 1 p1  N 2 p 2 ... N H p H   N h p h
N
N h 1
 W1p1  W2 p 2 ...WH p H 
H
 Wh p h
h 1
(4.20)
62
where p h is the estimator of Ph , the proportion of
elements in the h-th stratum which possess the
attribute. Once again we may view p h as a special
sample mean where y hi  0 if the i-th sample
observation in the h-th stratum does not possess the
attribute and y hi  1 if it does. Therefore, p h can be
calculated as
nh
ph 
 y hi
i 1
nh
 Stratum sample proportion with attribute
4.5.3 Some Statistical Notes About p st
A. p st is a special case of y st ; namely, where the
characteristic of interest is the dichotomous, 0-or-1
version of y hi defined above.
B. p st is a random variable.
C. p st is an unbiased estimator of P.
63
D.
The sampling distribution of p st will closely
resemble the normal distribution provided that n is
sufficiently large (e.g., npst and n(1 pst ) are
greater than 10).
4.5.4 Estimated Variance of p st
v(pst )  W12 v(p1 )  W22 v(p2 )  ...  WH2 v(p H )
 1  fh 
  W v(p h )  W 
p h (1  p h )

h 1
h 1
 n h  1
(4.22)
H
H
2
h
2
h
4.5.5 Some Statistical Notes About v(pst )
A. v(pst ) is a random variable.
B. v(pst ) is an unbiased estimator of the true variance.
of p st .
64
4.5.6 Standard Error of p st
se(pst )  v(pst ) 
2  1  fh 
W

h 
p h (1  p h )
n

1
h 1
 h 
H
(4.23)
4.5.7 Confidence Intervals for P (np st  10 and
n(1  p st )  10)
Lower Boundary: pst  {t} {se(pst )}
(4.24)
Upper Boundary: pst  {t} {se(pst )}
(4.25)
The value of t will depend on the confidence level that we
choose (see Section 2.2.7).
Interpretation: We are 95 percent sure that P is covered
by the interval whose boundaries are
(t=1.96)
covered by formulas (4.24) and (4.25).
(Skip to Section 4.8 on p. 76)
65
4.6 Illustrative Example of Analysis from Stratified
Random Samples
4.6.1 Setting
A. A county with two relatively large communities
wants to do a survey on certification of the
emergency medical technicians (EMTs) who work
in the county and are required to take special
training for periodic certification by passing a
competency exam.
B. Most EMTs work in "City A," which is relatively
large and is located in the main urban area of the
county. "City B" is smaller and has fewer EMTs
who work there. The rest of the county's EMTs
work in smaller towns and in rural areas.
C. Because of suspected similarities in certification
patterns among EMTs in City A and comparable
similarities in City B, we decide to divide the
county into three strata. The EMTs in the "other"
stratum includes all EMTs not working in either
city. Proportionately allocated sample sizes and
other information for the three strata are presented
below:
66
h
1
2
3
Stratum Composition
City A
City B
Rural Area
TOTAL COUNTY
Nh
155
62
93
310
nh
20
8
12
40
D. We wish to estimate the following:
(1) Y : The average number of hours of certification
training in the year prior to the last
certification.
(2) Y1: The average number of hours of certification
training in the year prior to the last
certification in City A.
o
(3) Y : The total number of certification hours for
EMTs in the county for the year prior to the
last certification.
(4) P: The proportion of EMTs who passed their last
periodic certification exam on the first try.
67
4.6.2 Sample Selection and Survey Results
A. Proportionately allocated simple random samples
are selected in the three strata.
B. Data on certification are collected from each of the
40 sample EMTs. Results are summarized in
Table 4.1.
68
Table 4.1 Summary of EMT Certification Survey
Stratum 1
City A
Value of yhi
Number of
Passed
Training
Cert.
i
Hours
Exam?
1
35
1
2
28
1
3
26
1
4
41
1
5
43
1
6
29
0
7
32
1
8
37
1
9
36
1
10
25
1
11
29
0
12
31
1
13
39
1
14
38
0
15
40
0
16
45
1
17
28
1
18
27
1
19
35
1
20
34
1
20
 y hi  678
i
1
2
3
4
5
6
7
8
Stratum 2
City B
Value of
Number of
Training
Hours
27
4
49
10
15
41
25
30
yhi
Passed
Cert.
Exam?
1
0
0
1
0
0
0
0
8
16
i1
 y hi  201
i
1
2
3
4
5
6
7
8
9
10
11
12
Stratum 3
Other
Value of yhi
Number of
Passed
Training
Cert.
Hours
Exam?
8
1
15
0
21
1
7
0
14
1
30
1
20
0
11
0
12
1
32
0
34
0
24
1
12
2
i1
 y hi  228
6
i1
Summary:
n1 =
N1 =
W1 =
f1 =
y1 =
p1 =
s12 =
20
155
0.5
0.129
33.900
0.8
35.358
n2 =
N2 =
W2 =
f2 =
y2 =
p2 =
s22 =
8
62
0.2
0.129
25.125
0.25
232.411
n3 =
N3 =
W3 =
f3 =
y3 =
p3 =
s23 =
12
93
0.3
0.129
19.000
0.50
87.636
69
4.6.3 Analysis
A. For Y :
(1) Estimate:
3
yst   Wh y h  (0.5)(33.900)  (0.2)(25125
. )  (0.3)(19.000)
h 1
 27.675 hours
(2) Variance:
1  f h  2
v(yst )   Wh2 
 sh
h 1
 nh 
3
 (0.5) 2
1  0129
. O
1  0129
. O
1  0129
. O
L
L
L
(
35
.
358
)

(
0
.
2
)
(
232
.
411
)

(
0
.
3
)
(87.636)
M
M
M
N 20 P
Q
N8 P
Q
N 12 P
Q
2
= 1.97
2
70
(3) Standard Error:
se(yst )  v(yst )  1.97  1.40
(4) 95 Percent Confidence Interval: (t=1.96)
Lower Boundary:
y st  196
. se( yst )  27.675  (196
. )(140
. )
 24.931
Upper Boundary:
y st  196
. se( yst )  27.675  (196
. )(140
. )
 30.419
Interpretation: We are 95 percent sure that Y
lies somewhere between 24.931
and 30.419 hours.
71
B. For Y1:
(1) Estimate:
n1
y1 
 y li
i 1
n1
678

 33.900
20
(2) Variance:
1  f1  2 1  0.129 
v(y1 )  
20  (35.358)  1.54
 s1  
n


 1 
(3) Standard Error:
se(y1 )  v(y1 )  1.54  1.24
(4) 95 Percent Confidence Interval:
Because of the small sample size
( n1  20) , we would be reluctant to
produce a confidence interval for Y1,
unless special tables of the "Student's" t
distribution were used to determine t.
Most statistics textbooks contain these
tables and an explanation of how to use
them.
72
o
C. For Y :
(1) Estimate:
o
y st  Ny st  (310)(27.675)  8579.25
(2) Variance:
o
v(yst )  N 2 v(yst )  (310) 2 (1.97)  189, 278
(3) Standard Error:
o
o
se(yst )  v(yst )  189, 278  435.06
(4) 95 Percent Confidence Interval:
Lower Boundary:
o
o
. )(435.06)  7727
y st  1.96 se( y st )  8579.25  (196
Upper Boundary:
o
y st  1.96
o
se( y st )  8579.25  (196
. )(435.06)  9432
o
Interpretation: We are 95 percent sure that Y lies
somewhere between 7727 and 9432
hours.
73
D. For P:
(1) Estimate:
3
pst   Wh p h  (0.5)(0.80)  (0.2)(0.25)  (0.3)(0.50)  0.60
h 1
(2) Variance:
 1  fh 
v(pst )   Wh2 
 p h (1  p h )
h 1
 n h  1
3
 (0.5) 2
1  0.129 O
1  0.129 O
L
L
(
0
.
80
)(
0
.
20
)

(
0
.
2
)
(0.25)(0.75)
M
M
N 19 P
Q
N7 P
Q
2
 ( 0.3) 2
1  0.129 O
L
( 0.50)(0.50)
M
N 11 P
Q
 0.0045
(3) Standard Error:
se(pst )  v(pst )  0.0045  0.0671
(4) 95 Percent Confidence Interval:
74
Lower Boundary:
p st  196
.
se( pst )  0.60  (196
. )(0.0671)
= 0.47
Upper Boundary:
p st  196
. se( pst )  0.60  (196
. )(0.0671)
= 0.73
Interpretation:
We are 95 percent sure that P lies
somewhere between 0.47 and 0.73 (or,
alternatively, between 47 and 73 percent).
75
4.7 Allocation of the Sample Among Strata
Up to this point we have said little about how to
choose the sample size for each stratum (i.e., n h for the hth stratum). Having reviewed available resources and/or
precision requirements to determine what the total sample
size (n) will be for the survey, the statistician must give
careful thought to his or her choice of stratum sample
sizes ( n h ) so that
H
n   nh
h 1
The remainder of this chapter deals with two of the four
strategies for allocating the sample among strata. One is
called Proportionate Stratified Sampling and the other is
called Optimum Allocation. Both strategies are special
types of stratified random sampling designs so we
continue to assume that simple random sampling is the
selection method in each stratum.
Four stratum allocation strategies among strata:
(1) Proportionate
Same sampling rates
(2) Optimum --Most cost-efficient sampling rates
(3) Balanced --Equal sample sizes
(4) Disproportionate
Unequal sampling rates (to
"oversample" important domains).
(Skip to Section 4.2 on p. 52 )
76
4.8 Proportionate Stratified Sampling
4.8.1 Stratum Allocation
Choose the same sampling rate ( f h ) for all strata. In
other words,
fh 
nh n
 f
Nh N
(4.26)
This is the same as saying that
Wh 
N h nh

 wh
N
n
(4.27)
Formula (4.27) means that the proportion of the
sample chosen from any given stratum will be the
same as the proportion of the population in that same
stratum.
77
4.8.2 Estimation of Means, Totals and Proportions
from Proportionate Stratified Samples
To estimate
o
To estimate P
To estimate Y
n
Estimator:
y prop 
Variance
estimator:
Y
 yi
i 1
n
1  f 
v(y prop )  
w h s 2h


n

 h 1
H
n
p prop 
o
y prop  Ny prop
1  f 
v(y prop )   2   n h s h2
 f  h 1
H
y
v(p prop ) 
1 f
n2
H

h

p h (1  p h )
 1
2
h
4.8.3 Some Statistical Notes on Proportionate
Stratified Sampling
A. When proportionate stratified sampling is
used, each element in the population has the
same probability of selection. This type of
design is called a self-weighting design since
sample estimates of means and proportions for
the total population are simple arithmetic
means.
B. All estimators and variance estimates
presented in Section 4.8.2 are unbiased.
i
n
 n
n
h 1
i 1
78
C. The true variance of estimates from
proportionate stratified samples is always
lower than simple random samples of the same
size (n). Gains over simple random sampling
are greatest when strata are internally
homogenous.
4.9 Optimum Allocation
4.9.1 Definition
Optimum Allocation - A method of stratum
allocation of a stratified random sample in which
stratum sampling rates are directly proportional
to the stratum element standard deviation (Sh )
and inversely proportional to the square root of
the average unit cost of data collection in the
stratum (c h ) . This optimum stratum allocation
yields estimates with the smallest possible
variance for fixed total survey cost (i.e., sample
size).
79
4.9.2 Setting
A. Definitions of strata have been established.
B. Average unit costs for collecting data in each
stratum (c h ) are known.
C. Some (possibly crude) measure of the element
standard deviation (Sh ) is available for each
stratum.
D. Wh is known exactly or approximately for
each stratum.
E. Sample size (n) has been specified.
4.9.3 Allocation Giving Minimum V(yst ) for Fixed
Cost
o
A. For estimating Y or Y :
nh
L
O
NS / c P
 nM
M
P
N
S
/
c
M
P
N
Q
h
h
h
H
h 1
h
h
h
(4.28)
80
B. For estimating P:
nh
L
N
 nM
M
N
M
N
h
h
H
h 1
O
P (1  P ) / c P
P
P (1  P ) / c P
Q
h
h
h
h
h
(4.29)
h
4.9.4 Estimators and Estimated Variances
A. General Rule: Apply the appropriate formula for
stratified random sampling using
the n h established by either
formula (4.28) or (4.29).
o
B. Estimators: y st , y st , or p st .
o
C. Estimated Variances: v(yst ) , v(yst ) , or v(pst ) .
81
4.9.5 Some Statistical Notes on Optimum Allocation
A. Optimum Allocation of n h to the h-th stratum
depends on:
(1) S h : Variability among population
elements in the h-th stratum
(2) c h : Average cost per sample element in
the h-th stratum
(3) N h : Total number of elements in the h-th
stratum
B. S h is never known exactly but it can often be
approximated:
o
(1) When estimating Y or Y use:
(a) Data from previous or related
surveys
(b) Expert opinion
(c) Data from a pretest or pilot study
(d) Intelligent guess by survey statistician
(e.g., from knowledge of the statistical
range of values in the population)
82
(2) When estimating P:
S h  Ph (1  Ph )
By knowing approximate values for Ph ,
we can estimate S h .
C. Reasonably good approximations for S h are
likely to yield estimates whose variances are
very close to the minimum possible variance.
D. Unless unit costs differ widely among strata,
proportionate stratified sampling is almost
always preferred (i.e., more practical;
precision almost the same) when estimating
proportions.
E. Specifically, variances of estimates (of P) from
optimum allocation are rarely much smaller
than variances from proportionate stratified
samples. Stratum proportions ( Ph ) must be
widely different before substantial precision
gains will result.
83
4.10 Illustrative Example of Optimum Allocation
4.10.1 Setting
A. Suppose that we wish to determine an
optimum allocation for the county-wide EMT
certification survey presented earlier in
Section 4.6.
B. We prespecify n=40.
C. We wish to determine that stratum allocation
which will give us close to minimum variance
for y st , the estimator of the mean number of
certification hours per EMT in the county.
4.10.2 Preliminary Information Available to the
Survey Statistician
h
1
2
3
Stratum Composition
City A
City B
Rural Area
TOTAL COUNTY
Sh
5
15
10
ch
9
9
16
Nh
155
62
93
NhSh/[ch]1/2
258.33
310.00
232.50
800.83
84
4.10.3 Resulting Stratum Allocation
L
O
NS / c P L
258.33O
M
n  nM

40
 12.9  13
M
P
P
800.83 Q
N S / c P N
M
N
Q
1 1
1
h 1
h
h
h
310.00 O
L
 155
.  15
M
P
N800.83 Q
L232.50O 116.  12
 40M P
N800.83 Q
n 2  40
n3
1
3
85
4.11 Some Concluding Statistical Notes on Stratified
Random Sampling
A. General stratification rules:
(1) The stratification variable should be
highly correlated with the principal
characteristic being measured in the
survey (e.g., age would be a good
stratification variable if we were doing a
survey on limitation due to chronic
illness).
(2) Strata should be internally homogeneous
(and mutually heterogeneous among
strata).
(3) The variance of a population estimate will
be smallest (for fixed cost) when each
stratum sampling rate is directly related to
the variability of elements within the
stratum and inversely related to the unit
cost of data collection in the stratum.
(4) Proportionate stratified sampling is
always "safe" in that precision will never
be worse than in simple random sampling.
86
B. Some advantages of stratified sampling:
(1) Improved precision of estimates (i.e.,
smaller variances) which leads to
narrower confidence intervals.
(2) Better control of sample sizes for
subpopulations which can be defined by
strata and for which separate estimates
may be sought.
(3) Sampling designs can be made more
flexible. For example, special strata may
be established to handle more difficult
segments of the population (e.g., transient
population in household surveys).
C. Several stratum allocations yield very close to
the optimum allocation. Excessive attempts to
determine the actual optimum allocation is
almost never cost-effective.
87
D. Stratification and sophisticated allocation
schemes add to survey costs. Before
embarking upon attempts at these techniques,
be convinced that likely gains in precision will
merit the cost of realizing those gains.
E. Discussion of stratification in this chapter has
largely assumed that simple random sampling
is used in each stratum. It should be noted that
any probability sampling can be used in each
stratum. In addition, probability sampling
methods may differ widely among strata.
Supplementary Reading
[1] Kish, L., Survey Sampling, Wiley and Sons, New York,
1965, Sections 3.1-3.6 and 4.6.
[2] Cochran, W. G., Sampling Techniques, Third Edition,
Wiley and Sons, New York, Sections 5.1-5.7, 5.10-5.11,
5A.1-5A.4, 5A.7-5A.9, and 8.1-8.8.
Download