48 SECTION 4 STRATIFIED RANDOM SAMPLING 4.1 What is Stratification? 4.1.1 Definition Stratification --- The process of dividing a population of elements into distinct subpopulations called strata. Strata are formed so that each population element is assigned to only one stratum. 4.1.2 Some Examples A. We wish to stratify a list of employees of a large company by age. Since the age of all employees is available on our sampling frame, we form the following strata: 49 h ---1 2 3 4 5 Stratum Composition: -------------------------------Less than 25 years old 25 to 34 years old 35 to 44 years old 45 to 54 years old 55 years of age or older B. To draw a sample of United States counties we wish to stratify by region. To this end we form the following strata: h --1 2 3 4 Stratum Composition All Counties Located in the -------------------------------------Northeast census region South census region North Central census region West census region 50 4.1.3 How is Stratification Used in Sample Surveys? A. The population is divided into strata so that each population element is a member of only one stratum. We use the letter H to represent the number of strata that are formed and N h to denote the number of population elements which fall in the h-th stratum. The total number of elements in the population is therefore N N1 N 2 ... N H H Nh h 1 (4.1) B. A sample size is chosen for each stratum. We denote the sample size in the h-th stratum by the symbol, n h . The total sample size over all strata is then n n1 n2 ... n H H nh h 1 (4.2) The corresponding sampling rate for the h-th stratum would be 51 nh Nh with the overall sampling rate being fh f n N (4.3) (4.4) C. A probability sample is separately chosen in each stratum so that the choice of elements in one stratum does not depend upon choices made in the other strata. Selection procedures among strata are often the same although different selection methods can be used, if needed. D. The population value to be estimated (e.g., mean, proportion, etc.) is estimated separately for each stratum. E. An estimate for the entire population is produced by appropriately combining the individual stratum estimates. (Skip to Section 4.7 on p. 75 ) 52 4.2 What is Stratified Simple Random Sampling: 4.2.1 Definition Stratified Simple Random Sampling: Stratified sampling in which simple random sampling is used in each stratum. NOTE: Stratified simple random sampling is a simple form of stratified sampling. There are many other types of stratified sampling, however. 4.2.2 How to Select a Stratified Random Sample A. Divide the population into strata. B. Determine sample sizes for each stratum. C. Select a separate simple random sample in each stratum. 53 4.3 Estimating a Population Mean From a Stratified Random Sample 4.3.1 Setting A. A stratified random sample has been selected. B. Data from each element in the sample have been collected. C. We wish to estimate the population mean per element for some characteristic. This mean can be expressed as o Y N 1Y1 N 2 Y2 ..... N H YH N N H W1Y1 W2 Y2 .....WH YH Wh Yh Y h 1 (4.5) where Wh Nh N (4.6) is the proportion of all population elements which fall in the h-th stratum. 54 4.3.2 Estimator of Y 1 1 H y st N1y1 N 2 y 2 ... N H y H N h y h N N h1 H = W1y1 W2 y 2 ...WH y H Wh y h h 1 (4.7) where y h is the estimator of Yh for the h-th stratum calculated as nh y hi y h i 1 nh (4.8) and y hi is the value of the observation made on the i-th sample element in the h-th stratum. 55 4.3.3 Some Statistical Notes About y st A. y st is a random variable since different stratified random samples of the same size from the same population are likely to lead to different values of y st . B. y st is an unbiased estimator of Y . C. With sufficiently large stratified random samples (n>30), the sampling distribution will closely resemble the normal distribution. 4.3.4 Estimated Variance of y st v(yst ) W12 v(y1 ) W22 v(y2 ) ... WH2 v(yH ) L 1 f O W M P s n N Q H h 1 2 h h h 2 h (4.9) 56 where s2h is the estimated element variance for the h-th stratum, which can be calculated by using the same formula as used to estimate element variance from a simple random sample, only applied for a specific stratum. 4.3.5 Some Statistical Notes About v(yst ) A. v(yst ) is a random variable (see Section 4.3.3). B. s2h is an unbiased estimator of the true element variance in the h-th stratum. C. v(yst ) is an unbiased estimator of the true variance of y st . 4.3.6 Standard Error of y st se(yst ) v(yst ) 2 1 f h 2 W h sh n h 1 h H (4.11) 57 4.3.7 Confidence Intervals for Y (n>30) Lower Boundary: yst {t}{se(yst )} (4.12) Upper Boundary: yst {t}{se(yst )} (4.13) The value of t which we use depends on the confidence level for our interval (see Section 2.2.7 for some commonly used values of t). Interpretation: We are 95 percent sure that Y is covered by the interval whose boundaries are (t=1.96) defined by formulas (4.12) and (4.13). 4.4 Estimating a Population Total from a Stratified Random Sample 4.4.1 Setting A. A stratified random sample has been selected. B. Data from each element in the sample have been collected. 58 C. We wish to estimate the aggregate total of some characteristic for the population. This total can be expressed as o H Y N 1Y1 N 2 Y2 ..... N H YH N h Yh h 1 (4.14) o 4.4.2 Estimator of Y o H y st Ny st N h y h (4.15) h 1 o 4.4.3 Some Statistical Notes About y st o A. y st is a random variable. o o B. y st is an unbiased estimator of Y . C. When we have large sample sizes, the sampling o distribution of y st will closely resemble the normal distribution. 59 o 4.4.4 Estimated Variance of y st H 1 f h 2 o 2 v yst N v(yst ) N 2h s h h 1 nh (4.16) o 4.4.5 Some Statistical Notes About v(yst ) o A. v(yst ) is a random variable o B. v(yst ) is an unbiased estimator of the true variance o of y st . 60 o 4.4.6 Standard Error of y st o o se(yst ) v(yst ) 2 1 f h 2 N h sh h 1 nh H (4.17) o 4.4.7 Confidence Intervals for Y( n 30) o o Lower Boundary: yst {t} {se(yst )} o (4.18) o Upper Boundary: yst {t} {se(yst )} (4.19) The value of t which we use depends on the confidence level for our interval (see Section 2.2.7). o Interpretation: We are 95 percent sure that Y is covered by the interval whose boundaries (t=1.96) are computed by formulas (4.18) and (4.19). 61 4.5 Estimating a Population Proportion from a Stratified Simple Random Sample 4.5.1 Setting A. A stratified random sample has been selected. B. Data from each element in the sample have been collected. C. We wish to estimate the proportion (fraction) of elements in the population which possess a certain attribute. This proportion can be denoted by the symbol, P. D. Example: Proportion of individuals with limitation of activities in 1978 due to chronic illness. 4.5.2 Estimator of P p st 1 1 H N 1 p1 N 2 p 2 ... N H p H N h p h N N h 1 W1p1 W2 p 2 ...WH p H H Wh p h h 1 (4.20) 62 where p h is the estimator of Ph , the proportion of elements in the h-th stratum which possess the attribute. Once again we may view p h as a special sample mean where y hi 0 if the i-th sample observation in the h-th stratum does not possess the attribute and y hi 1 if it does. Therefore, p h can be calculated as nh ph y hi i 1 nh Stratum sample proportion with attribute 4.5.3 Some Statistical Notes About p st A. p st is a special case of y st ; namely, where the characteristic of interest is the dichotomous, 0-or-1 version of y hi defined above. B. p st is a random variable. C. p st is an unbiased estimator of P. 63 D. The sampling distribution of p st will closely resemble the normal distribution provided that n is sufficiently large (e.g., npst and n(1 pst ) are greater than 10). 4.5.4 Estimated Variance of p st v(pst ) W12 v(p1 ) W22 v(p2 ) ... WH2 v(p H ) 1 fh W v(p h ) W p h (1 p h ) h 1 h 1 n h 1 (4.22) H H 2 h 2 h 4.5.5 Some Statistical Notes About v(pst ) A. v(pst ) is a random variable. B. v(pst ) is an unbiased estimator of the true variance. of p st . 64 4.5.6 Standard Error of p st se(pst ) v(pst ) 2 1 fh W h p h (1 p h ) n 1 h 1 h H (4.23) 4.5.7 Confidence Intervals for P (np st 10 and n(1 p st ) 10) Lower Boundary: pst {t} {se(pst )} (4.24) Upper Boundary: pst {t} {se(pst )} (4.25) The value of t will depend on the confidence level that we choose (see Section 2.2.7). Interpretation: We are 95 percent sure that P is covered by the interval whose boundaries are (t=1.96) covered by formulas (4.24) and (4.25). (Skip to Section 4.8 on p. 76) 65 4.6 Illustrative Example of Analysis from Stratified Random Samples 4.6.1 Setting A. A county with two relatively large communities wants to do a survey on certification of the emergency medical technicians (EMTs) who work in the county and are required to take special training for periodic certification by passing a competency exam. B. Most EMTs work in "City A," which is relatively large and is located in the main urban area of the county. "City B" is smaller and has fewer EMTs who work there. The rest of the county's EMTs work in smaller towns and in rural areas. C. Because of suspected similarities in certification patterns among EMTs in City A and comparable similarities in City B, we decide to divide the county into three strata. The EMTs in the "other" stratum includes all EMTs not working in either city. Proportionately allocated sample sizes and other information for the three strata are presented below: 66 h 1 2 3 Stratum Composition City A City B Rural Area TOTAL COUNTY Nh 155 62 93 310 nh 20 8 12 40 D. We wish to estimate the following: (1) Y : The average number of hours of certification training in the year prior to the last certification. (2) Y1: The average number of hours of certification training in the year prior to the last certification in City A. o (3) Y : The total number of certification hours for EMTs in the county for the year prior to the last certification. (4) P: The proportion of EMTs who passed their last periodic certification exam on the first try. 67 4.6.2 Sample Selection and Survey Results A. Proportionately allocated simple random samples are selected in the three strata. B. Data on certification are collected from each of the 40 sample EMTs. Results are summarized in Table 4.1. 68 Table 4.1 Summary of EMT Certification Survey Stratum 1 City A Value of yhi Number of Passed Training Cert. i Hours Exam? 1 35 1 2 28 1 3 26 1 4 41 1 5 43 1 6 29 0 7 32 1 8 37 1 9 36 1 10 25 1 11 29 0 12 31 1 13 39 1 14 38 0 15 40 0 16 45 1 17 28 1 18 27 1 19 35 1 20 34 1 20 y hi 678 i 1 2 3 4 5 6 7 8 Stratum 2 City B Value of Number of Training Hours 27 4 49 10 15 41 25 30 yhi Passed Cert. Exam? 1 0 0 1 0 0 0 0 8 16 i1 y hi 201 i 1 2 3 4 5 6 7 8 9 10 11 12 Stratum 3 Other Value of yhi Number of Passed Training Cert. Hours Exam? 8 1 15 0 21 1 7 0 14 1 30 1 20 0 11 0 12 1 32 0 34 0 24 1 12 2 i1 y hi 228 6 i1 Summary: n1 = N1 = W1 = f1 = y1 = p1 = s12 = 20 155 0.5 0.129 33.900 0.8 35.358 n2 = N2 = W2 = f2 = y2 = p2 = s22 = 8 62 0.2 0.129 25.125 0.25 232.411 n3 = N3 = W3 = f3 = y3 = p3 = s23 = 12 93 0.3 0.129 19.000 0.50 87.636 69 4.6.3 Analysis A. For Y : (1) Estimate: 3 yst Wh y h (0.5)(33.900) (0.2)(25125 . ) (0.3)(19.000) h 1 27.675 hours (2) Variance: 1 f h 2 v(yst ) Wh2 sh h 1 nh 3 (0.5) 2 1 0129 . O 1 0129 . O 1 0129 . O L L L ( 35 . 358 ) ( 0 . 2 ) ( 232 . 411 ) ( 0 . 3 ) (87.636) M M M N 20 P Q N8 P Q N 12 P Q 2 = 1.97 2 70 (3) Standard Error: se(yst ) v(yst ) 1.97 1.40 (4) 95 Percent Confidence Interval: (t=1.96) Lower Boundary: y st 196 . se( yst ) 27.675 (196 . )(140 . ) 24.931 Upper Boundary: y st 196 . se( yst ) 27.675 (196 . )(140 . ) 30.419 Interpretation: We are 95 percent sure that Y lies somewhere between 24.931 and 30.419 hours. 71 B. For Y1: (1) Estimate: n1 y1 y li i 1 n1 678 33.900 20 (2) Variance: 1 f1 2 1 0.129 v(y1 ) 20 (35.358) 1.54 s1 n 1 (3) Standard Error: se(y1 ) v(y1 ) 1.54 1.24 (4) 95 Percent Confidence Interval: Because of the small sample size ( n1 20) , we would be reluctant to produce a confidence interval for Y1, unless special tables of the "Student's" t distribution were used to determine t. Most statistics textbooks contain these tables and an explanation of how to use them. 72 o C. For Y : (1) Estimate: o y st Ny st (310)(27.675) 8579.25 (2) Variance: o v(yst ) N 2 v(yst ) (310) 2 (1.97) 189, 278 (3) Standard Error: o o se(yst ) v(yst ) 189, 278 435.06 (4) 95 Percent Confidence Interval: Lower Boundary: o o . )(435.06) 7727 y st 1.96 se( y st ) 8579.25 (196 Upper Boundary: o y st 1.96 o se( y st ) 8579.25 (196 . )(435.06) 9432 o Interpretation: We are 95 percent sure that Y lies somewhere between 7727 and 9432 hours. 73 D. For P: (1) Estimate: 3 pst Wh p h (0.5)(0.80) (0.2)(0.25) (0.3)(0.50) 0.60 h 1 (2) Variance: 1 fh v(pst ) Wh2 p h (1 p h ) h 1 n h 1 3 (0.5) 2 1 0.129 O 1 0.129 O L L ( 0 . 80 )( 0 . 20 ) ( 0 . 2 ) (0.25)(0.75) M M N 19 P Q N7 P Q 2 ( 0.3) 2 1 0.129 O L ( 0.50)(0.50) M N 11 P Q 0.0045 (3) Standard Error: se(pst ) v(pst ) 0.0045 0.0671 (4) 95 Percent Confidence Interval: 74 Lower Boundary: p st 196 . se( pst ) 0.60 (196 . )(0.0671) = 0.47 Upper Boundary: p st 196 . se( pst ) 0.60 (196 . )(0.0671) = 0.73 Interpretation: We are 95 percent sure that P lies somewhere between 0.47 and 0.73 (or, alternatively, between 47 and 73 percent). 75 4.7 Allocation of the Sample Among Strata Up to this point we have said little about how to choose the sample size for each stratum (i.e., n h for the hth stratum). Having reviewed available resources and/or precision requirements to determine what the total sample size (n) will be for the survey, the statistician must give careful thought to his or her choice of stratum sample sizes ( n h ) so that H n nh h 1 The remainder of this chapter deals with two of the four strategies for allocating the sample among strata. One is called Proportionate Stratified Sampling and the other is called Optimum Allocation. Both strategies are special types of stratified random sampling designs so we continue to assume that simple random sampling is the selection method in each stratum. Four stratum allocation strategies among strata: (1) Proportionate Same sampling rates (2) Optimum --Most cost-efficient sampling rates (3) Balanced --Equal sample sizes (4) Disproportionate Unequal sampling rates (to "oversample" important domains). (Skip to Section 4.2 on p. 52 ) 76 4.8 Proportionate Stratified Sampling 4.8.1 Stratum Allocation Choose the same sampling rate ( f h ) for all strata. In other words, fh nh n f Nh N (4.26) This is the same as saying that Wh N h nh wh N n (4.27) Formula (4.27) means that the proportion of the sample chosen from any given stratum will be the same as the proportion of the population in that same stratum. 77 4.8.2 Estimation of Means, Totals and Proportions from Proportionate Stratified Samples To estimate o To estimate P To estimate Y n Estimator: y prop Variance estimator: Y yi i 1 n 1 f v(y prop ) w h s 2h n h 1 H n p prop o y prop Ny prop 1 f v(y prop ) 2 n h s h2 f h 1 H y v(p prop ) 1 f n2 H h p h (1 p h ) 1 2 h 4.8.3 Some Statistical Notes on Proportionate Stratified Sampling A. When proportionate stratified sampling is used, each element in the population has the same probability of selection. This type of design is called a self-weighting design since sample estimates of means and proportions for the total population are simple arithmetic means. B. All estimators and variance estimates presented in Section 4.8.2 are unbiased. i n n n h 1 i 1 78 C. The true variance of estimates from proportionate stratified samples is always lower than simple random samples of the same size (n). Gains over simple random sampling are greatest when strata are internally homogenous. 4.9 Optimum Allocation 4.9.1 Definition Optimum Allocation - A method of stratum allocation of a stratified random sample in which stratum sampling rates are directly proportional to the stratum element standard deviation (Sh ) and inversely proportional to the square root of the average unit cost of data collection in the stratum (c h ) . This optimum stratum allocation yields estimates with the smallest possible variance for fixed total survey cost (i.e., sample size). 79 4.9.2 Setting A. Definitions of strata have been established. B. Average unit costs for collecting data in each stratum (c h ) are known. C. Some (possibly crude) measure of the element standard deviation (Sh ) is available for each stratum. D. Wh is known exactly or approximately for each stratum. E. Sample size (n) has been specified. 4.9.3 Allocation Giving Minimum V(yst ) for Fixed Cost o A. For estimating Y or Y : nh L O NS / c P nM M P N S / c M P N Q h h h H h 1 h h h (4.28) 80 B. For estimating P: nh L N nM M N M N h h H h 1 O P (1 P ) / c P P P (1 P ) / c P Q h h h h h (4.29) h 4.9.4 Estimators and Estimated Variances A. General Rule: Apply the appropriate formula for stratified random sampling using the n h established by either formula (4.28) or (4.29). o B. Estimators: y st , y st , or p st . o C. Estimated Variances: v(yst ) , v(yst ) , or v(pst ) . 81 4.9.5 Some Statistical Notes on Optimum Allocation A. Optimum Allocation of n h to the h-th stratum depends on: (1) S h : Variability among population elements in the h-th stratum (2) c h : Average cost per sample element in the h-th stratum (3) N h : Total number of elements in the h-th stratum B. S h is never known exactly but it can often be approximated: o (1) When estimating Y or Y use: (a) Data from previous or related surveys (b) Expert opinion (c) Data from a pretest or pilot study (d) Intelligent guess by survey statistician (e.g., from knowledge of the statistical range of values in the population) 82 (2) When estimating P: S h Ph (1 Ph ) By knowing approximate values for Ph , we can estimate S h . C. Reasonably good approximations for S h are likely to yield estimates whose variances are very close to the minimum possible variance. D. Unless unit costs differ widely among strata, proportionate stratified sampling is almost always preferred (i.e., more practical; precision almost the same) when estimating proportions. E. Specifically, variances of estimates (of P) from optimum allocation are rarely much smaller than variances from proportionate stratified samples. Stratum proportions ( Ph ) must be widely different before substantial precision gains will result. 83 4.10 Illustrative Example of Optimum Allocation 4.10.1 Setting A. Suppose that we wish to determine an optimum allocation for the county-wide EMT certification survey presented earlier in Section 4.6. B. We prespecify n=40. C. We wish to determine that stratum allocation which will give us close to minimum variance for y st , the estimator of the mean number of certification hours per EMT in the county. 4.10.2 Preliminary Information Available to the Survey Statistician h 1 2 3 Stratum Composition City A City B Rural Area TOTAL COUNTY Sh 5 15 10 ch 9 9 16 Nh 155 62 93 NhSh/[ch]1/2 258.33 310.00 232.50 800.83 84 4.10.3 Resulting Stratum Allocation L O NS / c P L 258.33O M n nM 40 12.9 13 M P P 800.83 Q N S / c P N M N Q 1 1 1 h 1 h h h 310.00 O L 155 . 15 M P N800.83 Q L232.50O 116. 12 40M P N800.83 Q n 2 40 n3 1 3 85 4.11 Some Concluding Statistical Notes on Stratified Random Sampling A. General stratification rules: (1) The stratification variable should be highly correlated with the principal characteristic being measured in the survey (e.g., age would be a good stratification variable if we were doing a survey on limitation due to chronic illness). (2) Strata should be internally homogeneous (and mutually heterogeneous among strata). (3) The variance of a population estimate will be smallest (for fixed cost) when each stratum sampling rate is directly related to the variability of elements within the stratum and inversely related to the unit cost of data collection in the stratum. (4) Proportionate stratified sampling is always "safe" in that precision will never be worse than in simple random sampling. 86 B. Some advantages of stratified sampling: (1) Improved precision of estimates (i.e., smaller variances) which leads to narrower confidence intervals. (2) Better control of sample sizes for subpopulations which can be defined by strata and for which separate estimates may be sought. (3) Sampling designs can be made more flexible. For example, special strata may be established to handle more difficult segments of the population (e.g., transient population in household surveys). C. Several stratum allocations yield very close to the optimum allocation. Excessive attempts to determine the actual optimum allocation is almost never cost-effective. 87 D. Stratification and sophisticated allocation schemes add to survey costs. Before embarking upon attempts at these techniques, be convinced that likely gains in precision will merit the cost of realizing those gains. E. Discussion of stratification in this chapter has largely assumed that simple random sampling is used in each stratum. It should be noted that any probability sampling can be used in each stratum. In addition, probability sampling methods may differ widely among strata. Supplementary Reading [1] Kish, L., Survey Sampling, Wiley and Sons, New York, 1965, Sections 3.1-3.6 and 4.6. [2] Cochran, W. G., Sampling Techniques, Third Edition, Wiley and Sons, New York, Sections 5.1-5.7, 5.10-5.11, 5A.1-5A.4, 5A.7-5A.9, and 8.1-8.8.