92 SECTION 6 CLUSTER SAMPLING AND SYSTEMATIC LIST SAMPLING 6.1 What is Cluster Sampling? 6.1.1 Definition Cluster Sampling --- Probability sampling in which sampling units at some point in the selection process are collections, or clusters, of population elements 6.1.2 Selecting a Simple One-Stage Cluster Sample A. Specify appropriate clusters. In most instances clusters would consist of population elements which are physically close together and therefore relatively similar to each other (i.e., with high intra-cluster homogeneity). Clusters almost always have varying numbers of population members. 93 B. Select a simple random sample of clusters from a complete list of clusters. C. Collect survey data from all population elements falling in the selected clusters. Note: We have considered here a special type of cluster sampling in which the probability sampling method used to select clusters is simple random sampling. However, any form of probability sampling as well as stratification can be used to choose the clusters. 6.1.3 Some Examples A. In a household survey for a small city, a probability sample of blocks is selected. Each block in this case represents a cluster of households. Households may or may not be subsampled within selected blocks. 94 B. In a national sample of inpatient hospital visits for individuals with multiple sclerosis during some calendar year, a probability sample of hospitals is chosen. Each hospital in this instance represents a cluster of visits by patients with multiple sclerosis who were discharged during that year. C. In a survey of first graders in the schools of a state, a probability sample of schools is selected. All first graders in a school would represent a cluster in this design. 6.1.4 Some Implications of Using Cluster Sampling A. If we collect survey data on all members of each selected cluster (i.e., we have a "one-stage" sample of members), the probability of choosing each element in this cluster is the same as the probability of choosing the cluster regardless of the number of elements in the cluster. For example, if we have 100 clusters of unequal size and we randomly choose 10 of them, the probability of selection for each element in the 10 selected clusters is 0.1 regardless of the sizes of the selected clusters. 95 B. Cluster sampling generally yields estimates with relatively larger variances (i.e., lower precision) than samples of the same size which are chosen by element (i.e., non cluster) sampling. The amount of the increase in variance is directly related to the average sample cluster size. C. Because members of clusters are often close in geographic proximity, the average cost per sample element can be reduced substantially over element sampling if cluster sampling is used. The amount of the reduction in costs is directly related to the average size of the clusters that are used. D. Since elements in clusters are usually similar (i.e., clusters are internally homogeneous), the amount of information gathered by the survey may not be increased substantially as new measurements are taken within clusters. This tells us that sample cluster sizes should not be too large. As a general rule, the number of clusters in the population should be large which means that the average size of clusters should be kept as small as possible. 96 E. NOTE RE-WORDING: The survey statistician frequently has some choice in the size of clusters that are used in a survey. In making this choice, the cost advantages of large (sample) clusters must be properly weighed against the statistical advantages of smaller (sample) clusters. F. Analysis of data from cluster samples is frequently somewhat more complex than analysis of data from element samples. G. Cluster sampling eliminates the need for a sampling frame consisting of a list of all elements in the population. Since clusters are the units being sampled, a listing of all clusters in the population constitutes an appropriate frame. H. V(n)>0 is possible if clusters vary in size. (Skip to Section 6.5 on p. 111) 6.2 Estimating a Population Mean from a Simple Cluster Sample 6.2.1 Setting A. A population of elements is divided into clusters so that each element is a member of only one cluster. 97 B. We wish to estimate the mean per element for some characteristic in the population (denoted by the symbol, Y ). C. The sampling method we decide to use in selecting our sample of clusters is simple random sampling. Other more complex sampling methods involving stratification might have also been used. D. Data are collected from all elements in the sample of clusters. 6.2.2 Some Definitions N = Number of elements in the population A = Number of clusters in the population N B = Average number of elements per cluster in A the population a = Number of clusters chosen in the sample n j = Number of population elements which fall in the j-th cluster 98 a n = n j = Total number of elements in the clusters j1 chosen in the sample b n = Average number of elements per cluster in the a sample y j = Total of all observations in the j-th cluster 6.2.3 Estimator of Y a r yj j1 a nj a yj j1 n (6.1) j1 Note: The symbol " r " is used since the estimator in formula (6.1) is usually called a ratio mean. 99 6.2.4 Estimated Variance of r bg A. If the average cluster size in the population B is known; Aa 2 v( r ) s 2 c AaB (6.2) where s2c is a measure of variability among clusters calculated by the formula a s2c ( y j rn j ) a 2 j1 a 1 a a y r n 2r y jn j j1 2 j 2 j1 2 j j1 a 1 (6.3) bg B. If the average cluster size in the population B is not known: (A a)a 2 v( r ) sc 2 An where s2c is defined in formula (6.3). (6.4) 100 6.2.5 Standard Error of r se( r ) v( r ) (6.5) where the value of v( r ) would depend on whether formula (6.2) or (6.4) was used. 6.2.6 Confidence Intervals for Y (n>50) Lower Boundary: r t se( r ) (6.6) Upper Boundary: r t se( r ) (6.7) The value of t depends on the confidence level we wish for the confidence interval (see Section 2.2.7). Interpretation: We are 95 percent sure that Y is covered by the interval whose (t=1.96) boundaries are computed by formulas (6.6) and (6.7). A. Formulas in Section 6.2 can be used to estimate a population proportion (P) with some attribute. 101 This is done by letting y j represent the number of elements possessing the attribute in the j-th cluster in the sample. B. r is not an unbiased estimator of Y (or P for proportions ) whenever we sample clusters of unequal size. The bias emerges since different cluster samples are likely to yield different values of n (i.e., n is a random variable). The amount of bias is directly related to the variability in size among clusters in the population. The survey statistician must therefore find some ways to reduce this variability. C. Some common methods for reducing variability in size among clusters: (1) Stratify clusters by sizes (i.e., by n j ). (2) Divide large clusters into smaller ones. (3) Combine smaller clusters into larger ones. (4) Select clusters with probabilities proportional to size (i.e., to n j ) and subsample within clusters. This is the so-called method of multi-stage sampling which is discussed briefly below. For additional reading see Chapter 7 of Kish [2]. 102 6.3 Estimating a Population Total from a Simple Cluster Sample 6.3.1 Setting A. A population of elements is divided into clusters. B. We wish to estimate the aggregate total of some characteristic in the population (denoted by the o symbol, Y ). C. A simple random sample of clusters is selected. D. Data are collected from all elements in the sample of clusters. 6.3.2 Some Definitions We use the same symbols for formulas to estimate totals as we did in Section 6.2.2. o 6.3.3 Estimator of Y A. If N is known: o N a r y j Nr j n 1 L O M NP Q (6.8) 103 B. If N is not known: o A a r2 y j Ay t a j1 L O M NP Q (6.9) where y t is the average total among clusters in the sample which is calculated with the formula a yt yj j1 (6.10) a o 6.3.4 Estimated Variance of r A. If N is known: A(A a) 2 v(r1 ) N 2 V( r ) sc a o where s2c is defined in formula (6.3). (6.11) 104 B. If N is not known: A(A a) 2 v(r2 ) A 2 v(y t ) st a o (6.12) where s2t measures the variability among cluster totals calculated by the formula: a a s 2 t (y j yt )2 j1 a 1 ay j1 2 j L y O M NP Q 2 a j1 a (a 1) j (6.13) o 6.3.5 Standard Error of r o o se(r) v(r) (6.14) o where the value of var( r ) would depend on whether formula (6.11) or (6.12) was used. 105 o 6.3.6 Confidence Intervals for Y Lower Boundary: r t L se( r ) O M N P Q Upper Boundary: r t L se( r ) O M N P Q o o o (6.15) o (6.16) The value of t depends on the confidence level we wish for the confidence interval (see Section 2.2.7). o Interpretation: We are 95 percent sure that Y is covered by the interval whose (t=1.96) boundaries are computed by formulas (6.15) and (6.16). 106 6.4 Illustrative Example of Cluster Sampling 6.4.1 Setting A. A household survey on health care utilization is done in a small city. Specifically, the objective is to estimate the following: (1) Y : The average number of health care visits per household in the city o (2) Y : The total number of health care visits in the city. B. The city consists of 415 blocks, which are used as clusters for the survey. A list of the city's blocks is prepared. C. The decision is made to select a simple random sample of 25 blocks and to measure the number of health care visits by each household in the sampled blocks. D. Results of the survey are summarized in Table 6.1. 107 Table 6.1 Summary of Results from Health Care Utilization Survey Cluster j Number of Households Number of Visits by Cluster Members nj yj 1 2 3 4 5 6 7 8 9 10 11 12 13 8 12 4 5 6 6 7 5 8 3 2 6 5 96 121 42 65 52 40 75 65 45 50 85 43 54 Summary Statistics: Cluster j j1 n n j 151 j1 a y j1 j 1,329 yj 10 9 3 6 5 5 4 6 8 7 3 8 A 415 y a nj 14 15 16 17 18 19 20 21 22 23 24 25 a a 25 Number of Households Number of Visits by Cluster Members 2 j 82,039 2 j 1,047 a n j1 a y n j1 j j 8, 403 49 53 50 32 22 45 37 51 30 39 47 41 108 6.4.2 Analysis (N and B are unknown) A. Estimate of Y : a r yj j1 n a s2c 1329 8.801 151 a a y r n 2r y jn j j1 2 j 2 j1 2 j j1 a 1 82,039 (8.801) 2 (1047) 2(8.801)(8403) = 634.479 24 (A a)a 2 (415 25)(25) v( r ) sc 634.479 2 2 An (415)(151) = 0.6538 se(r) v( r ) 0.6538 0.8086 109 95 percent confidence interval for Y : Lower Boundary: 8.801 - (1.96)(0.8086) = 7.216 Upper Boundary: 8.801 + (1.96)(0.8086) = 10.386 Interpretation: We are 95 percent sure that Y lies somewhere between 7.216 and 10.386 visits. o B. Estimate of Y : o r2 LA O L415O MP 1329 y M P Na Q N25 Q a j1 a s 2 t a y 2j j1 j 22,061 L y O M NP Q (25)(82,039) (1329) j1 a (a 1) = 474.557 2 a j (25)(24) 2 110 A(A a) 2 (415)(415 25) v(r2 ) st 474.557 a 25 o = 3,072,282 o o se( r 2 ) var( r 2 ) 3,072,282 1,753 o 95 percent confidence interval for Y : Lower Boundary: 22,061 - (1.96)(1753) = 18,625 Upper Boundary: 22,061 + (1.96)(1753) = 25,497 o Interpretation: We are 95 percent sure that Y lies somewhere between 18,625 and 25,497 visits. 111 6.5 Multi-Stage Cluster Sampling 6.5.1 Definition Multi-Stage Cluster Sampling --- A method of probability sampling in which the sample of elements is chosen in two or more stages. Second stage sampling units are chosen from the sampling units selected in the first stage. Third stage units are chosen from second stage sampling units; and so forth. 6.5.2 Selecting a Two-Stage Sample A. A list of first stage sampling units is prepared. Sampling units in the first stage of a multi-stage sample are called primary sampling units, or PSUs. B. A probability sample of first stage units is chosen. C. In each of the selected first stage units a list of population elements is prepared. D. A separate probability sample of elements is selected in each of the selected first stage units. 112 6.5.3 Some Examples A. Household sample of the non-institutionalized population in Virginia Primary Sampling Unit: Minor Civil Divisions Secondary Sampling Units: Small Groups of Blocks Tertiary Sampling Units: Households B. National Sample of Hospital Discharges PSU: Small Groups of Counties SSU: Hospitals TSU: Patient Medical Records 6.5.4 Some Implications of Multi-Stage Cluster Sampling A. Most of the cost savings can be retained while gaining back some of the statistical losses (i.e., larger variances) of cluster sampling. 113 B. Analysis of data collected in multi-stage designs can become increasingly more complex as the number of stages increases. 6.5.5 Analysis of Data from Multi-Stage Cluster Samples Analysis of data from multi-stage cluster samples is somewhat beyond the scope of these lectures. However, it is important to note that features of the design used to choose the sample do affect the approach and results of statistical inference. Further discussion of this topic can be found in the following references: [1] Wolter, K Introduction to Variance Estimation, Springer-Verlag, New York, 1985. [2] Kish, L., Survey Sampling, Wiley and Sons, New York, 1965, Chapters 6 and 7. [3] Cochran, W.G., Sampling Techniques, Third Edition, Wiley and Sons, 1977, Chapters 8. 10, 11. 114 6.6 PPS Cluster Sampling 6.6.1 General: A. Probability Proportional to Size. B. Used in conjunction with cluster samples selected in two or more stages. C. "Size" = Some measure which is indicative (often crudely) of the number of analysis units in a PSU (e.g., in the 1980 Somalia survey was a recent administrative count of the number of households in the PSU divided by 40, and rounded). D. PPS selection is invoked in selecting clusters (i.e., usually used in all but the final stage of sampling). E. Both with and without replacement methods available; latter generally more complicated. 115 6.6.2 Usefulness of PPS Sampling A. Creates equal-probability samples in which number of respondents per cluster is nearly equal; this simplifies data collection. B. Reduces the amount of variation in the overall sample size ( no ) which improves the statistical quality of survey estimates. Example: Unstratified Two-Stage Sample of 500 Dwellings in Atlanta DESIGN ALTERNATIVE 1 --- TWO-STAGE PPS: PSU = Census block (select with PPS) SSU = Dwelling (select with SRS using sampling rate determined by PSU "size") 116 Two-Stage PPS Sample Selection Steps: (1) List all A blocks in Atlanta (to create the first stage sampling frame), noting B (the number of dwellings in the -th block), or some reasonable measure of B . (2) Select 100 blocks with probability proportional to size (PPS); thus, 100B = Pr{Select -th block in Atlanta}= B where A B 1 B = Number of dwellings in Atlanta L O M N P Q (3) Within each selected block select a simple random sample (SRS) of exactly 5 dwellings; thus, | = Pr{Select a dwelling in -th block | select -th block} 5 B 117 (4) Thus, the probability of selecting any household in Atlanta by this PPS design is 100B 5 | 500 B B B L O L O M M N P Q NP Q Implications of Two-Stage PPS Design: + Constant sample size (5) in each sample block + Constant total sample size (500) among all possible two-stage samples; i.e., V(n o ) 0 + Equal selection probabilities for dwellings in Atlanta DESIGN ALTERNATIVE 2 --- TWO-STAGE SRS: PSU = Census block (select with SRS) SSU = Dwelling (select with SRS using sampling rate determined by PSU "size") 118 Two-Stage SRS Sample Selection Steps: (1) List all A blocks in Atlanta (to create the first stage sampling frame) (2) Select 100 blocks with SRS; thus, L O M NP Q = Pr{Select -th block in Atlanta}= 100 A (3) Within each selected block select a simple random sample (SRS) of exactly 5 dwellings; thus, | = Pr{Select a dwelling in -th block | select -th block} 5 B 119 (4) Thus, the probability of selecting any household in Atlanta by this 2-stage SRS design is L O L O MQ PM N NP Q | 100 5 500 A B AB Implications of Two-Stage SRS Design: + Constant sample size (5) in each sample block + Constant total sample size (500) among all possible two-stage samples; i.e., V(n o ) 0 - Unequal selection probabilities for dwellings in Atlanta (implying the need for sample weights for analysis and precision loss due to variable weights) 120 6.7 Systematic (List) Sampling 6.7.1 Definition: Systematic Sampling --- A method of probability sampling in which elements on an ordered list of population members are chosen by applying an interval of constant length after a random start. 6.7.2 Selecting a Systematic Sample A. The sampling frame consists of a list numbered sequentially from 1 to the number of sampling units in the list. B. A sampling interval (denoted by the symbol, k) is chosen. If a sample of about n out of N elements is desired, k is usually the ratio, N/n, rounded to the nearest integer. C. A random number between 1 and k is chosen. This number is called the random start and will be denoted by the symbol, g. D. Elements selected in the sample are those number g and every k-th element for the remainder of the list; i.e., g, g+k, g+2k, etc. 121 6.7.3 Example of the Systematic Selection Process We wish to select a systematic 1-in-5 sample from a population of 1000 discharges in a hospital. The list of discharges is represented in Figure 6.1. The random start is 3. Discharge 1 2 3 4 5 6 7 8 9 10 . . . 996 997 998 999 1000 Discharge Sampled 3 8 . . . . . . 998 Figure 6.1 Representation of the Process of Selecting a Systematic List Sample of Patient Discharges 122 6.7.4 Advantages of Systematic Sampling A. Systematic sampling can be quickly and easily applied. B. With proper ordering of the population list, systematic sampling can be used as a method of selecting a sample which is similar to a proportionate stratified sample. 6.7.5 Estimating a Population Mean Y If n represents the number of elements in the systematic sample, the estimator of Y is: n yi y sys i 1 n (6.17) 123 6.7.6 Estimated Variance of ysys A. If we can assume that the list is "thoroughly shuffled" (i.e., essentially randomly ordered), 1 f 2 v(ysys ) s n (6.18) where n s2 ny i 1 2 i n 2 yi i 1 n( n 1) (6.19) B. If we cannot assume that the list is "thoroughly shuffled," 1 f n / 2 v(ysys ) 2 (y h1 y h 2 ) 2 n h 1 (6.20) Formula (6.20) means that we multiply the factor 1 f by the sum of squares of successive paired n2 differences. L O M NP Q 124 6.7.7 Standard Error of ysys se(ysys ) v(ysys ) (6.21) 6.7.8 Confidence Intervals for Y (n>50) ch t[sec y h ] Lower Boundary: ysys t[se ysys ] (6.22) Upper Boundary: ysys (6.23) sys The value of t depends on the confidence level we wish for the confidence interval (see Section 2.2.7). Interpretation: We are 95 percent sure that Y is covered by the interval whose (t=1.96) boundaries are calculated by formulas (6.22) and (6.23). 125 6.7.9 Illustrative Example of Analysis from a Systematic Sample A. Data on the length of stay from a systematic sample of 20 hospital discharges from a population of 400 discharges are shown in Table 6.2. The list of sample elements is presented in the order in which they were selected from the sampling frame. Table 6.2 Results of a Systematic Sample of Hospital Discharges Sample Discharge i 1 2 3 4 5 6 7 8 9 10 Length of Stay yi 6 2 1 3 3 4 12 2 7 4 Sample Discharge i 11 12 13 14 15 16 17 18 19 20 20 y i = 103 n = 20 N = 400 i1 20 y 2i = 907 f = n/N = 0.05 i 1 Length of Stay yi 9 6 3 1 1 19 3 6 9 2 126 B. Estimator of Y : n y sys yi 103 515 . 20 n i 1 C. If we can assume that the sampling frame is thoroughly shuffled: n s2 n n y [ y i ]2 20 907f a 103f afa = = 19.818 20 19 afaf na n 1f 1 f O 1 005 . O 19.818 = 0.941 s L L M P M P n 20 N QN Q i 1 v ysys 2 i 2 i 1 2 D. If we cannot assume that the sampling frame is thoroughly shuffled: 1 f n / 2 v ysys 2 (y h1 y h 2 ) 2 n i 1 127 L O a f a f a f M P N Q 2 2 2 . 1 005 6 2 1 3 ... 9 2 202 1247 . 6.7.10 Some Statistical Notes on Systematic Sampling A. y sys is an unbiased estimator if N/n is an integer and very nearly unbiased if it is not an integer. B. If the sampling frame is not randomly ordered, v ysys is not unbiased and tends to slightly overestimate the true variance of v ysys . C. Systematic sampling is a special form of cluster sampling. Specifically, the population can be viewed as consisting of k clusters each of which is a possible systematic sample which can be chosen. By choosing a random start and applying a fixed interval in selecting the sample, 128 we are effectively randomly choosing one of the k possible clusters. D. If we can assume that the population list is "thoroughly shuffled," the formulas of Sections o 2.3 and 2.4 can be used to estimate a total (Y) or a proportion (p). E. Some precautionary notes concerning use of systematic sampling: (1) Periodicity in the population list (a) Problem: Variability of the measurement (i.e., yi ) in the list such that comparable measurements appear at fixed intervals. Particular problems arise when the periodic intervals of similarity are integral multiples of the sampling interval k. (b) Remedy: Change the random start several times as one moves through the list during the selection process. (2) Monotonicity in the population list 129 (a) Problem: Measurement values tend to increase or decrease as one moves through the list. The value of the random start therefore strongly influences the value of the estimate resulting from the sample. (b) Remedy: Randomly order the list, if possible. Supplementary Reading [1] Mendenhall, W., Ott, L., and Scheaffer, R.L., Elementary Survey Sampling, Duxbury Press, Belmont, 1971, Chapters 7 and 8. [2] Kish, L., Survey Sampling, Wiley and Sons, New York, 1965, Chapters 4, 5, and 6. [3] Cochran, W.G., Sampling Techniques, Third Edition, Wiley and Sons, New York, 1977, Chapters 8, 9, and 9A.