Cluster Sampling & Systematic Sampling

Cluster Sampling & Systematic Sampling Recall that cluster sampling is where we first divide the population into “clusters,” then select a simple random sample (SRS) of these clusters, and sample every unit within the selected clusters. This is a two-stage sampling plan, where we employ an SRS of clusters at the first stage and a census within these selected clusters at the second stage. Systematic sampling is where we first select a random starting point in the population, and then sample every mth unit beginning at that starting point. This also is a two-stage sampling plan, where we employ an SRS of size 1 from the list of potential starting points, and then census the sampling units at multiples of m units from the initial unit. • It turns out that cluster sampling and systematic random sampling, as defined, share identical sampling structures as two-stage sampling plans. To see this, consider the following general situation. Suppose the population is first partitioned into mutually exclusive groups, known as primary units, each of which contains a number of sampling units, known as secondary units. Selection occurs on the primary units, and then every secondary unit within a selected primary unit is sampled. • This should be clear for cluster sampling. With systematic sampling, the primary units are groups of observations m units apart in the population list. For example, if we want every 5th member of a population, then there are 5 primary units, and we take an SRS of 1 of these units and then examine every secondary unit in the primary unit selected. Main Idea: The important point for these two sampling plans is that whenever a primary unit is selected, all secondary units within are sampled. In truth then, the primary units are the sampling units in a cluster or systematic sample, even though measurements or observations are actually made on the secondary units. Thompson lists three special considerations for these two sampling plans which warrant further discussion and separate consideration (pages 129 & 131). 1. In cluster sampling, the size of the cluster may serve as auxiliary information that may be used either in selecting clusters with unequal probabilities (PPS Sampling) or in forming ratio estimators. 2. The size and shape of clusters may affect efficiency. 3. In systematic sampling, it is not uncommon to have a sample size of one; that is, a single primary unit. After defining the relevant notation for these sampling plans, each of the special considerations above will be addressed by way of examples. Notation: Consider taking a cluster sample from some population with response variable y. We let: 81 N = the number of primary units (clusters) in the population, n = the number of primary units (clusters) in the sample, Mi = the # of secondary units in the ith primary unit, M = the total # of secondary units in the population = N X Mi i=1 yij = the y-value of the j th secondary unit in the ith primary unit, yi = Mi X yij = the total of the y’s in the ith primary unit (cluster totals), j=1 τ = N X yi = the total of the y’s in all the units, i=1 τ = the mean of the y’s per secondary unit, M τ µ1 = = the mean of the y’s per primary unit. N Example: A sociologist wants to estimate the average per capita income in a certain small city. As no list of resident adults is available, she decides that each of the city blocks will be considered one cluster. The clusters are numbered on a city map from 1 to 415, and the experimenter decides she has enough time and money to sample n = 25 clusters where every household will be interviewed within the clusters (blocks) chosen. The data below give the number of residents and the total income for each of the 25 blocks sampled. [Problem taken from Scheaffer, Mendenhall, & Ott, Elementary Survey Sampling, page 248.] µ = Cluster Number of Residents, Total Income per Cluster, 1 2 3 4 5 6 7 8 9 10 11 12 13 8 12 4 5 6 6 7 5 8 3 2 6 5 $192,000 $242,000 $84,000 $130,000 $104,000 $80,000 $150,000 $130,000 $90,000 $100,000 $170,000 $86,000 $108,000 Cluster Number of Residents, Total Income per Cluster, 14 15 16 17 18 19 20 21 22 23 24 25 10 9 3 6 5 5 4 6 8 7 3 8 $98,000 $106,000 $100,000 $64,000 $44,000 $90,000 $74,000 $102,000 $60,000 $78,000 $94,000 $82,000 25 X i=1 Mi = 151 25 X yi = $2, 658, 000 i=1 Given that M = 2500 residents, use these data to estimate the average per capita income in 82 the city. Notation: N = 415 blocks, n = 25 blocks, M = 2500 residents, Mi = the number of residents in the ith block, M = the total number of residents in all 415 blocks, yij = the income of the j th resident in the ith block, yi = the total income of all residents on the ith block, τ = the total income of all residents in the city, µ = the mean income per resident, µ1 = the mean income per block. Aside: If the goal of a cluster sampling problem is to estimate τ or µ1 , the population total or mean, then we can treat this situation as an SRS of n clusters (or blocks in the example), and estimate the mean and total as with an SRS: n 1 1X yi = [2658000] = $106,320, with standard error: n i=1 25 sµ sµ ¶ 2 ¶ N − n su 415 − 25 1898226667 SE(y) = = = $8, 447, N n 415 25 25 1 X where: s2u = (yi − y)2 = the sample variance of the cluster totals (yi ’s). n − 1 i=1 • Estimate µ1 with y = • Estimate τ with N y = 415(106320) = $44,122,800, with standard error: s s s2u 1898226667 SE(τb) = N (N − n) = 415(415 − 25) = $3, 505, 584. n 25 • Note that all of the above computations are done strictly on the observations on the primary units (or clusters). Considering only the cluster totals, we cannot estimate µ, the average income per resident. • However, here, we have an auxiliary variable (the number of residents on a block), so we should take advantage of this to improve the SRS-based estimates of τ , µ1 . Ratio Estimator with Cluster Sampling: In cluster sampling, when the primary unit total yi is correlated with the cluster size Mi (such that we expect yi = 0 when Mi = 0), a ratio estimator for τ & µ1 may be employed where here the sample ratio is: n X r= i=1 n X i=1 n X yi = xi i=1 n X yi Mi = Sample total income Sample total # of residents i=1 so that the ratio estimator of the population total is: τbr = M r . • As with earlier ratio estimators, this estimator is biased. The bias is typically negligible, and so we look at the estimated variance as before. 83 • The approximate variance (Delta method) of τbr and corresponding estimated variance are given by: Var(τbr ) ≈ N (N − n) · d τb ) Var( r N N σr2 N (N − n) X N (N − n) X = (yi − Rxi )2 = (yi − Mi µ)2 , n n(N − 1) i=1 n(N − 1) i=1 n N (N − n) X (yi − rMi )2 . = n(n − 1) i=1 Back to the Example: In the income example, the sample ratio is: n X r= i=1 n X yi Mi = sample total income 2658000 = = $17, 603 per resident. sample # residents 151 i=1 • This is a biased (ratio) estimate of µ, the true average income per resident (secondary unit), with standard error: vÃ ! u u N − n s2r t SE(r) = , where: s2r = 2 N µx n n n 1 X 1 X (yi − rxi )2 = (yi − rMi )2 . n − 1 i=1 n − 1 i=1 • • Plugging in the true mean number of residents per block µx gives: SE(r) = $1621. To estimate the total income: τbr = M r = (2500)(17603) = $44006.6K, with standard error: SE(τbr ) = M · SE(r) = (2500)(1621) = $4053.5K. 200K 250K Income vs. # Residents 0 50K Income (yi) 100K 150K • Note that this standard error ($4053.5K) is larger than that found with an SRS earlier ($3505.6K). Why do you think this happened? 0 84 2 4 6 8 # of Residents (Mi) 10 12 • If we know M (= 2500 here), we can estimate µ (average income per resident) by: τb N 44122800 = y= = $17649 (close to r = $17603), M M 2500   µ ¶ τb 1 1 smaller than . b = SE SE(µ) = SE(τb) = (3505584) = $1402  M M 2500 SE(r) = $1621 µb = • The upshot of all this is that for cluster sampling, we have two basic options: 1. Work only with the primary units (clusters). 2. Use ratio estimation to make use of the relationship between primary and secondary units (if such a relationship exists). • In either case, the variation in the estimators only depends on n (the cluster size). So we should use the method which gives the smaller estimated variance in a particular problem. Systematic Sampling: Recall that systematic sampling is the special case of cluster sampling where each cluster is determined through some random starting point, and a single cluster is chosen. Example: Suppose we are sampling days where we can afford to sample every fifth day. This gives the following five clusters: • Every sampling unit is in exactly one cluster. • In systematic sampling, experimenters generally select ONE cluster from this group. This gives a sample size of 1, which prohibits any type of inference. Why? • There are two basic outlooks one takes toward this “problem”: 1. Take more than 1 systematic sample (replication). 2. Assume the one systematic sample taken is representative of the population (i.e.: assume it is an SRS). Review: To this point, we have considered two ways to estimate the population total (or mean) in a cluster sample. These two ways are reviewed below. 1. SRS of Clusters: τb = N y, where y is the cluster mean, N 1 X σ2 (yi − µ1 )2 is the population variance Var(τb) = N (N − n) u , where σu2 = n N − 1 i=1 of the cluster totals, and µ1 is the average cluster total. 85 2. Ratio Estimation: τbr = M r where M = total # of secondary units in the population, and r = the average response per secondary unit, N N (N − n) 2 1 X Var(τbr ) = σr , where σr2 = (yi − Mi µ)2 is the population variance n N − 1 i=1 of the cluster total within a cluster, and µ is the average response per secondary unit. • Under these two cluster sampling estimation methods: σu2 = the variability between clusters, and σr2 = the variability between clusters, accounting for cluster size. PPS Sampling: Suppose in cluster sampling that the primary units are drawn with replacement where the selection probabilities are proportional to the sizes of the primary units (i.e.: larger clusters are more likely to be selected than smaller clusters). Recall that in the situation of PPS sampling with replacement, an unbiased estimator of the population total is given by the Hansen-Hurwitz estimator: τbp = n n 1X yi MX yi Mi = , where: pi = n i=1 pi n i=1 Mi M with variance given by: Ã N 1X yi Var(τbp ) = pi −τ n i=1 pi µ µ !2 N 1X Mi = n i=1 M Ã yi −τ Mi /M !2 ¶¶ N N 2 yi 1X Mi MX = M −µ = Mi (y i − µ)2 n i=1 M Mi n i=1 (where: y i = yi /Mi , the average income per resident in cluster i) µ ¶ N N MX yi Mi µ 2 M X 1 = Mi − = (yi − Mi µ)2 . n i=1 Mi Mi n i=1 Mi • The squared difference (yi − Mi µ)2 in this final piece represents the variation in the cluster totals from the average per resident times the number of residents. Hence, the greater the differences between the average incomes for the clusters, the larger this variance will be. Does this make sense for cluster sampling? • An unbiased estimator of Var(τbp ) given above is: d τb ) = Var( p n M2 X (y − µb p )2 , where: µb p = τbp /M. n(n − 1) i=1 i 86 • This Hansen-Hurwitz estimator will be good if there is a lot of variability in the cluster sizes. Otherwise, one of the two cluster-based estimators of the population total (presented earlier) should be used. • The major advantage of the Hansen-Hurwitz estimator is its unbiasedness and low variability when the cluster sizes are very different. • A Horvitz-Thompson estimator based on the selection probabilities pi = Mi /M can also be developed, as indicated on page 134 of the book. Basic Principle of Cluster and Systematic Sampling: Because a census is taken on all secondary units within a primary unit, the within-primary-unit variance plays no role in the variances of population means or totals. This explains why the ideal case for cluster or systematic sampling is that when there is large variability within primary units relative to the variability between primary units, because this large within-primary-unit variability will have no effect on the variance of estimators! • Hence, we want the primary units to be “mini-populations”; that is, we want them to be representative of the population as a whole. This will minimize differences between primary units while maximizing differences within. Example (Systematic Sampling): Consider sampling via line transects from an irregularly-shaped area. Interest is in estimating the proportion of the area which has some attribute (say: bare ground). • Suppose we take a systematic random sample of 10 such transects, where the initial starting point is randomly chosen, and the resulting 10 transects are evenly spaced across the region. • Because the area is irregular in shape, the transects will be of different lengths. • Although we really have a cluster sample of size 1, we will assume that the 10 transects are representative of the area and treat them as an SRS of size 10. • So, the sampling unit here is a transect. What is the population? How might we estimate the proportion of the area (using these 10 transects) with the desired attribute (bare ground)? Three ways are considered: 1. Simple Average of Transect Proportions: We could measure the proportion of each transect with the attribute, and average the 10 resulting proportions. Problem? 87 yi = the length of transect i with the attribute, xi = the length of transect i. 2. Ratio Estimator: Suppose we let: Then a ratio estimate of the proportion of the area with the attribute is: 10 X yi µ ¶ n 1 X N −n 1 2 2 sr , where: sr = (yi − rxi )2 . r= , with variance Var(r) = 2 n − 1 i=1 | N {z } µx xi =1 here i=1 i=1 10 X • As a side note, we might estimate µx with x (the average length of the 10 transects), but we could actually determine µx exactly using integral calculus! • This is a perfectly valid way of estimating the proportion of the area with the attribute, although, as with all ratio estimators, there is some bias in the estimator. 3. Unbiased Way?: Consider only the lengths (yi ’s), and not the proportions. • We can determine the true mean length of a transect µx via calculus (since we know the total acreage and length across the base of the area). If we then take: y average length of the attribute among the transects = , µx true mean length of a transect this estimator will be unbiased, as it is no longer a ratio estimator. • If there are large differences in transect lengths, the ratio estimation method is better because it accounts for cases of many short or long transects. The variance of the ratio estimator will be smaller if there are large differences in the (yi /xi )’s (the proportions with the attribute) in the transects. R Code for Cluster Sampling Income Example > N <- 415; n <- 25; fpc <- (N-n)/N; M <- 2500 > Mi <- c(8,12,4,5,6,6,7,5,8,3,2,6,5,10,9,3,6,5,5,4,6,8,7,3,8) > y <- 1000*c(192,242,84,130,104,80,150,130,90,100,170,86, 108,98,106,100,64,44,90,74,102,60,78,94,82) # SRS of Clusters # =============== > ybar <- mean(y) > ybar [1] 106320 # Average income per block > su2 <- var(y) > se.ybar <- sqrt(fpc*su2/n) > se.ybar [1] 8447.19 # SE of average income per block 88 > tauhat <- N*ybar > tauhat [1] 44122800 > se.tauhat <- N*se.ybar > se.tauhat [1] 3505584 # Estimated total income in city # SE of estimated total income # Ratio Estimator with Cluster Sampling # ===================================== > r <- sum(y)/sum(Mi) > r [1] 17602.65 # Sample ratio = average income per resident > mux <- M/N > sr2 <- (1/(n-1))*sum((y-r*Mi)^2) > se.r <- sqrt(fpc*sr2/(n*mux^2)) > se.r # SE of average income per resident [1] 1621.409 # (assuming M = 2500 is KNOWN) > xbar <- mean(Mi) > se.r2 <- sqrt(fpc*sr2/(n*xbar^2)) > se.r2 # SE of average income per resident [1] 1617.14 # (assuming M is UNKNOWN) > tauhat <- M*r > tauhat [1] 44006623 # Estimated total income in city > se.tauhat <- M*se.r > se.tauhat [1] 4053522 # SE of estimated total income 89

Cluster Sampling & Systematic Sampling

Related documents

Products

Support

Cluster Sampling &amp; Systematic Sampling

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

Cluster Sampling & Systematic Sampling