Statistics 475 Notes 13 Reading: Lohr, Chapters 5.3-5.4 I. Two stage cluster sampling In one-stage cluster sampling, we examine all secondary sampling units (ssu’s) within the selected clusters (primary sampling units). In many situations, though, the units in a cluster may be so similar that examining all units within a cluster wastes resources; alternatively, it may be expensive to measure units in a cluster relative to the cost of sampling clusters. In these situations, taking a subsample within each cluster may be cheaper and more efficient. In two-stage cluster sampling, we 1. Select a simple random sample (SRS) S of n clusters from a population of N clusters. 2. Select a SRS of ssu’s from each selected cluster. The SRS of mi from the ith cluster is denoted by Si . Examples: (1) The Gallup poll for presidential elections samples approximately 300 election districts from around the U.S. At the second stage, this poll randomly selects approximately five households per district. (2) Sampling for quality control purposes often involves two stages of sampling. For example, when an investigator samples packaged products, such as frozen foods, he or she commonly samples 1 cartons and then samples packages from within cartons. When sampling requires the detailed investigation of components of products, such as measuring plate thickness in automobile batteries, a quite natural procedure is to sample some of the products (batteries) and then sample components (plates) within these products. Data example: A garment manufacturer has 90 plants located throughout the U.S. and wants to estimate the average number of hours that the sewing machines were down for repairs in the past three months. Because the plants are widely scattered, the manufacturer’s statistician decides to use cluster sampling, specifying each plant as a cluster of machines. Each plant contains many machines, and checking the repair record for each machine would be time consuming. Therefore, she uses two stage sampling. Enough time and money are available to sample n 10 plants and approximately 20% of the machines in each plant. The manufacturer knows she has a combined total of 4500 machines in all plants. Plant Mi mi 1 50 10 2 65 13 3 45 9 4 48 10 Downtime (hours) 5,7,9,0,11, 2,8,4,3,5 4,3,7,2,11,0, 1,,9,4,3,2,1,5 5,6,4,11,12, 0,1,8,4 6,4,0,1,0,9, 2 yi si2 5.40 11.38 4.00 10.67 5.67 16.75 4.80 13.29 5 52 10 6 58 12 7 42 8 8 66 13 9 40 8 10 56 11 8,4,6,10 11,4,3,1,0,2, 8,6,5,3 12,11,3,4,2, 0,0,1,4,3,2,4 3,7,6,7,8, 4,3,2 3,6,4,3,2,2, 8,4,0,4,5,6,3 6,4,7,3, 9,4,1,5 6,7,5,10,11, 2,1,4,0,5,4 4.30 11.12 3.83 14.88 5.00 5.14 3.85 4.31 4.88 6.13 5.00 11.80 Unbiased estimates of the population total and population mean: In one-stage cluster sampling, we could estimate the N population total by n ti ; the cluster totals ti were iS known because we sampled every secondary sampling unit in the cluster. In two-stage cluster sampling, however, since we do not observe every ssu in the sampled cluster, we need to estimate individual cluster totals by M tˆi i yij M i yi jSi mi and an unbiased estimator of the population total is 3 N tˆunb n N ˆ t i n iS M y . iS i i In two-stage sampling, the tˆi ’s are random variables. Consequently, the variance of tˆ has two components: (1) the variability between clusters; and (2) the variability of the ssu’s within clusters. We do not have to worry about component (2) in one-stage cluster sampling. The variance of tˆunb equals the variance of tˆunb from onestage cluster sampling plus an extra term because the tˆi ’s estimate the cluster the totals. For two-stage cluster sampling, 2 N mi 2 Si2 S n N 2 t Var (tˆunb ) N 1 1 Mi mi (1.1) N n n i 1 M i 2 where S t is the population variance of the cluster totals and S i2 is the population variance among elements within cluster i. The first term in (1.1) is the variance from onestage cluster sampling and the second term is the additional variance due to subsampling. 2 2 To estimate Var (tˆunb ) , we substitute estimates st and si 2 2 for S t and S i respectively. 2 ˆ tˆunb ti N , st2 iS n 1 si2 yij yi jSi mi 1 4 2 2 2 si2 s m n N t i ˆ (tˆunb ) N 1 1 Var Mi N n n M mi iS i The standard error SE (tˆunb ) is of course the square root of ˆ (tˆunb ) . Var 2 If we know the number of units in the population K , we can estimate the population mean by tˆ yˆunb unb K with standard error SE (tˆunb ) ˆ SE ( yunb ) K . R code: Mi=c(50,65,45,48,52,58,42,66,40,56); mi=c(10,13,9,10,10,12,8,13,8,11); yibar=c(5.4,4,5.67,4.8,4.3,3.83,5,3.85,4.88,5); sisq=c(11.38,10.67,16.75,13.29,11.12,14.88,5.14,4.31,6.13,11.8); N=90; n=10; K=4500; that.i=Mi*yibar; that.unb=(N/n)*sum(Mi*yibar); stsq=sum((that.i-that.unb/N)^2/(n-1)); varhat.that.unb=N^2*(1-n/N)*stsq/n+(N/n)*sum((1mi/Mi)*Mi^2*(sisq/mi)); se.that.unb=sqrt(varhat.that.unb); yhat.unb=that.unb/K; se.yhat.unb=se.that.unb/K; yhat.unb [1] 4.80118 se.yhat.unb 5 [1] 0.192594 Thus, we estimate that the average sewing machine was down for repair for 4.8 hours in the past three months and a 95% confidence interval for this average is approximately 4.8 1.96*0.19 (4.43, 5.17) . Ratio Estimation: As with one-stage cluster sampling, we can also use a ratio estimator for estimating the population mean where the auxiliary variables are the cluster sizes. N 1 N ti ti N i 1 y Ni 1 N 1 M Mi i N i 1 i 1 The ratio estimates of the population mean and population total are: tˆi yˆ r iS Mi iS tˆr Kyˆ r The difference between one-stage and two-stage cluster sampling is that we need to estimate the cluster totals ti in two-stage cluster sampling. Using a Taylor series approximation, the variance of the ratio estimator is estimated by 6 1 ˆ ( yˆ r ) 2 Var M n sr2 1 1 N n nN where s 2 r iS M i yi M i yˆ r mi M 1 iS Mi 2 i si2 mi 2 n 1 and M is the average cluster size – either the population average or sample average can be used in the estimate of the variance. # Ratio estimate ybarhat.r=sum(that.i)/sum(Mi); Mbar=mean(Mi); srsq=sum((Mi*yibar-Mi*ybarhat.r)^2)/(n-1); varhat.ybarhat.r=(1/Mbar^2)*((1-n/N)*srsq/n+(1/(n*N))*sum(Mi^2*(1mi/Mi)*sisq/mi)); se.ybarhat.r=sqrt(varhat.ybarhat.r); ybarhat.r [1] 4.598831 se.ybarhat.r [1] 0.2220061 In this example, the ratio estimator has a higher standard error than the unbiased estimator. However, if the population size K were unknown, then we would have to use the ratio estimator. Furthermore, the ratio estimator will be better than the unbiased estimator when the cluster sizes vary considerably and the cluster totals ti are roughly proportional to the cluster sizes M i . 7 When the cluster sizes are all the same, the ratio estimator and the unbiased estimator are the same. II. Sampling weights for cluster samples Recall from Chapter 4.3 that the sampling weight for an observation in a sample is the inverse of the probability that an observation would be selected in a sample and is the number of units in the population that the observation represents. For cluster sampling, P( jth ssu in ith cluster is selected) P(ith cluster selected) P( jth ssu selected|ith cluster selected) n mi N Mi Thus, the sampling weight for the jth unit in the ith cluster is NM i wij nmi We have N tˆunb tˆi wij yij and n iS iS jSi tˆi tˆunb yˆ r iS M i wij iS iS jSi In two stage cluster sampling, for each ssu in the sample to represent the same number of ssu’s in the population, mi 8 needs to be proportional to M i so mi / M i is constant. This is approximately true in the above example on downtime of sewing machines. III. Analysis of cluster samples using the survey package in R For using the survey package in R to analyze cluster samples, we need to create a variable that specifies which cluster each observation in the sample belongs to, a variable which gives a unique number to each ssu, a variable that gives the sampling weight of each observation and variables that give the overall number of clusters in the population and the cluster size of each cluster in the sample. # Responses downtime=c(5,7,9,0,11,2,8,4,3,5,4,3,7,2,11,0,1,9,4,3,2,1,5,5,6,4,11,12,0,1,8, 4,6,4,0,1,0,9,8,4,6,10,11,4,3,1,0,2,8,6,5,3,12,11,3,4,2,0,0,1,4,3,2,4,3,7,6,7,8, 4,3,2,3,6,4,3,2,2,8,4,0,4,5,6,3,6,4,7,3,9,1,4,5,6,7,5,10,11,2,1,4,0,5,4) # Cluster sizes plant=c(rep(1,10),rep(2,13),rep(3,9),rep(4,10),rep(5,10),rep(6,12),rep(7,8),re p(8,13),rep(9,8),rep(10,11)); # Number of clusters fpc1=rep(90,length(plant)); # Size of cluster for each ssu fpc2=c(rep(50,10),rep(65,13),rep(45,9),rep(48,10),rep(52,10),rep(58,12),rep (42,8),rep(66,13),rep(40,8),rep(56,11)); # Sampling weights wght=(fpc1*fpc2)/(10*c(rep(10,10),rep(13,13),rep(9,9),rep(10,10),rep(10,10 ),rep(12,12),rep(8,8),rep(13,13),rep(8,8),rep(11,11))); # Unique id for each ssu machine.id=seq(1,length(downtime),1); 9 # Data frame that contains the information for the survey design garment.dataframe=data.frame(machine.id,plant,fpc1,fpc2,wght); # Specify the sample design garment.design=svydesign(id=~plant+machine.id,weights=~wght,data=gar ment.dataframe,fpc=~fpc1+fpc2); # Estimate the population mean of downtime using the ratio estimator svymean(downtime,design=garment.design); mean SE [1,] 4.598 0.2219 The survey package automatically uses the ratio estimator for estimating the population mean for cluster samples. The survey packages uses tˆunb for estimating the population total: > svytotal(downtime,design=garment.design) total SE [1,] 21602 866.64 IV. Designing a cluster sample When designing a cluster sample, you need to decide four major issues: (1) What overall precision (standard error) is needed? (2) What size should the clusters be? (3) How many ssu’s should be sampled in each cluster selected for the sample (4) How many clusters should be sampled? Issue (1) is faced in any survey design. For issue (2), there are often natural clusters. We will focus on issues (3)-(4) in the case of equal cluster sizes (in which case yˆunb yˆ r ). Let M be the cluster sizes 10 and assume we will sample the same number m of ssu’s from each cluster. The ANOVA decomposition can be used to write n MSB m MSW Var ( yˆunb ) 1 1 N nM M nm where MSB and MSW are the between and within mean squares from the ANOVA table (see Notes 12; Lohr, pg. 138). Consider the simple cost function total cost C c1n c2 nm where c1 is the fixed cost of sampling each cluster (not including the cost of measuring ssu’s) and c2 is the additional cost of measuring each ssu. Using calculus, one can easily determine that the values C n c1 c2 m m c1M ( MSW ) c2 ( MSB MSW ) minimize Var ( yˆunb ) for fixed total cost C. Example: An inspector samples cans from a truckload of canned creamed corn to estimate the average number of worm fragments per can. The truck has 580 cases; each case contains 24 cans. The inspector samples 12 cases at random and subsamples 3 cans randomly from each selected case. 11 case=rep(seq(1,12,1),each=3); can=seq(1,36,1); frag=c(1,5,7,4,2,4,0,1,2,3,6,6,4,9,8,0,7,3,5,5,1,3,0,2,7,3,5,3,1,4,4,7,9,0,0,0); (a) Find an approximate 95% confidence interval for the average number of worm fragments per can # Cluster sizes clustersize=rep(24,36); noclusters=rep(580,36); # Sampling weights N=580; n=12; mi=rep(3,36); sampwght=(N*clustersize)/(n*mi); corn.design.dataframe=data.frame(case,can,clustersize,noclusters,sampwght) ; corn.design=svydesign(id=~case+can,weights=sampwght,data=corn.design. dataframe,fpc=~noclusters+clustersize); svymean(frag,design=corn.design); mean SE [1,] 3.6389 0.6102 Thus, an approximate 95% confidence interval for the average number of worm fragments per can is 3.64 1.96*.61 (2.44, 4.84) (b) Suppose a new truckload is to be inspected and it is thought to be similar to this one. It takes 20 minutes to locate and open a case, and 8 minutse to locate and examine each specified can within a case. How many cans should be examined per case? How many cases? Assume your budget is 120 minutes. 12 To find the MSB and MSW, we can use the one-way analysis of variance. > aov.frag=aov(frag~as.factor(case)); # as.factor(case) makes case into a categorical variable with categories 1,...,12 > summary(aov.frag) Df Sum Sq Mean Sq F value Pr(>F) as.factor(case) 11 149.639 13.604 3.0045 0.01172 * Residuals 24 108.667 4.528 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Thus, we estimate MSB=13.60 and MSW=4.53. Then, c1M ( MSW ) 20*24*4.53 5.47 c2 ( MSB MSW ) 8*(13.6 4.53) Round up to 6 cans per case. m Then, C 120 1.76 c1 c2 m 20 8*6 Round up and sample 2 cases. The total cost would be 2*(20+8*6)=136, which is overbudget. We can use 5 cans per case and remain within budget, 2*(20+8*5)=120. n 13