Example of Two-Stage Cluster Sampling A garment manufacturer has N = 90 plants located throughout the United States and wants to estimate the average number of hours that the sewing machines were down for repairs in the past months. Because the plants are widely scattered, she decides to use cluster sampling, specifying each plant as a cluster of machines. Each plant contains many machines, and checking the repair record for each machine would be time-consuming. Therefore she uses two-stage cluster sampling. Enough time and money are available to sample n = 10 plants and approximately 20% of the machines in each plant. The resulting data are given in the table below. Plant Mi mi Downtime (in hours) yi s i2 1 50 10 5, 7, 9, 0, 11, 2, 8, 4, 3, 5 5.40 11.38 2 65 13 4, 3, 7, 2, 11, 0, 1, 9, 4, 3, 2, 1, 5 4.00 10.67 3 45 9 5, 6, 4, 11, 12, 0, 1, 8, 4 5.67 16.75 4 48 10 6, 4, 0, 1, 0, 9, 8, 4, 6, 10 4.80 13.29 5 52 10 11, 4, 3, 1, 0., 2, 8, 6, 5, 3 4.30 11.12 6 58 12 12, 11, 3, 4, 2, 0, 0, 1, 4, 3, 2, 4 3.83 14.88 7 42 8 3, 7, 6, 7, 8, 4, 3, 2 5.00 5.14 8 66 13 3, 6, 4, 3, 2, 2, 8, 4, 0, 4, 5, 6, 3 3.85 4.31 9 40 8 6, 4, 7, 3, 9, 1, 4, 5 4.88 6.13 10 56 11 6, 7, 5, 10, 11, 2, 1, 4, 0, 5, 4 5.00 11.80 We want to estimate the average downtime per machine, and we know that the total number of machines in all plants is K = 4500. The ANOVA table is given below: Source Df Sum of Squares Mean Square F p Model 9 40.704487 4.522721 0.43 0.9177 Error 94 996.333974 10.599298 Total 103 1037.038462 10.068335 MSW 0.05274 , indicating that each cluster is relatively heterogeneous; thus cluster We have Ra2 1 S2 sampling is at least as efficient as simple random sampling. Unbiased estimation: N An unbiased estimate of the population mean is yˆ unb nK 10 M i 1 i yi 90 2400.59 4.80118 hours. 104500 2 si2 M i mi m N N The two terms in the variance estimate will be calculated separately. We have 2 1 i K n i 1 M i N The estimated variance is given by Vˆ yˆ unb K 2 m n st2 N N 1 i 1 2 N n K n i 1 M i 2 si2 M i mi 10 90 11.38 13 10.67 11 11.80 1 2500 1 4225 1 3136 0.0097722267 , 2 50 10 65 13 56 11 4500 10 The mean cluster size is estimated to be M S 52.2 , so that st2 1 n M i yi M S yˆ unb n 1 i 1 2 1 505.40 52.24.801182 505.40 52.24.801182 9 2 2 n s 2 90 10 892.3468481 N = 892.3468481, and 1 t 1 0.0317278879 . N n 4500 90 10 K Then the estimated variance is Vˆ yˆ unb 0.0415001146 , and the standard error is SE yˆ unb 0.2037157692 . A 95% C. I. estimate of the average downtime is yˆ unb 1.96SE yˆ unb 4.4019 hrs., 5.2005 hrs. . Ratio estimation: N The ratio estimator of the population mean is yˆ r M i 1 10 i M i 1 yi 2400.59 4.598831418 hours. 522 i 1 n s2 1 Then the estimated variance of the estimator is Vˆ yˆ r 2 1 r N n nN M S m M i2 1 i i 1 Mi n si2 , where mi M i2 yi yˆ r 1 2 2 2 2 s 50 5.40 4.598831418 56 5.00 4.598831418 1236.013278 . n 1 9 i 1 Then the estimated variance is 0.1299795637, and a 95% C. I. estimate of the average downtime is yˆ r 1.96SE yˆ r 4.598831418 1.960.3605267864 3.8922 hrs., 5.3055 hrs. . In this case, ratio estimation gives less precision than unbiased estimation. Let’s examine the reason. The scatterplot of cluster totals v. cluster sizes is shown below. It is clear that the line of best fit to the data does not pass through the origin. In addition, the regression of cluster totals v. cluster sizes shows a rather weak correlation, 0.5488. There is obviously not a strong relationship between cluster totals and cluster sizes. 2 Plot of Cluster Totals v. Cluster Sizes 60 50 Cluster Total n 2 r 40 30 20 10 0 0 10 20 30 40 50 Cluster Size SUMMARY OUTPUT Regression Statistics Multiple R 0.548818092 R Square 0.301201298 Adjusted R Square 0.213851461 Standard Error 5.01216653 Observations 10 60 70