Example of Two-Stage Cluster Sampling

advertisement
Example of Two-Stage Cluster Sampling
A garment manufacturer has N = 90 plants located throughout the United States and wants to estimate the
average number of hours that the sewing machines were down for repairs in the past months. Because the
plants are widely scattered, she decides to use cluster sampling, specifying each plant as a cluster of
machines. Each plant contains many machines, and checking the repair record for each machine would be
time-consuming. Therefore she uses two-stage cluster sampling. Enough time and money are available to
sample n = 10 plants and approximately 20% of the machines in each plant. The resulting data are given in
the table below.
Plant
Mi
mi
Downtime (in hours)
yi
s i2
1
50
10
5, 7, 9, 0, 11, 2, 8, 4, 3, 5
5.40
11.38
2
65
13
4, 3, 7, 2, 11, 0, 1, 9, 4, 3, 2, 1, 5
4.00
10.67
3
45
9
5, 6, 4, 11, 12, 0, 1, 8, 4
5.67
16.75
4
48
10
6, 4, 0, 1, 0, 9, 8, 4, 6, 10
4.80
13.29
5
52
10
11, 4, 3, 1, 0., 2, 8, 6, 5, 3
4.30
11.12
6
58
12
12, 11, 3, 4, 2, 0, 0, 1, 4, 3, 2, 4
3.83
14.88
7
42
8
3, 7, 6, 7, 8, 4, 3, 2
5.00
5.14
8
66
13
3, 6, 4, 3, 2, 2, 8, 4, 0, 4, 5, 6, 3
3.85
4.31
9
40
8
6, 4, 7, 3, 9, 1, 4, 5
4.88
6.13
10
56
11
6, 7, 5, 10, 11, 2, 1, 4, 0, 5, 4
5.00
11.80
We want to estimate the average downtime per machine, and we know that the total number of machines in
all plants is K = 4500.
The ANOVA table is given below:
Source
Df
Sum of Squares
Mean Square
F
p
Model
9
40.704487
4.522721
0.43
0.9177
Error
94
996.333974
10.599298
Total
103
1037.038462
10.068335
MSW
 0.05274 , indicating that each cluster is relatively heterogeneous; thus cluster
We have Ra2  1 
S2
sampling is at least as efficient as simple random sampling.
Unbiased estimation:
N
An unbiased estimate of the population mean is yˆ unb 
nK
 
10
M
i 1
i
yi 
90
2400.59  4.80118 hours.
104500
 2 si2
 M i
mi

m
N N 
The two terms in the variance estimate will be calculated separately. We have 2  1  i
K n i 1  M i
N
The estimated variance is given by Vˆ yˆ unb   
K
2
m
n  st2
N N 

1  i
1




2 
 N  n K n i 1  M i
 2 si2
M i
mi


  10 
90
11.38   13 
 10.67 
 11 
 11.80 
 1  2500
 
  1  4225
    1  3136
  0.0097722267 ,
2

50
10
65
13
56
11


4500
10















The mean cluster size is estimated to be M S  52.2 , so that
st2 

1 n
 M i yi  M S yˆ unb
n  1 i 1

2


1
505.40  52.24.801182  505.40  52.24.801182
9

2
2
n  s 2  90   10  892.3468481 
N 
= 892.3468481, and   1   t  
 1  
  0.0317278879 .
N  n  4500   90 
10
K 

 
 
Then the estimated variance is Vˆ yˆ unb  0.0415001146 , and the standard error is SE yˆ unb  0.2037157692 .
A 95% C. I. estimate of the average downtime is yˆ unb  1.96SE yˆ unb  4.4019 hrs., 5.2005 hrs. .
 
Ratio estimation:
N
The ratio estimator of the population mean is yˆ r 
M
i 1
10
i
M
i 1
yi

2400.59
 4.598831418 hours.
522
i
1 
n  s2
1
Then the estimated variance of the estimator is Vˆ yˆ r  2 1   r 
N  n nN
M S 
 



m
M i2 1  i

i 1
 Mi
n


 si2 
  , where
 mi 
M i2 yi  yˆ r
1
2
2
2
2
s 
 50 5.40  4.598831418    56 5.00  4.598831418  1236.013278 .
n 1
9
i 1
Then the estimated variance is 0.1299795637,
and a 95% C. I. estimate of the average downtime is
yˆ r  1.96SE yˆ r  4.598831418  1.960.3605267864   3.8922 hrs., 5.3055 hrs. .
In this case, ratio estimation gives less precision than unbiased estimation. Let’s examine the reason. The
scatterplot of cluster totals v. cluster sizes is shown below. It is clear that the line of best fit to the data does
not pass through the origin. In addition, the regression of cluster totals v. cluster sizes shows a rather weak
correlation, 0.5488. There is obviously not a strong relationship between cluster totals and cluster sizes.
2
 
Plot of Cluster Totals v. Cluster Sizes
60
50
Cluster Total
n
2
r
40
30
20
10
0
0
10
20
30
40
50
Cluster Size
SUMMARY
OUTPUT
Regression Statistics
Multiple R
0.548818092
R Square
0.301201298
Adjusted R Square 0.213851461
Standard Error
5.01216653
Observations
10
60
70
Download