Notes 13 - Wharton Statistics Department

advertisement
Statistics 475 Notes 13
Reading: Lohr, Chapters 5.3-5.4
I. Two stage cluster sampling
In one-stage cluster sampling, we examine all secondary
sampling units (ssu’s) within the selected clusters (primary
sampling units). In many situations, though, the units in a
cluster may be so similar that examining all units within a
cluster wastes resources; alternatively, it may be expensive
to measure units in a cluster relative to the cost of sampling
clusters. In these situations, taking a subsample within
each cluster may be cheaper and more efficient.
In two-stage cluster sampling, we
1. Select a simple random sample (SRS) S of n clusters
from a population of N clusters.
2. Select a SRS of ssu’s from each selected cluster. The
SRS of mi from the ith cluster is denoted by Si .
Examples:
(1) The Gallup poll for presidential elections samples
approximately 300 election districts from around
the U.S. At the second stage, this poll randomly
selects approximately five households per district.
(2) Sampling for quality control purposes often
involves two stages of sampling. For example,
when an investigator samples packaged products,
such as frozen foods, he or she commonly samples
1
cartons and then samples packages from within
cartons. When sampling requires the detailed
investigation of components of products, such as
measuring plate thickness in automobile batteries, a
quite natural procedure is to sample some of the
products (batteries) and then sample components
(plates) within these products.
Data example: A garment manufacturer has 90 plants
located throughout the U.S. and wants to estimate the
average number of hours that the sewing machines were
down for repairs in the past three months. Because the
plants are widely scattered, the manufacturer’s statistician
decides to use cluster sampling, specifying each plant as a
cluster of machines. Each plant contains many machines,
and checking the repair record for each machine would be
time consuming. Therefore, she uses two stage sampling.
Enough time and money are available to sample
n  10 plants and approximately 20% of the machines in
each plant. The manufacturer knows she has a combined
total of 4500 machines in all plants.
Plant
Mi
mi
1
50
10
2
65
13
3
45
9
4
48
10
Downtime
(hours)
5,7,9,0,11,
2,8,4,3,5
4,3,7,2,11,0,
1,,9,4,3,2,1,5
5,6,4,11,12,
0,1,8,4
6,4,0,1,0,9,
2
yi
si2
5.40
11.38
4.00
10.67
5.67
16.75
4.80
13.29
5
52
10
6
58
12
7
42
8
8
66
13
9
40
8
10
56
11
8,4,6,10
11,4,3,1,0,2,
8,6,5,3
12,11,3,4,2,
0,0,1,4,3,2,4
3,7,6,7,8,
4,3,2
3,6,4,3,2,2,
8,4,0,4,5,6,3
6,4,7,3,
9,4,1,5
6,7,5,10,11,
2,1,4,0,5,4
4.30
11.12
3.83
14.88
5.00
5.14
3.85
4.31
4.88
6.13
5.00
11.80
Unbiased estimates of the population total and population
mean:
In one-stage cluster sampling, we could estimate the
N
population total by n  ti ; the cluster totals ti were
iS
known because we sampled every secondary sampling unit
in the cluster. In two-stage cluster sampling, however,
since we do not observe every ssu in the sampled cluster,
we need to estimate individual cluster totals by
M
tˆi   i yij  M i yi
jSi mi
and an unbiased estimator of the population total is
3
N
tˆunb 
n
N
ˆ
t


i
n
iS
M y .
iS
i
i
In two-stage sampling, the tˆi ’s are random variables.
Consequently, the variance of tˆ has two components: (1)
the variability between clusters; and (2) the variability of
the ssu’s within clusters. We do not have to worry about
component (2) in one-stage cluster sampling.
The variance of tˆunb equals the variance of tˆunb from onestage cluster sampling plus an extra term because the tˆi ’s
estimate the cluster the totals. For two-stage cluster
sampling,
2
N
 mi  2 Si2
S
n
N

2
t
Var (tˆunb )  N 1     1 
 Mi
mi (1.1)
 N  n n i 1  M i 
2
where S t is the population variance of the cluster totals and
S i2 is the population variance among elements within
cluster i. The first term in (1.1) is the variance from onestage cluster sampling and the second term is the additional
variance due to subsampling.
2
2
To estimate Var (tˆunb ) , we substitute estimates st and si
2
2
for S t and S i respectively.
2
 ˆ tˆunb 

 ti  N 
 ,
st2  iS 
n 1
si2 
  yij  yi 
jSi
mi  1
4
2
2

 2 si2
s
m
n
N


t
i
ˆ (tˆunb )  N 1     1 
Var
 Mi
N
n
n
M
mi


iS 
i 
The standard error SE (tˆunb ) is of course the square root of
ˆ (tˆunb ) .
Var
2
If we know the number of units in the population K , we
can estimate the population mean by
tˆ
yˆunb  unb
K
with standard error
SE (tˆunb )
ˆ
SE ( yunb ) 
K .
R code:
Mi=c(50,65,45,48,52,58,42,66,40,56);
mi=c(10,13,9,10,10,12,8,13,8,11);
yibar=c(5.4,4,5.67,4.8,4.3,3.83,5,3.85,4.88,5);
sisq=c(11.38,10.67,16.75,13.29,11.12,14.88,5.14,4.31,6.13,11.8);
N=90;
n=10;
K=4500;
that.i=Mi*yibar;
that.unb=(N/n)*sum(Mi*yibar);
stsq=sum((that.i-that.unb/N)^2/(n-1));
varhat.that.unb=N^2*(1-n/N)*stsq/n+(N/n)*sum((1mi/Mi)*Mi^2*(sisq/mi));
se.that.unb=sqrt(varhat.that.unb);
yhat.unb=that.unb/K;
se.yhat.unb=se.that.unb/K;
yhat.unb
[1] 4.80118
se.yhat.unb
5
[1] 0.192594
Thus, we estimate that the average sewing machine was
down for repair for 4.8 hours in the past three months and a
95% confidence interval for this average is approximately
4.8  1.96*0.19  (4.43, 5.17) .
Ratio Estimation: As with one-stage cluster sampling, we
can also use a ratio estimator for estimating the population
mean where the auxiliary variables are the cluster sizes.
N
1 N
ti
ti


N
i 1
y  Ni 1 
N
1
M
Mi


i
N
i 1
i 1
The ratio estimates of the population mean and population
total are:
tˆi

yˆ r  iS
 Mi
iS
tˆr  Kyˆ r
The difference between one-stage and two-stage cluster
sampling is that we need to estimate the cluster totals ti in
two-stage cluster sampling.
Using a Taylor series approximation, the variance of the
ratio estimator is estimated by
6
1
ˆ ( yˆ r )  2
Var
M

n  sr2
1
1




N

 n nN

where
s 
2
r

iS
M i yi  M i yˆ r


mi
M
1



iS
 Mi
2
i
 si2 
 
 mi 
2
n 1
and M is the average cluster size – either the population
average or sample average can be used in the estimate of
the variance.
# Ratio estimate
ybarhat.r=sum(that.i)/sum(Mi);
Mbar=mean(Mi);
srsq=sum((Mi*yibar-Mi*ybarhat.r)^2)/(n-1);
varhat.ybarhat.r=(1/Mbar^2)*((1-n/N)*srsq/n+(1/(n*N))*sum(Mi^2*(1mi/Mi)*sisq/mi));
se.ybarhat.r=sqrt(varhat.ybarhat.r);
ybarhat.r
[1] 4.598831
se.ybarhat.r
[1] 0.2220061
In this example, the ratio estimator has a higher standard
error than the unbiased estimator. However, if the
population size K were unknown, then we would have to
use the ratio estimator. Furthermore, the ratio estimator
will be better than the unbiased estimator when the cluster
sizes vary considerably and the cluster totals ti are roughly
proportional to the cluster sizes M i .
7
When the cluster sizes are all the same, the ratio estimator
and the unbiased estimator are the same.
II. Sampling weights for cluster samples
Recall from Chapter 4.3 that the sampling weight for an
observation in a sample is the inverse of the probability that
an observation would be selected in a sample and is the
number of units in the population that the observation
represents.
For cluster sampling,
P( jth ssu in ith cluster is selected)
 P(ith cluster selected)  P( jth ssu selected|ith cluster selected)
n mi

N Mi
Thus, the sampling weight for the jth unit in the ith cluster
is
NM i
wij 
nmi
We have
N
tˆunb   tˆi   wij yij and
n iS
iS jSi
tˆi

tˆunb
yˆ r  iS 
 M i   wij
iS
iS jSi
In two stage cluster sampling, for each ssu in the sample to
represent the same number of ssu’s in the population, mi
8
needs to be proportional to M i so mi / M i is constant. This
is approximately true in the above example on downtime of
sewing machines.
III. Analysis of cluster samples using the survey package
in R
For using the survey package in R to analyze cluster
samples, we need to create a variable that specifies which
cluster each observation in the sample belongs to, a
variable which gives a unique number to each ssu, a
variable that gives the sampling weight of each observation
and variables that give the overall number of clusters in the
population and the cluster size of each cluster in the
sample.
# Responses
downtime=c(5,7,9,0,11,2,8,4,3,5,4,3,7,2,11,0,1,9,4,3,2,1,5,5,6,4,11,12,0,1,8,
4,6,4,0,1,0,9,8,4,6,10,11,4,3,1,0,2,8,6,5,3,12,11,3,4,2,0,0,1,4,3,2,4,3,7,6,7,8,
4,3,2,3,6,4,3,2,2,8,4,0,4,5,6,3,6,4,7,3,9,1,4,5,6,7,5,10,11,2,1,4,0,5,4)
# Cluster sizes
plant=c(rep(1,10),rep(2,13),rep(3,9),rep(4,10),rep(5,10),rep(6,12),rep(7,8),re
p(8,13),rep(9,8),rep(10,11));
# Number of clusters
fpc1=rep(90,length(plant));
# Size of cluster for each ssu
fpc2=c(rep(50,10),rep(65,13),rep(45,9),rep(48,10),rep(52,10),rep(58,12),rep
(42,8),rep(66,13),rep(40,8),rep(56,11));
# Sampling weights
wght=(fpc1*fpc2)/(10*c(rep(10,10),rep(13,13),rep(9,9),rep(10,10),rep(10,10
),rep(12,12),rep(8,8),rep(13,13),rep(8,8),rep(11,11)));
# Unique id for each ssu
machine.id=seq(1,length(downtime),1);
9
# Data frame that contains the information for the survey design
garment.dataframe=data.frame(machine.id,plant,fpc1,fpc2,wght);
# Specify the sample design
garment.design=svydesign(id=~plant+machine.id,weights=~wght,data=gar
ment.dataframe,fpc=~fpc1+fpc2);
# Estimate the population mean of downtime using the ratio estimator
svymean(downtime,design=garment.design);
mean SE
[1,] 4.598 0.2219
The survey package automatically uses the ratio estimator
for estimating the population mean for cluster samples.
The survey packages uses tˆunb for estimating the population
total:
> svytotal(downtime,design=garment.design)
total SE
[1,] 21602 866.64
IV. Designing a cluster sample
When designing a cluster sample, you need to decide four
major issues:
(1) What overall precision (standard error) is needed?
(2) What size should the clusters be?
(3) How many ssu’s should be sampled in each cluster
selected for the sample
(4) How many clusters should be sampled?
Issue (1) is faced in any survey design. For issue (2), there
are often natural clusters.
We will focus on issues (3)-(4) in the case of equal cluster
sizes (in which case yˆunb  yˆ r ). Let M be the cluster sizes
10
and assume we will sample the same number m of ssu’s
from each cluster.
The ANOVA decomposition can be used to write
n  MSB 
m  MSW

Var ( yˆunb )  1  
 1  
 N  nM  M  nm
where MSB and MSW are the between and within mean
squares from the ANOVA table (see Notes 12; Lohr, pg.
138).
Consider the simple cost function
total cost  C  c1n  c2 nm
where c1 is the fixed cost of sampling each cluster (not
including the cost of measuring ssu’s) and c2 is the
additional cost of measuring each ssu. Using calculus, one
can easily determine that the values
C
n
c1  c2 m
m
c1M ( MSW )
c2 ( MSB  MSW )
minimize Var ( yˆunb ) for fixed total cost C.
Example: An inspector samples cans from a truckload of
canned creamed corn to estimate the average number of
worm fragments per can. The truck has 580 cases; each
case contains 24 cans. The inspector samples 12 cases at
random and subsamples 3 cans randomly from each
selected case.
11
case=rep(seq(1,12,1),each=3);
can=seq(1,36,1);
frag=c(1,5,7,4,2,4,0,1,2,3,6,6,4,9,8,0,7,3,5,5,1,3,0,2,7,3,5,3,1,4,4,7,9,0,0,0);
(a) Find an approximate 95% confidence interval for the
average number of worm fragments per can
# Cluster sizes
clustersize=rep(24,36);
noclusters=rep(580,36);
# Sampling weights
N=580;
n=12;
mi=rep(3,36);
sampwght=(N*clustersize)/(n*mi);
corn.design.dataframe=data.frame(case,can,clustersize,noclusters,sampwght)
;
corn.design=svydesign(id=~case+can,weights=sampwght,data=corn.design.
dataframe,fpc=~noclusters+clustersize);
svymean(frag,design=corn.design);
mean SE
[1,] 3.6389 0.6102
Thus, an approximate 95% confidence interval for the
average number of worm fragments per can is
3.64  1.96*.61  (2.44, 4.84)
(b) Suppose a new truckload is to be inspected and it is
thought to be similar to this one. It takes 20 minutes to
locate and open a case, and 8 minutse to locate and
examine each specified can within a case. How many cans
should be examined per case? How many cases? Assume
your budget is 120 minutes.
12
To find the MSB and MSW, we can use the one-way
analysis of variance.
> aov.frag=aov(frag~as.factor(case)); # as.factor(case) makes case into a
categorical variable with categories 1,...,12
> summary(aov.frag)
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(case) 11 149.639 13.604 3.0045 0.01172 *
Residuals
24 108.667 4.528
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Thus, we estimate MSB=13.60 and MSW=4.53.
Then,
c1M ( MSW )
20*24*4.53

 5.47
c2 ( MSB  MSW )
8*(13.6  4.53)
Round up to 6 cans per case.
m
Then,
C
120

 1.76
c1  c2 m 20  8*6
Round up and sample 2 cases.
The total cost would be 2*(20+8*6)=136, which is
overbudget. We can use 5 cans per case and remain within
budget, 2*(20+8*5)=120.
n
13
Download