Notes 15 - Wharton Statistics Department

advertisement
Statistics 475 Notes 15
Reading: Lohr, Chapters 6.1-6.2
Ruben’s question from last class: In planning a two stage
cluster design, what happens if the cluster size M is large
but unknown?
Suppose the cost is c1 and c2 for sampling a unit within a
cluster. The optimal sample size m in each cluster can be
written as

c1M ( MSW )
c M ( N  1)  1
 1

1


c2 ( MSB  MSW )
c2 ( NM  1)  Ra2  ,
MSW
2
R

1

M is large, then the optimal m is
where a
S 2 . If
approximately
m

c1 
1  1
m
 1
1   
c2  N   Ra2 
I. Sampling clusters with unequal probabilities: motivating
example
O’Brien et al. (1995, Journal of the American Medical
Association) sought to estimate the preference of nursing
home residents in the Philadelphia’s for life sustaining
treatments. Do they wish to have cardiopulmonary
resuscitation (CPR) if the heart stops beating, or to be
transferred to a hospital if a serious illness develops, or to
1
be fed through an enteral tube if no longer able to eat? The
target population was all residents of licensed nursing
homes in the Philadelphia area. There were 294 such
homes with a total of 37,652 beds (before sampling, they
only knew the number of beds, not the total number of
residents).
Because the survey was to be done in person, cluster
sampling was essential for keeping survey costs
manageable. Consider a two stage cluster sample in which
each nursing home has the same probability of being
selected in the sample and then the subsample size for each
home is proportional to the number of beds in the home.
This is a self weighting sample, meaning that the weights
for each bed in the sample are the same, i.e., each bed in
the population has the same probability of being sample
and each sampled bed represents the same number of beds
in the population. However, this design has the following
drawbacks:
(1) We would expect that the total number of patients in a
home who desire CPR ( ti ) is proportional to the number of
beds in the home ( M i ) so the unbiased estimator of the
population mean would have large variance. Using a ratio
estimator would help to alleviate this concern.
(2) A self-weighting equal probability sample may be
cumbersome to administer. It may require driving out to a
nursing home just to interview one or two residents, and
equalizing workloads of interviewers may be difficult.
2
(3) The cost of the sample is unknown in advance – a
random sample of 40 homes may consist primarily of large
nursing homes, which would lead to greater expense than
anticipated.
Instead of taking a cluster sample of homes with equal
probabilities, the investigators randomly drew a sample of
57 nursing homes with probabilities proportional to the
number of beds. This is called probability proportional to
size sampling of clusters. They then took a simple random
sample of 30 beds (and their occupants) from a list of all
beds within the nursing home. If the number of residents
equals the number of beds and if a home has the same
number of beds when visited as are listed in the sampling
frame, then the sampling design results in every resident
having the same probability of being included in the
sample. The cost is known before selecting the sample, the
same number of interviews are taken at each home, and the
estimator of a population total will likely have a smaller
variance than if we had sampled the nursing homes with
equal probabilities.
II. Unequal Probability Sampling with Replacement
We first consider how to sample clusters proportional to
size with replacement. Sampling with replacement means
that the selection probabilities do not change after we have
drawn the first unit. Although sampling without
replacement is more efficient, sampling with replacement is
3
often used because of the ease in selecting and analyzing
samples. Let
 i  P(select unit i on first draw)
For sampling without replacement,  i is the probability
that unit i is selected on the second draw, or the third draw,
or any other given draw. The overall probability that unit i
is selected in the sample at least once is
 i  1  P(unit i not in sample)  1  (1  i ) n .
Example 1: Consider the following population of
introductory statistics classes at a college. The college has
15 such classes; class i has M i students, for a total of 64
students. We decide to sample 5 classes with replacement,
with probability proportional to M i , and then collect a
questionnaire from each student in the sampled classes.
For this example, then  i  M i / 647 .
Class
Number
1
2
3
4
5
6
7
8
9
Mi
i
44
33
26
22
76
63
20
44
54
0.068006
0.051005
0.040185
0.034003
0.117465
0.097372
0.030912
0.068006
0.083462
4
10
11
12
13
14
15
Total
34
46
24
46
100
15
647
0.052550
0.071097
0.037094
0.071097
0.154560
0.023184
1
One way to sample the clusters with probabilities  i is the
cumulative size method. Calculate the cumulative totals of
i:

Class
Cumulative
Number
Total of 
1
0.068006 0.068006
2
0.051005 0.119011
3
0.040185 0.159196
4
0.034003 0.193199
5
0.117465 0.310665
6
0.097372 0.408037
7
0.030912 0.438949
8
0.068006 0.506955
9
0.083462 0.590417
10
0.052550 0.642967
11
0.071097 0.714065
12
0.037094 0.751159
13
0.071097 0.822257
14
0.154560 0.976816
15
0.023184 1.000000
Total
1
i
i
5
Sample a random uniform number between 0 and 1.
> runif(1)
[1] 0.6633096
Find the smallest cluster i which has cumulative total  i
above the sampled random uniform number. This is the
sampled cluster.
For the above random uniform number, the sampled cluster
is 11.
Another method for sampling clusters with probability
proportional to  i is Lahiri’s (1951) method:
1. Draw a random number between 1 and N (the number
of clusters). This indicates which cluster you are
considering.
2. Draw a random number between 1 and max( M i ) : if
the random number is less than or M i , then include
cluster i in the sample; otherwise, go back to step 1.
3. Repeat until the desired sample size is obtained.
Lahiri’s method Example 1. The largest class has
max( M i ) =100 students, so we generate pairs of random
integers, the first between 1 and 15, the second between 1
and 100, until the sample has five clusters.
M
First random
Second
Action
number (cluster to random
consider)
number
i
6
12
6
14
1
24
65
7
10
14
15
5
11
1
84
49
47
43
24
87
36
24
6<24; include
cluster 12 in
sample
100 Include in sample
44 65>44; discard pair
of numbers and try
again
20 84>20; try again
34 Try again
100 Include
15 Try again
76 Include
46 Try again
44 Include
Proof that Lahiri’s method produces a probability
proportional to size sample:
Lahiri’s method is an example of rejection sampling.
Consider selecting one cluster with Lahiri’s method. Let
V j = 1 if the cluster is selected using the jth pair of random
numbers and 0 otherwise. We have
P(ith cluster is selected|V j  1,V1  0, , V j 1  0) 
Mi
1
N max( M i )
Mi

N
N
Mi
1
Mi


i 1 N max( M i )
i 1
Since this holds for all V j , the draw on which the cluster is
selected is independent of the cluster selected and
7
P(ith cluster is selected) 
Mi
N
M
i 1
i
III. Estimation Using Unequal Probability Sampling
Without Replacement
Because we are sampling with replacement, the sample
may contain the same unit more than once. To allow us to
keep track of which clusters occur multiple times in the
sample, define the random variables Qi by
Qi  number of times unit i occurs in sample
An unbiased estimate of the population total is
N
t
1
tˆ   Qi i
(1.1)
n i 1  i
N
Note that
Q
i 1
i
 n and E[Qi ]  n i so that
N
N
N
t
t
1
1
i
i
E[tˆ ]   E[Qi ]   n i
  ti , so that tˆ is an
n i 1
 i n i 1
 i i 1
unbiased estimator of the population total. Estimator (1.1)
can be motivated as follows. Suppose we just sample one
cluster, n  1 . Then the cluster i chosen represents the
proportion  i of the units in the population and so a
natural estimator of the total is ti (1/ i ) . Estimator (1.1)
averages this estimate over the n clusters in the sample.
Variance of tˆ : If n=1, we have
8
Var[tˆ ]  E[(tˆ  t ) 2 ]


P ( S )(tˆ S  t )2
possible samples S
2
t

  i  i  t 
i 1
 i 
Then, the estimator (1.1) is just the average of n
N
2
 ti



t
 , so
i
observations, each with variance 
i 1
 i

N
N
t

1
Var[tˆ ]   i  i  t 
n i 1   i 
2
(1.2)
To estimate Var[tˆ ] from a sample, we might think that we
could use a formula of the same form as (1.2) but this will
not work. Equation (1.2) involves a weighted average of
2
the (ti / i  t ) , weighted by the unequal probabilities of
selection. But in taking the sample, we have already used
the unequal probabilities – they appear in the random
variables Qi . If we included the  i ’s again as multipliers
in estimating the sample variance, we would be using the
unequal probabilities twice. Instead, to estimate the
variance, use
 ti

ˆ

t

 
N

1

ˆ [tˆ ]   Qi  i
Var
n i 1
n 1
2
9
ˆ [tˆ ] is just the sample variance of the ti / i ’s divided
Var
by the sample size n.
.
ˆ [tˆ ] is an unbiased estimate of the variance because
Var
2
N
 t


1
i
E (Var[tˆ ]) 
E Qi   tˆ  

n(n  1) i 1    i
 

2
N




ti
1
2

E Qi   t   Qi (tˆ  t ) 

n(n  1) i 1    i 


2
N




ti
1
  n i   t   nVar (tˆ ) 

n(n  1)  i 1

 i 

 Var (tˆ )
Example 1 continued: For the situation in Example 1 and
the sample {12, 14, 14, 5, 1} we selected using Lahiri’s
method, we have the following data:

t
t /
Class
12
24/647
75
2021.875
14
100/647
203
1313.410
14
100/647
203
1313.410
5
76/647
191
1626.013
1
44/647
168
2470.634
i
i
i
i
The numbers in the last column of the table are the
estimates of t that would be obtained if that cluster were
10
the one selected in a sample of size 1. The population total
is estimated by averaging the five values of ti / i :
2021.875  1313.410  1313.410  1626.013  2470.364
tˆ 
 1749.014
5
The standard error of tˆ is simply s / n where s is the
sample standard deviation of the ti / i ’s:
SE (tˆ ) 
1
5
(2021.875  1749.014) 2 
 (2470.364  1749.014) 2
4
 222.42 .
Then, we estimate the average amount of time a student
spent studying statistics by tˆ divided by the population
size:
1749.014
yˆ 
 2.70
647
hours with
SE ( yˆ )  SE (tˆ ) / 647  222.42 / 647  0.34 hours .
The formulas for estimating the population total, population
means and their standard errors are valid for any choice of
sampling probabilities  i , regardless of whether the  i ’s
are proportional to the size of the cluster. For example, if
we are interested in obtaining an accurate estimate of the
mean amount of time spent studying for the subpopulation
of students on college sports teams, we might want to
sample classes with more students on sports teams with
higher probability.
IV. Two Stage Sampling With Replacement
11
If we sample clusters with unequal probabilities and then
use simple random sampling within clusters, the estimators
from one-stage unequal sampling are modified slightly to
allow for different subsamples in clusters that are selected
more than once:
N Qi tˆ
1
tˆ   ij
n i 1 j 1  i
 tˆij

ˆ

t

 
N Qi

1

ˆ (tˆ )    i
Var
n i 1 j 1 n  1
2
Returning to Example 1, suppose we subsample five
students in each class rather than observing ti .
y
tˆ

y
t /
Class M
12
24 24/647 2,3,2.5,3,1.5 2.4 57.6 1552.8
14
100 100/647 2.5,2,3,0,0.5 1.6 160 1035.2
14
100 100/647 3,0.5,1.5,2,3 2.0 200 1294.0
5
76 76/647 1,2.5,3,5,2.5 2.8 212.8 1811.6
1
44 44/647 4,4.5,3,2,5 3.7 162.8 2393.9
average
1617.5
std. dev.
521.628
i
i
ij
i
i
i
i
Thus, tˆ  1617.5 and SE (tˆ )  521.568 / 5  233.28 .
Then we estimate the average amount of time spent
12
tˆ
ˆy    1617.5  2.5
studying as
hours (where K is the
K
647
SE (tˆ ) 233.28
ˆ
population size) with SE ( y )  K  647  0.36 .
13
Download