Notes 14

advertisement
Statistics 475 Notes 14
Reading: Lohr, Chapters 5.5-5.6
I. Designing a cluster sample
Consider equal sized clusters.
n  MSB 
m  MSW

Var ( yˆunb )  1  
 1  
 N  nM  M  nm
[Note: For equal sized clusters, Var ( yˆunb )  Var ( yˆr ) ]
Consider the following cost function for sampling
n clusters and m units per cluster.
total cost  C  c1n  c2 nm
where c1 is the fixed cost of sampling each cluster (not
including the cost of measuring ssu’s) and c2 is the
additional cost of measuring each ssu. Assuming that N is
large so that the finite population correction for sampling
clusters can be ignored and using calculus, one can easily
determine that the values
C
n
c1  c2 m
m
c1M ( MSW )
c2 ( MSB  MSW )
minimize Var ( yˆunb ) for fixed total cost C. [Note: For equal
sized clusters, Var ( yˆunb )  Var ( yˆr ) ], where
1
N
MSB 
M
 ( y
i 1 j 1
iU
 yU )
N 1
N
2
, MSW 
M
 ( y
i 1 j 1
ij
 yU ) 2
N ( M  1)
.
Example: One of the key quality assurance measurements
in the manufacturing of automobile batteries is the
thickness of lead plates. Positive plates are manufactured
to be thicker than negative plates, so the two must be
treated separately. It is desired to set up a sampling plan to
sample n batteries per day and make m negative plate
thickness measurements per battery, so that the standard
error of the estimated mean plate thickness is 0.3.
(Measurements of thickness are in thousandths of an inch).
The cost of cutting a battery open is six times the cost of
measuring a plate. There are M  9 plates per battery. A
preliminary study of four batteries, with nine plate
thickness measurements per battery gave the following
data:
battery1=c(97,101,97,97,99,100,96,100,100)
battery2=c(95,96,96,99,96,97,95,96,100);
battery3=c(99,96,97,97,96,98,99,98,100);
battery4=c(94,95,97,98,97,97,97,95,96);
We first estimate MSB and MSW by performing a one-way
ANOVA on the preliminary study data.
plate.thickness=c(battery1,battery2,battery3,battery4);
batterynumber=c(rep(1,9),rep(2,9),rep(3,9),rep(4,9));
aov.battery=aov(plate.thickness~as.factor(batterynumber));
summary(aov.battery);
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(batterynumber) 3 30.306 10.102 4.0747 0.01470 *
Residuals
32 79.333 2.479
--2
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Thus, we estimate MSB=10.10 and MSW=2.48.
The cost of cutting open a battery is six times that of
c1
measuring a plate so that c  6 and the optimal m is
2
c1M (MSW )
6*9*(2.48)

 4.19
c2 ( MSB  MSW )
(10.10  2.48)
We can round and sample m  4 plates per battery. To
achieve a SE of 0.3, we want to find n such that
m
n  MSB 
m  MSW

SE ( yˆunb )  1  
 1  
 0.3 , which is
 N  nM  M  nM
n
equivalent to, assuming that N  0 ,
10.10  4  2.48
SE ( yˆunb ) 
 1  
 0.3
n(9)  9  n(9)
which gives n  14.17 . We can take n  15 to have a
standard error <0.3.
Thus, the quality assurance plan can should call for
sampling n  15 batteries and m  4 plates per battery to
achieve a standard error <0.3 at minimum cost. The cost
for this sampling plan is
Because the data show little variability within clusters, it
would be wasteful to use one-stage cluster sampling, i.e.,
m  9 . For one stage cluster sampling we would need to
choose n so that
3
10.10  9  2.48
SE ( yˆunb ) 
 1  
 0.3 ,
n(9)  9  n(9)
which gives n  12.47 , so we take n to be 13.
The cost of the optimal plan in terms of the cost c2 of
measuring one plate is 15*(6* c2 )  15*4* c2  150c2 ,
whereas the cost of the one stage cluster sample is
13*(6* c2 )  13*9* c2  195c2 .
Although we have discussed only designs where all the
cluster sizes are equal, we can use these methods with
unequal cluster sizes M i as well: just substitute M for
M in the preceding work and decide the average subsample
size m to take. Then either take m observations for every
cluster of allocate observations so that
mi
 constant .
Mi
As long as the M i ’s do not vary too much, this should
produce a reasonable design. If the M i ’s are widely
variable and the ti ’s are correlated with the M i ’s, a cluster
sample with equal probabilities of selecting each cluster is
not necessarily very efficient; we will discuss an alternative
design when we cover Chapter 6.
4
II. Systematic Sampling (Chapter 5.6, Lohr)
A sample obtained by randomly selecting one element from
the first k elements in the sampling frame and then every
kth element thereafter is called a 1-in-k systematic sample
with a random start.
Systematic sampling is easier to perform in the field than
simple random sampling or stratified random sampling and
hence is less subject to selection errors by field workers.
Systematic sampling is particularly useful when the
population size N is not known in advance.
Example: Suppose we would like to take a random sample
of customers who shop in the Penn bookstore in a given
day. We do not know N or the sampling frame in advance
so we could not take a simple random sample. In contrast,
we could take a systematic sample (say, 1 in 20 shoppers).
In addition to being easier to perform and less subject to
interviewer error, systematic sampling sometimes provides
more information per unit cost than does simple random
sampling. A systematic sample is generally spread more
uniformly over the entire population and thus may provide
more information about the population than an equivalent
amount of data contained in a simple random sample.
Consider the following illustration. We wish to estimate
the proportion of travel vouchers that are field incorrectly
from a stack of N  1000 based on a sample of size n  10 .
Consider a 1-in-100 systematic sample of travel vouchers: a
5
voucher is drawn at random from the first five vouchers
(for example, number 3) and every 100th voucher thereafter
is included in the sample. Suppose that most of the first
500 vouchers have been correctly filed, but that because of
a change in clerks, the second 500 have all been incorrectly
filed. Simple random sampling could accidentally select a
larger number of the 10 vouchers from either the first or the
second 500 vouchers and hence yield a poor estimate of the
proportion of incorrectly field vouchers. In contrast,
systematic sampling would select an equal number of
vouchers from each of the two groups and would give a
very accurate estimate of the proportion of vouchers
incorrectly filed.
Other commons uses of systematic samples:
(1) Industrial quality control sampling plans are most often
systematic in structure:
 An inspection plan for manufactured items moving
along an assembly line may call for inspecting every
50th item.
 The time of day is often important in assessing quality
of worker performance and so an inspection plan may
call for sampling the output of a workstation at
systematically selected times of the day.
(2) Auditors are frequently confronted with the problem of
sampling a list of accounts to check compliance with
accounting procedures or to verify dollar amounts. The
most natural way to sample these lists is to choose accounts
systematically. If the accounts are ordered from largest to
smallest amount, then the systematic random sample does a
6
good job of sampling uniformly from accounts of different
types.
(3) Market researchers and opinion pollsters who sample
people on the move very often employ a systematic design.
Every 20th customer at a checkout counter may be asked his
or her opinion on the taste, color or texture of a food
product. Every 10th person boarding a bus may be asked to
fill out a questionnaire on bus service. Every 100th car
entering an amusement park may be stopped and the driver
questioned on various advertising policies of the park.
Estimation from systematic samples
Systematic sampling is really a special case of one-stage
cluster sampling where we draw one cluster. Suppose we
want to take a 1-in-4 systematic sample from a population
that has 12 units:
1 2 3 4 5 6 7 8 9 10 11 12
To take the systematic sample, we choose a number
randomly between 1 and 4, draw that unit and then every
fourth unit thereafter.
The population consists of four clusters
{1,5,9}, {2,6,10}, {3,7,11} {4,8,12}
and we are taking a simple random sample of one cluster.
Our estimate of the population mean is the sample mean of
the chosen cluster i, yi  yˆ sys .
7
This is the unbiased estimate of the population mean for a
one-stage cluster sample.
To express the variance of yˆ sys using our cluster sampling
variance formulas, let the population be of size NM where
the number of clusters N is equal to k (for a 1-in-k
systematic sample) and the cluster sizes are equal to M.
Recall that for a cluster sample with equal cluster sizes,
2
S
n


t
Var ( yˆunb )  1  
2 ,
 N  nM
N
2
where St 
 (t  t )
2
i
 M ( MSB)
N 1
For a systematic sample, the number of clusters sampled is
n  1 so that
2
S
1


t
Var ( yˆ sys )  1   2
 NM
A simple random sample of size M (the size of the
systematic sample) from a population of size NM has
variance
2
1
S


Var ( yˆ srs )  1   .
 NM
Thus, the systematic sample has smaller variance than a
2
simple random sample if MSB  S .
i 1
A problem with systematic sampling is that because we
2
only sample one cluster, we have no way to estimate S t
8
and no way to estimate Var ( yˆ sys ) . We need to know
something about the structure of the population to estimate
the variance.
Let’s look at three different population structures.
1. The list is in random order. In many situations, the
ordering of the population is unrelated to the characteristics
of interest, as when the list of persons in the sampling
frame is in alphabetic order. In this situation, we can use
the simple random sampling formula to estimate Var ( yˆ sys ) .
2. The sampling frame is in increasing or decreasing
order. Financial records may be listed with the largest
amounts first and the smallest last. Such a population is
said to have positive autocorrelation: Adjacent elements
tend to be more similar than elements that are farther apart.
A systematic sample forces the sample values to be spread
out, making a systematic sample more efficient than the
same sized simple random sample. When the frame is in
increasing or decreasing order, you may use the simple
random sampling formula for standard error, but it will
likely be an overestimate.
3. The sampling frame has periodic pattern. A population
is periodic if the elements of a population have values that
tend to cycle upward and downward in a regular pattern
when listed. For example, the daily sales volume for a
grocery store may be cyclical within weeks. For a periodic
population, the effectiveness of a 1-in-k systematic sample
depends on the value we choose for k. If we sample daily
9
sales every Wednesday, we will probably underestimate the
true average daily sales volume. If we sample every
Friday, we will probably overestimate the true average
daily sales volume. If we sample every ninth day, then
we’ll hit both the peak and valleys of the cyclical trend, and
the systematic sample will behave like a simple random
sample.
If periodicity in a population is a concern, one solution is to
use interpenetrating systematic samples. Instead of taking
one systematic sample, take several systematic samples
from the population. Then you can use the formulas for
cluster samples to estimate variances; each systematic
sample acts as one cluster.
Systematic sampling is likely to produce a sample that
behaves like a simple random sample.
10
Download