Statistics 475 Notes 14 Reading: Lohr, Chapters 5.5-5.6 I. Designing a cluster sample Consider equal sized clusters. n MSB m MSW Var ( yˆunb ) 1 1 N nM M nm [Note: For equal sized clusters, Var ( yˆunb ) Var ( yˆr ) ] Consider the following cost function for sampling n clusters and m units per cluster. total cost C c1n c2 nm where c1 is the fixed cost of sampling each cluster (not including the cost of measuring ssu’s) and c2 is the additional cost of measuring each ssu. Assuming that N is large so that the finite population correction for sampling clusters can be ignored and using calculus, one can easily determine that the values C n c1 c2 m m c1M ( MSW ) c2 ( MSB MSW ) minimize Var ( yˆunb ) for fixed total cost C. [Note: For equal sized clusters, Var ( yˆunb ) Var ( yˆr ) ], where 1 N MSB M ( y i 1 j 1 iU yU ) N 1 N 2 , MSW M ( y i 1 j 1 ij yU ) 2 N ( M 1) . Example: One of the key quality assurance measurements in the manufacturing of automobile batteries is the thickness of lead plates. Positive plates are manufactured to be thicker than negative plates, so the two must be treated separately. It is desired to set up a sampling plan to sample n batteries per day and make m negative plate thickness measurements per battery, so that the standard error of the estimated mean plate thickness is 0.3. (Measurements of thickness are in thousandths of an inch). The cost of cutting a battery open is six times the cost of measuring a plate. There are M 9 plates per battery. A preliminary study of four batteries, with nine plate thickness measurements per battery gave the following data: battery1=c(97,101,97,97,99,100,96,100,100) battery2=c(95,96,96,99,96,97,95,96,100); battery3=c(99,96,97,97,96,98,99,98,100); battery4=c(94,95,97,98,97,97,97,95,96); We first estimate MSB and MSW by performing a one-way ANOVA on the preliminary study data. plate.thickness=c(battery1,battery2,battery3,battery4); batterynumber=c(rep(1,9),rep(2,9),rep(3,9),rep(4,9)); aov.battery=aov(plate.thickness~as.factor(batterynumber)); summary(aov.battery); Df Sum Sq Mean Sq F value Pr(>F) as.factor(batterynumber) 3 30.306 10.102 4.0747 0.01470 * Residuals 32 79.333 2.479 --2 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Thus, we estimate MSB=10.10 and MSW=2.48. The cost of cutting open a battery is six times that of c1 measuring a plate so that c 6 and the optimal m is 2 c1M (MSW ) 6*9*(2.48) 4.19 c2 ( MSB MSW ) (10.10 2.48) We can round and sample m 4 plates per battery. To achieve a SE of 0.3, we want to find n such that m n MSB m MSW SE ( yˆunb ) 1 1 0.3 , which is N nM M nM n equivalent to, assuming that N 0 , 10.10 4 2.48 SE ( yˆunb ) 1 0.3 n(9) 9 n(9) which gives n 14.17 . We can take n 15 to have a standard error <0.3. Thus, the quality assurance plan can should call for sampling n 15 batteries and m 4 plates per battery to achieve a standard error <0.3 at minimum cost. The cost for this sampling plan is Because the data show little variability within clusters, it would be wasteful to use one-stage cluster sampling, i.e., m 9 . For one stage cluster sampling we would need to choose n so that 3 10.10 9 2.48 SE ( yˆunb ) 1 0.3 , n(9) 9 n(9) which gives n 12.47 , so we take n to be 13. The cost of the optimal plan in terms of the cost c2 of measuring one plate is 15*(6* c2 ) 15*4* c2 150c2 , whereas the cost of the one stage cluster sample is 13*(6* c2 ) 13*9* c2 195c2 . Although we have discussed only designs where all the cluster sizes are equal, we can use these methods with unequal cluster sizes M i as well: just substitute M for M in the preceding work and decide the average subsample size m to take. Then either take m observations for every cluster of allocate observations so that mi constant . Mi As long as the M i ’s do not vary too much, this should produce a reasonable design. If the M i ’s are widely variable and the ti ’s are correlated with the M i ’s, a cluster sample with equal probabilities of selecting each cluster is not necessarily very efficient; we will discuss an alternative design when we cover Chapter 6. 4 II. Systematic Sampling (Chapter 5.6, Lohr) A sample obtained by randomly selecting one element from the first k elements in the sampling frame and then every kth element thereafter is called a 1-in-k systematic sample with a random start. Systematic sampling is easier to perform in the field than simple random sampling or stratified random sampling and hence is less subject to selection errors by field workers. Systematic sampling is particularly useful when the population size N is not known in advance. Example: Suppose we would like to take a random sample of customers who shop in the Penn bookstore in a given day. We do not know N or the sampling frame in advance so we could not take a simple random sample. In contrast, we could take a systematic sample (say, 1 in 20 shoppers). In addition to being easier to perform and less subject to interviewer error, systematic sampling sometimes provides more information per unit cost than does simple random sampling. A systematic sample is generally spread more uniformly over the entire population and thus may provide more information about the population than an equivalent amount of data contained in a simple random sample. Consider the following illustration. We wish to estimate the proportion of travel vouchers that are field incorrectly from a stack of N 1000 based on a sample of size n 10 . Consider a 1-in-100 systematic sample of travel vouchers: a 5 voucher is drawn at random from the first five vouchers (for example, number 3) and every 100th voucher thereafter is included in the sample. Suppose that most of the first 500 vouchers have been correctly filed, but that because of a change in clerks, the second 500 have all been incorrectly filed. Simple random sampling could accidentally select a larger number of the 10 vouchers from either the first or the second 500 vouchers and hence yield a poor estimate of the proportion of incorrectly field vouchers. In contrast, systematic sampling would select an equal number of vouchers from each of the two groups and would give a very accurate estimate of the proportion of vouchers incorrectly filed. Other commons uses of systematic samples: (1) Industrial quality control sampling plans are most often systematic in structure: An inspection plan for manufactured items moving along an assembly line may call for inspecting every 50th item. The time of day is often important in assessing quality of worker performance and so an inspection plan may call for sampling the output of a workstation at systematically selected times of the day. (2) Auditors are frequently confronted with the problem of sampling a list of accounts to check compliance with accounting procedures or to verify dollar amounts. The most natural way to sample these lists is to choose accounts systematically. If the accounts are ordered from largest to smallest amount, then the systematic random sample does a 6 good job of sampling uniformly from accounts of different types. (3) Market researchers and opinion pollsters who sample people on the move very often employ a systematic design. Every 20th customer at a checkout counter may be asked his or her opinion on the taste, color or texture of a food product. Every 10th person boarding a bus may be asked to fill out a questionnaire on bus service. Every 100th car entering an amusement park may be stopped and the driver questioned on various advertising policies of the park. Estimation from systematic samples Systematic sampling is really a special case of one-stage cluster sampling where we draw one cluster. Suppose we want to take a 1-in-4 systematic sample from a population that has 12 units: 1 2 3 4 5 6 7 8 9 10 11 12 To take the systematic sample, we choose a number randomly between 1 and 4, draw that unit and then every fourth unit thereafter. The population consists of four clusters {1,5,9}, {2,6,10}, {3,7,11} {4,8,12} and we are taking a simple random sample of one cluster. Our estimate of the population mean is the sample mean of the chosen cluster i, yi yˆ sys . 7 This is the unbiased estimate of the population mean for a one-stage cluster sample. To express the variance of yˆ sys using our cluster sampling variance formulas, let the population be of size NM where the number of clusters N is equal to k (for a 1-in-k systematic sample) and the cluster sizes are equal to M. Recall that for a cluster sample with equal cluster sizes, 2 S n t Var ( yˆunb ) 1 2 , N nM N 2 where St (t t ) 2 i M ( MSB) N 1 For a systematic sample, the number of clusters sampled is n 1 so that 2 S 1 t Var ( yˆ sys ) 1 2 NM A simple random sample of size M (the size of the systematic sample) from a population of size NM has variance 2 1 S Var ( yˆ srs ) 1 . NM Thus, the systematic sample has smaller variance than a 2 simple random sample if MSB S . i 1 A problem with systematic sampling is that because we 2 only sample one cluster, we have no way to estimate S t 8 and no way to estimate Var ( yˆ sys ) . We need to know something about the structure of the population to estimate the variance. Let’s look at three different population structures. 1. The list is in random order. In many situations, the ordering of the population is unrelated to the characteristics of interest, as when the list of persons in the sampling frame is in alphabetic order. In this situation, we can use the simple random sampling formula to estimate Var ( yˆ sys ) . 2. The sampling frame is in increasing or decreasing order. Financial records may be listed with the largest amounts first and the smallest last. Such a population is said to have positive autocorrelation: Adjacent elements tend to be more similar than elements that are farther apart. A systematic sample forces the sample values to be spread out, making a systematic sample more efficient than the same sized simple random sample. When the frame is in increasing or decreasing order, you may use the simple random sampling formula for standard error, but it will likely be an overestimate. 3. The sampling frame has periodic pattern. A population is periodic if the elements of a population have values that tend to cycle upward and downward in a regular pattern when listed. For example, the daily sales volume for a grocery store may be cyclical within weeks. For a periodic population, the effectiveness of a 1-in-k systematic sample depends on the value we choose for k. If we sample daily 9 sales every Wednesday, we will probably underestimate the true average daily sales volume. If we sample every Friday, we will probably overestimate the true average daily sales volume. If we sample every ninth day, then we’ll hit both the peak and valleys of the cyclical trend, and the systematic sample will behave like a simple random sample. If periodicity in a population is a concern, one solution is to use interpenetrating systematic samples. Instead of taking one systematic sample, take several systematic samples from the population. Then you can use the formulas for cluster samples to estimate variances; each systematic sample acts as one cluster. Systematic sampling is likely to produce a sample that behaves like a simple random sample. 10