Statistics 475 Notes 15 Reading: Lohr, Chapters 6.1-6.2 Ruben’s question from last class: In planning a two stage cluster design, what happens if the cluster size M is large but unknown? Suppose the cost is c1 and c2 for sampling a unit within a cluster. The optimal sample size m in each cluster can be written as c1M ( MSW ) c M ( N 1) 1 1 1 c2 ( MSB MSW ) c2 ( NM 1) Ra2 , MSW 2 R 1 M is large, then the optimal m is where a S 2 . If approximately m c1 1 1 m 1 1 c2 N Ra2 I. Sampling clusters with unequal probabilities: motivating example O’Brien et al. (1995, Journal of the American Medical Association) sought to estimate the preference of nursing home residents in the Philadelphia’s for life sustaining treatments. Do they wish to have cardiopulmonary resuscitation (CPR) if the heart stops beating, or to be transferred to a hospital if a serious illness develops, or to 1 be fed through an enteral tube if no longer able to eat? The target population was all residents of licensed nursing homes in the Philadelphia area. There were 294 such homes with a total of 37,652 beds (before sampling, they only knew the number of beds, not the total number of residents). Because the survey was to be done in person, cluster sampling was essential for keeping survey costs manageable. Consider a two stage cluster sample in which each nursing home has the same probability of being selected in the sample and then the subsample size for each home is proportional to the number of beds in the home. This is a self weighting sample, meaning that the weights for each bed in the sample are the same, i.e., each bed in the population has the same probability of being sample and each sampled bed represents the same number of beds in the population. However, this design has the following drawbacks: (1) We would expect that the total number of patients in a home who desire CPR ( ti ) is proportional to the number of beds in the home ( M i ) so the unbiased estimator of the population mean would have large variance. Using a ratio estimator would help to alleviate this concern. (2) A self-weighting equal probability sample may be cumbersome to administer. It may require driving out to a nursing home just to interview one or two residents, and equalizing workloads of interviewers may be difficult. 2 (3) The cost of the sample is unknown in advance – a random sample of 40 homes may consist primarily of large nursing homes, which would lead to greater expense than anticipated. Instead of taking a cluster sample of homes with equal probabilities, the investigators randomly drew a sample of 57 nursing homes with probabilities proportional to the number of beds. This is called probability proportional to size sampling of clusters. They then took a simple random sample of 30 beds (and their occupants) from a list of all beds within the nursing home. If the number of residents equals the number of beds and if a home has the same number of beds when visited as are listed in the sampling frame, then the sampling design results in every resident having the same probability of being included in the sample. The cost is known before selecting the sample, the same number of interviews are taken at each home, and the estimator of a population total will likely have a smaller variance than if we had sampled the nursing homes with equal probabilities. II. Unequal Probability Sampling with Replacement We first consider how to sample clusters proportional to size with replacement. Sampling with replacement means that the selection probabilities do not change after we have drawn the first unit. Although sampling without replacement is more efficient, sampling with replacement is 3 often used because of the ease in selecting and analyzing samples. Let i P(select unit i on first draw) For sampling without replacement, i is the probability that unit i is selected on the second draw, or the third draw, or any other given draw. The overall probability that unit i is selected in the sample at least once is i 1 P(unit i not in sample) 1 (1 i ) n . Example 1: Consider the following population of introductory statistics classes at a college. The college has 15 such classes; class i has M i students, for a total of 64 students. We decide to sample 5 classes with replacement, with probability proportional to M i , and then collect a questionnaire from each student in the sampled classes. For this example, then i M i / 647 . Class Number 1 2 3 4 5 6 7 8 9 Mi i 44 33 26 22 76 63 20 44 54 0.068006 0.051005 0.040185 0.034003 0.117465 0.097372 0.030912 0.068006 0.083462 4 10 11 12 13 14 15 Total 34 46 24 46 100 15 647 0.052550 0.071097 0.037094 0.071097 0.154560 0.023184 1 One way to sample the clusters with probabilities i is the cumulative size method. Calculate the cumulative totals of i: Class Cumulative Number Total of 1 0.068006 0.068006 2 0.051005 0.119011 3 0.040185 0.159196 4 0.034003 0.193199 5 0.117465 0.310665 6 0.097372 0.408037 7 0.030912 0.438949 8 0.068006 0.506955 9 0.083462 0.590417 10 0.052550 0.642967 11 0.071097 0.714065 12 0.037094 0.751159 13 0.071097 0.822257 14 0.154560 0.976816 15 0.023184 1.000000 Total 1 i i 5 Sample a random uniform number between 0 and 1. > runif(1) [1] 0.6633096 Find the smallest cluster i which has cumulative total i above the sampled random uniform number. This is the sampled cluster. For the above random uniform number, the sampled cluster is 11. Another method for sampling clusters with probability proportional to i is Lahiri’s (1951) method: 1. Draw a random number between 1 and N (the number of clusters). This indicates which cluster you are considering. 2. Draw a random number between 1 and max( M i ) : if the random number is less than or M i , then include cluster i in the sample; otherwise, go back to step 1. 3. Repeat until the desired sample size is obtained. Lahiri’s method Example 1. The largest class has max( M i ) =100 students, so we generate pairs of random integers, the first between 1 and 15, the second between 1 and 100, until the sample has five clusters. M First random Second Action number (cluster to random consider) number i 6 12 6 14 1 24 65 7 10 14 15 5 11 1 84 49 47 43 24 87 36 24 6<24; include cluster 12 in sample 100 Include in sample 44 65>44; discard pair of numbers and try again 20 84>20; try again 34 Try again 100 Include 15 Try again 76 Include 46 Try again 44 Include Proof that Lahiri’s method produces a probability proportional to size sample: Lahiri’s method is an example of rejection sampling. Consider selecting one cluster with Lahiri’s method. Let V j = 1 if the cluster is selected using the jth pair of random numbers and 0 otherwise. We have P(ith cluster is selected|V j 1,V1 0, , V j 1 0) Mi 1 N max( M i ) Mi N N Mi 1 Mi i 1 N max( M i ) i 1 Since this holds for all V j , the draw on which the cluster is selected is independent of the cluster selected and 7 P(ith cluster is selected) Mi N M i 1 i III. Estimation Using Unequal Probability Sampling Without Replacement Because we are sampling with replacement, the sample may contain the same unit more than once. To allow us to keep track of which clusters occur multiple times in the sample, define the random variables Qi by Qi number of times unit i occurs in sample An unbiased estimate of the population total is N t 1 tˆ Qi i (1.1) n i 1 i N Note that Q i 1 i n and E[Qi ] n i so that N N N t t 1 1 i i E[tˆ ] E[Qi ] n i ti , so that tˆ is an n i 1 i n i 1 i i 1 unbiased estimator of the population total. Estimator (1.1) can be motivated as follows. Suppose we just sample one cluster, n 1 . Then the cluster i chosen represents the proportion i of the units in the population and so a natural estimator of the total is ti (1/ i ) . Estimator (1.1) averages this estimate over the n clusters in the sample. Variance of tˆ : If n=1, we have 8 Var[tˆ ] E[(tˆ t ) 2 ] P ( S )(tˆ S t )2 possible samples S 2 t i i t i 1 i Then, the estimator (1.1) is just the average of n N 2 ti t , so i observations, each with variance i 1 i N N t 1 Var[tˆ ] i i t n i 1 i 2 (1.2) To estimate Var[tˆ ] from a sample, we might think that we could use a formula of the same form as (1.2) but this will not work. Equation (1.2) involves a weighted average of 2 the (ti / i t ) , weighted by the unequal probabilities of selection. But in taking the sample, we have already used the unequal probabilities – they appear in the random variables Qi . If we included the i ’s again as multipliers in estimating the sample variance, we would be using the unequal probabilities twice. Instead, to estimate the variance, use ti ˆ t N 1 ˆ [tˆ ] Qi i Var n i 1 n 1 2 9 ˆ [tˆ ] is just the sample variance of the ti / i ’s divided Var by the sample size n. . ˆ [tˆ ] is an unbiased estimate of the variance because Var 2 N t 1 i E (Var[tˆ ]) E Qi tˆ n(n 1) i 1 i 2 N ti 1 2 E Qi t Qi (tˆ t ) n(n 1) i 1 i 2 N ti 1 n i t nVar (tˆ ) n(n 1) i 1 i Var (tˆ ) Example 1 continued: For the situation in Example 1 and the sample {12, 14, 14, 5, 1} we selected using Lahiri’s method, we have the following data: t t / Class 12 24/647 75 2021.875 14 100/647 203 1313.410 14 100/647 203 1313.410 5 76/647 191 1626.013 1 44/647 168 2470.634 i i i i The numbers in the last column of the table are the estimates of t that would be obtained if that cluster were 10 the one selected in a sample of size 1. The population total is estimated by averaging the five values of ti / i : 2021.875 1313.410 1313.410 1626.013 2470.364 tˆ 1749.014 5 The standard error of tˆ is simply s / n where s is the sample standard deviation of the ti / i ’s: SE (tˆ ) 1 5 (2021.875 1749.014) 2 (2470.364 1749.014) 2 4 222.42 . Then, we estimate the average amount of time a student spent studying statistics by tˆ divided by the population size: 1749.014 yˆ 2.70 647 hours with SE ( yˆ ) SE (tˆ ) / 647 222.42 / 647 0.34 hours . The formulas for estimating the population total, population means and their standard errors are valid for any choice of sampling probabilities i , regardless of whether the i ’s are proportional to the size of the cluster. For example, if we are interested in obtaining an accurate estimate of the mean amount of time spent studying for the subpopulation of students on college sports teams, we might want to sample classes with more students on sports teams with higher probability. IV. Two Stage Sampling With Replacement 11 If we sample clusters with unequal probabilities and then use simple random sampling within clusters, the estimators from one-stage unequal sampling are modified slightly to allow for different subsamples in clusters that are selected more than once: N Qi tˆ 1 tˆ ij n i 1 j 1 i tˆij ˆ t N Qi 1 ˆ (tˆ ) i Var n i 1 j 1 n 1 2 Returning to Example 1, suppose we subsample five students in each class rather than observing ti . y tˆ y t / Class M 12 24 24/647 2,3,2.5,3,1.5 2.4 57.6 1552.8 14 100 100/647 2.5,2,3,0,0.5 1.6 160 1035.2 14 100 100/647 3,0.5,1.5,2,3 2.0 200 1294.0 5 76 76/647 1,2.5,3,5,2.5 2.8 212.8 1811.6 1 44 44/647 4,4.5,3,2,5 3.7 162.8 2393.9 average 1617.5 std. dev. 521.628 i i ij i i i i Thus, tˆ 1617.5 and SE (tˆ ) 521.568 / 5 233.28 . Then we estimate the average amount of time spent 12 tˆ ˆy 1617.5 2.5 studying as hours (where K is the K 647 SE (tˆ ) 233.28 ˆ population size) with SE ( y ) K 647 0.36 . 13