Module H6 Sessions 13&14 Cluster and Multi-Stage Sampling Procedures 1. Cluster Sampling For administrative convenience and reduction of costs such as transport, interviewer time, supervision, a random selection is often made from clusters of units rather than from the total population of units themselves. For example, in a household survey, rather than selecting a simple random sample of households from the whole country, one could select a sample of villages. This has the added advantage that we don't need a list (sampling frame) of every household in the country from which to draw our sample, just a list of all the villages. If we then interview every household in the selected villages, we have a one–stage cluster sample. If we just interview a sample of households in each selected village we have a two–stage cluster sample. To take a random sample at this second stage, we need of course a sampling frame of households in the chosen villages. Advantages : (i) The major advantage of a cluster sampling scheme is to lower the costs and for administrative convenience. (ii) Also of advantage when no satisfactory sampling frame is available for the whole population, but a listing of the clusters into which the population is divided, is available. NOTE: (i) The selection of m clusters from a population made up of M clusters may be made by simple random sampling (SRS), or stratified random sampling, or systematic sampling, by treating the clusters themselves as units. (ii) Whether or not a particular group of units should be called a cluster depends on the circumstances. e.g. If units are households, villages may be clusters; if units are family members, households may be clusters. SADC Course in Statistics Module H6 Sessions 13&14 – Page 1 Module H6 Sessions 13&14 (iii) Clusters need not necessarily be natural aggregates. For example, in area sampling, imposing grids onto maps, produces artificial clusters. (iv) In any one sample design, several levels of clusters may be used (e.g. district, village, household). This leads to a multi-stage sampling procedure. The units at the first (topmost) stage of selection are known as primary sampling units (PSU's). In general, clusters do not usually contain an equal number of units. Cluster sampling will give less precise estimates than a simple random sample of the same size, but this loss in precision is usually far outweighed by the reduced cost, which enables a larger sample to be taken. Also, better administrative control will give better quality of data. In a large survey some clustering will probably be inevitable anyway because of the lack of a suitable sampling frame. 2. Selection of Clusters in One–Stage Cluster Sampling Could sample clusters with (a) equal probability (SRS) and use weights in estimation (an appreciation of the meaning and use of sampling weights will be given in Session 19); or (b) with probability proportional to size (PPS). Example: Suppose we want to select 1 cluster from a population made up of 10 clusters. Let Pi be the probability of selecting the ith cluster. Let Pi(a) and Pi(b) correspond to sampling with SRS and with PPS respectively (see Table below). Need to think whether sampling is with or without replacement. In (a), with or without replacement leads to the same probabilities Pi(a). But column Pi(b), gives with replacement probabilities. Sampling of clusters under scheme (b) is usually done with replacement. SADC Course in Statistics Module H6 Sessions 13&14 – Page 2 Module H6 Sessions 13&14 District size(pop) Pi(a) 1 2 3 4 5 6 7 1000 400 200 300 1200 1000 1600 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1000/6700 400/6700 200/6700 300/6700 1200/6700 1000/6700 1600/6700 8 9 10 200 350 450 1/10 1/10 1/10 200/6700 350/6700 450/6700 Total 6700 Cluster No: (i) Pi(b) 1 1 Other advantages of with replacement selection are: (i) in multistage sampling, variance estimates are easier (ii) selected clusters are independent but on the other hand, we may get the same cluster twice! If we have within-cluster sampling (2-stage) this may not be a problem, especially if withincluster sampling fraction is low. In practice, if the sampling fraction f n is small, we can sample without replacement, N but pretend we didn't. SADC Course in Statistics Module H6 Sessions 13&14 – Page 3 Module H6 Sessions 13&14 3. Method of Drawing a PPS Sample Make a cumulative list of cluster sizes. Example: Cluster No 1 2 Size 1000 400 3 4 5 6 7 8 9 10 200 300 1200 1000 1600 200 350 450 Cumulative Size 1000 1400 1600 1900 3100 4100 5700 5900 6250 6700 Method (i). Draw a systematic sample based on cumulative size, assuming clusters are arranged in random order. e.g. To select m = 3 clusters, look at 6700 2233. 3 Draw a random number between 1 and 2233. Say we get 1814. 1814 Select District 4 1814 + 2233 = 4047 Select District 6 4047 + 2233 = 6280 Select District 10 Unless one cluster is very large, this will be without replacement. SADC Course in Statistics Module H6 Sessions 13&14 – Page 4 Module H6 Sessions 13&14 Method (ii). Draw a random sample based on cumulative size. e.g. To select m = 3 clusters, take 3 random numbers from 1 to 6700. Say we get 5814, 832, 3534. 5814 Select District 8 832 Select District 1 3534 Select District 6 This is effectively sampling with replacement. 4. Estimation in two-stage cluster sampling Assume clusters are sampled with replacement, and xij represents the measurement made on the jth unit in the ith cluster. Then each cluster i provides an independent unbiased estimate xi of the population total XT (of some measurement variable X). This is given by xi m j xij pij ……………………...(1) where m is the number of clusters sampled; xij is the measurement made on the jth unit in the ith PSU (cluster); pij is the probability of selecting the jth unit in the ith cluster. The summation takes place over all xij's in the ith cluster (NOTE: In multi-stage samples, the summation extends over all draws and all stages of selection). SADC Course in Statistics Module H6 Sessions 13&14 – Page 5 Module H6 Sessions 13&14 Then we can estimate the overall total XT by 1 m x xi m i 1 ………...…………(2) and its variance by var x m 2 1 xi x . mm 1 i 1 …………….……..(3) Consider the previous example again with a slight change to cluster sizes. Cluster No Cluster Size pij for SRS pij for PPS 1 2 3 4 5 6 100 40 20 30 120 100 3/10 3/10 3/10 3/10 3/10 3/10 100/670 40/670 20/670 30/670 120/670 100/670 7 8 9 10 160 20 35 45 3/10 3/10 3/10 3/10 160/670 20/670 35/670 45/670 Total 670 Consider a 2–stage selection, taking 3 at stage 1, and then 8 at stage 2 (from each cluster). Consider SRS with replacement at stage 1 and without replacement at stage 2. Suppose selected clusters are: 1, 7, 8, at stage 1 sampling. Then, Prob. of selecting the jth unit from cluster 1 3 8 p1 j 10 100 Prob. of selecting the jth unit from cluster 7 3 8 p2 j 10 160 SADC Course in Statistics Module H6 Sessions 13&14 – Page 6 Module H6 Sessions 13&14 Prob. of selecting the jth unit from cluster 8 3 8 p3 j 10 20 Then from cluster 1 data, population total XT is estimated by 8 x1 j j 1 p1 j x1 3 , and similarly for x7 , x8 . The population total can then be estimated from all data as: x x7 x8 8 x1 j 8 x7 j 8 x8 j x 1 . 3 j 1 p1 j j 1 p7 j j 1 p8 j with the variance of x* given by expression (3) on page 4B. Notice that the first term above is: 10 x1j j1 p1j 10 x1j j1 3 8 10 100 x1j 8 100 10 3 mean for cluster1 total for cluster 1 Writing the 2nd and 3rd terms similarly, get total for total for total for 10 x . cluster 1 cluster 7 cluster 8 3 i.e. x* = (mean per cluster) 10. Computations above correspond to the way in which we would intuitively calculate the population total XT from data of just one cluster, and then average results across several clusters to get a better estimate of XT, together with its variance. SADC Course in Statistics Module H6 Sessions 13&14 – Page 7 Module H6 Sessions 13&14 5. Multistage Sampling We saw earlier that a two–stage sample results when a random sample of clusters is selected, and these are then further subsampled (rather than completely enumerated) to provide the units making up the final sample. The first stage units are called Primary Sampling Units (PSUs), and the second stage units are called Secondary Sampling Units (SSUs). If the second stage units are further sub–sampled, a three–stage sampling procedure results. In general, a population sub–sampled at several stages leads to a MULTISTAGE SAMPLING procedure. For example: Districts Towns Hospitals Wards Patients. NOTE: Most large scale survey investigations are multistage. They are cost–effective and administratively convenient. They are also useful in the absence of an adequate sampling frame. Sample Selection Suppose a population of size = 240000, is divided into 800 PSUs, each PSU of size 300, or 240 PSUs of size 1000, etc. Suppose a sampling fraction of 1 in 100 is to be taken. Allocation A B C No of PSU 800 240 120 D E F G 60 40 20 8 No SSU taken per PSU 3 10 20 40 60 120 300 In the above example, SRS at each stage will produce unbiased estimates of the population mean and allow s.e.'s to be found. In general however, size of PSUs vary! SADC Course in Statistics Module H6 Sessions 13&14 – Page 8 Module H6 Sessions 13&14 What size of samples should then be taken at each stage? Some possibilities are given below 1. Take a SRS of PSUs, and sample each selected PSU by SRS or stratified or systematic sampling. Suppose decide on SRS of SSUs. Then need to decide on n1, n2, ..., nm ; i.e. the sample sizes from the selected m PSUs. For example, a) Take all ni of equal size. Disadvantage Sample might not be representative if some large PSUs are not selected at the first stage. b) Take ni proportional to size of ith PSU. Disadvantage If a large PSU is selected, the final sample is heavily biased by units in this PSU. 2. Select PSUs with probability proportional to size (PPS), and take equal sized samples from the selected PSUs. This gives greater precision than case 1, with a constant sampling fraction over the two stages of selection. Advantages of 2 are: Equal ni a large PSU cannot exert too great an effect; Administratively convenient; If each PSU is to be studied separately, this is an efficient allocation. Limitation: Size of the PSUs must be known. SADC Course in Statistics Module H6 Sessions 13&14 – Page 9 Module H6 Sessions 13&14 An Example A population has individuals belonging to one of three districts A, B, C. District A B C Total District Size (Ni) 20000 2000 8000 30000 Consider selecting one PSU, and then sub–sampling from it. (a) Consider SRS of one PSU and SRS of ni = 100 SSUs. If A is the selected PSU, 1 100 1 Prob. any particular individual is selected 3 20000 600 If B is the selected PSU, 1 100 Prob. any particular individual is selected 3 2000 1 60 If C is the selected PSU, Prob. any particular individual is selected 1 100 1 3 8000 240 So this scheme gives different individuals a widely different chance of entering the final sample. (b) Select one district using SRS and ni proportional to Ni SSUs. (e.g. Take a 5% sample at stage 2). If A is selected, Prob. any particular individual is selected SADC Course in Statistics 1 1 1 3 20 60 Module H6 Sessions 13&14 – Page 10 Module H6 Sessions 13&14 If B is selected, Prob. any particular individual is selected 1 1 1 3 20 60 If C is selected, Prob. any particular individual is selected 1 1 1 3 20 60 (c) Select one district with PPS, and take equal ni = 100. If A is selected, Prob. any particular individual is selected 20000 100 1 30000 20000 300 If B is selected, Prob. any particular individual is selected 2000 100 1 30000 2000 300 If C is selected, Prob. any particular individual is selected 8000 100 1 30000 8000 300 NOTE :In (2) and (3), probability any particular individual was selected is constant. Such a design is called a self–weighting design. 6. Self–Weighting Designs A self–weighting design is one in which each unit at the final level (e.g. plot, person, household) has the same probability of being selected. This probability is the product of the probabilities of selection (over all draws) at each stage of the design. For example, suppose we are planning a two–stage survey from a population of villages of different sizes. We will take a sample of villages, and within each selected village take a subsample of households. We can make the design self–weighting by either: SADC Course in Statistics Module H6 Sessions 13&14 – Page 11 Module H6 Sessions 13&14 (a) Choosing villages with probability proportional to size, and then choosing the same number of households in each selected village. or (b) Choosing a simple random sample of villages, and then selecting a number of households in each selected village which is proportional to the size of that village. The advantage of a self–weighting design is that the straightforward sample mean 1 x x i is an unbiased estimator of the true population mean. n This is not so for a non–self–weighting design. For such a design, you need to take account of the different probabilities of selection. The formulae given under expressions (1), (2) and (3) on pages 5 and 6 again apply, BUT with pij defined as the overall probability of selection of the jth unit from the ith cluster covering all draws and all stages of selection. Note that the above method of estimation is appropriate only in instances where PSUs are sampled with replacement. 6. Design Effects (brief overview) The design effect (deff) is the ratio of the correct variance of an estimator (say ˆ ) in the given design to the variance calculated as if it were a simple random sample of the same size. Thus variancecorrect (ˆ) deff varianceSRS (ˆ) [ Note: Some also consider deft = (deff) ] Deff is a comprehensive measure which attempts to summarise the effect of various complexities in the design, especially those of clustering and stratification. In general, deff will be different for different groups of variables measured in the same survey, but may be quite similar across groups of variables of the same type (e.g. socioeconomic variables, health knowledge variables, etc). SADC Course in Statistics Module H6 Sessions 13&14 – Page 12 Module H6 Sessions 13&14 It may be similar in value for the same variable using the same design across different surveys in time and location. Uses of deff: (a) To give an indication of the precision lost (or gained) due to a complex survey design, and to enable different designs to be compared. (b) To enable easy estimation of sample size for future surveys, e.g. if we know deff for a crucial variable from a previous survey (of the same design), we can use sample size n as S2 n deff x V Where V is the desired variance for the real survey and S2 is the variance of a simple random sample of the same size (the finite population correction is ignored here). (c) To enable simple (rough!) calculation of standard errors. For example if the design is self-weighting, the SRS mean will be correct. For correct variance, the SRS variance estimates should be multiplied by an estimate of deff from either ((i) an earlier survey of the same design; or (ii) calculating the correct multi-stage sampling variance for one or two variables in the group (e.g. demographic variables) and using deff for these as an estimate of deff for other variables in the group. Further information about deff can be found in Pettersson H. and Silva,P.L.D.N. (2004) Analysis of design effects for surveys in developing countries. Chapter VII, pp.123-143, of the UN Publication An Analysis of Operating Characteristics of Household Surveys in Developing and Transition Countries: Survey Costs, Design Effects and Non-Sampling Errors. Available at http://unstats.un.org/unsd/hhsurveys/index.htm. (accessed 10th September 2007) Other references related to design effects can also be found under Section B of the above publication. References for the bulk of the contents of this document appear on the next page. SADC Course in Statistics Module H6 Sessions 13&14 – Page 13 Module H6 Sessions 13&14 General References related to multi-stage samples and practical applications Barnett, V. (1974) Elements of Sampling Theory. Edward Arnold. ISBN 0 340 17387 4 Cochran, W.G. (1977) Sampling Techniques (3rd edition). Wiley & Sons. Levy, P.S. and Lemeshow, S. (1999) Sampling and Populations: Methods and Applications (3rd edition) Wiley, New York. ISBN 0-471-15575-6 Lohr, S.L. (1999) Sampling: Design and Analysis. International Thomson Publishing. ISBN 0534-35361-4 Rao, P.S.R.S. (2000) Sampling Methodologies: with applications. Chapman and Hall, London. Scheaffer, R.L., Mendenhall, W., Ott, L. (1990) Elementary survey sampling, (4th Edition). PWS-Kent Publishing Company, pp. 390. Wilson I. M., Huttly, S.R.A. and Fenn, B. (2006) A Case Study of Sample Design for Longitudinal Research : Young Lives. International Journal of Social Research Methodology: Theory and Practice. 9, no.5, pp.351-365. SADC Course in Statistics Module H6 Sessions 13&14 – Page 14