Methodology Glossary Tier 2 Cluster Sampling Cluster sampling is a method used to select a sample. The main reason to cluster sample is to increase the efficiency of survey administration by reducing costs (particularly travel costs). A sample derived through simple random sampling can result in sample units which are widely dispersed geographically, meaning that interviewers must travel great distances to conduct a survey. This means that expensive travel costs may be incurred and it may take longer to complete all the interviews required. When cluster sampling, the population of interest is first split into geographic areas (‘clusters’) or some other natural cluster (such as industries, schools or hospitals). Some of these clusters are then randomly selected to form a sampling frame from which a sample is chosen. One-stage Cluster Sampling This involves splitting the population into clusters, then randomly selecting a proportion of these clusters. All units within the selected clusters are chosen to participate in the survey. Example Glasgow City Council wishes to find out information about the diets of primary one pupils. It is clearly not feasible to survey every primary one pupil in Glasgow City, as it would be expensive and time consuming. We may therefore decide that each school in Glasgow represents a cluster and randomly select 20 schools. In one-stage cluster sampling we would visit each of these schools and interview all primary one pupils in each of the 20 school. Two-stage Cluster Sampling This involves splitting the population into clusters and selecting a proportion of these clusters to form a sampling frame. In two-stage cluster sampling we then randomly select a proportion of individuals within each chosen cluster to participate in the survey. Example The Scottish Government wishes to find out about the sleep patterns of all primary school pupils. The survey will require one to one interviews with school pupils and their parents/guardians. Obviously it is not feasible to survey pupils in all schools or even to study all the students in a sample of schools. Instead, we would randomly select a number of schools and then randomly select a proportion of pupils from the chosen schools to participate in the study. http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary Methodology Glossary Tier 2 Selecting a Cluster Sample Say we wanted to use cluster sampling to draw a sample of 1,000 people in Scotland. We decide to use Scottish Parliamentary Constituencies as our clusters. We intend to randomly choose 10 out of the 73 constituencies and sample 100 people within each constituency. One way to select constituencies would be to list them all alphabetically, select a random starting point on the list and pick every 7th constituency. This presents a problem – the population of each constituency is different so people living in constituencies with smaller populations (e.g. Shetland) would have a higher chance of being in our sample than those living in constituencies with larger populations (e.g. Dundee West). This could lead to bias in our estimates and we would have to weight our results to reflect the different selection probabilities of people living in different constituencies. This can complicate the analysis of results. One way to overcome this problem is to set the probability of each cluster being selected for the sample to be proportional to its overall size. This means the larger a cluster the more chance it has of being selected. The following shows how to determine what probability should be assigned to each cluster. Note, this method only works when the same number of people will be sampled in each cluster. Say we have: a population of size N , we want to select a sample of size n, we split the population into j clusters, we select a clusters and we select b individuals within each cluster for our sample, To ensure everyone in the population has an equal chance of being included in the sample we need: Probability an individual is in the sample = n N We can say that: n ab If we say that Nj is the total number of individuals in cluster j then: n N = a b N http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary Methodology Glossary Tier 2 = a b N N N j j = a N j b N Nj a N j gives the probability that should be assigned to cluster j. The first part N The second part b is the resulting probability an individual within cluster Nj j has of being included in the sample if cluster j is included in the sample. Example Say we have a population of 10,000 patients that are in 12 hospitals and we want to sample 500 patients. We intend to sample 100 patients each in five of the hospitals. ((a Nj ) / N ) 1 2 3 4 5 6 7 8 9 10 11 12 Total 874 789 695 884 704 924 712 877 801 999 950 791 10,000 (5 (5 (5 (5 (5 (5 (5 (5 (5 (5 (5 (5 874 ) / 10,000 789 ) / 10,000 695 ) / 10,000 884 ) / 10,000 704 ) / 10,000 924 ) / 10,000 712 ) / 10,000 877 ) / 10,000 791 ) / 10,000 999 ) / 10,000 950 ) / 10,000 791 ) / 10,000 = 43.70% = 39.45% = 34.75% = 44.20% = 35.20% = 46.20% = 35.60% = 43.85% = 40.05% = 49.95% = 47.50% = 39.55% Overall probability of patient being included in the sample Probability a patient is included in the sample if their hospital is chosen Probability hospital is selected Patient Population Hospital The table below shows how many patients are in each hospital and the probability we should assign to each hospital when selecting clusters. It also shows the probability an individual within each hospital is included in the sample if their hospital is selected and how the overall probability of selection is equal for all patients ((b / Nj )) ( 100 / 874 ) ( 100 / 789 ) ( 100 / 695 ) ( 100 / 884 ) ( 100 / 704 ) ( 100 / 924 ) ( 100 / 712 ) ( 100 / 877 ) ( 100 / 801 ) ( 100 / 999 ) ( 100 / 950 ) ( 100 / 791 ) = 11.44% = 12.67% = 14.39% = 11.31% = 14.20% = 10.82% = 14.04% = 11.40% = 12.48% = 10.01% = 10.53% = 12.64% 43.70 39.45 34.75 44.20 35.20 46.20 35.60 43.85 40.05 49.95 47.50 39.55 11.44 12.67 14.39 11.31 14.20 10.82 14.04 11.40 12.48 10.01 10.53 12.64 = 5% = 5% = 5% = 5% = 5% = 5% = 5% = 5% = 5% = 5% = 5% = 5% To select 5 hospitals we must use systematic sampling. This can be done by: 1. List the hospitals in a spreadsheet http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary Methodology Glossary Tier 2 2. Order the hospitals by a random number 3. For each hospital insert as many records as there are patients in that hospital 4. Calculate a sampling interval by dividing the total number of patients by the number of hospitals we want to select (in this case 10,000 divided by 5 = 2,000) 5. Draw a systematic sample by selecting a random start between 1 and 2,000 and counting down the list of records selecting the hospital from which every 2,000 record belongs. Quality Implications of Cluster Sampling We have established that the main advantage of cluster sampling is time and cost savings. However, there are disadvantages to cluster sampling which mean the technique should be treated with caution. For some variables, individuals within the same cluster are more likely to be similar to each other than individuals in different clusters. Examples 1. Income levels are likely to be very different across geographic areas. Therefore clustering may not be appropriate for a survey to estimate average income in Scotland as it would severely reduce the quality of any results. 2. The proportion of the population that is female is likely to be similar across geographic areas. Therefore clustering is unlikely to affect survey estimates of the proportion of the population that is female. Clustering, therefore, makes us less certain our results reflect the whole population and produces a larger standard error than the equivalent simple random sample would. Despite these disadvantages, clustering usually enables the selection of larger samples than simple/stratified random sampling. Consequently, it is often possible to target a large enough sample that offsets the loss of precision. The selection process effectively comes down to a trade off between two factors – precision and cost. If you require a very accurate and precise sample, then simple/stratified random sampling is more appropriate whereas, if cost and time are the more important factors in your considerations, then cluster sampling is likely to be more suitable, so long as the sample is large enough to offset the loss of precision. Measuring the Impact of Clustering on Final Estimates It is possible to evaluate the effect clustering has on each variable we collect by estimating the design effect due to clustering. The larger the design effect due to clustering the less precise our estimates are. This is defined as: http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary Methodology Glossary Tier 2 Design Effect Due to Clustering = DEFFCL = 1 b 1 Where: ρ is the intra-cluster correlation (see below for definition) b is the average cluster sample size Intra-cluster Correlation The intra-cluster correlation (ρ in the above equation) is a measure of how strongly individuals in the same cluster resemble each other. A worked example using these formulae is provided later in the paper. If we define each observation as: Y ji j ji Where Yij is the i’th observation in cluster j, μ is the overall mean, j is a random effect shared by all observation in cluster j, ji is a random effect for the i’th observation in cluster j. Then the intra-cluster correlation ( ) is: Variance j Variance j Variance ji This can be thought of as how much of the total variance in our sample is explained by clustering and it can be calculated using the following formula: b 1 Where: α is the mean square between clusters, defined as n Y j Y 2 j z 1 Where n j = sample size of cluster j Y j = sample mean for cluster j Y = overall mean for entire sample z = number of clusters β is the mean square within clusters, defined as http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary Methodology Glossary Tier 2 Y ij j i Yj 2 Where Yij = observation i in cluster j n z Y j = sample mean for cluster j n = total sample size z = number of clusters b is the average cluster sample size The method used to calculate design effects is designed for continuous variables. When used for proportions the analysis tends to give biased results for the intra-cluster correlation coefficient. However, this bias tends to be small for proportions between 0.2 and 0.8. Effective Sample Size The design effect due to clustering can be used to calculate the effective sample size. Effective sample size = actual sample size / DEFFCL The effective sample size is the simple random sample equivalent of the clustered sample – i.e. how large would a simple random sample have to be to give the same statistical power as the clustered sample. If we have prior knowledge of what the design effect due to clustering is likely to be we can use this to calculate how many individuals need to be included in our sample. Example We have conducted a survey to estimate what proportion of people in Scotland have ever travelled abroad. The sample was clustered and included 2,500 individuals and gave us an estimate of 67%. The design effect due to clustering was 1.8. The effective sample size is: 2,500 / 1.8 = 1,388.89 So the clustered sample of 2,500 people has the same precision as a simple random sample of 1,389 people. Or our clustered sample of 2,500 people has the same statistical power as a simple random sample of 1,389 people. Complex Standard Errors The design effect due to clustering can also be used to calculate a complex standard error. To do this we must calculate the design factor due to clustering. http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary Methodology Glossary Tier 2 Design factor due to clustering = DEFTCL = DEFFCL The complex standard error is: Complex standard error = standard error DEFTCL Details on calculating standard errors is available in Tier 2 - Confidence Intervals. The complex standard error can be used to produce confidence intervals which take into account the design effect due to clustering. 95% Confidence Interval = estimate 1.96 * complex standard error DEFTCL can be thought of as the factor by which confidence intervals increase because the sample was clustered. Example A restaurant chain has twenty outlets in Scotland. They intend to buy new aprons for all staff. If they buy all the aprons in the same size they will receive a significant discount. They therefore want to know the average height of their staff before placing the apron order. They decide to use a cluster sample and randomly select three outlets to include in the sample. They will measure the height of four employees in each of the outlets. The probability of each outlet being included in the sample will be proportional to size. Outlet 1 2 3 4 5 6 7 8 9 10 Number of staff 41 32 23 26 30 37 35 25 33 38 Outlet 11 12 13 14 15 16 17 18 19 20 Number of staff 25 30 24 41 28 37 30 41 28 26 The restaurant owner uses the method described in the ‘Selecting a Cluster Sample’ to pick three outlets. They randomly select outlet 1, 8 and 10. Outlet 1 8 Number of staff 41 25 http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary Methodology Glossary Tier 2 10 38 Outlet Observation number Height Outlet Observation number Height Outlet Observation number Height Within each restaurant they randomly choose 4 staff members and measure their height. They get the following results: 1 1 1 1 1 2 3 4 172 165 157 166 8 8 8 8 5 6 7 8 184 179 164 177 10 10 10 10 9 10 11 12 161 168 176 171 The restaurant owner then calculates the mean and the 95% confidence intervals around the mean. Mean = 172+165+157+166+184+…+171 12 = 170 = 170 standard error x 1.96 = 170 2.3 x 1.96 = 170 4.5 = (165.5,174.5) Confidence intervals The restaurant owner now wants to assess what impact the clustering has on the estimates. They therefore want to calculate the design effect due to clustering. Firstly they must calculate the intra-cluster correlation. To do this they must calculate the mean square within clusters and the mean square between clusters. Mean square within clusters: j 1 2 2 so: n z Height Mean height for outlet i Y ji i 1 1 j Observation number Outlet The formula is Yj 172 165 Y Yj 165 165 ji Yj Square of difference between observation and mean height for outlet ij Difference between observation and mean height for outlet Y Y Yj 2 ji 7 0 http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary 49 0 Methodology Glossary Tier 2 1 1 8 8 8 8 10 10 10 10 3 4 5 6 7 8 9 10 11 12 157 166 184 179 164 177 161 168 176 171 165 165 176 176 176 176 169 169 169 169 -8 1 8 3 -12 1 -8 -1 7 2 64 1 64 9 144 1 64 1 49 4 There are 12 observations and 3 groups so n – z = 12 – 3 = 9 The mean square within clusters is: 49 0 64 1 64 9 144 1 64 1 49 4 450 50 9 9 Mean squares between clusters: n Y The formula is j Y 2 j z 1 Outlet Mean height for each outlet Overall mean height j Yj Y 1 8 10 165 176 169 so: Difference between mean height for each outlet and overall mean height Y j Y 170 170 170 Square of difference between mean height for each outlet and overall mean height Y Y 2 j -5 6 -1 There are three clusters so z = 3. The mean sum of squares between clusters is: 25 4 36 4 1 4 248 124 2 2 The intra-cluster correlation is therefore: b 1 = 124 50 124 4 1 50 74 274 = 0.27 = The design effect due to clustering (DEFFCL) is therefore: http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary Outlet sample size nj 25 36 1 4 4 4 Methodology Glossary Tier 2 1 b 1 = 1 0.27 4 1 = 1 0.81 = 1.81 12 6.6 . This means the clustered sample the 1.81 restaurant chain took has the same power as a simple random sample of sample size 6. The effective sample size is The complex standard error is: Complex standard error = standard error x DEFFCL = 2.3 x 1.81 = 3.09 Our 95% confidence intervals become: Confidence interval = 170 complex standard error x 1.96 = 170 3.09 x 1.96 = 170 6.06 = (163.9,176.1) This means that the clustering has added 1.6 cm onto our confidence limits on either size. Notes: The design effect due to clustering is different for every estimate in a cluster sample. It is often unfeasible to calculate the design effect due to clustering for all variables measured. Therefore, it is appropriate to calculate the design effect due to clustering for the most important variables collected in the sample and the variables likely to have the greatest geographic differences. There are other design effects which impact on the size of the confidence interval and effective sample size. For example, the design effect of proportionate stratification will be less than one and will therefore reduce the size of confidence intervals and increase the effective sample size. The design effect due to clustering is always greater than 1. This means the precision of our estimates can never be improved by clustering. However, if the design effect due to clustering is very close to 1, the effect of clustering on the precision of our estimates is very small. Further Information Tier 1 Sampling | Social Survey Design Tier 2 Confidence Intervals | Simple Random Sampling | Stratified Sampling http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/Glossary