======================================================================= SAMPLING: AN INTRODUCTION NUMERICAL PRACTICE EXERCISES (notes to the attached Excel tables) ======================================================================= Vijay Verma University of Siena Siena, October 2014 1 Numerical practice (1) NOTES This classroom exercise concerns numerical illustrations of simple random, clustered (multi-stage) and stratified sampling. It is recommended that the students work in pairs. To begin with, from a given population we will select simple random samples of elements of various sizes to illustrate how the distribution of sample means varies with sample size. We will also illustrate how the variability among elements in any given sample estimates the population variance. 1 The population (1) In order to facilitate quick selection of many random samples, we will employ a simple wellmixed population with known characteristics in which units with different values appear in an entirely random order, so that any arbitrary set of units can be regarded as a random sample. Our example is artificial in two respects: We actually know the entire population (the Yj values vary in the range 0 to 9). The population is thoroughly mixed at least in relation to values of the variable of interest, i.e. the Yj values appear in an entirely random order. (2) Theoretically, the population parameters are: Y 1 10 . Y 4.500; j 10 2 j 1 2 1 10 . Y 4.5 8.25; j 10 2.87 j 1 Our illustrative population is shown in tab “40x40 random digits 0-9”. Let us assume that these digits represent values of some variable Y, the average value of which we are interested in estimating. Our table of 1,600 digits is of limited size and not perfect. It compares as follows with the theoretical values above (see tab “frequency distribution”): Comparison of frequencies in the 1600 digit table with the expected frequencies 0 theoretical 160 actual 151 Frequency distribution of digits 1 2 3 4 160 160 160 160 162 161 144 176 5 160 169 6 160 169 7 160 170 8 160 154 y 9 160 4.500 144 4.498 2 8.250 7.940 S2 8.250 7.945 2 Simple random sampling For illustrative purposes, we construct six sets of simple random samples: Sample “design” Sample size (1r) one quarter of each row n =(1x10)=10 160 (1c) one quarter of each column n =(10x1)=10 160 (2) Each (4x4) square n =(4x4)=16 100 (3r) each whole row n =(1x40)=40 40 (3c) each whole column n =(40x1)=40 40 (4) Each (10x10) square n =(10x10)=100 16 Number of samples 2 3 Illustrations of simple random sampling In each case, (1r)-(4), the above mentioned samples amount to a very small subset of all possible samples of a given size that can be drawn from the population. Many more can be easily created. The task is to select the different types of samples as specified above, and comment of the basic characteristics (mean, variance, standard deviation, minimum and maximum values) of the sampling distributions obtained. Since in each example only a very small set of all possible samples is considered, the results are subject to some random variability. 3 Numerical practice (2) NOTES This classroom exercise concerns numerical illustrations of clustered (multi-stage) sampling. It is recommended that the students work in pairs. 1 The population In Numerical practice (1), from a given population we selected simple random samples of elements of various sizes to illustrate how the distribution of sample means varies with sample size. To remind, our illustrative population is merely a table of 40x40 random digits 0-9. Hence any arbitrary set of units can be regarded as a random sample. We assume that these digits represent values of some variable Y, the average value of which we are interested in estimating. The population distribution of Y values and their mean and variance is as follows. Comparison of frequencies in the 1600 digit table with the expected frequencies 0 theoretical 160 actual 151 Frequency distribution of digits 1 2 3 4 160 160 160 160 162 161 144 176 5 160 169 6 160 169 7 160 170 8 160 154 y 9 160 4.500 144 4.498 2 8.250 7.940 S2 8.250 7.945 2 Clusters with different degrees of homogeneity Here we have considered the 100 square clusters of size (4x4). Clusters with five different degrees of homogeneity have been formed: (1) Entirely random clusters (2) For each set of 4 columns of digits, the first column is sorted by increasing value of Y. Such sorting is applied to left half of the table of (40x40) digits. The (4x4) clusters are formed in the normal way, but using the set of digits sorted as above. (3) As above, except that the ‘1 column in 4’ sorting is applied to the whole table. (4) The sorting is applied to the first two columns of each set of 4 columns of digits in the left half of the table. (5) The above ‘2 column in 4’ sorting is applied to the whole table. If a simple random sample of a clusters (each of size n) were selected from a population of A clusters, variance of the mean for this clustered sample would be Var y A 2 Yk Y a S A2 A a SA , with S A2 ,f . (1 f ). . A 1 A a A a 2 When the clusters are formed by entirely random grouping as in (1), the variance would be identical to that for a simple random sample of elements of the same size, i.e. of SRS with (a.n) elements: Var1 y A Var0 y , giving 2 2 S2 S2 n 2 N a.n S Aa S S , hence . Var0 y (1 f ). . . 1A n a N a.n A a.n 4 In schemes (2)-(5), clusters correspond to increasingly homogeneous groupings of elements. Greater homogeneity within clusters implies greater variability between clusters. Hence the cluster means in populations (2)-(5) are increasingly more diverse compared to cluster means in (1). S 2 5A S 2 4A S 2 3A S 2 2A S 2 1A S2 n . Generally, with n>>1, Si A S 2 , except in the ‘artificial’ case (1). 2 A a Sk A With Vark y A , the above gives Vark y A > Vark 1 y A , 𝑘 = 1 − 5. . A a 2 3 Design effects A most useful concept for the computation, analysis and interpretation of sampling errors concerns the design effect. Design effect is the ratio of the variance, Vark y A , under the given sample design, to the variance, Var0 y , under a simple random sample of the same size: d 2 Vark y A Var0 y , . Computing design effects requires the additional step of estimating variance or sampling error under simple random sampling apart from its estimate under the actual design. Proceeding from standard errors to design effects is essential for understanding the patterns of variation and determinants of the magnitude of the error, for smoothing and extrapolating the results for diverse statistics and population subclasses, and for evaluating the performance of the sampling design. Analysing design effects into components helps to better understand from where inefficiencies of the sample arise, to identify patterns of variation. In the current illustration random sampling of clusters, we have 2 Vark y A S k A deft , 𝑘 = 1 − 5. Var0 y S 2 n 2 k deft5 A deft42A deft3 A deft2 A deft1 A 1 . 2 2 2 2 5 Numerical practice (3) NOTES This classroom exercise concerns numerical illustrations of stratified sampling, with simple random sampling within strata. It is recommended that the students work in pairs. 1 The population In Numerical practice (1), from a given population we selected simple random samples of elements of various sizes to illustrate how the distribution of sample means varies with sample size. To remind, our illustrative population is merely a table of 40x40 random digits 0-9. Hence any arbitrary set of units can be regarded as a random sample. We assume that these digits represent values of some variable Y, the average value of which we are interested in estimating. The population distribution of Y values and their mean and variance is as follows. Comparison of frequencies in the 1600 digit table with the expected frequencies 0 theoretical 160 actual 151 Frequency distribution of digits 1 2 3 4 160 160 160 160 162 161 144 176 5 160 169 6 160 169 7 160 170 8 160 154 y 9 160 4.500 144 4.498 2 8.250 7.940 S2 8.250 7.945 2 Stratification If sample selection and estimation is done separately within each stratum, the basic expression Var y 1 f . S2 n Y j Y 2 with S 2 N 1 applies to each stratum. Using subscript h to refer to a particular stratum, we have with SRS within strata: Yhj Yh S2 Var y h 1 f h . h with S h2 , Nh 1 nh 2 summed over Nh units in stratum h. In putting together the results from different strata, we often do that in proportion to stratum size, e.g. with weights Wh=Nh/N. For the total population Y hWh .Yh . and if the Wh are known, y hWh . y h and Var y h Wh2 .Var y h . For simplicity in our illustrations, we consider the population as divided into H strata equal in population as well as in sample size (Wh = Nh/N = nh/n = 1/H). With this, it follows from the above expression that with SRS within each stratum, variance can be written as Var y E 1 f h S h2 . . n H Comparing this with unstratified SRS, the effect of stratification is in proportion to the ratio of the average within-stratum variance S 2 S h2 H to the unstratified value S2. We may decompose the total variance into variation within strata and variation among the strata means: 6 h j Yhj Y 2 h j Yhj Yh 2 h j Yh Y , 2 or dividing by N, we may write the above as 2 2 2 , where the first term on the right is the within-stratum and the second term the between-strata component. ∆̅ is the mean squared deviation of means. The proportionate reduction in variance from stratification is approximately 2 2 hj Yh Y hj Yhj Y 2 2 h N h . Yh Y hj Yhj Yh 2 2 h N h . Yh Y 2 . The actual gain is slightly smaller due to the (generally minor) difference in the definition of and S. With H strata of equal size N/H, it is seen to be S 2 S 2 2 H 1 2 2 . . N H 2 S2 3 Stratification in multistage sampling An important point to note is that exactly the same idea applies when we are dealing with clusters rather than elements as the sampling units in a stratified design. The above quantities then refer to the variance of cluster means. With a given stratification, the deviation between strata means, 2 , is the same whether element or cluster sampling is used within strata. By contrast, S2 or the within-stratum term, S 2 , is usually much smaller for cluster means than that for individual elements (as noted earlier). Hence with cluster sampling, the relative gain from stratification is usually much more appreciable. 4 Numerical exercise In the example, the set of 40x40 random digits has been divided into two equal strata of 20 rows each. This is done for each of the five sets constructed by sorting some of the columns as described in Exercise 2. Populations with five different degrees of ‘gradients’ in Y values have been formed: (1) Entirely random arrangement of elements. (2) For each set of 4 columns of digits, the first column is sorted by increasing value of Y. Such sorting is applied to left half of the table of (40x40) digits. (3) As above, except that the ‘1 column in 4’ sorting is applied to the whole table. (4) The sorting is applied to the first two columns of each set of 4 columns of digits in the left half of the table. (5) The above ‘2 column in 4’ sorting is applied to the whole table. In (1), the two strata have the same expected mean. Generally, the mean in the lower stratum is larger than that in the upper stratum, the difference between them increases as we go from (2) to (5). First we examine the effect of stratification assuming a SRS of elements within each stratum. What is the proportionate reduction in variance in each case? Then we consider it to be a population of 100 square clusters of size (4x4)=16 elements each. We examine the effect of stratification assuming a SRS of clusters within each stratum. What is the proportionate reduction in variance in each case; how does it compare with the reduction in the samples of elements? 7 Numerical practice (4) NOTES This classroom exercise concerns numerical illustrations of how the variability among elements in any given sample estimates the population variance. The procedure illustrated applies only in the case of simple random sampling. However, the results of this type also apply if the sample is more complex: in those situations as well, variability among elements in a sample estimates the population variance – though not necessarily the sampling error under the actual complex design. 1 Estimating variance from the sample In our illustrations, the average of the sample means y =yi/n is equal to the population mean Y . We say that the expected value of the former equals to the latter: E[ y ]=Y ; i.e. y provides an unbiased estimator of Y . Furthermore, the variability among elements in any particular sample provides a measure of that 2 2 variability in the population, i.e. 2 i y i y n j Y j Y N 2 , where the summation on the left is over n elements of the sample and that on the right is over N population elements. 2 Actually, for an SRS the exact relationship happens to be E[s2]=S2, where s 2 i y i y n 1 and S 2 j Y j Y 2 N 1 . Hence for a simple random sample E[var( y )]=Var[ y ], where var( y )=(1-f)s2/n is estimated from the sample and Var( y )=(1-f).S2/n its population value. This is the basis on which we can estimate the variance (a measure of variability among different samples) from the results of a single sample that is available. 2. Wider significance of the results It is important to note that variance computed above provides a valid estimate for only simple random sampling. For more complex designs estimating variance will involve more complex formulae taking into account the complexity of the design. But interestingly, an important result of sampling theory is that for many complex designs, the relationship E[s2]S2 still holds approximately. Hence we can use the data from a sample of any complexity to produce an estimate of what the error would be in a simple random sample of the same size – there is an approximation involved but it is small, and generally of the order of (1/n). This is an important statistics since iat forms the denominator of the ‘design effect’. 8 Numerical practice (5). Selecting households in two stages PANEL (1) The population Our sampling frame consists of a list of 30 population census areas (col. 1 of the panel), with given number of households in each area at the time of the census (col. 2). These are the ‘area measures of size’ (MoS) available in the sampling frame, i.e. for all the areas in the population prior to any sample selection. Col. (6) gives the actual number of households in each area at the time of the survey. Generally these latter numbers are not known, except in the areas actually selected and enumerated during the survey. 1. First stage selection probability of areas It is decided to select 6 of these areas (as PSUs), with probability proportional to size (PPS). The measure of size (Si) is the number of households in col. (2). Complete col (3): the probability of selection each area will receive. (Complete col. 3 in Panel (1); note that it is then automatically completed col.(3) in Panel (2) as well.) 2. Sample take (number of households selected) per area in self-weighting design Design A consist of two stages: selection of areas, followed by the selection of households in the selected areas. The first stage is as above. At the second stage, the target sample size is 10 households per selected area. Within each selected area, households are selected with a UNIFORM PROBABILITY = 10/Si. The number of households in the area at the time of the survey (Hi) is given in col (6). Complete in Panel (1): Col (4): the sampling rate which would be applied for selecting households within each area, if the area had been selected at the first stage. Col (5): the OVERALL (after the 2 stages) selection probability of a household in each area. Col (7): the number of households which would be selected from the area, given col. (6). PANEL (2) 3. Fixed-take design Design B also consists of two stages, the first stage exactly as above. At the second stage, the target sample size is FIXED at 10 households per selected area. Within each selected area, complete in Panel (2): Col (4): the sampling rate which would be applied for selecting households within each area, if the area had been selected at the first stage. Col (5): the OVERALL (after the 2 stages) selection probability of a household in each area. (As before, col (6) gives the number of households in the area at the time of the survey.) PANEL (3) 4. Selecting a sample of areas Select a systematic PPS sample of a = 6 areas, given the frame in cols (1)-(2). For this purpose, - complete col (8) giving cumulation of size measures given in col (2); 9 - compute the sampling interval to be applied, - take a random start, and select a systematic sample with the interval computed above, and mark with “x” in col (9) the 6 areas selected in your particular sample. 5. Analysis of the selected sample From the sample of 6 areas you have selected, complete for each SELECTED area: Col (10): the sample weight to be applied to the area to compute its contribution towards estimating a population total. Col (11): contribution of the area towards estimating the total number of households in the population at the time of the survey, given the number of household found in the area during the survey (col 6). What estimate do you get from the sample for the total number of households in the population? 10 Numerical practice (6). Further on two-stage sampling This exercise explain how to treat area units which are either ‘too large’ or ‘too small’ for the purpose of applying the PPS selection scheme in two stages, involving the selection of areas followed by the selection of households within selected areas. 1. The frame The frame covers a population of small localities. There are 482 localities, containing 3,419 households. There are 10 large, 72 of medium size, and many (400) of small size. The details are as follows. Count total MoS Mean 482 3,419 7.09 Stratum 1 10 303 30.30 Stratum 2 72 919 12.76 Stratum 3 400 2,197 5.49 SUM 482 3,419 7.09 TOTAL Full list frame is given in sheet “frame”. The number of households in a locality is used as the measure of size (MoS) for the first stage sampling of a certain specified number (=a) of localities. It is assumed in this example that the actual number of households in the locality at the time of the survey is exactly the same as this number in the frame. 2. First-stage sampling The objective is to select a self-weighting sample of households in two stages: PPS selection of localities; and then selection of households in each selected locality with inverse-PPS. Since the MoS are accurate, this also gives a constant number of sample households from each selected locality. With this scheme the sample allocation to a stratum, and hence the number of localities to be selected from it, is in proportion to the total MoS in the stratum. This basic allocation has to be adjusted to take into account the presence of ‘very large’ localities which limits the total number available for selection (this occurs mainly in Stratum 1, with a few cases in Stratum 2), and the presence of ‘very small’ localities which occur in Stratum 3. Sheet “first stage sampling” shows the procedure for different values of the number of PSUs to be selected: a=50, 100, 150, and 200. Repeat this table for other values of this number, such as a=175, a=225, etc. With a=50 or a=100, allocation of the sample areas to be selected in a stratum can be made strictly in proportion to the total measure of size in the stratum. This is because the allocation so determined does not exceed the number of areas available in any of the strata. However, with increasing sample size (a = the number of areas to be selected), the allocation to the larger strata (i.e. in strata containing large localities) can exceed the total number of localities existing in the stratum. The latter quantity is the limit of the maximum number of units which can be selected. 11 We begin with the stratum containing the largest units, and determine the allocation for each subsequent smaller stratum on the basis of what measure of size and what number of sample areas are still available for allocation after having dealt with the larger strata. 3. Two-stage sampling The sum of size measures (MoS) for the whole population is available from the frame. Let N be this sum, and 𝑁𝑖 the measure of size of area (PSU) 𝑖. According to the required design, let 𝑎 be the number of PSUs to be selected, then the PPS sampling interval for the selection of PSUs is 𝐼 = 𝑁⁄𝑎. The other design parameter is to fix the target sample size as 𝑛 households, or the overall sampling rate for a self-weighting sample 𝑓, the two being related as 𝑓 = 𝑛⁄𝑁 . The target sample size per sample PSU is 𝑏 = 𝑛⁄𝑎 = 𝑓. 𝐼 . The first-stage sampling rate is 𝑓1 = 𝑁𝑖 ⁄𝐼 ; and assuming perfect size measures (i.e., exactly equalling the actual unit sizes), the second-stage sampling rate is 𝑓2 = 𝑏⁄𝑁𝑖 , giving, as expected, the overall sampling rate 𝑓 = 𝑓1 𝑓2 = 𝑏⁄𝐼 . In actual application, neither of the probabilities 𝑓1 or 𝑓2 can exceed 1. Hence we have three cases: (1) ‘Very large’ units, meaning with size 𝑁𝑖 ≥ 𝐼. For these we have 𝑓1 = 1; 𝑓2 = 𝑓 (meaning that the area is automatically included in the sample). (3) ‘Very small’ units, meaning with size 𝑁𝑖 ≤ 𝑏. For these we have 𝑓1 = 𝑓; 𝑓2 = 1 (meaning that all households in a selected area are taken into the sample) (2) ‘Normal’ units with 𝑓1 = 𝑁𝑖 ⁄𝐼 ; 𝑓2 = 𝑏⁄𝑁𝑖 . Since with the above adjustments to the selection probabilities, the overall selection probability of any unit, 𝑓, remains unchanged, we still obtain the target sample size 𝑛 = 𝑓. 𝑁. However, the number of PSUs coming into the sample may be disturbed – i.e. may come out to be 𝑎′ ≠ 𝑎, the target number. For a given 𝑓 and the condition that no probability can exceed 1, we cannot adjust (1) or (3). We can adjust the two probabilities in (2) to obtain the required number, 𝑎, of selections of PSUs, while still keeping the overall unit selection probability 𝑓 unchanged: 𝑓1 = 𝑁𝑖 ⁄(𝑘𝐼); 𝑓2 = (𝑘𝑏)⁄𝑁𝑖 . Factor 𝑘 is given by: 𝑘= 𝑎2′ 𝑎2 = 𝑎′ −(𝑎1 +𝑎3 ) 𝑎−(𝑎1 +𝑎3 ) where (𝑎1 , 𝑎2 , 𝑎3 ) are, respectively, the units to be selected from parts (1)-(3). 12 (7) Sampling establishments with PPS 1 Sampling with probability proportional to employment in the establishment Attached is a frame consisting of 100 establishments from six different sectors of activity, with information on employment, output, input and investment in each establishment. Suppose that a sample of 20 establishments is to be selected, with employment as the measure of size. Total employment in the 100 establishments is 18,884. This gives a PPS sampling interval of I = 18,884/20 = 994. 3 of the establishments have more than this number of employees. These are taken into the sample with certainty. Dividing the remaining MoS (employment) by 17, the number of units still to be selected, gives selection interval I = 668. One additional establishment has MoS greater than this new sampling interval, so that is also taken into the sample with certainty. The final sampling interval is I = 667 for selecting further 16 units, from the remaining 96 units all of which have MoS smaller than this interval. 2 Sampling with probability proportional to a composite measure If any of the establishment characteristics other than employment are used as MoS, some of the establishments will have no chance of being included in the sample because the value of the characteristic concerned is reported to be zero. As an alternative, it may be desired to select establishments with probability proportional to a composite measure which takes into account all the four establishment characteristics available in the frame: employment, output, input and investment. For instance, one may take the ratio of each of these measure to its mean value per establishment, and then take the average of the four means so computed to obtain a composite measure of establishment size. This gives, by definition, the total MoS as 100, and hence sampling interval as I = 100/20 = 5.0. Suppose that the largest 5 establishments in terms of this composite MoS are taken into the sample with certainty. The remaining measure of size is 53.9, giving a sampling interval of 3.60 for selecting 15 more establishments with PPS. Similarly, if the largest 10 establishments in terms of this composite MoS are taken into the sample with certainty, the remaining measure of size is 34.3, giving a sampling interval of 3.43 for selecting 10 more establishments with PPS. 3 Using rescaled size measures For a given situation, the scaling of the size measures can be arbitrary. We can multiply size measures of all establishments by any constant factor, say k, and also multiply the sampling interval by the same factor. The resulting sample design and size will remain unaffected. This feature can be used for convenience when different sampling intervals have to be used for different parts of the sample. Suppose for instance that the number of units to be selected has been fixed separately for each stratum. In general this would give a different value of the sampling interval for different strata: 𝐼ℎ = 𝑀ℎ ⁄𝑛ℎ where 𝑀ℎ is the total MoS for stratum ℎ and 𝑛ℎ is the number of units to be selected from it. Suppose that we rescale the unit size measures in each stratum as 𝑀ℎ′ = 𝑘ℎ 𝑀ℎ with 𝑘ℎ = (𝑛ℎ ⁄𝑀ℎ ). 13 This gives an identical sampling interval 𝐼ℎ = 𝑀ℎ′ ⁄𝑛ℎ = (𝑛ℎ ⁄𝑀ℎ ) 𝑀ℎ ⁄𝑛ℎ = 1 in all strata. The sample can be selected by putting together all the strata and using the same sampling interval (=1) throughout. The procedure would automatically give the required number of selections in each stratum. This procedure can be very convenient in selecting stratified samples of establishments from many different sectors. 14 (8) Sample rotation patterns Attached are illustrations of four sample rotation patterns over time, such as between monthly samples in a labour force survey. The rotation patterns and the overlaps are as follows. 1. “5-consecutive” Overlap with current sample Subsample nos. at period t which overlap with the sample at period (t-i) 5 4-5 3-5 2-5 % overlap 0% 20% 40% 60% 80% period t-7 t-6 t-5 t-4 t-3 t-2 t-1 t 2. “3(1)2” Subsample nos. at period t which overlap with the sample at period (t-i) 6 5,6 5,6 3,5 2,3,6 % overlap 0% 20% 40% 40% 40% 60% period t-7 t-6 t-5 t-4 t-3 t-2 t-1 t 3. “3(2)2” Overlap with current sample Subsample nos. at period t which overlap with the sample at period (t-i) 7 6,7 6,7 6 3 2,3,7 % overlap 0% 20% 40% 40% 20% 20% 60% period t-7 t-6 t-5 t-4 t-3 t-2 t-1 t 4. “4(8)4” Percentage overlap with sample at time t: 0 25 50 37.5 50 37.5 25 12.5 0 0 0 0 0 25 50 75 100 t-16 t-15 t-14 t-13 t-12 t-11 t-10 t-9 t-8 t-7 t-6 t-5 t-4 t-3 t-2 t-1 t two quarters one year apart two months one quarter apart Comment on the variation over time in the overlap achieved in each pattern. Construct the last pattern “4(8)4” by showing all the (16) subsamples it involves. 15 (9) Illustration of computing design weights, and their use in estimating from the sample The table below provides a simple illustration of estimation from the sample with units selected with PPS. The column headings and explanation are provided at the bottom of the table. 1. A PPS sample Column (3) gives cumulation of the size measures, and an example of the sample selected. We begin with a random number from 1 to 𝐼 (=100); let this number be 𝑟 = 54. The first unit selected is the one for which the cumulated size measure equals or exceeds 𝑟 = 54 for the first time; the next unit selected is the one for which the cumulated size measure equals or exceeds 𝑟 + 𝐼 = 154; the third unit is the one in which the cumulated size measure equals or exceeds 𝑟 + 2𝐼 = 254, and the fourth unit is the one in which the cumulated size measure equals or exceeds 354. 2 Design weights The design weights are introduced to compensate for differences in the probabilities of selection into the sample. Each ultimate unit in the sample is weighted in inverse proportion to the probability with which it was selected. If 𝑝𝑗 is the overall sampling probability of unit 𝑗, and 𝑛 the number of (𝐴) units successfully enumerated in the sample, the design weights 𝑤𝑗 are: n 1 . w j A 1 p p , j j where the sum ∑ is over the 𝑛 units in the sample. The factor in parentheses is simply a constant, chosen to scale the average value of the weight per unit enumerated in the survey to be 1.0, since ∑ 𝑤𝑗(𝐴) = 𝑛. Such scaling can be convenient in practice. Superscript (A) has been used to indicate that the reference is to design weights (as distinct from weights arising from other sources, such as non-response and calibration). Note also that the weights in the above equation are not affected by the scaling of 𝑝𝑗 values: the 𝑝𝑗 values need to be only proportional to (within some arbitrary constant), rather than equal to, the unit selection probabilities. 3 A simple illustration of calculating design weights Let us assume that the survey units are households, with the size measures in column (2) of the table being some function of household size and composition. For the sample of 𝑛=4 units selected, the table illustrates the procedure for computing design weights. 16 Illustration of computing design weights, and their use in estimating from the sample (1) Stratum (1) unit 1 unit 2 unit 3 unit 4 unit 5 Stratum (2) unit 6 unit 7 unit 8 unit 9 unit 10 Total 27.1 41.3 51.6 33.5 46.5 39.4 45.0 31.2 38.5 45.9 (2) (3) 200.0 27.1 68.4 120.0 153.5 200.0 200.0 400.0 239.4 284.4 315.6 354.1 400.0 Selected r=54, I=100 (4) (5) (6) X 0.413 2.42 1.031 X 0.465 2.15 0.917 X 0.450 2.22 0.947 X 0.385 2.60 1.105 1.712 9.39 4.000 (7) 5.2 6.7 15.2 5.9 2.3 9.4 8.3 3.4 5.6 4.4 31.1 (8) 16.2 5.0 18.5 14.5 33.0 Sampling interval for systematic PPS selection, =100 with desired number of selections a = 4 Notes to columns (1) unit size measures (scaled to obtain 2 selections in each statum) (2) size measures, total by stratum (3) cumulation of (1), and indicator of whether a unit is selected (interval I=100; assumed random number between 1-100, r=54) (4) selection probability (p=s/I) (5) sample weight (w=1/p) (6) rescaled sample weight to average 1.0 per household (7) value of the variable to be estimated in the frame (normally not available) (8) estimate of the value from the units in the sample 17 (10) Poisson and Sequential sampling 1. Poisson sampling Poisson sampling is a simple and flexible procedure which we can emulate in choosing an appropriate sampling method, keeping its merits to the extent possible, and minimising its limitations. The procedure is as follows. The method subjects each unit 𝑖 independently to selection with its assigned probability 𝑃𝑖 . This can be achieved by assigning to each unit in the frame a random number 𝑟𝑖 from uniform distribution [0,1). The unit is included in the sample if 𝑟𝑖 ≤ 𝑃𝑖 , and not included otherwise. 2. Sequential sampling This is a variation on the Poisson sampling procedure aimed at controlling sample size with variable selection probabilities. Consider the basic Poisson sampling procedure in which a unit is in the sample if the random number selected for it is at or below its probability of selection, 𝑟𝑖 ≤ 𝑃𝑖 . In terms of a transformed variable 𝑥𝑖 = (𝑟𝑖 ⁄𝑃𝑖 ), the same result is obtained by taking a unit into the sample if 𝑥𝑖 ≤ 1. Hence a straightforward method is to arrange the units in order of increasing 𝑥𝑖 and take a certain number of units from the top of the ordered list into the sample. The specified selection probabilities are applied if we select up to the last unit for which 𝑥𝑖 ≤ 1. But instead, in Sequential Poisson sampling, exactly 𝑛 units from the top of the ordered list are taken into the sample, where 𝑛 is the fixed required sample size. That is, the sample is defined by 𝑟𝑎𝑛𝑘(𝑥𝑖 ) ≤ 𝑛. This has to be done separately within each stratum, or within groups of strata where it suffices to control the sample size within groups only. In the presence of many small strata, appropriate grouping of the strata may in fact be the only practical way of keeping the system manageable. The attached table is based on Table in Note (7). It illustrates the application of the above two procedures. Poisson sampling controls the unit selection probabilities, but the number of units selected can vary. Sequential sampling controls the number of units selected in each stratum, but the selection probabilities depart from strict PPS sampling. Repeat the illustrative selection by using a new set of random numbers. 18