Methods of Large -Scale Sample surveys Dr. Md. Anwar Hossain Talukder Ex-faculty, RU and DU Abstract For multi-stage sampling with unequal clusters in large-scale sample surveys, two main methods of selection of clusters are: (i) equal probability selection of clusters without replacement and (ii) selection of clusters with probability proportional to size (PPS) systematic sampling without replacement. The first method was described by Talukder (1984) and the second practically more useful procedure is discussed here along with simple methods of estimation. It sometimes becomes necessary to collect nationwide or region-wide information on some important matter. For this, large-scale sample surveys are needed. These surveys are usually complex but simple methods of estimation are obtained under a few simplifying assumptions. 1. Multi-stage sampling For a large population, sampling frame of ultimate sampling units (elements) for which data are to be collected, is not usually available. But lists of groups of elements, called clusters, are usually available from human population census, agriculture census, industry census etc. reports. In multi-stage sampling, clusters are selected at different stages with the help of such lists of clusters. Elements are usually selected at the last stage. Sometimes clusters are also selected at the last stage. One advantage of multi-stage sampling is that cost of survey is reduced because of grouping of sampling units in smaller areas (clusters) but standard errors of estimates increase because of the same reason. In practice, 2-stage or 3-stage sampling with or without stratification is usually enough for such surveys. 2. Methods of selection of clusters in multi-stage sampling For large populations, natural clusters are usually of unequal size which is number of elements in the cluster. Following are few methods of selection of unequal clusters in multi-stage sample. 2 (i) Selection of unequal clusters at all stages with equal probability without replacement This is a simple method where clusters at all stages and elements at the last stage are selected on simple random sampling basis or equivalently on simple systematic sampling basis assuming random order of clusters or elements. Because of variation in sizes of clusters, this method usually produces larger standard errors. To reduce wide variations in sizes, larger clusters are split and smaller ones combined so that standard errors become smaller. Simple estimates and their simple estimated standard errors are obtained by Talukder (1984) under a few simple conditions. (ii) With replacement selection of unequal clusters at the first stage For this procedure, primary sampling units (PSU) are selected at the first stage with replacement with probability proportional to size (PPS) along with any method of selection of clusters and elements at the subsequent stages. This method provides one component expression of unbiased estimate of variance of an estimate. In this method, any PSU may be selected more than once resulting in loss of precision of estimates. (iii) Selection of unequal clusters with PPS systematic sampling method This is a practically more useful method of selection of multi-stage sample. In this case, clusters at each stage are selected with PPS systematic sampling method without replacement. For small sampling fraction at the first stage, standard errors are computed by with-replacement formula. This method usually produces less standard errors and has firm control over sample sizes. 3. Estimates for PPS systematic selection of unequal clusters in multi-stage samples Selection of clusters at every stage except the last one, is made here with the help of cumulative measures of size or estimated measures of size. At the last stage, elements are selected usually by simple systematic sampling method without replacement after making a list of elements of selected clusters of the last-but-one stage of sampling. 3 (i) Estimate of population total For k-stage sampling, let yij… … … … tu be the observation of variable, y, for an element of a cluster at the last stage of sampling Yij… … … … tu be the measure of variable, y, for an element in population Let n1, n2i, n3ij… …. … nk-1ij.. .. ..t be sample size of clusters at 1st, 2nd, 3rd, … … …(k1)th stages respectively. Let M2i, M3ij, … … … Mkij… … t be population size (number of elements) of clusters at 1st, 2nd, 3rd, … … …(k-1)th stages respectively. Let p2i=M2i/M0, p3ij=M3ij/M2i, … … … pkij … … …lt=Mkij … … … lt/Mk-1ij … … … l be probabilities of selection of clusters at different stages (first to (k-1) stages) with M0=∑∑ … …∑Mkij … …lt as total size (no. of elements) of population. Also let 𝑌̂𝑝𝑝𝑠 = 𝑀𝑘𝑖𝑗……𝑡 ∑ 𝑦𝑖𝑗 ......𝑡𝑢 1 1 1 ……… ∑ … (1) ∑ ∑ 𝑛1 𝑛2𝑖 𝑝2𝑖 𝑛3𝑖𝑗 𝑝3𝑖𝑗 𝑛𝑘𝑖𝑗……𝑡 𝑝𝑘𝑖𝑗…….𝑡 be an estimate of population total Y= ∑∑ … …∑ Yij … …. … tu of variable y with nij.. as sample size of elements at kth stage of sampling in (k-1)th stage clusters. .. t For any stage of selection of clusters, previous stages samples are assumed here to be fixed. Then by taking k conditional expectations on this condition, it can be shown that 𝑌̂𝑝𝑝𝑠 is an unbiased estimate of population total, Y, of measures of variable y in population. By the same principle, 𝑀𝑘𝑖𝑗………𝑡 1 1 𝑌̂𝑖 = ……...∑ ∑ ∑ 𝑦𝑖𝑗… …. 𝑡𝑢 𝑛2𝑖 𝑛3𝑖𝑗 𝑝3𝑖𝑗 𝑛𝑘𝑖𝑗……𝑡 𝑝𝑘𝑖𝑗……𝑡 is an unbiased estimate of total for ith PSU, Yi = ∑ ∑ … … … ∑ 𝑌𝑖𝑗……𝑡𝑢 in population. Also 𝑌̂𝑖 /𝑝2𝑖 is an unbiased estimate of population total, Y, by the same principle. Then 𝑌̂𝑝𝑝𝑠 = 1 𝑛1 ∑ 𝑌̂𝑖 𝑝2𝑖 is the mean of (𝑌̂𝑖 /𝑝2𝑖 ) and is thus unbiased for Y. For small sampling fraction (≤0.1) at the first stage, PPS systematic sample of PSUs may be approximately considered as PPS sampling with replacement. 4 Hence, by with replacement formula, the estimated variance of 𝑌̂𝑝𝑝𝑠 is given by 𝑌̂ ∑( 𝑖 − 𝑌̂𝑝𝑝𝑠 )2 𝑝2𝑖 𝑣(𝑌̂𝑝𝑝𝑠 ) = … … … (2) 𝑛1 (𝑛1 − 1) Cochran (1977) has stated this estimator for 2-stage sampling. For an adequate and stable estimate, sample size n1, of PSUs at the first stage should not be small that is n1 ≥ 10. The two mild conditions: (i) small sample fraction(≤0.1) at the 1st stage and (ii) n1 ≥ 10, are likely to be satisfied for large- scale sample surveys for large populations. (ii) Estimate of population mean Estimate of population mean is: 𝑌̅̂ = 𝑌̂𝑝𝑝𝑠 with M0 as total population size and its estimated variance can be obtained from that of estimate, 𝑌̂𝑝𝑝𝑠 of population total, Y, as 𝑣(𝑌̂𝑝𝑝𝑠 ) 𝑣(𝑌̅̂ ) = 𝑝𝑝𝑠 𝑝𝑝𝑠 𝑀0 𝑀02 4. Simple method of estimation For PPS systematic sampling, probabilities of selection of clusters are p2i = M2i/M0, p3ij = M3ij/M2i and so on. (i)Estimate of population total Using these expressions of probabilities of selection at different stages and simplifying, the estimated total at (1) becomes 𝑴 𝟏 𝟏 𝟏 ̂ 𝒑𝒑𝒔 = 𝟎 ∑ 𝒀 …...… ∑ ∑ ∑ 𝒚𝒊𝒋……𝒕𝒖 𝒏𝟏 𝒏𝟐𝒊 𝒏𝟑𝒊𝒋 𝒏𝒌𝒊𝒋……𝒕 If sample sizes at each stage are kept constant, that is n1=n1 , n2i=n2 , n3ij = n3, … …. … …, nkij … …. t =nk for all i, j, ............, t, then the estimated total becomes 𝑀0 𝑦 𝑌̂𝑝𝑝𝑠 = ∑ ∑ … … ∑ 𝑦𝑖𝑗… ….𝑡𝑢 = … … (3) 𝑛1 𝑛2 … … 𝑛𝑘 𝑓 where f = (n1, n2 ...........nk ) / M0 = n/ M0, is the overall sampling fraction and y = ∑∑ …. … ∑yij … … tu is total of all observations in the sample and n =n1 n2 … … nk is the total sample size(total number of elements in sample). 5 (ii)Estimated variance of estimated total For equal sample sizes at all stages, estimate of ith PSU total divided by p2i, becomes 𝑌̂𝑖 𝑀0 𝑛1 𝑦𝑖 = ∑ ∑ … … ∑ 𝑦𝑖𝑗… ….𝑡𝑢 = 𝑝2𝑖 𝑛2 𝑛3 … … 𝑛𝑘 𝑓 with yi = ∑∑ …. … ∑yij … … tu as total of all observations in ith PSU. Hence, simple estimated variance of 1 𝑌̂ 𝑌̂𝑝𝑝𝑠 = ∑ 𝑖 from (2) is 𝑛1 𝑝2𝑖 𝑛1 𝑦𝑖 𝑦 2 − ) 𝑓 𝑓 ̂ 𝑣(𝑌𝑝𝑝𝑠 ) = 𝑛1 (𝑛1 − 1) (𝑛1 ∑ 𝑦𝑖2 − 𝑦 2 ) = … … … (4) 𝑓 2 (𝑛1 − 1) ∑( Formulae at (3) and (4) are quite simple under the assumption that sample sizes are constant at different stages although the sampling design is quite complex. Also for n1 ≥ 10, the estimated variance is likely to be adequate and stable. Simple estimate of population mean and its estimated variance can be obtained from (3) and (4) respectively. 5. Estimates when last-stage sampling units are clusters Sometimes last-stage sampling units are clusters of elements. As in section 3, clusters are selected here at all (k-1) stages of sampling and all elements of (k1)th stage clusters are included in the sample for investigation. So there are (k1) stages of sampling of clusters only. Let Yij …. ….. ltu be measures of elements in (k-1)th clusters and let Yij … …. lt = ∑Yij … … .ltu be total of measures of all elements in such clusters. Then the estimated total for Y, is given by 1 1 1 ∑ ∑…….…∑ ∑ 𝑌𝑖𝑗… …𝑙𝑡𝑢 𝑛1 𝑝2𝑖 𝑛2𝑖 𝑝3𝑖𝑗 𝑛𝑘−1𝑖𝑗……𝑙 𝑝𝑘𝑖𝑗……𝑙𝑡 𝑌𝑖𝑗 …. 𝑙𝑡 1 1 1 1 = ∑ ∑ ∑……… ∑ ∑ 𝑛1 𝑛2𝑖 𝑝2𝑖 𝑛3𝑖𝑗 𝑝3𝑖𝑗 𝑛𝑘−1𝑖𝑗 …. 𝑙 𝑝𝑘−1𝑖𝑗 …. 𝑙 𝑝𝑘𝑖𝑗 …𝑙𝑡 𝑌̂𝑝𝑝𝑠 = ∑ 6 By taking (k-1) conditional expectations as before, it can be shown that this 𝑌̂𝑝𝑝𝑠 is an unbiased estimate of population total Y, of variable y. Putting expectations of p2i, p3ij and so on in terms of cluster sizes M2i ,M3ij etc., the estimated total becomes on simplification, 𝑌𝑖𝑗 …….𝑙𝑡 𝑀0 1 1 1 𝑌̂𝑝𝑝𝑠 = ∑ ∑ ∑… ……∑ ∑ 𝑛1 𝑛2𝑖 𝑛3𝑖𝑗 𝑛𝑘−1𝑖𝑗 …. .. 𝑙 𝑀𝑘𝑖𝑗… ….𝑙𝑡 = 𝑀0 ∑ ∑ … … . ∑ 𝑌𝑖𝑗…. (𝑛1 𝑛2 … . … 𝑛𝑘−1 ) ..𝑙𝑡 assuming sample sizes at each stage as constant that is n2i = n2, n3ij = n3, … … …nk-1ij …. … l = nk-1 for all I,j … …. l with 𝑌𝑖𝑗 …. ….𝑙𝑡 = 𝑌𝑖𝑗….. .. 𝑙𝑡 𝑀𝑘𝑖𝑗… ….𝑙𝑡 as the mean of measures of all elements in (k-1)th clusters. Then 𝑌̂𝑝𝑝𝑠 = ∑ 𝑦𝑖𝑚 𝑦𝑚 = 𝑓′ 𝑓′ where 𝑦𝑖𝑚 = ∑ ∑ … … ∑ 𝑌𝑖𝑗….… 𝑙𝑡 is the sum of means of measures of elements in (k-1)th clusters in ith PSU and 𝑛 𝑛 𝑛 𝑛′ 𝑦𝑚 = ∑ 𝑦𝑖𝑚 , 𝑛′ = 𝑛1 𝑛2 … … 𝑛𝑘−1 𝑎𝑛𝑑 𝑓 ′ = 1 2… … 𝑘−1 = 𝑀0 𝑀0 is ratio of total number of (k-1) clusters in sample to population size, M0. The simple estimated variance of 𝑌̂𝑝𝑝𝑠 is then given by 2 2 (𝑛1 ∑ 𝑦𝑖𝑚 − 𝑦𝑚 ) ̂ 𝑣(𝑌𝑝𝑝𝑠 ) = 𝑓 ′ 2 (𝑛1 − 1) ̅ is The estimate of population mean, 𝑌, ∑ 𝑦𝑖𝑚 𝑦𝑚 𝑌̅̂𝑝𝑝𝑠 = = 𝑛′ 𝑛′ Its estimated variance is 2 2 (𝑛1 ∑ 𝑦𝑖𝑚 − 𝑦𝑚 ) ̂ ̅ 𝑣(𝑌𝑝𝑝𝑠) = 𝑛′2 (𝑛1 − 1) 6. Applications (i) Estimation of area and production of Aman paddy in 2 districts This sample survey was conducted in Satkhira and Rajshahi districts to estimate area and production of Aman paddy in 2002. The districts were fixed beforehand. 7 In each district, some Mouzas were selected with PPS systematic selection at the first stage of sampling with total areas as their sizes. In each selected Mouza, 2 plots of cultivated land were selected randomly and located in the field with the help of Mouza maps at the second stage of sampling. An area of about 5 acres of cultivated land was demarcated around each selected plot for investigation. Thus sampling design is a 2-stage cluster sample in each district. Area of Aman paddy in each selected cluster is obtained by observation of plots of selected clusters on the spot at the time of growth of the crop. For estimation of yield rate, crop-cutting experiments were conducted on a few small plots in the selected Mouzas with the help of the cultivators of plots during harvest time. Estimates were obtained by using the data of the plots of land in the sample clusters by considering the formulae of section 5. (ii)Survey of life style of aged people in 4 districts A survey on life style of aged people over 60 years of age was undertaken in 2004 in the rural areas of 4 districts : Dhaka, Comilla, Bogra, and Jessore and in Dhaka city as urban area. From the lists of Mouza/Mohalla, of reports of 2001 population census, some Mouza in rural area of 4 districts and some Mohalla in Dhaka city were selected on PPS systematic sample basis with number of households in a Mouza/Mohalla as its size. To select aged people in each selected Mouza/Mohalla, Enumeration Areas (EA) established in connection with 2001 population census in each district were considered. One EA having 100 households on the average, was selected in each Mouza/Mohalla at random. A list of aged people was prepared on the spot for each selected EA and all such aged people were included in the sample for interview. Thus a two-stage cluster sample was drawn in the rural areas of each of 4 districts and also in urban area of Dhaka city. Estimates were obtained from data of the aged people in the sample by using the formulae provided in section 5. 8 (iii) Bangladesh agriculture sample survey of farm holdings This is a large-scale sample survey of farmers for the whole country conducted in 2005 by Bangladesh Bureau of Statistics. Since district estimates were needed, districts were considered as strata. In each district, 12% of Mouza were selected with PPS systematic sampling at the first stage with number of farmers of a Mouza as its size. For each selected Mouza, a list of farm holdings was prepared on the spot and a systematic sample of 30 farmers was selected at the second stage of sampling. So, this is a two–stage stratified cluster sample. Using the data collected from the sample farmers, estimates for each district were obtained with the help of the formulae given in section 5 or by package program. Combined estimates for the whole country were also computed from the district estimates. References Cochran, W.G. (1977) - Sampling Techniques, Wiley & Sons. Talukder, M.A.H. (1984) - A simple method of estimating standard errors for multi-stage sampling with unequal cluster (equal probability selection), Biometrie-Praximetrie, vol. 24, pp. 127-140.