docx

advertisement
Methods of Large -Scale Sample surveys
Dr. Md. Anwar Hossain Talukder
Ex-faculty, RU and DU
Abstract
For multi-stage sampling with unequal clusters in large-scale sample surveys,
two main methods of selection of clusters are: (i) equal probability selection of
clusters without replacement and (ii) selection of clusters with probability
proportional to size (PPS) systematic sampling without replacement. The first
method was described by Talukder (1984) and the second practically more
useful procedure is discussed here along with simple methods of estimation.
It sometimes becomes necessary to collect nationwide or region-wide
information on some important matter. For this, large-scale sample surveys are
needed. These surveys are usually complex but simple methods of estimation
are obtained under a few simplifying assumptions.
1. Multi-stage sampling
For a large population, sampling frame of ultimate sampling units (elements)
for which data are to be collected, is not usually available. But lists of groups of
elements, called clusters, are usually available from human population census,
agriculture census, industry census etc. reports. In multi-stage sampling,
clusters are selected at different stages with the help of such lists of clusters.
Elements are usually selected at the last stage. Sometimes clusters are also
selected at the last stage. One advantage of multi-stage sampling is that cost of
survey is reduced because of grouping of sampling units in smaller areas
(clusters) but standard errors of estimates increase because of the same
reason. In practice, 2-stage or 3-stage sampling with or without stratification is
usually enough for such surveys.
2. Methods of selection of clusters in multi-stage sampling
For large populations, natural clusters are usually of unequal size which is
number of elements in the cluster. Following are few methods of selection of
unequal clusters in multi-stage sample.
2
(i) Selection of unequal clusters at all stages with equal probability without
replacement
This is a simple method where clusters at all stages and elements at the last
stage are selected on simple random sampling basis or equivalently on simple
systematic sampling basis assuming random order of clusters or elements.
Because of variation in sizes of clusters, this method usually produces larger
standard errors. To reduce wide variations in sizes, larger clusters are split
and smaller ones combined so that standard errors become smaller. Simple
estimates and their simple estimated standard errors are obtained by Talukder
(1984) under a few simple conditions.
(ii) With replacement selection of unequal clusters at the first stage
For this procedure, primary sampling units (PSU) are selected at the first stage
with replacement with probability proportional to size (PPS) along with any
method of selection of clusters and elements at the subsequent stages. This
method provides one component expression of unbiased estimate of variance
of an estimate. In this method, any PSU may be selected more than once
resulting in loss of precision of estimates.
(iii) Selection of unequal clusters with PPS systematic sampling method
This is a practically more useful method of selection of multi-stage sample. In
this case, clusters at each stage are selected with PPS systematic sampling
method without replacement. For small sampling fraction at the first stage,
standard errors are computed by with-replacement formula. This method
usually produces less standard errors and has firm control over sample sizes.
3. Estimates for PPS systematic selection of unequal clusters in multi-stage
samples
Selection of clusters at every stage except the last one, is made here with the
help of cumulative measures of size or estimated measures of size. At the last
stage, elements are selected usually by simple systematic sampling method
without replacement after making a list of elements of selected clusters of the
last-but-one stage of sampling.
3
(i) Estimate of population total
For k-stage sampling,
let yij… … … … tu be the observation of variable, y, for an element of a cluster at the
last stage of sampling
Yij… … … … tu be the measure of variable, y, for an element in population
Let n1, n2i, n3ij… …. … nk-1ij.. .. ..t be sample size of clusters at 1st, 2nd, 3rd, … … …(k1)th stages respectively.
Let M2i, M3ij, … … … Mkij… … t be population size (number of elements) of clusters
at 1st, 2nd, 3rd, … … …(k-1)th stages respectively.
Let p2i=M2i/M0, p3ij=M3ij/M2i, … … … pkij … … …lt=Mkij … … … lt/Mk-1ij … … … l be
probabilities of selection of clusters at different stages (first to (k-1) stages)
with M0=∑∑ … …∑Mkij … …lt as total size (no. of elements) of population. Also let
𝑌̂𝑝𝑝𝑠 =
𝑀𝑘𝑖𝑗……𝑡 ∑ 𝑦𝑖𝑗 ......𝑡𝑢
1
1
1
……… ∑
… (1)
∑
∑
𝑛1
𝑛2𝑖 𝑝2𝑖
𝑛3𝑖𝑗 𝑝3𝑖𝑗
𝑛𝑘𝑖𝑗……𝑡 𝑝𝑘𝑖𝑗…….𝑡
be an estimate of population total Y= ∑∑ … …∑ Yij … …. … tu of variable y with nij..
as sample size of elements at kth stage of sampling in (k-1)th stage clusters.
.. t
For any stage of selection of clusters, previous stages samples are assumed
here to be fixed. Then by taking k conditional expectations on this condition, it
can be shown that 𝑌̂𝑝𝑝𝑠 is an unbiased estimate of population total, Y, of
measures of variable y in population.
By the same principle,
𝑀𝑘𝑖𝑗………𝑡
1
1
𝑌̂𝑖 =
……...∑
∑
∑ 𝑦𝑖𝑗… …. 𝑡𝑢
𝑛2𝑖
𝑛3𝑖𝑗 𝑝3𝑖𝑗
𝑛𝑘𝑖𝑗……𝑡 𝑝𝑘𝑖𝑗……𝑡
is an unbiased estimate of total for ith PSU, Yi = ∑ ∑ … … … ∑ 𝑌𝑖𝑗……𝑡𝑢 in
population.
Also 𝑌̂𝑖 /𝑝2𝑖 is an unbiased estimate of population total, Y, by the same
principle.
Then 𝑌̂𝑝𝑝𝑠 =
1
𝑛1
∑
𝑌̂𝑖
𝑝2𝑖
is the mean of (𝑌̂𝑖 /𝑝2𝑖 ) and is thus unbiased for Y.
For small sampling fraction (≤0.1) at the first stage, PPS systematic sample of
PSUs may be approximately considered as PPS sampling with replacement.
4
Hence, by with replacement formula, the estimated variance of 𝑌̂𝑝𝑝𝑠 is given by
𝑌̂
∑( 𝑖 − 𝑌̂𝑝𝑝𝑠 )2
𝑝2𝑖
𝑣(𝑌̂𝑝𝑝𝑠 ) =
… … … (2)
𝑛1 (𝑛1 − 1)
Cochran (1977) has stated this estimator for 2-stage sampling.
For an adequate and stable estimate, sample size n1, of PSUs at the first stage
should not be small that is n1 ≥ 10. The two mild conditions:
(i) small sample fraction(≤0.1) at the 1st stage and (ii) n1 ≥ 10, are likely to be
satisfied for large- scale sample surveys for large populations.
(ii) Estimate of population mean
Estimate of population mean is: 𝑌̅̂
=
𝑌̂𝑝𝑝𝑠
with M0 as total population size
and its estimated variance can be obtained from that of estimate, 𝑌̂𝑝𝑝𝑠 of
population total, Y, as
𝑣(𝑌̂𝑝𝑝𝑠 )
𝑣(𝑌̅̂ ) =
𝑝𝑝𝑠
𝑝𝑝𝑠
𝑀0
𝑀02
4. Simple method of estimation
For PPS systematic sampling, probabilities of selection of clusters are
p2i = M2i/M0, p3ij = M3ij/M2i and so on.
(i)Estimate of population total
Using these expressions of probabilities of selection at different stages and
simplifying, the estimated total at (1) becomes
𝑴
𝟏
𝟏
𝟏
̂ 𝒑𝒑𝒔 = 𝟎 ∑
𝒀
…...… ∑
∑
∑ 𝒚𝒊𝒋……𝒕𝒖
𝒏𝟏
𝒏𝟐𝒊
𝒏𝟑𝒊𝒋
𝒏𝒌𝒊𝒋……𝒕
If sample sizes at each stage are kept constant, that is
n1=n1 , n2i=n2 , n3ij = n3, … …. … …, nkij … …. t =nk
for all i, j, ............, t, then the estimated total becomes
𝑀0
𝑦
𝑌̂𝑝𝑝𝑠 =
∑ ∑ … … ∑ 𝑦𝑖𝑗… ….𝑡𝑢 = … … (3)
𝑛1 𝑛2 … … 𝑛𝑘
𝑓
where f = (n1, n2 ...........nk ) / M0 = n/ M0, is the overall sampling fraction and y
= ∑∑ …. … ∑yij … … tu is total of all observations in the sample and n =n1 n2 … … nk
is the total sample size(total number of elements in sample).
5
(ii)Estimated variance of estimated total
For equal sample sizes at all stages, estimate of ith PSU total divided by p2i,
becomes
𝑌̂𝑖
𝑀0
𝑛1 𝑦𝑖
=
∑ ∑ … … ∑ 𝑦𝑖𝑗… ….𝑡𝑢 =
𝑝2𝑖 𝑛2 𝑛3 … … 𝑛𝑘
𝑓
with yi = ∑∑ …. … ∑yij … … tu as total of all observations in ith PSU.
Hence, simple estimated variance of
1
𝑌̂
𝑌̂𝑝𝑝𝑠 = ∑ 𝑖 from (2) is
𝑛1
𝑝2𝑖
𝑛1 𝑦𝑖 𝑦 2
− )
𝑓
𝑓
̂
𝑣(𝑌𝑝𝑝𝑠 ) =
𝑛1 (𝑛1 − 1)
(𝑛1 ∑ 𝑦𝑖2 − 𝑦 2 )
=
… … … (4)
𝑓 2 (𝑛1 − 1)
∑(
Formulae at (3) and (4) are quite simple under the assumption that sample
sizes are constant at different stages although the sampling design is quite
complex. Also for n1 ≥ 10, the estimated variance is likely to be adequate and
stable. Simple estimate of population mean and its estimated variance can be
obtained from (3) and (4) respectively.
5. Estimates when last-stage sampling units are clusters
Sometimes last-stage sampling units are clusters of elements. As in section 3,
clusters are selected here at all (k-1) stages of sampling and all elements of (k1)th stage clusters are included in the sample for investigation. So there are (k1) stages of sampling of clusters only.
Let Yij …. ….. ltu be measures of elements in (k-1)th clusters and let Yij … …. lt = ∑Yij … …
.ltu be total of measures of all elements in such clusters. Then the estimated
total for Y, is given by
1
1
1
∑
∑…….…∑
∑ 𝑌𝑖𝑗… …𝑙𝑡𝑢
𝑛1 𝑝2𝑖
𝑛2𝑖 𝑝3𝑖𝑗
𝑛𝑘−1𝑖𝑗……𝑙 𝑝𝑘𝑖𝑗……𝑙𝑡
𝑌𝑖𝑗 …. 𝑙𝑡
1
1
1
1
= ∑
∑
∑……… ∑
∑
𝑛1
𝑛2𝑖 𝑝2𝑖
𝑛3𝑖𝑗 𝑝3𝑖𝑗
𝑛𝑘−1𝑖𝑗 …. 𝑙 𝑝𝑘−1𝑖𝑗 …. 𝑙
𝑝𝑘𝑖𝑗 …𝑙𝑡
𝑌̂𝑝𝑝𝑠 = ∑
6
By taking (k-1) conditional expectations as before, it can be shown that this
𝑌̂𝑝𝑝𝑠 is an unbiased estimate of population total Y, of variable y.
Putting expectations of p2i, p3ij and so on in terms of cluster sizes M2i ,M3ij etc.,
the estimated total becomes on simplification,
𝑌𝑖𝑗 …….𝑙𝑡
𝑀0
1
1
1
𝑌̂𝑝𝑝𝑠 =
∑
∑
∑… ……∑
∑
𝑛1
𝑛2𝑖
𝑛3𝑖𝑗
𝑛𝑘−1𝑖𝑗 …. .. 𝑙
𝑀𝑘𝑖𝑗… ….𝑙𝑡
=
𝑀0
∑ ∑ … … . ∑ 𝑌𝑖𝑗….
(𝑛1 𝑛2 … . … 𝑛𝑘−1 )
..𝑙𝑡
assuming sample sizes at each stage as constant that is
n2i = n2, n3ij = n3, … … …nk-1ij …. … l = nk-1 for all I,j … …. l
with 𝑌𝑖𝑗 ….
….𝑙𝑡
=
𝑌𝑖𝑗….. .. 𝑙𝑡
𝑀𝑘𝑖𝑗… ….𝑙𝑡
as the mean of measures of all elements in (k-1)th
clusters. Then
𝑌̂𝑝𝑝𝑠 =
∑ 𝑦𝑖𝑚 𝑦𝑚
=
𝑓′
𝑓′
where 𝑦𝑖𝑚 = ∑ ∑ … … ∑ 𝑌𝑖𝑗….… 𝑙𝑡 is the sum of means of measures of
elements in (k-1)th clusters in ith PSU and
𝑛 𝑛
𝑛
𝑛′
𝑦𝑚 = ∑ 𝑦𝑖𝑚 , 𝑛′ = 𝑛1 𝑛2 … … 𝑛𝑘−1 𝑎𝑛𝑑 𝑓 ′ = 1 2… … 𝑘−1 =
𝑀0
𝑀0
is ratio of total number of (k-1) clusters in sample to population size, M0. The
simple estimated variance of 𝑌̂𝑝𝑝𝑠 is then given by
2
2
(𝑛1 ∑ 𝑦𝑖𝑚
− 𝑦𝑚
)
̂
𝑣(𝑌𝑝𝑝𝑠 ) =
𝑓 ′ 2 (𝑛1 − 1)
̅ is
The estimate of population mean, 𝑌,
∑ 𝑦𝑖𝑚
𝑦𝑚
𝑌̅̂𝑝𝑝𝑠 =
=
𝑛′
𝑛′
Its estimated variance is
2
2
(𝑛1 ∑ 𝑦𝑖𝑚
− 𝑦𝑚
)
̂
̅
𝑣(𝑌𝑝𝑝𝑠) =
𝑛′2 (𝑛1 − 1)
6. Applications
(i) Estimation of area and production of Aman paddy in 2 districts
This sample survey was conducted in Satkhira and Rajshahi districts to estimate
area and production of Aman paddy in 2002. The districts were fixed
beforehand.
7
In each district, some Mouzas were selected with PPS systematic selection at
the first stage of sampling with total areas as their sizes. In each selected
Mouza, 2 plots of cultivated land were selected randomly and located in the
field with the help of Mouza maps at the second stage of sampling. An area of
about 5 acres of cultivated land was demarcated around each selected plot for
investigation. Thus sampling design is a 2-stage cluster sample in each district.
Area of Aman paddy in each selected cluster is obtained by observation of
plots of selected clusters on the spot at the time of growth of the crop. For
estimation of yield rate, crop-cutting experiments were conducted on a few
small plots in the selected Mouzas with the help of the cultivators of plots
during harvest time.
Estimates were obtained by using the data of the plots of land in the sample
clusters by considering the formulae of section 5.
(ii)Survey of life style of aged people in 4 districts
A survey on life style of aged people over 60 years of age was undertaken in
2004 in the rural areas of 4 districts : Dhaka, Comilla, Bogra, and Jessore and in
Dhaka city as urban area. From the lists of Mouza/Mohalla, of reports of 2001
population census, some Mouza in rural area of 4 districts and some Mohalla in
Dhaka city were selected on PPS systematic sample basis with number of
households in a Mouza/Mohalla as its size.
To select aged people in each selected Mouza/Mohalla, Enumeration Areas
(EA) established in connection with 2001 population census in each district
were considered. One EA having 100 households on the average, was selected
in each Mouza/Mohalla at random. A list of aged people was prepared on the
spot for each selected EA and all such aged people were included in the sample
for interview.
Thus a two-stage cluster sample was drawn in the rural areas of each of 4
districts and also in urban area of Dhaka city. Estimates were obtained from
data of the aged people in the sample by using the formulae provided in
section 5.
8
(iii) Bangladesh agriculture sample survey of farm holdings
This is a large-scale sample survey of farmers for the whole country conducted
in 2005 by Bangladesh Bureau of Statistics. Since district estimates were
needed, districts were considered as strata.
In each district, 12% of Mouza were selected with PPS systematic sampling at
the first stage with number of farmers of a Mouza as its size.
For each selected Mouza, a list of farm holdings was prepared on the spot and
a systematic sample of 30 farmers was selected at the second stage of
sampling. So, this is a two–stage stratified cluster sample. Using the data
collected from the sample farmers, estimates for each district were obtained
with the help of the formulae given in section 5 or by package program.
Combined estimates for the whole country were also computed from the
district estimates.
References
Cochran, W.G. (1977) - Sampling Techniques, Wiley & Sons.
Talukder, M.A.H. (1984) - A simple method of estimating standard errors for
multi-stage sampling with unequal cluster (equal probability selection),
Biometrie-Praximetrie, vol. 24, pp. 127-140.
Download