SAMPLING

advertisement
=======================================================================
SAMPLING: AN INTRODUCTION
NUMERICAL PRACTICE EXERCISES
(notes to the attached Excel tables)
=======================================================================
Vijay Verma
University of Siena
Siena, October 2014
1
Numerical practice (1)
NOTES
This classroom exercise concerns numerical illustrations of simple random, clustered (multi-stage)
and stratified sampling. It is recommended that the students work in pairs.
To begin with, from a given population we will select simple random samples of elements of
various sizes to illustrate how the distribution of sample means varies with sample size. We will
also illustrate how the variability among elements in any given sample estimates the population
variance.
1 The population
(1) In order to facilitate quick selection of many random samples, we will employ a simple wellmixed population with known characteristics in which units with different values appear in an
entirely random order, so that any arbitrary set of units can be regarded as a random sample.
Our example is artificial in two respects:


We actually know the entire population (the Yj values vary in the range 0 to 9).
The population is thoroughly mixed at least in relation to values of the variable of interest,
i.e. the Yj values appear in an entirely random order.
(2) Theoretically, the population parameters are:
Y 
1 10
.  Y  4.500;
j
10
2 
j  1
2
1 10 
.   Y  4.5  8.25;
j
10
  2.87
j  1
Our illustrative population is shown in tab “40x40 random digits 0-9”. Let us assume that these
digits represent values of some variable Y, the average value of which we are interested in
estimating. Our table of 1,600 digits is of limited size and not perfect. It compares as follows with
the theoretical values above (see tab “frequency distribution”):
Comparison of frequencies in the 1600 digit table with the expected frequencies
0
theoretical 160
actual
151
Frequency distribution of digits
1
2
3
4
160 160 160 160
162 161 144 176
5
160
169
6
160
169
7
160
170
8
160
154
y
9
160 4.500
144 4.498
2
8.250
7.940
S2
8.250
7.945
2 Simple random sampling
For illustrative purposes, we construct six sets of simple random samples:
Sample “design”
Sample size
(1r)
one quarter of each row
n =(1x10)=10
160
(1c)
one quarter of each column
n =(10x1)=10
160
(2)
Each (4x4) square
n =(4x4)=16
100
(3r)
each whole row
n =(1x40)=40
40
(3c)
each whole column
n =(40x1)=40
40
(4)
Each (10x10) square
n =(10x10)=100
16
Number of samples
2
3 Illustrations of simple random sampling
In each case, (1r)-(4), the above mentioned samples amount to a very small subset of all possible
samples of a given size that can be drawn from the population. Many more can be easily created.
The task is to select the different types of samples as specified above, and comment of the basic
characteristics (mean, variance, standard deviation, minimum and maximum values) of the
sampling distributions obtained. Since in each example only a very small set of all possible samples
is considered, the results are subject to some random variability.
3
Numerical practice (2)
NOTES
This classroom exercise concerns numerical illustrations of clustered (multi-stage) sampling. It is
recommended that the students work in pairs.
1 The population
In Numerical practice (1), from a given population we selected simple random samples of elements
of various sizes to illustrate how the distribution of sample means varies with sample size.
To remind, our illustrative population is merely a table of 40x40 random digits 0-9. Hence any
arbitrary set of units can be regarded as a random sample. We assume that these digits represent
values of some variable Y, the average value of which we are interested in estimating. The
population distribution of Y values and their mean and variance is as follows.
Comparison of frequencies in the 1600 digit table with the expected frequencies
0
theoretical 160
actual
151
Frequency distribution of digits
1
2
3
4
160 160 160 160
162 161 144 176
5
160
169
6
160
169
7
160
170
8
160
154
y
9
160 4.500
144 4.498
2
8.250
7.940
S2
8.250
7.945
2 Clusters with different degrees of homogeneity
Here we have considered the 100 square clusters of size (4x4).
Clusters with five different degrees of homogeneity have been formed:
(1) Entirely random clusters
(2) For each set of 4 columns of digits, the first column is sorted by increasing value of Y. Such
sorting is applied to left half of the table of (40x40) digits. The (4x4) clusters are formed in the
normal way, but using the set of digits sorted as above.
(3) As above, except that the ‘1 column in 4’ sorting is applied to the whole table.
(4) The sorting is applied to the first two columns of each set of 4 columns of digits in the left half
of the table.
(5) The above ‘2 column in 4’ sorting is applied to the whole table.
If a simple random sample of a clusters (each of size n) were selected from a population of A
clusters, variance of the mean for this clustered sample would be
Var y A
2
Yk  Y 
a
S A2
 A  a  SA
, with S A2 
,f  .

 (1  f ).
.
A 1
A
a
 A  a
2
When the clusters are formed by entirely random grouping as in (1), the variance would be identical
to that for a simple random sample of elements of the same size, i.e. of SRS with (a.n) elements:
Var1  y A  Var0  y  , giving
2
2
S2
S2 n
2
 N  a.n  S
 Aa S
S

,
hence
.
Var0  y   

 (1  f ).
.
.
1A
n
a
 N  a.n  A  a.n
4
In schemes (2)-(5), clusters correspond to increasingly homogeneous groupings of elements.
Greater homogeneity within clusters implies greater variability between clusters. Hence the cluster
means in populations (2)-(5) are increasingly more diverse compared to cluster means in (1).
S
2
5A
S
2
4A
S
2
3A
S
2
2A
S
2
1A

S2
n
.
Generally, with n>>1, Si A  S 2 , except in the ‘artificial’ case (1).
2
 A  a  Sk A
With Vark  y A  
, the above gives Vark  y A > Vark 1  y A , 𝑘 = 1 − 5.
.
 A  a
2
3 Design effects
A most useful concept for the computation, analysis and interpretation of sampling errors concerns
the design effect. Design effect is the ratio of the variance, Vark  y A , under the given sample
design, to the variance, Var0  y  , under a simple random sample of the same size:
d 2  Vark  y A Var0  y , .
Computing design effects requires the additional step of estimating variance or sampling error
under simple random sampling apart from its estimate under the actual design. Proceeding from
standard errors to design effects is essential for understanding the patterns of variation and
determinants of the magnitude of the error, for smoothing and extrapolating the results for diverse
statistics and population subclasses, and for evaluating the performance of the sampling design.
Analysing design effects into components helps to better understand from where inefficiencies of
the sample arise, to identify patterns of variation.
In the current illustration random sampling of clusters, we have
2
Vark  y A  S k A 
deft 

, 𝑘 = 1 − 5.
Var0  y   S 2 n 
2
k
deft5 A  deft42A  deft3 A  deft2 A  deft1 A  1 .
2
2
2
2
5
Numerical practice (3)
NOTES
This classroom exercise concerns numerical illustrations of stratified sampling, with simple random
sampling within strata. It is recommended that the students work in pairs.
1 The population
In Numerical practice (1), from a given population we selected simple random samples of elements
of various sizes to illustrate how the distribution of sample means varies with sample size.
To remind, our illustrative population is merely a table of 40x40 random digits 0-9. Hence any
arbitrary set of units can be regarded as a random sample. We assume that these digits represent
values of some variable Y, the average value of which we are interested in estimating. The
population distribution of Y values and their mean and variance is as follows.
Comparison of frequencies in the 1600 digit table with the expected frequencies
0
theoretical 160
actual
151
Frequency distribution of digits
1
2
3
4
160 160 160 160
162 161 144 176
5
160
169
6
160
169
7
160
170
8
160
154
y
9
160 4.500
144 4.498
2
8.250
7.940
S2
8.250
7.945
2 Stratification
If sample selection and estimation is done separately within each stratum, the basic expression
Var y   1  f .
S2
n
Y j  Y 
2
with S 
2
N 1
applies to each stratum. Using subscript h to refer to a particular stratum, we have with SRS within
strata:
Yhj  Yh 
S2
Var  y h   1  f h . h with S h2 
,
Nh  1
nh
2
summed over Nh units in stratum h.
In putting together the results from different strata, we often do that in proportion to stratum size,
e.g. with weights Wh=Nh/N. For the total population Y   hWh .Yh . and if the Wh are known,
y   hWh . y h and
Var  y    h Wh2 .Var  y h 
.
For simplicity in our illustrations, we consider the population as divided into H strata equal in
population as well as in sample size (Wh = Nh/N = nh/n = 1/H). With this, it follows from the above
expression that with SRS within each stratum, variance can be written as
Var y  E
1  f   h S h2 

.
.
n  H 
Comparing this with unstratified SRS, the effect of stratification is in proportion to the ratio of the
average within-stratum variance S 2  S h2 H to the unstratified value S2.
We may decompose the total variance into variation within strata and variation among the strata
means:
6

 h  j Yhj  Y

2

  h  j Yhj  Yh

2
  h  j Yh  Y  ,
2
or dividing by N, we may write the above as  2   2  2 , where the first term on the right is the
within-stratum and the second term the between-strata component. ∆̅ is the mean squared deviation
of means. The proportionate reduction in variance from stratification is approximately
2
2

 hj  Yh  Y 

 hj Yhj  Y
2

2


 h N h .  Yh  Y 
 hj Yhj  Yh

2
2
  h N h .  Yh  Y 
2
.
The actual gain is slightly smaller due to the (generally minor) difference in the definition of  and
S. With H strata of equal size N/H, it is seen to be
S 2  S 2 2  H  1  2 
 2 
. .
N H 2
S2

3 Stratification in multistage sampling
An important point to note is that exactly the same idea applies when we are dealing with clusters
rather than elements as the sampling units in a stratified design. The above quantities then refer to
the variance of cluster means. With a given stratification, the deviation between strata means, 2 , is
the same whether element or cluster sampling is used within strata. By contrast, S2 or the
within-stratum term, S 2 , is usually much smaller for cluster means than that for individual elements
(as noted earlier).
Hence with cluster sampling, the relative gain from stratification is usually much more appreciable.
4 Numerical exercise
In the example, the set of 40x40 random digits has been divided into two equal strata of 20 rows
each. This is done for each of the five sets constructed by sorting some of the columns as described
in Exercise 2. Populations with five different degrees of ‘gradients’ in Y values have been formed:
(1) Entirely random arrangement of elements.
(2) For each set of 4 columns of digits, the first column is sorted by increasing value of Y. Such
sorting is applied to left half of the table of (40x40) digits.
(3) As above, except that the ‘1 column in 4’ sorting is applied to the whole table.
(4) The sorting is applied to the first two columns of each set of 4 columns of digits in the left half
of the table.
(5) The above ‘2 column in 4’ sorting is applied to the whole table.
In (1), the two strata have the same expected mean. Generally, the mean in the lower stratum is
larger than that in the upper stratum, the difference between them increases as we go from (2) to (5).
First we examine the effect of stratification assuming a SRS of elements within each stratum. What
is the proportionate reduction in variance in each case?
Then we consider it to be a population of 100 square clusters of size (4x4)=16 elements each. We
examine the effect of stratification assuming a SRS of clusters within each stratum. What is the
proportionate reduction in variance in each case; how does it compare with the reduction in the
samples of elements?
7
Numerical practice (4)
NOTES
This classroom exercise concerns numerical illustrations of how the variability among elements in
any given sample estimates the population variance. The procedure illustrated applies only in the
case of simple random sampling. However, the results of this type also apply if the sample is more
complex: in those situations as well, variability among elements in a sample estimates the
population variance – though not necessarily the sampling error under the actual complex design.
1 Estimating variance from the sample
In our illustrations, the average of the sample means y =yi/n is equal to the population mean Y . We
say that the expected value of the former equals to the latter: E[ y ]=Y ; i.e. y provides an unbiased
estimator of Y .
Furthermore, the variability among elements in any particular sample provides a measure of that
2
2
variability in the population, i.e.  2   i  y i  y  n   j Y j  Y
N   2 , where the summation


on the left is over n elements of the sample and that on the right is over N population elements.
2
Actually, for an SRS the exact relationship happens to be E[s2]=S2, where s 2   i  y i  y  n  1

and S 2   j Y j  Y

2
 N  1 . Hence for a simple random sample
E[var( y )]=Var[ y ],
where var( y )=(1-f)s2/n is estimated from the sample and Var( y )=(1-f).S2/n its population value.
This is the basis on which we can estimate the variance (a measure of variability among different
samples) from the results of a single sample that is available.
2. Wider significance of the results
It is important to note that variance computed above provides a valid estimate for only simple
random sampling. For more complex designs estimating variance will involve more complex
formulae taking into account the complexity of the design. But interestingly, an important result of
sampling theory is that for many complex designs, the relationship E[s2]S2 still holds
approximately.
Hence we can use the data from a sample of any complexity to produce an estimate of what the
error would be in a simple random sample of the same size – there is an approximation involved but
it is small, and generally of the order of (1/n). This is an important statistics since iat forms the
denominator of the ‘design effect’.
8
Numerical practice (5). Selecting households in two stages
PANEL (1)
The population
Our sampling frame consists of a list of 30 population census areas (col. 1 of the panel), with given
number of households in each area at the time of the census (col. 2). These are the ‘area measures of
size’ (MoS) available in the sampling frame, i.e. for all the areas in the population prior to any
sample selection. Col. (6) gives the actual number of households in each area at the time of the
survey. Generally these latter numbers are not known, except in the areas actually selected and
enumerated during the survey.
1. First stage selection probability of areas
It is decided to select 6 of these areas (as PSUs), with probability proportional to size (PPS). The
measure of size (Si) is the number of households in col. (2).
Complete col (3): the probability of selection each area will receive.
(Complete col. 3 in Panel (1); note that it is then automatically completed col.(3) in Panel (2) as
well.)
2. Sample take (number of households selected) per area in self-weighting design
Design A consist of two stages: selection of areas, followed by the selection of households in the
selected areas. The first stage is as above. At the second stage, the target sample size is 10
households per selected area. Within each selected area, households are selected with a UNIFORM
PROBABILITY = 10/Si.
The number of households in the area at the time of the survey (Hi) is given in col (6). Complete in
Panel (1):
Col (4): the sampling rate which would be applied for selecting households within each area, if
the area had been selected at the first stage.
Col (5): the OVERALL (after the 2 stages) selection probability of a household in each area.
Col (7): the number of households which would be selected from the area, given col. (6).
PANEL (2)
3. Fixed-take design
Design B also consists of two stages, the first stage exactly as above. At the second stage, the target
sample size is FIXED at 10 households per selected area. Within each selected area, complete in
Panel (2):
Col (4): the sampling rate which would be applied for selecting households within each area, if
the area had been selected at the first stage.
Col (5): the OVERALL (after the 2 stages) selection probability of a household in each area.
(As before, col (6) gives the number of households in the area at the time of the survey.)
PANEL (3)
4. Selecting a sample of areas
Select a systematic PPS sample of a = 6 areas, given the frame in cols (1)-(2).
For this purpose,
- complete col (8) giving cumulation of size measures given in col (2);
9
- compute the sampling interval to be applied,
- take a random start, and select a systematic sample with the interval computed above,
and mark with “x” in col (9) the 6 areas selected in your particular sample.
5. Analysis of the selected sample
From the sample of 6 areas you have selected, complete for each SELECTED area:
Col (10): the sample weight to be applied to the area to compute its contribution towards
estimating a population total.
Col (11): contribution of the area towards estimating the total number of households in the
population at the time of the survey, given the number of household found in the area during the
survey (col 6).
What estimate do you get from the sample for the total number of households in the population?
10
Numerical practice (6). Further on two-stage sampling
This exercise explain how to treat area units which are either ‘too large’ or ‘too small’ for the
purpose of applying the PPS selection scheme in two stages, involving the selection of areas
followed by the selection of households within selected areas.
1. The frame
The frame covers a population of small localities. There are 482 localities, containing 3,419
households. There are 10 large, 72 of medium size, and many (400) of small size. The details are as
follows.
Count
total MoS
Mean
482
3,419
7.09
Stratum 1
10
303
30.30
Stratum 2
72
919
12.76
Stratum 3
400
2,197
5.49
SUM
482
3,419
7.09
TOTAL
Full list frame is given in sheet “frame”.
The number of households in a locality is used as the measure of size (MoS) for the first stage
sampling of a certain specified number (=a) of localities. It is assumed in this example that the
actual number of households in the locality at the time of the survey is exactly the same as this
number in the frame.
2. First-stage sampling
The objective is to select a self-weighting sample of households in two stages: PPS selection of
localities; and then selection of households in each selected locality with inverse-PPS. Since the
MoS are accurate, this also gives a constant number of sample households from each selected
locality.
With this scheme the sample allocation to a stratum, and hence the number of localities to be
selected from it, is in proportion to the total MoS in the stratum.
This basic allocation has to be adjusted to take into account the presence of ‘very large’ localities
which limits the total number available for selection (this occurs mainly in Stratum 1, with a few
cases in Stratum 2), and the presence of ‘very small’ localities which occur in Stratum 3.
Sheet “first stage sampling” shows the procedure for different values of the number of PSUs to be
selected: a=50, 100, 150, and 200.
Repeat this table for other values of this number, such as a=175, a=225, etc.
With a=50 or a=100, allocation of the sample areas to be selected in a stratum can be made strictly
in proportion to the total measure of size in the stratum. This is because the allocation so determined
does not exceed the number of areas available in any of the strata. However, with increasing sample
size (a = the number of areas to be selected), the allocation to the larger strata (i.e. in strata
containing large localities) can exceed the total number of localities existing in the stratum. The
latter quantity is the limit of the maximum number of units which can be selected.
11
We begin with the stratum containing the largest units, and determine the allocation for each
subsequent smaller stratum on the basis of what measure of size and what number of sample areas
are still available for allocation after having dealt with the larger strata.
3. Two-stage sampling
The sum of size measures (MoS) for the whole population is available from the frame. Let N be this
sum, and 𝑁𝑖 the measure of size of area (PSU) 𝑖. According to the required design, let 𝑎 be the
number of PSUs to be selected, then the PPS sampling interval for the selection of PSUs is 𝐼 =
𝑁⁄𝑎. The other design parameter is to fix the target sample size as 𝑛 households, or the overall
sampling rate for a self-weighting sample 𝑓, the two being related as 𝑓 = 𝑛⁄𝑁 . The target sample
size per sample PSU is 𝑏 = 𝑛⁄𝑎 = 𝑓. 𝐼 .
The first-stage sampling rate is 𝑓1 = 𝑁𝑖 ⁄𝐼 ; and assuming perfect size measures (i.e., exactly
equalling the actual unit sizes), the second-stage sampling rate is 𝑓2 = 𝑏⁄𝑁𝑖 , giving, as expected,
the overall sampling rate 𝑓 = 𝑓1 𝑓2 = 𝑏⁄𝐼 .
In actual application, neither of the probabilities 𝑓1 or 𝑓2 can exceed 1. Hence we have three cases:
(1) ‘Very large’ units, meaning with size 𝑁𝑖 ≥ 𝐼. For these we have
𝑓1 = 1; 𝑓2 = 𝑓 (meaning that the area is automatically included in the sample).
(3) ‘Very small’ units, meaning with size 𝑁𝑖 ≤ 𝑏. For these we have
𝑓1 = 𝑓; 𝑓2 = 1 (meaning that all households in a selected area are taken into the sample)
(2) ‘Normal’ units with
𝑓1 = 𝑁𝑖 ⁄𝐼 ; 𝑓2 = 𝑏⁄𝑁𝑖 .
Since with the above adjustments to the selection probabilities, the overall selection probability of
any unit, 𝑓, remains unchanged, we still obtain the target sample size 𝑛 = 𝑓. 𝑁. However, the
number of PSUs coming into the sample may be disturbed – i.e. may come out to be 𝑎′ ≠ 𝑎, the
target number.
For a given 𝑓 and the condition that no probability can exceed 1, we cannot adjust (1) or (3). We
can adjust the two probabilities in (2) to obtain the required number, 𝑎, of selections of PSUs, while
still keeping the overall unit selection probability 𝑓 unchanged:
𝑓1 = 𝑁𝑖 ⁄(𝑘𝐼); 𝑓2 = (𝑘𝑏)⁄𝑁𝑖 .
Factor 𝑘 is given by:
𝑘=
𝑎2′
𝑎2
=
𝑎′ −(𝑎1 +𝑎3 )
𝑎−(𝑎1 +𝑎3 )
where (𝑎1 , 𝑎2 , 𝑎3 ) are, respectively, the units to be selected from parts (1)-(3).
12
(7) Sampling establishments with PPS
1 Sampling with probability proportional to employment in the establishment
Attached is a frame consisting of 100 establishments from six different sectors of activity, with
information on employment, output, input and investment in each establishment.
Suppose that a sample of 20 establishments is to be selected, with employment as the measure of
size. Total employment in the 100 establishments is 18,884. This gives a PPS sampling interval of
I = 18,884/20 = 994.
3 of the establishments have more than this number of employees. These are taken into the sample
with certainty. Dividing the remaining MoS (employment) by 17, the number of units still to be
selected, gives selection interval I = 668. One additional establishment has MoS greater than this
new sampling interval, so that is also taken into the sample with certainty. The final sampling
interval is I = 667 for selecting further 16 units, from the remaining 96 units all of which have MoS
smaller than this interval.
2 Sampling with probability proportional to a composite measure
If any of the establishment characteristics other than employment are used as MoS, some of the
establishments will have no chance of being included in the sample because the value of the
characteristic concerned is reported to be zero.
As an alternative, it may be desired to select establishments with probability proportional to a
composite measure which takes into account all the four establishment characteristics available in
the frame: employment, output, input and investment.
For instance, one may take the ratio of each of these measure to its mean value per establishment,
and then take the average of the four means so computed to obtain a composite measure of
establishment size. This gives, by definition, the total MoS as 100, and hence sampling interval as I
= 100/20 = 5.0.
Suppose that the largest 5 establishments in terms of this composite MoS are taken into the sample
with certainty. The remaining measure of size is 53.9, giving a sampling interval of 3.60 for
selecting 15 more establishments with PPS. Similarly, if the largest 10 establishments in terms of
this composite MoS are taken into the sample with certainty, the remaining measure of size is 34.3,
giving a sampling interval of 3.43 for selecting 10 more establishments with PPS.
3 Using rescaled size measures
For a given situation, the scaling of the size measures can be arbitrary. We can multiply size
measures of all establishments by any constant factor, say k, and also multiply the sampling interval
by the same factor. The resulting sample design and size will remain unaffected.
This feature can be used for convenience when different sampling intervals have to be used for
different parts of the sample.
Suppose for instance that the number of units to be selected has been fixed separately for each
stratum. In general this would give a different value of the sampling interval for different strata:
𝐼ℎ = 𝑀ℎ ⁄𝑛ℎ
where 𝑀ℎ is the total MoS for stratum ℎ and 𝑛ℎ is the number of units to be selected from it.
Suppose that we rescale the unit size measures in each stratum as
𝑀ℎ′ = 𝑘ℎ 𝑀ℎ with 𝑘ℎ = (𝑛ℎ ⁄𝑀ℎ ).
13
This gives an identical sampling interval 𝐼ℎ = 𝑀ℎ′ ⁄𝑛ℎ = (𝑛ℎ ⁄𝑀ℎ ) 𝑀ℎ ⁄𝑛ℎ = 1 in all strata. The
sample can be selected by putting together all the strata and using the same sampling interval (=1)
throughout. The procedure would automatically give the required number of selections in each
stratum.
This procedure can be very convenient in selecting stratified samples of establishments from many
different sectors.
14
(8) Sample rotation patterns
Attached are illustrations of four sample rotation patterns over time, such as between monthly
samples in a labour force survey.
The rotation patterns and the overlaps are as follows.
1. “5-consecutive”
Overlap with current sample
Subsample nos. at period t which overlap with the sample at period (t-i)
5
4-5
3-5
2-5
% overlap
0%
20% 40% 60% 80%
period
t-7
t-6
t-5
t-4
t-3
t-2
t-1
t
2. “3(1)2”
Subsample nos. at period t which overlap with the sample at period (t-i)
6
5,6
5,6
3,5 2,3,6
% overlap
0%
20% 40% 40% 40% 60%
period
t-7
t-6
t-5
t-4
t-3
t-2
t-1
t
3. “3(2)2”
Overlap with current sample
Subsample nos. at period t which overlap with the sample at period (t-i)
7
6,7
6,7
6
3
2,3,7
% overlap 0%
20% 40% 40% 20% 20% 60%
period
t-7
t-6
t-5
t-4
t-3
t-2
t-1
t
4. “4(8)4”
Percentage overlap with sample at time t:
0
25
50
37.5
50
37.5
25
12.5
0
0
0
0
0
25
50
75
100
t-16
t-15
t-14
t-13
t-12
t-11
t-10
t-9
t-8
t-7
t-6
t-5
t-4
t-3
t-2
t-1
t
two quarters one year apart
two months one quarter apart
Comment on the variation over time in the overlap achieved in each pattern.
Construct the last pattern “4(8)4” by showing all the (16) subsamples it involves.
15
(9) Illustration of computing design weights, and their use in estimating from the sample
The table below provides a simple illustration of estimation from the sample with units selected
with PPS. The column headings and explanation are provided at the bottom of the table.
1. A PPS sample
Column (3) gives cumulation of the size measures, and an example of the sample selected. We
begin with a random number from 1 to 𝐼 (=100); let this number be 𝑟 = 54. The first unit selected
is the one for which the cumulated size measure equals or exceeds 𝑟 = 54 for the first time; the
next unit selected is the one for which the cumulated size measure equals or exceeds 𝑟 + 𝐼 =
154; the third unit is the one in which the cumulated size measure equals or exceeds 𝑟 + 2𝐼 = 254,
and the fourth unit is the one in which the cumulated size measure equals or exceeds 354.
2 Design weights
The design weights are introduced to compensate for differences in the probabilities of selection
into the sample. Each ultimate unit in the sample is weighted in inverse proportion to the probability
with which it was selected. If 𝑝𝑗 is the overall sampling probability of unit 𝑗, and 𝑛 the number of
(𝐴)
units successfully enumerated in the sample, the design weights 𝑤𝑗
are:
 n  1
.
w j A   
 1 p   p ,
j 
j

where the sum ∑ is over the 𝑛 units in the sample. The factor in parentheses is simply a constant,
chosen to scale the average value of the weight per unit enumerated in the survey to be 1.0, since
∑ 𝑤𝑗(𝐴) = 𝑛. Such scaling can be convenient in practice. Superscript (A) has been used to indicate
that the reference is to design weights (as distinct from weights arising from other sources, such as
non-response and calibration).
Note also that the weights in the above equation are not affected by the scaling of 𝑝𝑗 values: the 𝑝𝑗
values need to be only proportional to (within some arbitrary constant), rather than equal to, the unit
selection probabilities.
3 A simple illustration of calculating design weights
Let us assume that the survey units are households, with the size measures in column (2) of the
table being some function of household size and composition. For the sample of 𝑛=4 units selected,
the table illustrates the procedure for computing design weights.
16
Illustration of computing design weights, and their use in estimating from the sample
(1)
Stratum (1)
unit 1
unit 2
unit 3
unit 4
unit 5
Stratum (2)
unit 6
unit 7
unit 8
unit 9
unit 10
Total
27.1
41.3
51.6
33.5
46.5
39.4
45.0
31.2
38.5
45.9
(2)
(3)
200.0
27.1
68.4
120.0
153.5
200.0
200.0
400.0
239.4
284.4
315.6
354.1
400.0
Selected
r=54, I=100
(4)
(5)
(6)
X
0.413
2.42
1.031
X
0.465
2.15
0.917
X
0.450
2.22
0.947
X
0.385
2.60
1.105
1.712
9.39
4.000
(7)
5.2
6.7
15.2
5.9
2.3
9.4
8.3
3.4
5.6
4.4
31.1
(8)
16.2
5.0
18.5
14.5
33.0
Sampling interval for systematic PPS selection, =100 with desired number of selections a = 4
Notes to columns
(1) unit size measures (scaled to obtain 2 selections in each statum)
(2) size measures, total by stratum
(3) cumulation of (1), and indicator of whether a unit is selected
(interval I=100; assumed random number between 1-100, r=54)
(4) selection probability (p=s/I)
(5) sample weight (w=1/p)
(6) rescaled sample weight to average 1.0 per household
(7) value of the variable to be estimated in the frame (normally not available)
(8) estimate of the value from the units in the sample
17
(10) Poisson and Sequential sampling
1. Poisson sampling
Poisson sampling is a simple and flexible procedure which we can emulate in choosing an
appropriate sampling method, keeping its merits to the extent possible, and minimising its
limitations. The procedure is as follows.
The method subjects each unit 𝑖 independently to selection with its assigned probability 𝑃𝑖 . This
can be achieved by assigning to each unit in the frame a random number 𝑟𝑖 from uniform
distribution [0,1). The unit is included in the sample if 𝑟𝑖 ≤ 𝑃𝑖 , and not included otherwise.
2. Sequential sampling
This is a variation on the Poisson sampling procedure aimed at controlling sample size with variable
selection probabilities.
Consider the basic Poisson sampling procedure in which a unit is in the sample if the random
number selected for it is at or below its probability of selection, 𝑟𝑖 ≤ 𝑃𝑖 . In terms of a transformed
variable 𝑥𝑖 = (𝑟𝑖 ⁄𝑃𝑖 ), the same result is obtained by taking a unit into the sample if 𝑥𝑖 ≤ 1.
Hence a straightforward method is to arrange the units in order of increasing 𝑥𝑖 and take a certain
number of units from the top of the ordered list into the sample. The specified selection
probabilities are applied if we select up to the last unit for which 𝑥𝑖 ≤ 1.
But instead, in Sequential Poisson sampling, exactly 𝑛 units from the top of the ordered list are
taken into the sample, where 𝑛 is the fixed required sample size. That is, the sample is defined by
𝑟𝑎𝑛𝑘(𝑥𝑖 ) ≤ 𝑛.
This has to be done separately within each stratum, or within groups of strata where it suffices to
control the sample size within groups only. In the presence of many small strata, appropriate
grouping of the strata may in fact be the only practical way of keeping the system manageable.
The attached table is based on Table in Note (7). It illustrates the application of the above two
procedures.
Poisson sampling controls the unit selection probabilities, but the number of units selected can vary.
Sequential sampling controls the number of units selected in each stratum, but the selection
probabilities depart from strict PPS sampling.
Repeat the illustrative selection by using a new set of random numbers.
18
Download