Aspetti metodologici

advertisement
Metodi di Campionamento per la stima di indicatori economici per
sottopopolazioni di imprese
VERSIONE PROVVISORIA
Piero Demetrio Falorsi, Stefano Falorsi, Paolo Righi1
ABSTRACT
In the present work balanced sampling and coordinated sampling, useful in order to
obtain planned sample size for domains belonging to different partitions of the
population, have been examined. In particular, these sampling methods may be applied
to deal with small area problems in the phase of the sampling design. The main
advantages of the two techniques are the computational feasibility which permits to
easily implement an overall strategy considering jointly the design and estimation phase
and improving the efficiency of the estimators. An empirical simulation on real
population data and different domain estimators shows the empirical properties of the
examined sample designs.
Key words: Planning Sample Size of Small Domains, Controlled Selection, Sample
Coordination, Balanced Sampling.
1. Introduction
The small area problem is usually considered to be treated via estimation. However, if
the domain indicator variables are available for each unit in the population there are
opportunities to be exploited at the survey design stage. As noted by Singh et al. (1994),
there is a need to develop an overall strategy that deals with small area problems,
involving both planning sample design and estimation aspects. In this framework it is
crucial to control the sample size for each domain of interest, so that each domain is
treated as a planned domain, at design stage, for which it is possible to produce direct
estimates with a prefixed level of precision. In general, with a design-based approach to
the inference, the presence of sample units in each domain allows to compute domain
estimates although not always reliable. Furthermore, in the model-based or modelassisted approach, the presence of sample units in each estimation domain permits to
utilise models with specific small area effects, allowing more accurate estimates of the
parameters of interest at small area level (Lehtonen, et al., 2003). For small domains it
will often be needed to use indirect estimators (Rao, 2003), such as synthetic estimators,
for producing reliable estimates; nevertheless, synthetic estimators are very sensitive to
the model assumption on which they rely on providing a good description of the data
structure. For a true model there will be no design bias but if the model does not fit well
for the whole domain structure there will be some design bias, at least in a part of the
1
Italian National Institute of Statistics - ISTAT, Via Magenta 4, 00185 Roma, parighi@istat.it.
data, inflating the mean square error of the estimates. Of course, there will never be a
true model available in practice. In order to overcome this problem it is decisive to base
the domain estimation on the sample units belonging to the small domain. The sampling
design techniques considered in this paper allow controlling the sample sizes for
domains of interest which are defined by different partitions of the reference population.
Such techniques are useful when the overall sample size is relatively small and,
therefore, in some of the partitions there are small domains. In fact, when the aim of the
survey is to produce estimates for two or more partitions of the population, a standard
solution to obtain planned sample sizes for the domains of interest is to use a stratified
sample with the strata given by cross-classification of variables defining the different
partitions. In the following, this design will be denoted as cross-classification design. In
many practical situations, the cross-classification design is often unfeasible since it
needs the selection of at least a number of sampling units as large as the product of the
number of categories of the stratification variables.
In order to overcome some problems of cross-classification designs, an easy strategy is
to drop one or more stratifying variables or to group some of the categories.
Nevertheless, some planned domains become unplanned and some of them can have
small or null sample size.
Many methods have been proposed in the literature to keep under control the sample
size in all the categories of the stratifying variables without using cross-classification
design. These methods, are generally referred as multi-way stratification techniques,
and have been developed under two main approaches:
– methods based on selection according to Latin Squares or Latin Lattices schemes
(Bryant et al., 1960; Jessen, 1970);
– methods based on controlled rounding problems via linear programming (Causey et
al., 1985; Sitter and Skinner, 1994).
The seminal paper of Bryant et al. (1960) suggests to allocate the units in the sample by
means of a two-way Latin Square table randomly selected and two estimators of the
parameter of interest are proposed. The method implies that expected sample counts in
each stratum display independence between the rows and columns of the two-way table;
in this way it can be used without any sort of limitations only when the stratifying
variables are also independent in the population. Another drawback is that it is not
possible to implement the procedure when there is no population in one or more crossclassification strata. In order to solve these problems, Jessen (1970) proposes two
approaches, both fairly complicated to implement and not always leading to a solution
(Causey et al., 1985).
As far as concerns the methods based on linear programming, Causey et al. (1985)
consider the controlled multi-way stratification as a rounding problem solved by means
of transportation theory. The method may not have a solution in case of three or more
stratification variables. Following the linear programming approach to controlled
sampling designs proposed by Rao and Nigam (1990, 1992), Sitter and Skinner (1994)
suggest a method based on linear programming more flexible to different situations than
the method proposed by Causey et al. (1985) and some further computational
simplification of Sitter and Skinner method have been suggested by Lu and Sitter
(2002). Nevertheless, the main weakness of the linear programming approach is the
computational complexity.
As a consequence, the drawbacks of both approaches have limited the use of multi-way
stratification techniques as a standard solution for planning the survey sampling
designs. This paper illustrates as two well known sampling techniques may be useful
for controlling the sample size in all the domains of interest without suffering from the
disadvantages of the above mentioned methods.
The first technique faces the problem by means of the selection of a balanced sample
(Deville, Tillé, 2004) based on domain membership indicator variables. In this context
the balancing equations assure that the prefixed marginal sample sizes are satisfied.
The second technique deals with the problem by means of procedures developed to
guarantee the sample coordination. The sampling units are selected from separate
stratified sampling designs, each one defined by a single auxiliary variable. The samples
are selected using permanent random number techniques (Ohlsson, 1995) assuring the
maximum overlap of the samples.
The paper is organised as follows. Section 2 introduces the essential notation. Section 3
is dedicated to the balanced sampling, while section 4 focuses on coordination
techniques. Section 5 deals with the estimation topics, and section 6 shows the main
results of a simulation study conducted on a real population of Italian enterprises.
Finally some brief conclusions are underlined in section 7.
2. Notation and parameters of interest
In order to simplify the notation, the paper will be restricted to the case of two
stratifying variables. However, the generalisation to the multi-way case is
straightforward.
Let U define the population of interest of size N. Suppose that two auxiliary variables
V1 and V2 are known for each unit k (k=1, …, N) of the population with, respectively, R
and C modalities. In particular, the generic value or class of values assumed by V1 and
V2 is denoted respectively with v1r and v 2c . Let U rc  U r.  U .c specify the
subpopulation of size N rc  0 , being U r . and U .c the subpopulations of size
N r .  c N rc and N .c 
r N rc
for which V1  v1r and V2  v2c . Let yk define the
value of the target variable in the unit k ( k U ) . There are R  C  1 parameters of
interest defined by,
Y
 yk ,
k U
Yr. 
 yk , (r  1,, R) ,
k U r.
Y.c 
 yk , (c  1,, C ) .
k U .c
In order to obtain the sample estimates of the parameters of interest, the standard
cross-classification design selects a stratified sample s of size n from the population U
with probability p(s), where the strata are obtained combining the categories of the
variables V1 and V2 . Let s rc (r  1, , R, c  1, , C ) be the subsample of s of size nrc ,
where min b, N rc  n rc  N rc , selected from population U rc ; let b denote the


minimum sample size fixed in advance, where b is equal to 1 for producing unbiased
estimates of the parameters of interest or equal to 2 for being able to compute unbiased
variance estimates. Furthermore, n  nrc  is the set of the strata frequencies, while
n1  (n1. , , nr. ,, nR. )' and n 2  (n.1 , , n.c ,,n.C )' are the marginal distributions
of n with respect to V1 and V2 , being nr. 
c nrc
and n.c 
r nrc . We will call
marginal stratification with respect to V1 the partition of U in U1. , , U r. ,,U R. , so
C
that nr. is the sample size of subsample s r .  c 1 s rc of the generic marginal stratum
r; while we will call marginal stratification with respect to V2 the partition of U in
R
U.1, , U.c ,,U.C , so that n.c is the sample size of subsample s.c  r 1 s rc of the
generic marginal strata c.
3. Methods based on balanced sampling
Multi-way stratification designs can be treated by means of balanced sampling.
The definition of a balanced sample depends on the assumed inferential framework. In
the model based approach, a sample is defined as balanced on a set of auxiliary
variables if there is the equality between the sample and the known population means
of the auxiliary variables (Royall and Herson, 1973; Valliant et al., 2000). Following
the design based approach considered in this paper, a sample is balanced when the
Horvitz-Thompson estimates of the auxiliary variables totals are equal to their known
population totals (Deville and Tillé, 2004).
Balanced sampling, in this last case, represents a generalisation of probability sampling
designs exploiting auxiliary information known for each unit of the population.
Standard designs like simple design with fixed size, stratified sampling and unequal
probability sampling represent some specific applications of balanced designs. The
multi-way stratification designs belong to this class of designs too.
3.1. General introduction to balanced sampling
For a formal description of balanced sampling, we introduce a general definition of
sampling design. Sampling design is a probability distribution p(.), on the set S of all the
p( s)  1 , where p(s) is the probability of
subset s of the population U such that
sS
the sample s to be drawn. The set S may be represented by the random variable  that
takes the value λ  (1 , ..., k , ...,  N )' , with k  1 if k s and k  0 otherwise. Let
π  ( 1 ,...,  k ,...,  N ) be the vector of inclusion probabilities. We consider the
sampling design p(.) with inclusion probabilities π which assigns a probability p(s) to
each sample s such that E (λ )  sS p( s )λ  π . Let x k  ( x1k ,..., xhk ,..., x Hk )' be a
vector of H auxiliary variables available for each population unit. The sampling design
p(s) is said to be balanced with respect to the H auxiliary variables if and only if it
satisfies the balancing equations given by
ˆ
X
dir


xk

k U
k
k 
X

 xk ,
k U
(3.1)
for all s S such that p(s)>0, where dir X̂ is the Horvitz-Thompson estimator of the
known population total X of the auxiliary variable.
Two simple examples of balanced sampling designs are illustrated in the following.
Example 1. The design of fixed sample size n is balanced on the auxiliary variable
xk   k
x
  kk k  1    k  n ,
kU
ks
kU
for all s S.
Example 2. The stratified design, where for each stratum U h of size N h a simple
random sample of size nh is drawn, is balanced on H auxiliary variables such that the
generic h-th variable assumes the values in the k-th unit
1 if k U h
.
xhk  
0 if k U h
The balancing equations become:
x hk
nh
N
  k k   nhh   xhk  N h ,
ks
k 1
kU
for all s S and for h=1, …, H.
One of the main problems of balanced sampling has always been implementing a
general procedure which gives a multivariate balanced random sample (see Valliant et
al., 2000). Recently, Deville and Tillé (2004) proposed the cube method that allows the
selection of balanced (or approximately balanced) samples for a large set of auxiliary
variables and with respect to different vectors of inclusion probabilities.
The method is based on at most two phases. In the first phase, called flight phase, the
cube method searches for a sampling design satisfying equations (3.1). The sample
selection is performed by means of rounding off randomly to 1 or 0 almost all the
elements of the vector π .
At the end of the flight phase the elements of π equal to 1 or 0 indicate respectively
that the corresponding unit is included or not in the sample. However, the balancing
equations could be exactly satisfied with some fractional element of π (the number of
these elements are at most equal to the number of balancing variables), and therefore the
flight phase does not determine whether the corresponding units have to be drawn or not
in the sample. In these cases the cube method searches for an approximately balanced
sample solution applying the landing phase; this solution relaxes some constraints,
minimising the variance of dir X̂ . The landing phase is set up as a rounding problem
and it is solved via linear programming techniques. A simple example where the
landing phase occurs it is the design of fixed sample size when the sum of inclusion
probabilities is not an integer. In example 1, the balancing equation is never satisfied
with the flight phase if known total n is not an integer.
As described below, in the multi-way stratification design, the cube method is able to
select a balanced sample only via flight phase (Deville and Tillé, 2000). In Chauvet and
Tillé (2006) is presented a faster algorithm for executing the flight phase implemented
by means of a SAS-IML macro. The procedure is computationally feasible for large
data sets with a lot of balancing variables as it has been tested by the authors.
3.2. Balanced sampling applied for marginal stratification
Let us suppose that a vector of inclusion probabilities π , consistent with the marginal
distribution n1 and n 2 , is available, that is
nr.  kU  k
n.c  kU  k
( r  1,...R)
r.
( c  1,...C ) . (3.2)
.c
If π is not available, the well known Iterative Proportional Fitting (IPF, see Bishop et
al., 1975) or the Generalised Iterative Proportional Fitting (GIPF, Dykstra, 1985;
Dykstra and Wollan, 1987) procedure can be used to define it. Starting from an initial
vector * π  (* 1 ,...,* k ,...,* N )' of inclusion probabilities such that kU *  k  n ,
IPF or GIPF adjust the initial probabilities computing the final vector π satisfying the
marginal sample sizes constraints (3.2). The main useful difference between IPF and
GIPF is that the second procedure avoids to define inclusion probabilities greater than 1.
Therefore, it could be better to adopt GIPF to satisfy equations (3.2). However, GIPF
and IPF produce the same results when the latter procedure does not yield a solution
with elements of π greater than 1. A third more empirical procedure to calculate π ,
satisfying the constraints (3.2), is to use IPF procedure iteratively in the following way:
starting from initial vector of inclusion probabilities * π , at the end of the procedure the
 k ’s greater than 1 are set equal to 1. Then the IPF procedure is reiterated with the
same starting inclusion probabilities, removing the units with inclusion probabilities
equal to 1 and updating the marginal constraints so that, if the generic unit k belongs to
the subpopulation U rc , the new marginal sample sizes became nr . 1 and n.c 1 .
The selection by means of a balanced design, taking into account planned small sample
size for two marginal distributions, is related to the choice of the auxiliary variables.
Then we use the vector of auxiliary variable

x k  x11k ,..., x1rk ,..., x1Rk , x12k ...., x2ck ,...., x2Ck

where
 k if k  U r.
x1rk  
0 if k  U r.
 k if k  U .c
.
x2ck  
0
if
k

U
.c

The flight phase of cube method selects a random sample which satisfies the following
constraints:
ˆ

X
dir
X 


 

r

x1k
k  1   x1rk    k  nr.
 
, (r=1,…,R ; c=1,…,C ) (3.3)
ksr .
kU
kU r .
kU  k

x 2ck

c
   k  1   x 2k    k  n.c
ks. c
kU
kU. c
kU k
Equations (3.3) imply that the sample sizes of each subsample s r . and s.c must be
equal to the fixed marginal frequencies nr. and n.c respectively. Deville and Tillé
(2000), show that equations (3.3) can be exactly satisfied by means of the cube method
via the flight phase.
Finally we underline some interesting aspects of balanced sampling.
- The cube method is quite general and it can be easily applied to more than two
partitions in the domains of interest.
- Additional balancing equations may be defined involving other variables correlated
with study ones so as to improve the precision of estimators of totals.
- As far as concerns the estimation phase, the balanced sampling can be used jointly
with an estimator exploiting auxiliary information. This sampling strategy allows to
bound the variability of the sampling weights (for instance calibrated weights) that
causes problems, especially in small domains of estimate. For instance if known
population counts are used in the calibration phase at domain level, it should be
useful to utilize the same auxiliary information as constraints in the design phase, in
addition to the auxiliary variables which allow to respect the marginal sample sizes.
4. Methods based on coordinated sampling
The problem under study may be considered as a particular case of sample coordination.
Generally, coordinated sampling is adopted in order to guarantee an expected
predefined overlap rate between two or more surveys. Minimum and maximum overlap
is indicated respectively as negative and positive coordination. In our case, the sampling
units are selected from two separate sampling designs on the same target population
using respectively V1 and V2 as stratification variables. The two samples are maximum
positively coordinated and the resulting final sample is composed by the union of the
two samples.
In this context the techniques can be classified into:
- methods based on the Permanent Random Numbers (PRN);
- methods implementing linear programming algorithm (Ernst and Paben, 2002).
Within the first techniques (Ohlsson, 1995), a random number (PRN), drawn
independently from the uniform distribution on the interval [0,1], is assigned to each
unit in the frame list. The PRN techniques allow to implement different sampling
designs such as: simple random sampling without replacement, Bernoulli, Poisson and
other probability proportional to size sampling procedures. Then, using the assigned
PRNs, two maximum positively coordinated samples are selected from the two
sampling designs. The second type of methods, that will not be discussed in this paper,
are procedures maximizing the overlap among samples and are based on sequential
application of transportation problems. The algorithms seem to be rather complex
compared with permanent random numbers based procedures.
4.1. General introduction to coordinated sampling using permanent random
numbers
In the PRN context, the sampling units are selected from two separate sampling designs,
say p (1) () and p ( 2) () , on the same target population using V1 and V2 as stratification
variables. Let s (i ) and S (i ) denote, respectively, the generic sample and the set of all the
subset s (i ) generated by the design p (i ) () (i=1,2). The inclusion probabilities of unit k
in the marginal design p (i ) () are indicated with  (i ) k  s
( i )S( i )
p(i ) ( s (i ) ) (i ) k being
(i) k a dummy variable that equals to 1 if k  s(i ) , and equals to 0 otherwise. Two
positively coordinated samples, say s (1) and s ( 2 ) , are selected respectively according to
p (1) () and p ( 2) () designs and the resulting final sample s is composed by the union of
the two samples. Let us denote with zk the PRN assigned to the unit k (k=1, …, N) and
n1  (n1. , , nr. ,, nR. )'
for each above considered design let
and
n 2  (n.1 , , n.c ,,n.C )' indicate the sample sizes, being
kU
r.
 (1) k  nr. and
kU
.c
r 1 nr.  c 1 n.c  n ,
R
C
 ( 2) k  n.c .
The final sample s, being s  s (1)  s ( 2) , is selected with the joint design p(s) and the
corresponding inclusion probabilities are given by  k  Pr k  s   sS p( s ) k , with
k =1 if either (1) k or ( 2) k equals to 1. With these techniques, the realised sizes of
the final sample are random outcomes with expected values larger than the constraints
n, n1 and n 2 . Empirical results on Italian business surveys has shown that the overall
realised sample size is generally larger than the prefixed n between 10% and 30%. In
order to partially overcome this problem, in section 4.3 is shown an algorithm which,
starting from initial *  (1) k and *  ( 2) k probabilities, calculates modified inclusion
probabilities,  (1) k and  ( 2) k . These guarantee that the expected marginal sample sizes
in the joint sample design are roughly equal to the constrains n1  (n1. , , nr. ,, nR. )'
and n2  (n.1 ,  , n.c , ,n.C )' .
In the following section we will illustrate how coordination may be realised with some
important designs.
4.2. Coordination with some sampling designs
Design I: simple random sampling without replacement in each marginal stratum
With this design we have  (1) k  nr . N r . for k  U r . (r=1,…,R) and  ( 2) k  n.c N .c
for k  U .c (c=1,…,C).
The selection schema is very simple: for each design, the frame units are ordered by
stratum and, in each stratum, by ascending order of zk . In the stratum U r . of the design
p (1) () , the sample is composed of the first nr . units in the ordered list. In stratum U .c
of the design p ( 2) () , the sample is composed of the first n.c units in the ordered list.
Ohlsson (1992) presents a formal proof that this technique realises a sample in which
the inclusion probabilities in s (1) and s ( 2 ) samples agree with the prefixed values, being
Pr (k  s(1) )   (1) k  nr . N r .
for k  U r. and Pr (k  s( 2) )   ( 2) k  n.c N .c for
k  U .c . The unit k is included in the overall sample s, if it is included in s(1) with
probability  (1) k or if it is included in s ( 2 ) with probability  ( 2) k . There is no formal
proof of the values assumed by the inclusion probabilities  k . An intuitive reasoning
suggests that  k  max  (1) k ,  ( 2) k . Empirical simulations of coordinated sample


selection with design I has shown that the  k values are only slightly larger than
max  (1) k ,  ( 2) k .


Design II: Poisson sampling
With this design we have
 (1)k  nr. g k
 gk
for k  U r . (r=1,…,R)
(4.1)
 g k for k U.c (c=1,…,C),
(4.2)
kU r .
and
 (2)k  n.c g k
kU.c
where g k is a non-negative measure of size for the unit k. The unit k is included in the
sample s (i ) (i=1 or 2) if z k   (i ) k . The inclusion probabilities of the final sample are,


thus,  k  max  (1) k ,  ( 2) k .
Design III: probability proportional to size without replacement
This design may be only approximated with PRN techniques. The two approximated
solutions proposed are order sampling and collocated sampling. The joint design of
both techniques may be summarised as follows: (i) the expected sample sizes are larger
than the constraints n, n1 and n 2 ; (ii) there is no formal proof of the values assumed by
the inclusion probabilities  k , but in practical situations they may be well
approximated by  k  max  (1) k ,  ( 2) k .
The order sampling (Rosen 1997a, 1997b) represents an interesting way to avoid the
perceived drawback given by random sample size of Poisson sampling and still comes
very close to achieve the desired inclusion probabilities (4.1) and (4.2). A particular
case of order sampling is Pareto sampling, carried out as follows:
(i) for every k U , compute  (i ) k  z k (1   (i ) k ) (1  z k )  (i ) k (i=1,2);


(ii) for the sample design p (i ) () (i=1,2), the frame units are ordered by stratum and in
each stratum by ascending order of  (i) k . In stratum U r. of the design p (1) () , the
sample s (1) is composed of the first nr . units in the ordered list and in stratum U .c
of the design p ( 2) () , the sample s ( 2 ) is composed of the first n.c units in the
ordered list.
Another case of order sampling is the sequential Poisson sampling. The sampling
selection is carried out with steps, (i) and (ii) above illustrated, but the  (i) k values are
defined by  (i ) k   k  z k / N g k . Both the above order sampling schemas realise the
marginal fixed sample sizes n1 and n 2 for a given marginal design s (i ) (i=1, or 2);
but the inclusion probabilities for p (1) () and p ( 2) () are slightly different from the
planned probabilities  (1) k and  ( 2) k expressed respectively by (4.1) and (4.2).
With collocated sampling, the frame units are randomly ordered by stratum and in each
stratum by ascending order of zk , giving unit k ( k  U rc ) the ranks Lr ., k and L.c ,k in
the marginal strata U r . and U .c of p (1) () and p ( 2) () designs. Then the new variables
r(1) k 
Lr .,k  
N r.
,
r( 2) k 
L.c,k  
N .c
are computed where  is a random number drawn independently from the uniform
distribution on the interval [0, 1]. The unit k is included in the sample s (i ) (i=1 or 2) if
r(i ) k   (i ) k . For each marginal design the collocated sampling is strictly pps with
inclusion probabilities, expressed respectively by (4.1) and (4.2) , for p (1) () and
p ( 2) () designs; but the realised sample has a variability lower than the one assured by
the design II.
As a concluding remark we note that the methods based on PRN techniques are very
easy to implement and for some designs unbiased estimators of the variance may be
computed, but designs based on probability proportional to size without replacement
may be only approximated, in the sense that in order sampling the theoretical inclusion
probabilities  (i) k (i=1 or 2) may be only approximated and the collocated sampling
does not assure to attain the fixed sample sizes n1 and n 2 .
4.3. An algorithm for controlling the expected sample sizes
Let * π (1)  (* (1)1 ,...,* (1) k ,...,* (1) N )' and * π ( 2)  (* ( 2)1 ,...,* ( 2) k ,...,* ( 2) N )' denote
the initial vectors of inclusion probabilities of designs p (1) () and p ( 2) () , being
kU
r.
*  (1) k nr. and
kU
.c
*  ( 2) k n.c .
Starting from initial *  (1) k and *  ( 2) k probabilities, the algorithm calculates modified
inclusion probabilities,  (1) k and  ( 2) k . These may be used as inclusion probabilities
in the marginal designs and they guarantee that the expected marginal sample sizes of
the joint sample design are equal to constraints n1  (n1. , , nr. ,, nR. )' and
n2  (n.1 ,  , n.c , ,n.C )' .
The algorithm makes use of the condition  k  max  (1) k ,  ( 2) k that is true for
Poisson sampling and only approximated by other designs. The algorithm is articulated
in the following steps.
Step 0. Initialization. Denote with t the iteration index (t=0,1,…). Let us pose at t=0:
0
 (i ) k  * (1) k (i=1 or 2).


Step 1. Calculus of updated modified probabilities. The updated modified inclusion
probabilities t  (i ) k (i=1 or 2), at iteration t (t=1,2,…), are obtained by solving
the following minimum constraints problem
2



 tMin kU  G ( t  (i ) k ; t 1 (i ) k )
  (i ) k 
i 1


t
t
t
t
kU x k  (1) k  (1  x k )  ( 2) k n
t

x t
 (1  t x k ) t  ( 2) k  nr . ; r  1,..., R  1
kU r . k (1) k
t

x k t  (1) k  (1  t x k ) t  ( 2) k n.c ; c  1,..., C  1
k

U
.c

in
which:
G ( t  (i ) k ; t 1 (i ) k )  t  (i ) k ln( t  (i ) k /
(4.3)
t 1
 (i ) k )  t  (i ) k  t 1 (i ) k
(i=1, 2) is the logarithmic distance function between t 1  (i ) k and t  (i ) k ; t x k
is an indicator variable that equals 1 if
t 1
 (1)k 
t 1
 (2)k and equals 0
otherwise.
Step 2. Iteration. With updated values
values
t 1
t
 (i ) k (i=1 or 2), let us calculate the updated
xk and then compute
t
t
nr.  kU
n.c  kU
t 1
r.
.c
t 1
xk t  (1)k  (1 
t 1
xk t  (1)k  (1 
t 1
xk ) t  (2)k (r=1,…, R)
xk ) t  (2)k (c=1,…, C).
For a fixed close to zero small positive quantity,  , if the following condition
holds
C
 R t


  | nr.  nr. |   | t n.c  n.c |    
 r 1

c 1

(4.4)
then the algorithm stops and the unit k is selected with the modified
probabilities  (1) k  t  (1) k and  ( 2) k  t  ( 2) k , defined as solution of (4.3) at
iteration t; otherwise step 1 is iterated at t=t+1 until condition (4.4) is
respected.
Note that (4.3) represents a calibration process, that may be solved by the well known
IPF (Bishop et al., 1975) or GIPF (Dykstra, 1985; Dykstra and Wollan, 1987)
procedures. The logarithmic distance function avoids to define t  (i ) k lower than 0,
while GIPF prevents to obtain t  (i ) k values larger than 1. A third more empirical
procedure to obtain  (i) k values simultaneously satisfying the (4.3) and lower than 1 is
the following: solve (4.3) with IPF procedure and set t  (i ) k  t 1 (i ) k if t  (i ) k  1 .
5. Estimation of domain totals
In this section some considerations on the estimation are presented.
In order to simplify the notation of the estimators, we denote the populations U, U r . and
U .c generically as U d (d=1,…,D=R+C+1). Furthermore with analogous symbolism, we
indicate with s d and Yd , respectively, the sample and the total of interest of the
population U d .
The direct estimator of the parameter of interest is given by
ˆ 
dir Yd
 y k ak
(5.1)
ksd
where a k are the base weights. If the probabilities  k are known (as in the case of
balanced sampling and Poisson sampling), the base weight is a k  1 /  k . Otherwise, if
the  k values are unknown (as in the case of Pareto sampling and sequential Poisson
sampling), we may approximate each of them as  k  max  (1) k ,  ( 2) k . Another

solution may be to set the base weights a k equal to

1 1
2 
 (1) k
 1  1
1
ak   


 2   (1) k  ( 2) k
1 1

 2  ( 2) k
if k  s(12 )

 if k  s(12)  s(1)  s( 2)


(5.2)
if k  s( 1 2)
where s(12 ) and s ( 1 2) denote respectively the set of units included only in sample s (1)
and the set of units included only in sample s ( 2 ) . Adopting the weights defined in (5.2),
the estimator (5.1) is unbiased as below illustrated:
Ep
dir Yˆd   12 E p ks

yk
(12 )
 (1) k
 dk  ks
(12)
 ks
(12)

yk
 dk 
yk
 dk  ks
 (1) k
 ( 2) k
( 1 2)
yk
 ( 2) k

 dk  



 1
yk
yk
1  
 dk   E p  ks
 dk   Yd  Yd   Yd ,
 E p ks


 2
(1) 
( 2) 
2  
(1) k
( 2) k




being  dk  1 if unit k belongs to U d and  dk = 0 otherwise and E p denotes the
expectation over repeated sampling.
Domain estimates could be improved using auxiliary information and a superpopulation
model Em ( y k )  f (x k ; β) that, for each unit k, links the model expected value,
E ( y ) , to the auxiliary vector x , where f(.; β ) is a functional form depending by the
m
k
k
vector β of unknown parameters. After obtaining the sample estimates β̂ of β , based
on y , x ; k  s, the predicted value yˆ  f (x ; βˆ ) for each unit k U are
k
k
k
k
computed, assuming that x k is known for each unit k in the population.
Two different estimator types, the Synthetic and the Generalised Regression, are
considered. Following Lehtonen et al. (2003), the above estimators are expressed by:
ˆ   yˆ
k
U
synYd
(5.3)
d
ˆ   yˆ   a ( y  yˆ ) .
k
k
k
U
s k
gregYd
d
(5.4)
d
An another estimator, often used in small area context, is the composite estimator
expressed by compYˆd  U yˆ k  γˆd s ak ( y k  yˆ k ) , where γ̂ d is a domain-specific
d
d
weight appropriately constructed to have certain optimality properties; it approaches
unit for increasingly large domain sample size, and it tends to zero for decreasingly
sample size.
The predictions ŷ k ; k  U  differ from one model specification to another, depending
on the functional form and from the choice of the auxiliary variables. As illustrated in
Särndal et al., (1992, p. 399), adopting a linear model, Em ( y k )  x' k β , if each
prediction is obtained by means of a submodel defined on the population U d , the
estimators (5.3) and (5.4) agree. Moreover, linear model allows to define the above
estimators knowing only the domain totals of the auxiliary information and the x k
values for the sampling units. However, knowing x k values for every k U it is
possible to construct an estimators with more efficient predictions ŷ k obtained by
generalised linear models (Lehtonen and Veijanen, 1998) or non parametric regression
techniques (Montanari and Ranalli, 2003).
Finally, let us point out that the synthetic estimator relies on the truth of the model and
it is usually biased; nevertheless, as noted in Lehtonen et al. (2003), if the model
incorporates a specific domain effect (fixed or random) the bias is strongly reduced and
with the incorporation of domain-specific terms in the model it is good to control the
sample size at domain level.
6. Monte Carlo experiment to evaluate the sampling strategies
6.1. Background information
The simulation is carried out on real enterprises data. The analysis has been focused to
the 1999 population of the enterprises from 1 to 99 employers belonging to the
Computer and related economic activities (2-digits of the NACE rev.1 classification). In
order to simplify the empirical analysis some units with outlier values have been
deleted. At the end of the cleaning procedure, the data base used for the simulation
study has N=10,392 enterprises. The value added and labour cost are the variables of
interest chosen in the simulation. The variable values are available for each unit in the
population by an administrative data source. According to the EU Council Regulation
n°58/97 on Structural Business Statistics the estimation domains are defined as different
partition subsets of the population. In particular, we consider two partitions: (1)
geographical region with 20 marginal domains, hereafter referred as DOM1; (2)
Economic activity group (3-digits of the NACE rev.1 classification with 6 different
groups) by Size class (defined in terms of number of persons employed: 1=1-4; 2=5-9;
3=10-19; 4=20-99) with 24 marginal domains, hereafter referred as DOM2. Therefore,
the overall number of marginal domains is 44, while the number of the
cross-classification strata is 480 but only 360 strata have one or more population units.
The experiment is based on an overall sample size of 360 units, planning the sample
size of each domain of DOM1 and DOM2 such that the coefficient of variation of the
domain estimates for the variable number of employers (supposed known at sampling
stage) is boundend to be lower than 34.5% for the DOM1 domains and to 8.7% for the
DOM2 domains. This sample allocation represents a compromise between the
Allocation Proportional to Population (APP) size and an allocation uniform for each
domain of each partition. In the following we refer to the domains with the planned
sample size greater than the APP sample size as oversized domains. These domains
need to have a sample size larger than the APP sample size to bound the sampling
errors; roughly speaking these domains may be classified as small domains. The
analysis of the empirical results will be focused on these domains.
Table 1 reports the planned sample sizes and the APP sample sizes for domains of the
DOM1 and DOM2 partitions. The small domains are reported in grey cells.
Table 1. Population and sample sizes for the domains of interest
DOM1
Population
Geographical
size
Region
Piemonte
DOM2
APP
Population
sample 3-digit NACE Rev. 1;
size
size
size class
Planned
sample size
APP
sample
size
Planned
sample size
912
18
32 721;1
57
21
2
21
7
1 721;2
29
6
1
2,903
19
101 721;3
21
3
1
179
17
6 721;4
21
10
1
1,064
17
37 722;1
1,916
34
66
Friuli V. Giulia
246
18
9 722;2
874
7
30
Liguria
280
11
10 722;3
604
6
21
Emilia R.
898
19
31 722;4
521
30
18
Toscana
693
19
24 723;1
2,970
32
103
Umbria
110
20
4 723;2
1,006
6
35
Marche
203
21
7 723;3
498
6
17
1,239
20
43 723;4
283
28
10
114
19
4 724;1
46
21
2
34
17
1 724;2
14
5
0
Campania
445
15
15 724;3
11
4
0
Puglia
372
17
13 724;4
13
10
0
47
21
2 725;1
246
27
9
Calabria
103
21
4 725;2
98
6
3
Sicilia
371
23
13 725;3
61
5
2
Sardegna
158
21
5 725;4
30
14
1
10,392
360
360 726;1
643
37
22
726;2
204
7
7
726;3
138
7
5
726;4
88
28
3
Total
10,392
360
360
Valle d'Aosta
Lombardia
Trentino A. A.
Veneto
Lazio
Abruzzo
Molise
Basilicata
Total
A grey cell denotes a small domain
In the simulation we have considered seven sampling designs as reported in table 2.
Five hundred Monte Carlo samples have been selected for each sampling design.
Table 2. Sampling Design used in the simulation study
Sampling Design
Abbreviation
SRSWOR*
Stratified on DOM1 with
in each stratum
Stratified on DOM2 with SRSWOR* in each stratum
Balanced sampling on the marginal sample sizes and on population sizes
Balanced sampling on the marginal sample sizes
Coordinated Pareto sampling
Coordinated Poisson sampling
Coordinated Sequential Poisson sampling
STDOM1
STDOM2
BALPOP
BAL
CPAR
CPOI
CSPOI
*SRSWOR: Simple Random Sampling Without Replacement
In the experiment STDOM1 and STDOM2 have planned sampling sizes for a single
partition respectively DOM1 and DOM2. BALPOP is based on (3.3) adding the
balancing equations
ks
d
 dk  k  N d (d=1, …, 44), while in BAL only the
equations (3.3) are considered. Starting from initial probabilities *  k  n N , the
inclusion probabilities for BALPOP and BAL have been attained applying the IPF
procedure iteratively (four times) as described in section 3.2. In the coordinated designs
CPAR, CPOI and CSPOI, the marginal sample sizes are satisfied only as expectation
over repeated sampling. Starting from initial probabilities *  (1) k  nr . N r . for k  U r.
and *  ( 2) k  n.c N .c for k  U .c , the final inclusion probabilities  (1) k and  ( 2) k have
been obtained by means of the algorithm described in section 4.2; the final solution has
been found with eight iterations.
For each sample the estimates of the domain totals have been computed by the direct,
generalised regression and synthetic estimators. In particular, for the three coordinated
sampling we have examined two direct estimators. The first one adopting as direct
weight the value a k  max(  (1) k ,  ( 2) k ) 1 , while the ak weights in the second direct
estimator are defined by the expression (5.2). We show the results only for the first
direct estimator because the simulation study has stressed that the second estimator has
much higher variability than the first estimator.
As far as concern the estimators using auxiliary information, two simple
homoschedastic linear models have been implemented: the model (1) uses 10 auxiliary
variables, six of them are the economic activity group membership indicators, and the
remaining four are the size class membership indicators; the model (2) uses the 44
domain membership indicator variables.
The linear model (1) is expressed by

Em ( y k )   a   b

for k  U a U b ,
where U a is the population of enterprises of a-th (a=1, …, 6) economic activity group
and U b is the population of enterprises of b-th (b=1, …, 4) size class of the number of
employers and  a and  b are the fixed effects of the a-th economic activity group and
the b-th size class.
The linear model (2) is
Em ( y k )   r   c
for k  U r. U .c ,
where the subscripts r (r=1,…,20) denotes the generic domain belonging to the DOM1
partition and c (c=1,…,24) denotes the generic domain belonging to the DOM2 partition
and  r and  c are the separate domain-specific effects.
We point out that the main aim of the experiment is to compare different sampling
designs using the same estimator. In this context the choice of the best model does not
represent a central issue. Hence, we have considered two quite general feasible models
that can be implemented in all situations of planned domains. The model (1) is
somewhat more reliable since the estimates of the regression parameters are based on
large sample sizes; while in model (2) it is possible to evaluate the effect of planning the
domain sample sizes, although the estimates of each regression parameter are based on
small sample sizes. Obviously, more flexible model formulations could be possible as
described for instance in Lehtonen et al. (2005).
As pointed out in section 5, using the model (2) the synthetic and the generalised
regression estimators give identical results. In the following each sampling strategy is
indicated in short by the couple (dis, est), where dis indicates one of the 7 sample
designs referred in table 2 and est assumes the categories dir, syn, and greg respectively
for the estimators (5.1), (5.3) and (5.4).
We have computed two quality measures: the average Absolute Relative Bias (ARB)
and the average Relative Mean Square Error (RMSE ) expressed by
ARB F (dis , est ) 
1
1 500
 
card ( F ) dF 500 i 1
RMSE F (dis , est ) 

ˆi
estYd ( dis )  Yd
 1 500
1
 
card ( F ) dF  500 i 1


Yd  100 ,

2
ˆi
estYd ( dis )  Yd

Yd2   100

denoting with: F a specific subset of the marginal domains; card(F) the cardinality of F;
Yˆ i (dis ) the i-th Monte Carlo sample estimate (i=1,…, 500) of the total Y in the
est d
d
strategy (dis, est). In particular, F represents alternatively the subset of small domains of
DOM1, DOM2 or the overall set of small domains (of both DOM1 and DOM2).
6.2. Simulation results
The Monte Carlo simulation study highlights that the multi-way stratification
techniques proposed in this paper are able to take bias and variability under control with
respect to two benchmark strategies ,collapsing one of the two stratification variables.
The main results of the experiment referred to the small domains set are showed in table
3. The table is organised in four blocks: the first one illustrates the quality measures of
the direct estimator; the second and third block are dedicated respectively to the syn and
greg estimators based on 10 auxiliary variables (model (1)); the forth block presents the
results of syn or greg estimators based on the 44 domain membership indicator
variables. We restrict the comments only on the value added variable but similar
consideration could be expressed for the labour cost variable. In general, the comments
are referred to the overall set of small domains.
Table 3. Comparison of small domains sampling strategies: Average Absolute
Relative Bias ( ARB ) and Relative Mean Square Error ( RMSE )
Value Added
Sampling Design
DOM1
ARB
RMSE
DOM2
ARB
RMSE
Labour Cost
Overall
ARB
RMSE
DOM1
ARB
RMSE
DOM2
ARB
RMSE
Overall
ARB
RMSE
Direct estimator (block 1)
STDOM1
1.79
8.18 148.28
5.41 102.74
1.72
6.86 155.87
4.63 106.88
STDOM2
3.42 107.49
0.47
15.26
1.75
55.23
3.32 105.66
0.46
12.66
1.70
52.96
BALPOP
0.77
24.86
1.29
38.49
1.06
32.58
0.74
23.60
1.20
34.26
1.00
29.64
BAL
0.84
25.43
1.45
40.61
1.19
34.03
0.79
24.22
1.57
35.80
1.23
30.78
CPAR
1.35
32.52
2.81
53.85
2.18
44.60
1.44
31.68
2.62
51.44
2.11
42.88
CPOI
1.50
38.64
3.01
63.40
2.36
52.67
1.42
37.64
3.11
61.92
2.38
51.40
CSPOI
1.61
32.90
2.73
56.28
2.24
46.15
1.68
31.76
2.99
54.28
2.42
44.52
43.19
42.82
Synthetic estimator with 10 auxiliary variables (block 2)
STDOM1
14.22
18.88
13.81 100.55
13.99
65.16
12.29
18.40
9.25
95.03
10.57
61.83
STDOM2
24.82
33.96
14.48
15.96
20.34
26.16
13.13
14.79
12.46
23.11
12.75
19.51
BALPOP
13.68
17.51
24.98
43.98
20.09
32.51
11.89
15.60
12.35
33.08
12.15
25.50
BAL
14.92
18.46
21.82
41.66
18.83
31.61
13.37
16.91
10.41
32.64
11.69
25.82
CPAR
13.68
17.83
23.45
44.63
19.22
33.02
11.82
16.13
11.69
34.93
11.75
26.78
CPOI
13.44
17.55
23.17
47.18
18.95
34.34
11.36
15.73
12.04
37.55
11.75
28.10
13.78
17.75
25.11
45.07
20.20
33.23
12.21
16.52
13.06
36.17
12.69
27.66
11.79 119.23
CSPOI
Generalised regression estimator with 10 auxiliary variables (block 3)
STDOM1
2.35
30.13
7.40
81.03
1.86
29.28
7.49
80.25
STDOM2
3.98
58.62
0.95
15.26
2.26
34.05
2.90
52.66
0.93
12.66
1.78
29.99
BALPOP
1.11
19.41
2.20
25.80
1.73
23.03
1.01
16.42
1.99
21.73
1.57
19.43
BAL
1.63
19.41
1.76
26.11
1.70
23.21
1.21
16.72
2.08
21.96
1.70
19.69
CPAR
1.04
21.27
1.63
29.30
1.37
25.82
1.03
18.27
1.11
24.60
1.08
21.86
CPOI
1.67
21.99
1.92
32.53
1.81
27.96
1.83
18.89
2.22
27.47
2.05
23.76
CSPOI
1.32
21.68
1.17
29.31
1.23
26.00
1.30
18.11
1.07
25.39
1.17
22.23
11.26 119.95
Synthetic or Gen. regression estimator with 44 auxiliary variables (block 4)
STDOM1
STDOM2
3.39
31.30
27.48
63.22
17.04
49.39
17.24 102.24
2.76
30.80
28.67
63.05
17.44
49.08
23.00 102.64
1.37
20.65
8.25
56.00
1.42
19.10
10.77
55.30
BALPOP
1.07
20.71
1.97
26.98
1.58
24.26
1.08
17.62
1.93
24.07
1.56
21.27
BAL
1.47
20.36
2.13
28.46
1.84
24.95
1.41
17.66
2.02
25.10
1.75
21.88
CPAR
1.79
23.38
2.22
32.39
2.03
28.48
1.65
20.73
2.08
30.39
1.90
26.21
CPOI
3.94
30.21
3.06
33.60
3.44
32.13
3.35
25.28
3.13
31.68
3.23
28.91
CSPOI
1.71
24.29
2.52
31.84
2.17
28.57
2.13
21.28
2.48
30.16
2.32
26.31
Examining firstly direct estimates, we observe the following.
- The two benchmark designs (STDOM1 and STDOM2) have an RMSE value for
the unplanned domains equal to 148.28% and 107.49% respectively. These values
cause the large RMSE values computed for the overall set of small domains and
respectively equal to 102.74% and 55.23%.
- The STDOM2 shows better results than those attained by STDOM1. This finding is
explained by the fact that the STDOM2 stratification criterion is correlated with the
variables of interest and takes under control a larger number of small domain than
the STDOM1 stratification.
- As far as concerns the overall set of small domains, the BALPOP is the more
efficient design, both in terms of ARB (1.06%) and RMSE (32.58%), even if BAL
is only slightly worse.
- The strategies adopting the coordinated sampling show worse values with respect to
balanced sampling but they perform better in terms of RMSE than benchmark
strategies. In particular, CPAR design has the smallest RMSE (44,60%) followed
by CSPOI (46,15%) and CPOI (52,67%). Nevertheless, the ARB values are quite
high and larger than that attained by STDOM2 design. These findings depend on
DOM2 domains where the ARB values are greater than 2%.
Considering the synthetic estimator based on 10 auxiliary variables, some issues may be
pointed out.
- All designs are characterized by a large bias. The STDOM1 has an ARB equal to
13.99% (although it has an unacceptable RMSE that is equal to 65,16%). The rest
of the designs have the ARB values higher than 18%. This evidence gives a
warning against the use of synthetic estimator.
- The STDOM2 design has the lowest RMSE (26,16%), because of a strong
reduction of the DOM1 variance. However, the ARB value (20.34%) is the largest
of all designs.
- The behaviour of balanced and coordinated designs in terms of bias and variance
are more or less equal. The BAL has the lowest ARB (18.33%) and RMSE
(31.61%) values.
The experimental results of the greg estimator in third block of the table 3 suggest some
considerations.
- All the designs show strong improvements of the quality measures. In general,
the ARB measure has a remarkable reduction with respect to the same indicator
computed on the synthetic estimator. Only the STDOM1 still presents a high ARB
value (7.40%).
- In the STDOM2, the reduction of the bias is more than compensated from the
increase of the variability. This produces an RMSE equal to 34.05%.
- Both the balanced and the coordinated designs have good performances, though the
balanced designs are slightly better being the RMSE roughly equal to the 23%.
- The CPAR design is the best coordinated sampling in terms of RMSE (25.82%)
with a small ARB value (1.37%). The CSPOI design has the lowest ARB value
(1.26%) and bounded RMSE value (26.00%). Therefore, these results highlight the
equivalence of CPAR and CSPOI designs.
- Finally, the syn or greg estimator based on 44 auxiliary variables show analogous
results to the greg estimator based on 10 auxiliary variables. The balanced designs
are the best with slight preference for the BALPOP sampling. The CPAR and
CSPOI designs are enough close to the balanced designs even if the bias increases.
As general findings, the balanced designs seem to guarantee a best strategies to take
under control bias and variance of the overall set of the small domains.
Like for balanced designs, the coordinated sampling have good performances with
respect to the benchmark designs. In particular, CPAR and CSPOI are preferred to
CPOI and could be a valid alternative to balanced sampling when using the syn and
greg estimators.
The conclusion is that for all blocks, BALPOP shows best overall performance with
respect to bias and accuracy and in BALPOP, the strategy with greg estimation with the
ten auxiliary variables (block 3) is a safe choice for both value added and labour cost.
Furthermore, the results show that the synthetic estimator of block 2 must be considered
carefully because the bias can be unexpectedly large and the squared bias would be the
dominating part in the RMSE .
7. Conclusions
In the present work balanced sampling and coordinated sampling, useful in order to
obtain planned sample size for domains belonging to different population partitions,
have been examined. In particular, these sampling methods may be applied to deal with
small area problems, in planning the sampling design so as to have planned size for
each domain. These methods are useful for defining an overall strategy considering
jointly the design and estimation phase. That causes some advantages in the different
approaches to the finite population inference: in the design-based framework it is
possible to compute direct domain estimators with prefixed level of precision; while in
the model assisted or model-based framework, the presence of sample units in each
estimation domains permits to utilise models with specific small area effects allowing a
more accurate predictions of the parameters of interest at small area level.
A straightforward method to obtain planned domains is to stratify combining all the
variables, which define the domains of interest. This method has some problems, first
of all it could produce a large overall sample size depending on the number of the strata.
To solve this problem some techniques, generally referred as multi-way stratification
designs, have been proposed. Nevertheless, the main weakness of these techniques is
the computational complexity.
This paper studies in detail the use of the two sampling techniques, the balanced and
coordinated sampling designs, to control the sample size in all the domains of interest
without suffering from the computational complexity.
The experimental results conducted on real business population data, show the optimal
behaviour of balanced sampling with different direct and indirect estimators.
The Pareto sampling seems to be the best design among the coordinated ones; when it is
used jointly with estimators involving auxiliary information, it is slightly worse than the
estimates computed by means of balanced designs. Furthermore, the strategies based on
Pareto sampling may be useful if the sample survey needs to be coordinated over time.
This may be an effective sampling strategy for many business surveys.
Finally, we point out that the paper has been focused on the feasibility of the sampling
selection techniques. The topics connected to the estimation of variance have been
neglected and will be treated in depth in future researches.
REFERENCES
Bishop Y., Fienberg S., Holland P. (1975) Discrete Multivariate Analysis, MIT Press,
Cambridge, MA.
Bryant E. C., Hartley H. O., Jessen R. J. (1960) Design and Estimation in Two-Way
Stratification, Journal of the American Statistical Association, 55: 105-124.
Causey B. D., Cox L. H., Ernst L. R. (1985) Applications Transportation Theory to
Statistical Problem, Journal American Statistical Society, 80: 903-909.
Chauvet G., Tillé Y. (2006) A Fast Algorithm of Balanced Sampling, to appear in Journal
of Computational Statistics.
Deville J.-C., Tillé, Y. (2000) Selection a several unequal probability samples from the
same population, Journal of Statistical Planning and Inference, 86: 89-101.
Deville J.-C., Tillé Y. (2004) Efficient Balanced Sampling: the Cube Method, Biometrika,
91: 893-912.
Dykstra R. (1985) An Iterative Procedure for Obtaining I-Projections onto the Intersection
of Convex Sets, Annals of Probability, 13, 975-984.
Dykstra R., Wollan P. (1987) Finding I-Projections Subject to a Finite Set of Linear
Inequality Constraints, Applied Statistics, 36, 377-383.
Ernst L. R., Paben S. P. (2002) Maximizing and Minimizing Overlap When Selecting Any
Number of Units per Stratum Simultaneously for Two Designs with Different
Stratifications, Journal of Official Statistics, 18: 185-202.
Jessen R. J. (1970) Probability Sampling with Marginal Constraints, Journal American
Statistical Society, 65: 776-795.
Lehtonen R., Veijanen A. (1998) Logistic Generalized Regression Estimators, Survey
Methodology, 24: 51-55.
Lehtonen R., Särndal C. E, Veijanen A. (2003) The effect of Model Choice in Estimation
for Domains, Including Small Domains, Survey Methodology, 1: 33-44.
Lehtonen R., Särndal C. E., Veijanen A. (2005) Does the Model Matter? Comparing
Model-Assisted and Model-Dependent Estimators of Class Frequencies for
Domains, Statistics In Transition, 7: 649-673.
Lu W., Sitter R. R. (2002) Multi-way Stratification by Linear Programming Made
Practical, Survey Methodology, 2: 199-207.
Montanari G. E., Ranalli M.G. (2003) Nonparametric Methods in Survey Sampling, in:
New Developments in Classification and Data Analysis, Vinci M., Monari P.,
Mignani S., Montanari A. (eds), Springer , Berlin.
Ohlsson E. (1995) Coordination of Samples using Permanent Random Numbers, in:
Business Survey Methods, Cox B. G., Binder D. A., Chinnappa B. N., Chirstianson
A., Colledge M. J., Kott, P.S. (eds), Wiley, New York, 153-169.
Rao J. N. K. (2003) Small Area Estimation, Wiley, New York.
Rao J. N. K., Nigam A. K. (1990) Optimal Controlled Sampling Design, Biometrika, 77:
807-814.
Rao J. N. K., Nigam A. K. (1992) Optimal Controlled Sampling: a Unifying Approach,
International Statistical Review, 60: 89-98.
Rosén B. (1997a) Asymptotic Theory for Order Sampling, Journal of Statistical Planning
and Inference, 62, 135-158.
Rosén B. (1997b) On Sampling with Probability Proportional to Size, Journal of
Statistical Planning and Inference, 62, 159-191.
Royall R., Herson J. (1973) Robust Estimation in Finite Population, Journal American
Statistical Society, 68: 880-889.
Särndal C. E., Swensson B., Wretman, J. (1992) Model Assisted Survey Sampling,
Springer-Verlag, New York.
Singh M. P., Gambino J., Mantel H. J. (1994) Issues and Strategies for Small Area Data,
Survey Methodology, 20: 3-22.
Sitter R. R., Skinner C. J. (1994) Multi-way Stratification by Linear Programming, Survey
Methodology, 20: 65-73.
Valliant R., Dorfman A. H., Royall R. M. (2000) Finite Population Sampling and
Inference: A Prediction Approach, Wiley, New York.
Download