Metodi di Campionamento per la stima di indicatori economici per sottopopolazioni di imprese VERSIONE PROVVISORIA Piero Demetrio Falorsi, Stefano Falorsi, Paolo Righi1 ABSTRACT In the present work balanced sampling and coordinated sampling, useful in order to obtain planned sample size for domains belonging to different partitions of the population, have been examined. In particular, these sampling methods may be applied to deal with small area problems in the phase of the sampling design. The main advantages of the two techniques are the computational feasibility which permits to easily implement an overall strategy considering jointly the design and estimation phase and improving the efficiency of the estimators. An empirical simulation on real population data and different domain estimators shows the empirical properties of the examined sample designs. Key words: Planning Sample Size of Small Domains, Controlled Selection, Sample Coordination, Balanced Sampling. 1. Introduction The small area problem is usually considered to be treated via estimation. However, if the domain indicator variables are available for each unit in the population there are opportunities to be exploited at the survey design stage. As noted by Singh et al. (1994), there is a need to develop an overall strategy that deals with small area problems, involving both planning sample design and estimation aspects. In this framework it is crucial to control the sample size for each domain of interest, so that each domain is treated as a planned domain, at design stage, for which it is possible to produce direct estimates with a prefixed level of precision. In general, with a design-based approach to the inference, the presence of sample units in each domain allows to compute domain estimates although not always reliable. Furthermore, in the model-based or modelassisted approach, the presence of sample units in each estimation domain permits to utilise models with specific small area effects, allowing more accurate estimates of the parameters of interest at small area level (Lehtonen, et al., 2003). For small domains it will often be needed to use indirect estimators (Rao, 2003), such as synthetic estimators, for producing reliable estimates; nevertheless, synthetic estimators are very sensitive to the model assumption on which they rely on providing a good description of the data structure. For a true model there will be no design bias but if the model does not fit well for the whole domain structure there will be some design bias, at least in a part of the 1 Italian National Institute of Statistics - ISTAT, Via Magenta 4, 00185 Roma, parighi@istat.it. data, inflating the mean square error of the estimates. Of course, there will never be a true model available in practice. In order to overcome this problem it is decisive to base the domain estimation on the sample units belonging to the small domain. The sampling design techniques considered in this paper allow controlling the sample sizes for domains of interest which are defined by different partitions of the reference population. Such techniques are useful when the overall sample size is relatively small and, therefore, in some of the partitions there are small domains. In fact, when the aim of the survey is to produce estimates for two or more partitions of the population, a standard solution to obtain planned sample sizes for the domains of interest is to use a stratified sample with the strata given by cross-classification of variables defining the different partitions. In the following, this design will be denoted as cross-classification design. In many practical situations, the cross-classification design is often unfeasible since it needs the selection of at least a number of sampling units as large as the product of the number of categories of the stratification variables. In order to overcome some problems of cross-classification designs, an easy strategy is to drop one or more stratifying variables or to group some of the categories. Nevertheless, some planned domains become unplanned and some of them can have small or null sample size. Many methods have been proposed in the literature to keep under control the sample size in all the categories of the stratifying variables without using cross-classification design. These methods, are generally referred as multi-way stratification techniques, and have been developed under two main approaches: – methods based on selection according to Latin Squares or Latin Lattices schemes (Bryant et al., 1960; Jessen, 1970); – methods based on controlled rounding problems via linear programming (Causey et al., 1985; Sitter and Skinner, 1994). The seminal paper of Bryant et al. (1960) suggests to allocate the units in the sample by means of a two-way Latin Square table randomly selected and two estimators of the parameter of interest are proposed. The method implies that expected sample counts in each stratum display independence between the rows and columns of the two-way table; in this way it can be used without any sort of limitations only when the stratifying variables are also independent in the population. Another drawback is that it is not possible to implement the procedure when there is no population in one or more crossclassification strata. In order to solve these problems, Jessen (1970) proposes two approaches, both fairly complicated to implement and not always leading to a solution (Causey et al., 1985). As far as concerns the methods based on linear programming, Causey et al. (1985) consider the controlled multi-way stratification as a rounding problem solved by means of transportation theory. The method may not have a solution in case of three or more stratification variables. Following the linear programming approach to controlled sampling designs proposed by Rao and Nigam (1990, 1992), Sitter and Skinner (1994) suggest a method based on linear programming more flexible to different situations than the method proposed by Causey et al. (1985) and some further computational simplification of Sitter and Skinner method have been suggested by Lu and Sitter (2002). Nevertheless, the main weakness of the linear programming approach is the computational complexity. As a consequence, the drawbacks of both approaches have limited the use of multi-way stratification techniques as a standard solution for planning the survey sampling designs. This paper illustrates as two well known sampling techniques may be useful for controlling the sample size in all the domains of interest without suffering from the disadvantages of the above mentioned methods. The first technique faces the problem by means of the selection of a balanced sample (Deville, Tillé, 2004) based on domain membership indicator variables. In this context the balancing equations assure that the prefixed marginal sample sizes are satisfied. The second technique deals with the problem by means of procedures developed to guarantee the sample coordination. The sampling units are selected from separate stratified sampling designs, each one defined by a single auxiliary variable. The samples are selected using permanent random number techniques (Ohlsson, 1995) assuring the maximum overlap of the samples. The paper is organised as follows. Section 2 introduces the essential notation. Section 3 is dedicated to the balanced sampling, while section 4 focuses on coordination techniques. Section 5 deals with the estimation topics, and section 6 shows the main results of a simulation study conducted on a real population of Italian enterprises. Finally some brief conclusions are underlined in section 7. 2. Notation and parameters of interest In order to simplify the notation, the paper will be restricted to the case of two stratifying variables. However, the generalisation to the multi-way case is straightforward. Let U define the population of interest of size N. Suppose that two auxiliary variables V1 and V2 are known for each unit k (k=1, …, N) of the population with, respectively, R and C modalities. In particular, the generic value or class of values assumed by V1 and V2 is denoted respectively with v1r and v 2c . Let U rc U r. U .c specify the subpopulation of size N rc 0 , being U r . and U .c the subpopulations of size N r . c N rc and N .c r N rc for which V1 v1r and V2 v2c . Let yk define the value of the target variable in the unit k ( k U ) . There are R C 1 parameters of interest defined by, Y yk , k U Yr. yk , (r 1,, R) , k U r. Y.c yk , (c 1,, C ) . k U .c In order to obtain the sample estimates of the parameters of interest, the standard cross-classification design selects a stratified sample s of size n from the population U with probability p(s), where the strata are obtained combining the categories of the variables V1 and V2 . Let s rc (r 1, , R, c 1, , C ) be the subsample of s of size nrc , where min b, N rc n rc N rc , selected from population U rc ; let b denote the minimum sample size fixed in advance, where b is equal to 1 for producing unbiased estimates of the parameters of interest or equal to 2 for being able to compute unbiased variance estimates. Furthermore, n nrc is the set of the strata frequencies, while n1 (n1. , , nr. ,, nR. )' and n 2 (n.1 , , n.c ,,n.C )' are the marginal distributions of n with respect to V1 and V2 , being nr. c nrc and n.c r nrc . We will call marginal stratification with respect to V1 the partition of U in U1. , , U r. ,,U R. , so C that nr. is the sample size of subsample s r . c 1 s rc of the generic marginal stratum r; while we will call marginal stratification with respect to V2 the partition of U in R U.1, , U.c ,,U.C , so that n.c is the sample size of subsample s.c r 1 s rc of the generic marginal strata c. 3. Methods based on balanced sampling Multi-way stratification designs can be treated by means of balanced sampling. The definition of a balanced sample depends on the assumed inferential framework. In the model based approach, a sample is defined as balanced on a set of auxiliary variables if there is the equality between the sample and the known population means of the auxiliary variables (Royall and Herson, 1973; Valliant et al., 2000). Following the design based approach considered in this paper, a sample is balanced when the Horvitz-Thompson estimates of the auxiliary variables totals are equal to their known population totals (Deville and Tillé, 2004). Balanced sampling, in this last case, represents a generalisation of probability sampling designs exploiting auxiliary information known for each unit of the population. Standard designs like simple design with fixed size, stratified sampling and unequal probability sampling represent some specific applications of balanced designs. The multi-way stratification designs belong to this class of designs too. 3.1. General introduction to balanced sampling For a formal description of balanced sampling, we introduce a general definition of sampling design. Sampling design is a probability distribution p(.), on the set S of all the p( s) 1 , where p(s) is the probability of subset s of the population U such that sS the sample s to be drawn. The set S may be represented by the random variable that takes the value λ (1 , ..., k , ..., N )' , with k 1 if k s and k 0 otherwise. Let π ( 1 ,..., k ,..., N ) be the vector of inclusion probabilities. We consider the sampling design p(.) with inclusion probabilities π which assigns a probability p(s) to each sample s such that E (λ ) sS p( s )λ π . Let x k ( x1k ,..., xhk ,..., x Hk )' be a vector of H auxiliary variables available for each population unit. The sampling design p(s) is said to be balanced with respect to the H auxiliary variables if and only if it satisfies the balancing equations given by ˆ X dir xk k U k k X xk , k U (3.1) for all s S such that p(s)>0, where dir X̂ is the Horvitz-Thompson estimator of the known population total X of the auxiliary variable. Two simple examples of balanced sampling designs are illustrated in the following. Example 1. The design of fixed sample size n is balanced on the auxiliary variable xk k x kk k 1 k n , kU ks kU for all s S. Example 2. The stratified design, where for each stratum U h of size N h a simple random sample of size nh is drawn, is balanced on H auxiliary variables such that the generic h-th variable assumes the values in the k-th unit 1 if k U h . xhk 0 if k U h The balancing equations become: x hk nh N k k nhh xhk N h , ks k 1 kU for all s S and for h=1, …, H. One of the main problems of balanced sampling has always been implementing a general procedure which gives a multivariate balanced random sample (see Valliant et al., 2000). Recently, Deville and Tillé (2004) proposed the cube method that allows the selection of balanced (or approximately balanced) samples for a large set of auxiliary variables and with respect to different vectors of inclusion probabilities. The method is based on at most two phases. In the first phase, called flight phase, the cube method searches for a sampling design satisfying equations (3.1). The sample selection is performed by means of rounding off randomly to 1 or 0 almost all the elements of the vector π . At the end of the flight phase the elements of π equal to 1 or 0 indicate respectively that the corresponding unit is included or not in the sample. However, the balancing equations could be exactly satisfied with some fractional element of π (the number of these elements are at most equal to the number of balancing variables), and therefore the flight phase does not determine whether the corresponding units have to be drawn or not in the sample. In these cases the cube method searches for an approximately balanced sample solution applying the landing phase; this solution relaxes some constraints, minimising the variance of dir X̂ . The landing phase is set up as a rounding problem and it is solved via linear programming techniques. A simple example where the landing phase occurs it is the design of fixed sample size when the sum of inclusion probabilities is not an integer. In example 1, the balancing equation is never satisfied with the flight phase if known total n is not an integer. As described below, in the multi-way stratification design, the cube method is able to select a balanced sample only via flight phase (Deville and Tillé, 2000). In Chauvet and Tillé (2006) is presented a faster algorithm for executing the flight phase implemented by means of a SAS-IML macro. The procedure is computationally feasible for large data sets with a lot of balancing variables as it has been tested by the authors. 3.2. Balanced sampling applied for marginal stratification Let us suppose that a vector of inclusion probabilities π , consistent with the marginal distribution n1 and n 2 , is available, that is nr. kU k n.c kU k ( r 1,...R) r. ( c 1,...C ) . (3.2) .c If π is not available, the well known Iterative Proportional Fitting (IPF, see Bishop et al., 1975) or the Generalised Iterative Proportional Fitting (GIPF, Dykstra, 1985; Dykstra and Wollan, 1987) procedure can be used to define it. Starting from an initial vector * π (* 1 ,...,* k ,...,* N )' of inclusion probabilities such that kU * k n , IPF or GIPF adjust the initial probabilities computing the final vector π satisfying the marginal sample sizes constraints (3.2). The main useful difference between IPF and GIPF is that the second procedure avoids to define inclusion probabilities greater than 1. Therefore, it could be better to adopt GIPF to satisfy equations (3.2). However, GIPF and IPF produce the same results when the latter procedure does not yield a solution with elements of π greater than 1. A third more empirical procedure to calculate π , satisfying the constraints (3.2), is to use IPF procedure iteratively in the following way: starting from initial vector of inclusion probabilities * π , at the end of the procedure the k ’s greater than 1 are set equal to 1. Then the IPF procedure is reiterated with the same starting inclusion probabilities, removing the units with inclusion probabilities equal to 1 and updating the marginal constraints so that, if the generic unit k belongs to the subpopulation U rc , the new marginal sample sizes became nr . 1 and n.c 1 . The selection by means of a balanced design, taking into account planned small sample size for two marginal distributions, is related to the choice of the auxiliary variables. Then we use the vector of auxiliary variable x k x11k ,..., x1rk ,..., x1Rk , x12k ...., x2ck ,...., x2Ck where k if k U r. x1rk 0 if k U r. k if k U .c . x2ck 0 if k U .c The flight phase of cube method selects a random sample which satisfies the following constraints: ˆ X dir X r x1k k 1 x1rk k nr. , (r=1,…,R ; c=1,…,C ) (3.3) ksr . kU kU r . kU k x 2ck c k 1 x 2k k n.c ks. c kU kU. c kU k Equations (3.3) imply that the sample sizes of each subsample s r . and s.c must be equal to the fixed marginal frequencies nr. and n.c respectively. Deville and Tillé (2000), show that equations (3.3) can be exactly satisfied by means of the cube method via the flight phase. Finally we underline some interesting aspects of balanced sampling. - The cube method is quite general and it can be easily applied to more than two partitions in the domains of interest. - Additional balancing equations may be defined involving other variables correlated with study ones so as to improve the precision of estimators of totals. - As far as concerns the estimation phase, the balanced sampling can be used jointly with an estimator exploiting auxiliary information. This sampling strategy allows to bound the variability of the sampling weights (for instance calibrated weights) that causes problems, especially in small domains of estimate. For instance if known population counts are used in the calibration phase at domain level, it should be useful to utilize the same auxiliary information as constraints in the design phase, in addition to the auxiliary variables which allow to respect the marginal sample sizes. 4. Methods based on coordinated sampling The problem under study may be considered as a particular case of sample coordination. Generally, coordinated sampling is adopted in order to guarantee an expected predefined overlap rate between two or more surveys. Minimum and maximum overlap is indicated respectively as negative and positive coordination. In our case, the sampling units are selected from two separate sampling designs on the same target population using respectively V1 and V2 as stratification variables. The two samples are maximum positively coordinated and the resulting final sample is composed by the union of the two samples. In this context the techniques can be classified into: - methods based on the Permanent Random Numbers (PRN); - methods implementing linear programming algorithm (Ernst and Paben, 2002). Within the first techniques (Ohlsson, 1995), a random number (PRN), drawn independently from the uniform distribution on the interval [0,1], is assigned to each unit in the frame list. The PRN techniques allow to implement different sampling designs such as: simple random sampling without replacement, Bernoulli, Poisson and other probability proportional to size sampling procedures. Then, using the assigned PRNs, two maximum positively coordinated samples are selected from the two sampling designs. The second type of methods, that will not be discussed in this paper, are procedures maximizing the overlap among samples and are based on sequential application of transportation problems. The algorithms seem to be rather complex compared with permanent random numbers based procedures. 4.1. General introduction to coordinated sampling using permanent random numbers In the PRN context, the sampling units are selected from two separate sampling designs, say p (1) () and p ( 2) () , on the same target population using V1 and V2 as stratification variables. Let s (i ) and S (i ) denote, respectively, the generic sample and the set of all the subset s (i ) generated by the design p (i ) () (i=1,2). The inclusion probabilities of unit k in the marginal design p (i ) () are indicated with (i ) k s ( i )S( i ) p(i ) ( s (i ) ) (i ) k being (i) k a dummy variable that equals to 1 if k s(i ) , and equals to 0 otherwise. Two positively coordinated samples, say s (1) and s ( 2 ) , are selected respectively according to p (1) () and p ( 2) () designs and the resulting final sample s is composed by the union of the two samples. Let us denote with zk the PRN assigned to the unit k (k=1, …, N) and n1 (n1. , , nr. ,, nR. )' for each above considered design let and n 2 (n.1 , , n.c ,,n.C )' indicate the sample sizes, being kU r. (1) k nr. and kU .c r 1 nr. c 1 n.c n , R C ( 2) k n.c . The final sample s, being s s (1) s ( 2) , is selected with the joint design p(s) and the corresponding inclusion probabilities are given by k Pr k s sS p( s ) k , with k =1 if either (1) k or ( 2) k equals to 1. With these techniques, the realised sizes of the final sample are random outcomes with expected values larger than the constraints n, n1 and n 2 . Empirical results on Italian business surveys has shown that the overall realised sample size is generally larger than the prefixed n between 10% and 30%. In order to partially overcome this problem, in section 4.3 is shown an algorithm which, starting from initial * (1) k and * ( 2) k probabilities, calculates modified inclusion probabilities, (1) k and ( 2) k . These guarantee that the expected marginal sample sizes in the joint sample design are roughly equal to the constrains n1 (n1. , , nr. ,, nR. )' and n2 (n.1 , , n.c , ,n.C )' . In the following section we will illustrate how coordination may be realised with some important designs. 4.2. Coordination with some sampling designs Design I: simple random sampling without replacement in each marginal stratum With this design we have (1) k nr . N r . for k U r . (r=1,…,R) and ( 2) k n.c N .c for k U .c (c=1,…,C). The selection schema is very simple: for each design, the frame units are ordered by stratum and, in each stratum, by ascending order of zk . In the stratum U r . of the design p (1) () , the sample is composed of the first nr . units in the ordered list. In stratum U .c of the design p ( 2) () , the sample is composed of the first n.c units in the ordered list. Ohlsson (1992) presents a formal proof that this technique realises a sample in which the inclusion probabilities in s (1) and s ( 2 ) samples agree with the prefixed values, being Pr (k s(1) ) (1) k nr . N r . for k U r. and Pr (k s( 2) ) ( 2) k n.c N .c for k U .c . The unit k is included in the overall sample s, if it is included in s(1) with probability (1) k or if it is included in s ( 2 ) with probability ( 2) k . There is no formal proof of the values assumed by the inclusion probabilities k . An intuitive reasoning suggests that k max (1) k , ( 2) k . Empirical simulations of coordinated sample selection with design I has shown that the k values are only slightly larger than max (1) k , ( 2) k . Design II: Poisson sampling With this design we have (1)k nr. g k gk for k U r . (r=1,…,R) (4.1) g k for k U.c (c=1,…,C), (4.2) kU r . and (2)k n.c g k kU.c where g k is a non-negative measure of size for the unit k. The unit k is included in the sample s (i ) (i=1 or 2) if z k (i ) k . The inclusion probabilities of the final sample are, thus, k max (1) k , ( 2) k . Design III: probability proportional to size without replacement This design may be only approximated with PRN techniques. The two approximated solutions proposed are order sampling and collocated sampling. The joint design of both techniques may be summarised as follows: (i) the expected sample sizes are larger than the constraints n, n1 and n 2 ; (ii) there is no formal proof of the values assumed by the inclusion probabilities k , but in practical situations they may be well approximated by k max (1) k , ( 2) k . The order sampling (Rosen 1997a, 1997b) represents an interesting way to avoid the perceived drawback given by random sample size of Poisson sampling and still comes very close to achieve the desired inclusion probabilities (4.1) and (4.2). A particular case of order sampling is Pareto sampling, carried out as follows: (i) for every k U , compute (i ) k z k (1 (i ) k ) (1 z k ) (i ) k (i=1,2); (ii) for the sample design p (i ) () (i=1,2), the frame units are ordered by stratum and in each stratum by ascending order of (i) k . In stratum U r. of the design p (1) () , the sample s (1) is composed of the first nr . units in the ordered list and in stratum U .c of the design p ( 2) () , the sample s ( 2 ) is composed of the first n.c units in the ordered list. Another case of order sampling is the sequential Poisson sampling. The sampling selection is carried out with steps, (i) and (ii) above illustrated, but the (i) k values are defined by (i ) k k z k / N g k . Both the above order sampling schemas realise the marginal fixed sample sizes n1 and n 2 for a given marginal design s (i ) (i=1, or 2); but the inclusion probabilities for p (1) () and p ( 2) () are slightly different from the planned probabilities (1) k and ( 2) k expressed respectively by (4.1) and (4.2). With collocated sampling, the frame units are randomly ordered by stratum and in each stratum by ascending order of zk , giving unit k ( k U rc ) the ranks Lr ., k and L.c ,k in the marginal strata U r . and U .c of p (1) () and p ( 2) () designs. Then the new variables r(1) k Lr .,k N r. , r( 2) k L.c,k N .c are computed where is a random number drawn independently from the uniform distribution on the interval [0, 1]. The unit k is included in the sample s (i ) (i=1 or 2) if r(i ) k (i ) k . For each marginal design the collocated sampling is strictly pps with inclusion probabilities, expressed respectively by (4.1) and (4.2) , for p (1) () and p ( 2) () designs; but the realised sample has a variability lower than the one assured by the design II. As a concluding remark we note that the methods based on PRN techniques are very easy to implement and for some designs unbiased estimators of the variance may be computed, but designs based on probability proportional to size without replacement may be only approximated, in the sense that in order sampling the theoretical inclusion probabilities (i) k (i=1 or 2) may be only approximated and the collocated sampling does not assure to attain the fixed sample sizes n1 and n 2 . 4.3. An algorithm for controlling the expected sample sizes Let * π (1) (* (1)1 ,...,* (1) k ,...,* (1) N )' and * π ( 2) (* ( 2)1 ,...,* ( 2) k ,...,* ( 2) N )' denote the initial vectors of inclusion probabilities of designs p (1) () and p ( 2) () , being kU r. * (1) k nr. and kU .c * ( 2) k n.c . Starting from initial * (1) k and * ( 2) k probabilities, the algorithm calculates modified inclusion probabilities, (1) k and ( 2) k . These may be used as inclusion probabilities in the marginal designs and they guarantee that the expected marginal sample sizes of the joint sample design are equal to constraints n1 (n1. , , nr. ,, nR. )' and n2 (n.1 , , n.c , ,n.C )' . The algorithm makes use of the condition k max (1) k , ( 2) k that is true for Poisson sampling and only approximated by other designs. The algorithm is articulated in the following steps. Step 0. Initialization. Denote with t the iteration index (t=0,1,…). Let us pose at t=0: 0 (i ) k * (1) k (i=1 or 2). Step 1. Calculus of updated modified probabilities. The updated modified inclusion probabilities t (i ) k (i=1 or 2), at iteration t (t=1,2,…), are obtained by solving the following minimum constraints problem 2 tMin kU G ( t (i ) k ; t 1 (i ) k ) (i ) k i 1 t t t t kU x k (1) k (1 x k ) ( 2) k n t x t (1 t x k ) t ( 2) k nr . ; r 1,..., R 1 kU r . k (1) k t x k t (1) k (1 t x k ) t ( 2) k n.c ; c 1,..., C 1 k U .c in which: G ( t (i ) k ; t 1 (i ) k ) t (i ) k ln( t (i ) k / (4.3) t 1 (i ) k ) t (i ) k t 1 (i ) k (i=1, 2) is the logarithmic distance function between t 1 (i ) k and t (i ) k ; t x k is an indicator variable that equals 1 if t 1 (1)k t 1 (2)k and equals 0 otherwise. Step 2. Iteration. With updated values values t 1 t (i ) k (i=1 or 2), let us calculate the updated xk and then compute t t nr. kU n.c kU t 1 r. .c t 1 xk t (1)k (1 t 1 xk t (1)k (1 t 1 xk ) t (2)k (r=1,…, R) xk ) t (2)k (c=1,…, C). For a fixed close to zero small positive quantity, , if the following condition holds C R t | nr. nr. | | t n.c n.c | r 1 c 1 (4.4) then the algorithm stops and the unit k is selected with the modified probabilities (1) k t (1) k and ( 2) k t ( 2) k , defined as solution of (4.3) at iteration t; otherwise step 1 is iterated at t=t+1 until condition (4.4) is respected. Note that (4.3) represents a calibration process, that may be solved by the well known IPF (Bishop et al., 1975) or GIPF (Dykstra, 1985; Dykstra and Wollan, 1987) procedures. The logarithmic distance function avoids to define t (i ) k lower than 0, while GIPF prevents to obtain t (i ) k values larger than 1. A third more empirical procedure to obtain (i) k values simultaneously satisfying the (4.3) and lower than 1 is the following: solve (4.3) with IPF procedure and set t (i ) k t 1 (i ) k if t (i ) k 1 . 5. Estimation of domain totals In this section some considerations on the estimation are presented. In order to simplify the notation of the estimators, we denote the populations U, U r . and U .c generically as U d (d=1,…,D=R+C+1). Furthermore with analogous symbolism, we indicate with s d and Yd , respectively, the sample and the total of interest of the population U d . The direct estimator of the parameter of interest is given by ˆ dir Yd y k ak (5.1) ksd where a k are the base weights. If the probabilities k are known (as in the case of balanced sampling and Poisson sampling), the base weight is a k 1 / k . Otherwise, if the k values are unknown (as in the case of Pareto sampling and sequential Poisson sampling), we may approximate each of them as k max (1) k , ( 2) k . Another solution may be to set the base weights a k equal to 1 1 2 (1) k 1 1 1 ak 2 (1) k ( 2) k 1 1 2 ( 2) k if k s(12 ) if k s(12) s(1) s( 2) (5.2) if k s( 1 2) where s(12 ) and s ( 1 2) denote respectively the set of units included only in sample s (1) and the set of units included only in sample s ( 2 ) . Adopting the weights defined in (5.2), the estimator (5.1) is unbiased as below illustrated: Ep dir Yˆd 12 E p ks yk (12 ) (1) k dk ks (12) ks (12) yk dk yk dk ks (1) k ( 2) k ( 1 2) yk ( 2) k dk 1 yk yk 1 dk E p ks dk Yd Yd Yd , E p ks 2 (1) ( 2) 2 (1) k ( 2) k being dk 1 if unit k belongs to U d and dk = 0 otherwise and E p denotes the expectation over repeated sampling. Domain estimates could be improved using auxiliary information and a superpopulation model Em ( y k ) f (x k ; β) that, for each unit k, links the model expected value, E ( y ) , to the auxiliary vector x , where f(.; β ) is a functional form depending by the m k k vector β of unknown parameters. After obtaining the sample estimates β̂ of β , based on y , x ; k s, the predicted value yˆ f (x ; βˆ ) for each unit k U are k k k k computed, assuming that x k is known for each unit k in the population. Two different estimator types, the Synthetic and the Generalised Regression, are considered. Following Lehtonen et al. (2003), the above estimators are expressed by: ˆ yˆ k U synYd (5.3) d ˆ yˆ a ( y yˆ ) . k k k U s k gregYd d (5.4) d An another estimator, often used in small area context, is the composite estimator expressed by compYˆd U yˆ k γˆd s ak ( y k yˆ k ) , where γ̂ d is a domain-specific d d weight appropriately constructed to have certain optimality properties; it approaches unit for increasingly large domain sample size, and it tends to zero for decreasingly sample size. The predictions ŷ k ; k U differ from one model specification to another, depending on the functional form and from the choice of the auxiliary variables. As illustrated in Särndal et al., (1992, p. 399), adopting a linear model, Em ( y k ) x' k β , if each prediction is obtained by means of a submodel defined on the population U d , the estimators (5.3) and (5.4) agree. Moreover, linear model allows to define the above estimators knowing only the domain totals of the auxiliary information and the x k values for the sampling units. However, knowing x k values for every k U it is possible to construct an estimators with more efficient predictions ŷ k obtained by generalised linear models (Lehtonen and Veijanen, 1998) or non parametric regression techniques (Montanari and Ranalli, 2003). Finally, let us point out that the synthetic estimator relies on the truth of the model and it is usually biased; nevertheless, as noted in Lehtonen et al. (2003), if the model incorporates a specific domain effect (fixed or random) the bias is strongly reduced and with the incorporation of domain-specific terms in the model it is good to control the sample size at domain level. 6. Monte Carlo experiment to evaluate the sampling strategies 6.1. Background information The simulation is carried out on real enterprises data. The analysis has been focused to the 1999 population of the enterprises from 1 to 99 employers belonging to the Computer and related economic activities (2-digits of the NACE rev.1 classification). In order to simplify the empirical analysis some units with outlier values have been deleted. At the end of the cleaning procedure, the data base used for the simulation study has N=10,392 enterprises. The value added and labour cost are the variables of interest chosen in the simulation. The variable values are available for each unit in the population by an administrative data source. According to the EU Council Regulation n°58/97 on Structural Business Statistics the estimation domains are defined as different partition subsets of the population. In particular, we consider two partitions: (1) geographical region with 20 marginal domains, hereafter referred as DOM1; (2) Economic activity group (3-digits of the NACE rev.1 classification with 6 different groups) by Size class (defined in terms of number of persons employed: 1=1-4; 2=5-9; 3=10-19; 4=20-99) with 24 marginal domains, hereafter referred as DOM2. Therefore, the overall number of marginal domains is 44, while the number of the cross-classification strata is 480 but only 360 strata have one or more population units. The experiment is based on an overall sample size of 360 units, planning the sample size of each domain of DOM1 and DOM2 such that the coefficient of variation of the domain estimates for the variable number of employers (supposed known at sampling stage) is boundend to be lower than 34.5% for the DOM1 domains and to 8.7% for the DOM2 domains. This sample allocation represents a compromise between the Allocation Proportional to Population (APP) size and an allocation uniform for each domain of each partition. In the following we refer to the domains with the planned sample size greater than the APP sample size as oversized domains. These domains need to have a sample size larger than the APP sample size to bound the sampling errors; roughly speaking these domains may be classified as small domains. The analysis of the empirical results will be focused on these domains. Table 1 reports the planned sample sizes and the APP sample sizes for domains of the DOM1 and DOM2 partitions. The small domains are reported in grey cells. Table 1. Population and sample sizes for the domains of interest DOM1 Population Geographical size Region Piemonte DOM2 APP Population sample 3-digit NACE Rev. 1; size size size class Planned sample size APP sample size Planned sample size 912 18 32 721;1 57 21 2 21 7 1 721;2 29 6 1 2,903 19 101 721;3 21 3 1 179 17 6 721;4 21 10 1 1,064 17 37 722;1 1,916 34 66 Friuli V. Giulia 246 18 9 722;2 874 7 30 Liguria 280 11 10 722;3 604 6 21 Emilia R. 898 19 31 722;4 521 30 18 Toscana 693 19 24 723;1 2,970 32 103 Umbria 110 20 4 723;2 1,006 6 35 Marche 203 21 7 723;3 498 6 17 1,239 20 43 723;4 283 28 10 114 19 4 724;1 46 21 2 34 17 1 724;2 14 5 0 Campania 445 15 15 724;3 11 4 0 Puglia 372 17 13 724;4 13 10 0 47 21 2 725;1 246 27 9 Calabria 103 21 4 725;2 98 6 3 Sicilia 371 23 13 725;3 61 5 2 Sardegna 158 21 5 725;4 30 14 1 10,392 360 360 726;1 643 37 22 726;2 204 7 7 726;3 138 7 5 726;4 88 28 3 Total 10,392 360 360 Valle d'Aosta Lombardia Trentino A. A. Veneto Lazio Abruzzo Molise Basilicata Total A grey cell denotes a small domain In the simulation we have considered seven sampling designs as reported in table 2. Five hundred Monte Carlo samples have been selected for each sampling design. Table 2. Sampling Design used in the simulation study Sampling Design Abbreviation SRSWOR* Stratified on DOM1 with in each stratum Stratified on DOM2 with SRSWOR* in each stratum Balanced sampling on the marginal sample sizes and on population sizes Balanced sampling on the marginal sample sizes Coordinated Pareto sampling Coordinated Poisson sampling Coordinated Sequential Poisson sampling STDOM1 STDOM2 BALPOP BAL CPAR CPOI CSPOI *SRSWOR: Simple Random Sampling Without Replacement In the experiment STDOM1 and STDOM2 have planned sampling sizes for a single partition respectively DOM1 and DOM2. BALPOP is based on (3.3) adding the balancing equations ks d dk k N d (d=1, …, 44), while in BAL only the equations (3.3) are considered. Starting from initial probabilities * k n N , the inclusion probabilities for BALPOP and BAL have been attained applying the IPF procedure iteratively (four times) as described in section 3.2. In the coordinated designs CPAR, CPOI and CSPOI, the marginal sample sizes are satisfied only as expectation over repeated sampling. Starting from initial probabilities * (1) k nr . N r . for k U r. and * ( 2) k n.c N .c for k U .c , the final inclusion probabilities (1) k and ( 2) k have been obtained by means of the algorithm described in section 4.2; the final solution has been found with eight iterations. For each sample the estimates of the domain totals have been computed by the direct, generalised regression and synthetic estimators. In particular, for the three coordinated sampling we have examined two direct estimators. The first one adopting as direct weight the value a k max( (1) k , ( 2) k ) 1 , while the ak weights in the second direct estimator are defined by the expression (5.2). We show the results only for the first direct estimator because the simulation study has stressed that the second estimator has much higher variability than the first estimator. As far as concern the estimators using auxiliary information, two simple homoschedastic linear models have been implemented: the model (1) uses 10 auxiliary variables, six of them are the economic activity group membership indicators, and the remaining four are the size class membership indicators; the model (2) uses the 44 domain membership indicator variables. The linear model (1) is expressed by Em ( y k ) a b for k U a U b , where U a is the population of enterprises of a-th (a=1, …, 6) economic activity group and U b is the population of enterprises of b-th (b=1, …, 4) size class of the number of employers and a and b are the fixed effects of the a-th economic activity group and the b-th size class. The linear model (2) is Em ( y k ) r c for k U r. U .c , where the subscripts r (r=1,…,20) denotes the generic domain belonging to the DOM1 partition and c (c=1,…,24) denotes the generic domain belonging to the DOM2 partition and r and c are the separate domain-specific effects. We point out that the main aim of the experiment is to compare different sampling designs using the same estimator. In this context the choice of the best model does not represent a central issue. Hence, we have considered two quite general feasible models that can be implemented in all situations of planned domains. The model (1) is somewhat more reliable since the estimates of the regression parameters are based on large sample sizes; while in model (2) it is possible to evaluate the effect of planning the domain sample sizes, although the estimates of each regression parameter are based on small sample sizes. Obviously, more flexible model formulations could be possible as described for instance in Lehtonen et al. (2005). As pointed out in section 5, using the model (2) the synthetic and the generalised regression estimators give identical results. In the following each sampling strategy is indicated in short by the couple (dis, est), where dis indicates one of the 7 sample designs referred in table 2 and est assumes the categories dir, syn, and greg respectively for the estimators (5.1), (5.3) and (5.4). We have computed two quality measures: the average Absolute Relative Bias (ARB) and the average Relative Mean Square Error (RMSE ) expressed by ARB F (dis , est ) 1 1 500 card ( F ) dF 500 i 1 RMSE F (dis , est ) ˆi estYd ( dis ) Yd 1 500 1 card ( F ) dF 500 i 1 Yd 100 , 2 ˆi estYd ( dis ) Yd Yd2 100 denoting with: F a specific subset of the marginal domains; card(F) the cardinality of F; Yˆ i (dis ) the i-th Monte Carlo sample estimate (i=1,…, 500) of the total Y in the est d d strategy (dis, est). In particular, F represents alternatively the subset of small domains of DOM1, DOM2 or the overall set of small domains (of both DOM1 and DOM2). 6.2. Simulation results The Monte Carlo simulation study highlights that the multi-way stratification techniques proposed in this paper are able to take bias and variability under control with respect to two benchmark strategies ,collapsing one of the two stratification variables. The main results of the experiment referred to the small domains set are showed in table 3. The table is organised in four blocks: the first one illustrates the quality measures of the direct estimator; the second and third block are dedicated respectively to the syn and greg estimators based on 10 auxiliary variables (model (1)); the forth block presents the results of syn or greg estimators based on the 44 domain membership indicator variables. We restrict the comments only on the value added variable but similar consideration could be expressed for the labour cost variable. In general, the comments are referred to the overall set of small domains. Table 3. Comparison of small domains sampling strategies: Average Absolute Relative Bias ( ARB ) and Relative Mean Square Error ( RMSE ) Value Added Sampling Design DOM1 ARB RMSE DOM2 ARB RMSE Labour Cost Overall ARB RMSE DOM1 ARB RMSE DOM2 ARB RMSE Overall ARB RMSE Direct estimator (block 1) STDOM1 1.79 8.18 148.28 5.41 102.74 1.72 6.86 155.87 4.63 106.88 STDOM2 3.42 107.49 0.47 15.26 1.75 55.23 3.32 105.66 0.46 12.66 1.70 52.96 BALPOP 0.77 24.86 1.29 38.49 1.06 32.58 0.74 23.60 1.20 34.26 1.00 29.64 BAL 0.84 25.43 1.45 40.61 1.19 34.03 0.79 24.22 1.57 35.80 1.23 30.78 CPAR 1.35 32.52 2.81 53.85 2.18 44.60 1.44 31.68 2.62 51.44 2.11 42.88 CPOI 1.50 38.64 3.01 63.40 2.36 52.67 1.42 37.64 3.11 61.92 2.38 51.40 CSPOI 1.61 32.90 2.73 56.28 2.24 46.15 1.68 31.76 2.99 54.28 2.42 44.52 43.19 42.82 Synthetic estimator with 10 auxiliary variables (block 2) STDOM1 14.22 18.88 13.81 100.55 13.99 65.16 12.29 18.40 9.25 95.03 10.57 61.83 STDOM2 24.82 33.96 14.48 15.96 20.34 26.16 13.13 14.79 12.46 23.11 12.75 19.51 BALPOP 13.68 17.51 24.98 43.98 20.09 32.51 11.89 15.60 12.35 33.08 12.15 25.50 BAL 14.92 18.46 21.82 41.66 18.83 31.61 13.37 16.91 10.41 32.64 11.69 25.82 CPAR 13.68 17.83 23.45 44.63 19.22 33.02 11.82 16.13 11.69 34.93 11.75 26.78 CPOI 13.44 17.55 23.17 47.18 18.95 34.34 11.36 15.73 12.04 37.55 11.75 28.10 13.78 17.75 25.11 45.07 20.20 33.23 12.21 16.52 13.06 36.17 12.69 27.66 11.79 119.23 CSPOI Generalised regression estimator with 10 auxiliary variables (block 3) STDOM1 2.35 30.13 7.40 81.03 1.86 29.28 7.49 80.25 STDOM2 3.98 58.62 0.95 15.26 2.26 34.05 2.90 52.66 0.93 12.66 1.78 29.99 BALPOP 1.11 19.41 2.20 25.80 1.73 23.03 1.01 16.42 1.99 21.73 1.57 19.43 BAL 1.63 19.41 1.76 26.11 1.70 23.21 1.21 16.72 2.08 21.96 1.70 19.69 CPAR 1.04 21.27 1.63 29.30 1.37 25.82 1.03 18.27 1.11 24.60 1.08 21.86 CPOI 1.67 21.99 1.92 32.53 1.81 27.96 1.83 18.89 2.22 27.47 2.05 23.76 CSPOI 1.32 21.68 1.17 29.31 1.23 26.00 1.30 18.11 1.07 25.39 1.17 22.23 11.26 119.95 Synthetic or Gen. regression estimator with 44 auxiliary variables (block 4) STDOM1 STDOM2 3.39 31.30 27.48 63.22 17.04 49.39 17.24 102.24 2.76 30.80 28.67 63.05 17.44 49.08 23.00 102.64 1.37 20.65 8.25 56.00 1.42 19.10 10.77 55.30 BALPOP 1.07 20.71 1.97 26.98 1.58 24.26 1.08 17.62 1.93 24.07 1.56 21.27 BAL 1.47 20.36 2.13 28.46 1.84 24.95 1.41 17.66 2.02 25.10 1.75 21.88 CPAR 1.79 23.38 2.22 32.39 2.03 28.48 1.65 20.73 2.08 30.39 1.90 26.21 CPOI 3.94 30.21 3.06 33.60 3.44 32.13 3.35 25.28 3.13 31.68 3.23 28.91 CSPOI 1.71 24.29 2.52 31.84 2.17 28.57 2.13 21.28 2.48 30.16 2.32 26.31 Examining firstly direct estimates, we observe the following. - The two benchmark designs (STDOM1 and STDOM2) have an RMSE value for the unplanned domains equal to 148.28% and 107.49% respectively. These values cause the large RMSE values computed for the overall set of small domains and respectively equal to 102.74% and 55.23%. - The STDOM2 shows better results than those attained by STDOM1. This finding is explained by the fact that the STDOM2 stratification criterion is correlated with the variables of interest and takes under control a larger number of small domain than the STDOM1 stratification. - As far as concerns the overall set of small domains, the BALPOP is the more efficient design, both in terms of ARB (1.06%) and RMSE (32.58%), even if BAL is only slightly worse. - The strategies adopting the coordinated sampling show worse values with respect to balanced sampling but they perform better in terms of RMSE than benchmark strategies. In particular, CPAR design has the smallest RMSE (44,60%) followed by CSPOI (46,15%) and CPOI (52,67%). Nevertheless, the ARB values are quite high and larger than that attained by STDOM2 design. These findings depend on DOM2 domains where the ARB values are greater than 2%. Considering the synthetic estimator based on 10 auxiliary variables, some issues may be pointed out. - All designs are characterized by a large bias. The STDOM1 has an ARB equal to 13.99% (although it has an unacceptable RMSE that is equal to 65,16%). The rest of the designs have the ARB values higher than 18%. This evidence gives a warning against the use of synthetic estimator. - The STDOM2 design has the lowest RMSE (26,16%), because of a strong reduction of the DOM1 variance. However, the ARB value (20.34%) is the largest of all designs. - The behaviour of balanced and coordinated designs in terms of bias and variance are more or less equal. The BAL has the lowest ARB (18.33%) and RMSE (31.61%) values. The experimental results of the greg estimator in third block of the table 3 suggest some considerations. - All the designs show strong improvements of the quality measures. In general, the ARB measure has a remarkable reduction with respect to the same indicator computed on the synthetic estimator. Only the STDOM1 still presents a high ARB value (7.40%). - In the STDOM2, the reduction of the bias is more than compensated from the increase of the variability. This produces an RMSE equal to 34.05%. - Both the balanced and the coordinated designs have good performances, though the balanced designs are slightly better being the RMSE roughly equal to the 23%. - The CPAR design is the best coordinated sampling in terms of RMSE (25.82%) with a small ARB value (1.37%). The CSPOI design has the lowest ARB value (1.26%) and bounded RMSE value (26.00%). Therefore, these results highlight the equivalence of CPAR and CSPOI designs. - Finally, the syn or greg estimator based on 44 auxiliary variables show analogous results to the greg estimator based on 10 auxiliary variables. The balanced designs are the best with slight preference for the BALPOP sampling. The CPAR and CSPOI designs are enough close to the balanced designs even if the bias increases. As general findings, the balanced designs seem to guarantee a best strategies to take under control bias and variance of the overall set of the small domains. Like for balanced designs, the coordinated sampling have good performances with respect to the benchmark designs. In particular, CPAR and CSPOI are preferred to CPOI and could be a valid alternative to balanced sampling when using the syn and greg estimators. The conclusion is that for all blocks, BALPOP shows best overall performance with respect to bias and accuracy and in BALPOP, the strategy with greg estimation with the ten auxiliary variables (block 3) is a safe choice for both value added and labour cost. Furthermore, the results show that the synthetic estimator of block 2 must be considered carefully because the bias can be unexpectedly large and the squared bias would be the dominating part in the RMSE . 7. Conclusions In the present work balanced sampling and coordinated sampling, useful in order to obtain planned sample size for domains belonging to different population partitions, have been examined. In particular, these sampling methods may be applied to deal with small area problems, in planning the sampling design so as to have planned size for each domain. These methods are useful for defining an overall strategy considering jointly the design and estimation phase. That causes some advantages in the different approaches to the finite population inference: in the design-based framework it is possible to compute direct domain estimators with prefixed level of precision; while in the model assisted or model-based framework, the presence of sample units in each estimation domains permits to utilise models with specific small area effects allowing a more accurate predictions of the parameters of interest at small area level. A straightforward method to obtain planned domains is to stratify combining all the variables, which define the domains of interest. This method has some problems, first of all it could produce a large overall sample size depending on the number of the strata. To solve this problem some techniques, generally referred as multi-way stratification designs, have been proposed. Nevertheless, the main weakness of these techniques is the computational complexity. This paper studies in detail the use of the two sampling techniques, the balanced and coordinated sampling designs, to control the sample size in all the domains of interest without suffering from the computational complexity. The experimental results conducted on real business population data, show the optimal behaviour of balanced sampling with different direct and indirect estimators. The Pareto sampling seems to be the best design among the coordinated ones; when it is used jointly with estimators involving auxiliary information, it is slightly worse than the estimates computed by means of balanced designs. Furthermore, the strategies based on Pareto sampling may be useful if the sample survey needs to be coordinated over time. This may be an effective sampling strategy for many business surveys. Finally, we point out that the paper has been focused on the feasibility of the sampling selection techniques. The topics connected to the estimation of variance have been neglected and will be treated in depth in future researches. REFERENCES Bishop Y., Fienberg S., Holland P. (1975) Discrete Multivariate Analysis, MIT Press, Cambridge, MA. Bryant E. C., Hartley H. O., Jessen R. J. (1960) Design and Estimation in Two-Way Stratification, Journal of the American Statistical Association, 55: 105-124. Causey B. D., Cox L. H., Ernst L. R. (1985) Applications Transportation Theory to Statistical Problem, Journal American Statistical Society, 80: 903-909. Chauvet G., Tillé Y. (2006) A Fast Algorithm of Balanced Sampling, to appear in Journal of Computational Statistics. Deville J.-C., Tillé, Y. (2000) Selection a several unequal probability samples from the same population, Journal of Statistical Planning and Inference, 86: 89-101. Deville J.-C., Tillé Y. (2004) Efficient Balanced Sampling: the Cube Method, Biometrika, 91: 893-912. Dykstra R. (1985) An Iterative Procedure for Obtaining I-Projections onto the Intersection of Convex Sets, Annals of Probability, 13, 975-984. Dykstra R., Wollan P. (1987) Finding I-Projections Subject to a Finite Set of Linear Inequality Constraints, Applied Statistics, 36, 377-383. Ernst L. R., Paben S. P. (2002) Maximizing and Minimizing Overlap When Selecting Any Number of Units per Stratum Simultaneously for Two Designs with Different Stratifications, Journal of Official Statistics, 18: 185-202. Jessen R. J. (1970) Probability Sampling with Marginal Constraints, Journal American Statistical Society, 65: 776-795. Lehtonen R., Veijanen A. (1998) Logistic Generalized Regression Estimators, Survey Methodology, 24: 51-55. Lehtonen R., Särndal C. E, Veijanen A. (2003) The effect of Model Choice in Estimation for Domains, Including Small Domains, Survey Methodology, 1: 33-44. Lehtonen R., Särndal C. E., Veijanen A. (2005) Does the Model Matter? Comparing Model-Assisted and Model-Dependent Estimators of Class Frequencies for Domains, Statistics In Transition, 7: 649-673. Lu W., Sitter R. R. (2002) Multi-way Stratification by Linear Programming Made Practical, Survey Methodology, 2: 199-207. Montanari G. E., Ranalli M.G. (2003) Nonparametric Methods in Survey Sampling, in: New Developments in Classification and Data Analysis, Vinci M., Monari P., Mignani S., Montanari A. (eds), Springer , Berlin. Ohlsson E. (1995) Coordination of Samples using Permanent Random Numbers, in: Business Survey Methods, Cox B. G., Binder D. A., Chinnappa B. N., Chirstianson A., Colledge M. J., Kott, P.S. (eds), Wiley, New York, 153-169. Rao J. N. K. (2003) Small Area Estimation, Wiley, New York. Rao J. N. K., Nigam A. K. (1990) Optimal Controlled Sampling Design, Biometrika, 77: 807-814. Rao J. N. K., Nigam A. K. (1992) Optimal Controlled Sampling: a Unifying Approach, International Statistical Review, 60: 89-98. Rosén B. (1997a) Asymptotic Theory for Order Sampling, Journal of Statistical Planning and Inference, 62, 135-158. Rosén B. (1997b) On Sampling with Probability Proportional to Size, Journal of Statistical Planning and Inference, 62, 159-191. Royall R., Herson J. (1973) Robust Estimation in Finite Population, Journal American Statistical Society, 68: 880-889. Särndal C. E., Swensson B., Wretman, J. (1992) Model Assisted Survey Sampling, Springer-Verlag, New York. Singh M. P., Gambino J., Mantel H. J. (1994) Issues and Strategies for Small Area Data, Survey Methodology, 20: 3-22. Sitter R. R., Skinner C. J. (1994) Multi-way Stratification by Linear Programming, Survey Methodology, 20: 65-73. Valliant R., Dorfman A. H., Royall R. M. (2000) Finite Population Sampling and Inference: A Prediction Approach, Wiley, New York.