The R package sampling, a software tool for training in official statistics and survey sampling Yves Tillé1 and Alina Matei2 1 2 Institute of Statistics, University of Neuchâtel, Switzerland yves.tille@unine.ch Institute of Statistics, University of Neuchâtel, Switzerland alina.matei@unine.ch Summary. The R package sampling is a software tool for training in official statistics and survey sampling. It is a collection of tools for selecting and weighting samples. Equal and unequal probability sampling, balanced sampling, and calibration methods are implemented. A large number of examples is available in the manual software. Key words: survey sampling, equal/unequal probability sampling, balanced sampling, calibration 1 Introduction Training programmes for official statisticians vary across countries and institutions, and it is a permanent preoccupation at official statistical agencies. The R package sampling [TM06] is a software tool for training in official statisticsand survey sampling It was developed for the training course Advanced methods of survey sampling3 organized by the Swiss Statistical Federal Office in the framework of the European Statistical Training Programme. This paper is a general description of the package and an introduction for new users. The sampling package is a R package containing a collection of tools related to the sampling survey theory. The implemented functions regard selecting and weighting samples. Several procedures allow selecting samples with equal or unequal probabilities. It is also possible to employ a balanced sampling. The function for selecting a balanced sample uses the cube method [DT04]. Two methods for calibration are also implemented: the regression estimator that uses a chi-square distance, and the raking ratio estimator that uses the Kullback-Leibler divergence. Moreover, the package contains three databases, a set of tools for computing the inclusion probabilities and for rearranging strata. A large number of examples is 3 in April 2005, at Neuchâtel, Switzerland 1474 Yves Tillé and Alina Matei available in the manual software [TM06]. For a description of the used sampling algorithms, one can see [Til06]. A brief description of the package is given in Section 2. The function names are given in verbatim style, e.g. Rsrswor. We give three examples, involving methods for selecting and weighting samples (cf. Section 3). Conclusions are drawn in Section 4. 2 Package description 2.1 Some notations and basic concepts Let U = {1, . . . , k, . . . , N } be the finite population. The unit k is the reference unit. A sample s is a subset of U . It can be represented as a {0, 1} vector s = (sk )k∈U sk = 1 if unit k ∈ s, 0 if not. (1) Let S be the sample support, which is the set of all possible samples drawn from U. Thus S is the set of 2N subsets of U. A couple (S, p) is denoted as a sampling design, where p is a probability distribution on S. For a given p(.), any s ∈ S is viewed as a realization of a random variable S, such that P r(S = s) = p(s). Suppose we have k ∈ S. Thus the random event ”S ∋ k” is the event ”a sample containing k is realized” [SSW92]. The cardinality of the set s is the sample size of s, and we shall denote it by n. Given p(.), the inclusion probability of a unit k is the probability that unit k will be in a sample. It is defined by πk = P r(k ∈ S) = X p(s). s∋k s∈S The quantities πk are denoted as the first-order inclusion probabilities, ∀ k ∈ U. Similarly, the second-order inclusion probabilities or the joint inclusion probabilities are defined as X πkℓ = P r(k ∈ S, ℓ ∈ S) = p(s). s∋k,ℓ s∈S When the sample size is fixed to n, the inclusion probabilities satisfy the conditions X k∈U πk = n, X πkℓ = (n − 1)πk . ℓ∈U ℓ6=k Let y be the variable P of interest. The P Horvitz-Thompson estimator [HT52] of the population total ty = k∈U yk is b tπ = k∈s yk /πk . The R package sampling 1475 2.2 Sampling with equal or unequal probabilities The package contains functions for drawing samples with or without replacement, and with equal or unequal probabilities. The implemented unequal probability sampling designs are: Brewer sampling [Bre63], maximum entropy sampling (or Conditional Poisson sampling) [Háj64,Háj81], Midzuno sampling [Mid52], minimal support sampling [DT98], multinomial sampling [HH43], pivotal sampling [DT98], Poisson sampling [Háj58], systematic sampling [Mad49], Sampford sampling [Sam67] and Tillé sampling [Til96]. The first-order and the second-order inclusion probabilities are computed for the following sampling designs: maximum entropy, Midzuno, systematic, and Tillé. The functions which implement unequal probability sampling designs use in their names the prefix UP (Unequal Probability), e.g. RUPpoisson. 2.3 Balanced sampling A balanced sampling design is defined by the property that the Horvitz-Thompson estimators of the population totals of a set of auxiliary variables equal the known totals of these variables X xjk k∈s πk = X xjk , (2) k∈U for all s ∈ S such that p(s) > 0, where j ∈ {1, . . . , J}, and xk = (x1k , . . . , xJ k ) is a row vector of auxiliary variables. The cube method [DT04] is a general method for selecting approximately balanced samples with equal or unequal inclusion probabilities and any number of auxiliary variables. The main function for selecting a balanced sample by means of the cube method is the function Rsamplecube. The two phases of the cube method, the flight phase (Rfastflightcube) and the landing phase (Rlandingcube), can be run separately. Additional procedures can be used to select a balanced stratified sample (Rbalancedstratification), a balanced cluster sample (Rbalancedcluster), and a balanced two-stage sample (Rbalancedtwostage). 2.4 Calibration The calibration estimator [DS92] is defined as btCAL = X wk yk , k∈s where X k∈s wk xk = X xk = tx , (3) k∈U for a row vector of auxiliary variables xk = (x1k , . . . , xJ k ), for which tx is known. The equation (3) is called the calibration equation. Let dk be the initial weights, usually equal to 1/πk . Deville and Särndal required that the difference between the set of sampling design weights dk and wk , k ∈ s, satisfying equation (3), minimizes some function. The function to minimize is 1476 Yves Tillé and Alina Matei X dk qk Gk (wk /dk ) − λ( k∈s X wk xk − tx ), k∈s where λ is the vector of the Lagrange multipliers. Minimization leads to the calibration weights wk = dk Fk (x′k λ/qk ), where qk is a weight associated with unit k, unrelated to dk , that accounts for heteroscedastic residuals from fitting y on x = (xk ), and Fk is the inverse of the dGk (u)/du function with the property that Fk (0) = 1, Fk′ (0) = qk > 0. Two methods of calibration are implemented: the regression estimator (Rregressionestimator) which uses a chi-square distance, and the raking ratio estimator (Rrakingratio) which uses the Kullback-Leibler divergence ( [DS92]). The g-weights (equal to wk /dk ) can be bounded for both methods by means of two additional procedures Rboundedregressionestimator, and Rboundedrakingratio. Since the calibration estimator does not always exist, the function Rcheckcalibration can check the existence of the solution. 2.5 Additional functions and datasets The package contains additional facilities such as: computation of inclusion probability for a πps sampling design (Rinclusionprobabilities), computation of inclusion probabilities for a stratified design (Rinclusionprobastrata), list of all possible samples with fixed sample size (Rwritesample), renumber and suppress the empty strata of a stratification variable (Rcleanstrata), create a disjunctive codification of a stratification or factor variable (Rdisjunctive). Three datasets are supplied with the package: MU284 dataset [SSW92], Belgian municipalities dataset, and Swiss municipalities dataset. 3 Demonstration 3.1 A small overview R is an environment for statistical computing and graphics based on the S programming language from Bell Labs. The software provides a wide variety of statistical and graphical techniques. R can be freely downloaded from the address http://www.r-project.org/ Regardless the operating system, the package sampling can be installed by typing the following command at R prompt: install.packages("sampling") This will install the latest version from Comprehensive R Archive Network http: //CRAN.R-project.org/ A few examples of sampling’s capabilities are shown in the following transcript of a R session. A more extensive demonstration can be seen by loading the package by library(sampling) The list of all function is given by typing help(package=sampling) The R package sampling 1477 3.2 Examples Example 1 A simple example is given below. The vector of the first-order inclusion probabilities is defined and denoted by Rpik. A sample of fixed size equal to 3 is selected by using the systematic sampling with unequal probabilities. The sample is represented as in expression (1). The Horvitz-Thompson estimator of ty is computed. #define the first-order inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #the population size N=length(pik) #define the variable of interest y=c(23.4,5.64,31.45) #select a sample s=UPsystematic(pik) #the selected sample is (1:N)[s==1] #The Horvitz-Thompson estimator of the total is c((1/pik[s==1]) %*% y) If the selected sample is {1, 3, 4}, the Horvitz-Thompson estimator is 186.95. Example 2 A more complex example given below involves the selection of samples of fixed size or expected size equal to 200 with equal or unequal probabilities. The population is the Belgian municipalities dataset. The first-order inclusion probabilities are computed using an auxiliary information (the variable total 2004, Tot04). The following 9 sampling designs are considered: Poisson sampling, systematic sampling with random order of units in population (denoted in Fig. 1 as rsystematic), pivotal sampling with random order of units in population (denoted in Fig. 1 as rpivotal), Tillé sampling, Midzuno sampling, systematic sampling, pivotal sampling, multinomial sampling, and simple random sampling without replacement. The Horvitz-Thompson estimator of the total ty (y is the variable taxable income, TaxableIncome) is computed. Monte-Carlo simulations are executed in order to compare the accuracy of the Horvitz-Thompson estimator for these different sampling designs. The number of simulations (given by the variable sim) is fixed to 1000. The simulation results can be interpreted via boxplots (see Fig. 1). Simple random sampling, multinomial sampling, and Poisson sampling are not accurate. All the methods of unequal probability sampling seem to have the same accuracy, except from random systematic sampling and random pivotal sampling that have variances which depend on the order of the units in the file. data(belgianmunicipalities) attach(belgianmunicipalities) #compute the inclusion probabilities pik pik=inclusionprobabilities(Tot04,200) #the population size N=length(pik) #the sample size n=sum(pik) #number of simulations sim=1000 ss=array(0,c(sim,9)) # the variable of interest y=TaxableIncome #simulations and computation of the Horvitz-Thompson estimator for(i in 1:sim) 1478 Yves Tillé and Alina Matei srswor multinom pivotal systematic midzuno tille rpivotal rsystematic poisson 1.0 e+11 1.2 e+11 1.4 e+11 1.6 e+11 cat("Step ",i," normal ") ss[i,]=ss[i,]+c( HTestimator(y,pik,UPpoisson(pik)), HTestimator(y,pik,UPrandomsystematic(pik)), HTestimator(y,pik,UPrandompivotal(pik)), HTestimator(y,pik,UPtille(pik)), HTestimator(y,pik,UPmidzuno(pik)), HTestimator(y,pik,UPsystematic(pik)), HTestimator(y,pik,UPpivotal(pik)), HTestimator(y,pik,UPmultinomial(pik)), HTestimator(y,rep(n/N,N),srswor(n,N))) # boxplots of the estimators colnames(ss) <c("poisson","rsystematic","rpivotal","tille","midzuno", "systematic","pivotal","multinom","srswor") boxplot(data.frame(ss), las=3) Fig. 1. Accuracy of the Horvitz-Thompson estimator Example 3 The third example computes the g-weights for the regression estimator. There are 3 auxiliary variables and 10 population units. The first two auxiliary variables are categorical, and the last one is numerical. The first-order inclusion probabilities are equal to 0.2. The known population totals for the auxiliary variables are 24, The R package sampling 1479 26 and 280. A simple random sample without replacement of size 4 is drawn. The calibration estimator of ty is computed. # matrix of auxiliary variables defined by columns Xs=cbind(c(1,1,1,1,1,0,0,0,0,0),c(0,0,0,0,0,1,1,1,1,1), c(1,2,3,4,5,6,7,8,9,10)) # the inclusion probabilities piks=rep(0.2,times=10) # the vector of totals t=c(24,26,280) # the g-weights g=regressionestimator(Xs,piks,t) # verify the calibration # in the affirmative case, the printed values are equal to t if(checkcalibration(Xs,piks,t,g)) c((g/piks) #draw a srswor of size 4 from a population of size 10 s=srswor(4,10) #the sample is (1:10)[s==1] #define the variable of interest y=c(23.4,5.64,31.45,10.23) # the calibration estimator is crossprod((g/piks)[s==1],y) The resulting g-weights are 0.96 0.96 0.96 0.96 0.96 1.04 1.04 1.04 1.04 1.04, and the calibration is possible. For the selected sample {3, 6, 8, 10}, the calibration estimator is equal to 358.384. 4 Conslusions The R sampling package is both a training and a teaching tool. It can be used in official statistics, survey sampling, as well as in biostatistics. There are functions for selecting and weighting samples. For each package function illustrative examples can be found. Functions for variance estimations are forthcoming. The last version of the package and its manual can be freely downloaded from the address http: //cran.r-project.org/src/contrib/Descriptions/sampling.html References [Bre63] K. R. W. Brewer. A model of systematic sampling with unequal probabilites. Australian Journal of Statistics, 5:5–13, 1963. [DS92] J.-C. Deville and C.-E. Särndal. Calibration estimators in survey sampling. Journal of the American Statistical Association, 87:376–382, 1992. [DT98] J.-C. Deville and Y. Tillé. Unequal probability sampling without replacement through a splitting method. Biometrika, 85:89–101, 1998. [DT04] J.-C. Deville and Y. Tillé. Efficient balanced sampling: the cube method. Biometrika, 91:893–912, 2004. [Háj58] J. Hájek. Some contributions to the theory of probability sampling. In ISI, editor, Bulletin of the International Statistical Institute: Proceedings of the 30th session (Stockholm), volume 36, book 3, pages 127–134, The Hague, 1958. [Háj64] J. Hájek. Asymptotic theory of rejective sampling with varying probabilities from a finite population. Annals of Mathematical Statistics, 35:1491– 1523, 1964. 1480 Yves Tillé and Alina Matei [Háj81] J. Hájek. Sampling from a Finite Population. Marcel Dekker, New York, 1981. [HH43] M.H. Hansen and W.N. Hurwitz. On the theory of sampling from finite populations. Annals of Mathematical Statistics, 14:333–362, 1943. [HT52] D.G. Horvitz and D.J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47:663–685, 1952. [Mad49] W.G. Madow. On the theory of systematic sampling, II. Annals of Mathematical Statistics, 20:333–354, 1949. [Mid52] H. Midzuno. On the sampling system with probability proportional to sum of size. Annals of the Institute of Statistical Mathematics, 3:99–107, 1952. [Sam67] M.R. Sampford. On sampling without replacement with unequal probabilities of selection. Biometrika, 54:499–513, 1967. [SSW92] C.-E. Särndal, B. Swensson, and J.H. Wretman. Model Assisted Survey Sampling. Springer Verlag, New York, 1992. [Til96] Y. Tillé. An elimination procedure of unequal probability sampling without replacement. Biometrika, 83:238–241, 1996. [Til06] Y. Tillé. Sampling Algorithms. Springer, 2006. [TM06] Y. Tillé and A. Matei. The sampling package. Software manual, CRAN, http://cran.r-project.org/src/contrib/Descriptions/sampling.html, 2006.