The R package sampling, a software tool for sampling

advertisement
The R package sampling, a software tool for
training in official statistics and survey
sampling
Yves Tillé1 and Alina Matei2
1
2
Institute of Statistics, University of Neuchâtel, Switzerland
yves.tille@unine.ch
Institute of Statistics, University of Neuchâtel, Switzerland
alina.matei@unine.ch
Summary. The R package sampling is a software tool for training in official statistics and survey sampling. It is a collection of tools for selecting and weighting samples. Equal and unequal probability sampling, balanced sampling, and calibration
methods are implemented. A large number of examples is available in the manual
software.
Key words: survey sampling, equal/unequal probability sampling, balanced sampling, calibration
1 Introduction
Training programmes for official statisticians vary across countries and institutions,
and it is a permanent preoccupation at official statistical agencies. The R package
sampling [TM06] is a software tool for training in official statisticsand survey sampling It was developed for the training course Advanced methods of survey sampling3
organized by the Swiss Statistical Federal Office in the framework of the European
Statistical Training Programme. This paper is a general description of the package
and an introduction for new users.
The sampling package is a R package containing a collection of tools related
to the sampling survey theory. The implemented functions regard selecting and
weighting samples. Several procedures allow selecting samples with equal or unequal probabilities. It is also possible to employ a balanced sampling. The function
for selecting a balanced sample uses the cube method [DT04]. Two methods for
calibration are also implemented: the regression estimator that uses a chi-square
distance, and the raking ratio estimator that uses the Kullback-Leibler divergence.
Moreover, the package contains three databases, a set of tools for computing the
inclusion probabilities and for rearranging strata. A large number of examples is
3
in April 2005, at Neuchâtel, Switzerland
1474
Yves Tillé and Alina Matei
available in the manual software [TM06]. For a description of the used sampling
algorithms, one can see [Til06].
A brief description of the package is given in Section 2. The function names are
given in verbatim style, e.g. Rsrswor. We give three examples, involving methods for
selecting and weighting samples (cf. Section 3). Conclusions are drawn in Section 4.
2 Package description
2.1 Some notations and basic concepts
Let U = {1, . . . , k, . . . , N } be the finite population. The unit k is the reference unit.
A sample s is a subset of U . It can be represented as a {0, 1} vector s = (sk )k∈U
sk =
1 if unit k ∈ s,
0 if not.
(1)
Let S be the sample support, which is the set of all possible samples drawn from U.
Thus S is the set of 2N subsets of U. A couple (S, p) is denoted as a sampling design,
where p is a probability distribution on S. For a given p(.), any s ∈ S is viewed as
a realization of a random variable S, such that
P r(S = s) = p(s).
Suppose we have k ∈ S. Thus the random event ”S ∋ k” is the event ”a sample
containing k is realized” [SSW92]. The cardinality of the set s is the sample size of
s, and we shall denote it by n. Given p(.), the inclusion probability of a unit k is
the probability that unit k will be in a sample. It is defined by
πk = P r(k ∈ S) =
X
p(s).
s∋k
s∈S
The quantities πk are denoted as the first-order inclusion probabilities, ∀ k ∈ U.
Similarly, the second-order inclusion probabilities or the joint inclusion probabilities
are defined as
X
πkℓ = P r(k ∈ S, ℓ ∈ S) =
p(s).
s∋k,ℓ
s∈S
When the sample size is fixed to n, the inclusion probabilities satisfy the conditions
X
k∈U
πk = n,
X
πkℓ = (n − 1)πk .
ℓ∈U
ℓ6=k
Let y be the variable P
of interest. The P
Horvitz-Thompson estimator [HT52] of the
population total ty = k∈U yk is b
tπ = k∈s yk /πk .
The R package sampling
1475
2.2 Sampling with equal or unequal probabilities
The package contains functions for drawing samples with or without replacement,
and with equal or unequal probabilities. The implemented unequal probability sampling designs are: Brewer sampling [Bre63], maximum entropy sampling (or Conditional Poisson sampling) [Háj64,Háj81], Midzuno sampling [Mid52], minimal support
sampling [DT98], multinomial sampling [HH43], pivotal sampling [DT98], Poisson
sampling [Háj58], systematic sampling [Mad49], Sampford sampling [Sam67] and
Tillé sampling [Til96]. The first-order and the second-order inclusion probabilities
are computed for the following sampling designs: maximum entropy, Midzuno, systematic, and Tillé. The functions which implement unequal probability sampling
designs use in their names the prefix UP (Unequal Probability), e.g. RUPpoisson.
2.3 Balanced sampling
A balanced sampling design is defined by the property that the Horvitz-Thompson
estimators of the population totals of a set of auxiliary variables equal the known
totals of these variables
X xjk
k∈s
πk
=
X
xjk ,
(2)
k∈U
for all s ∈ S such that p(s) > 0, where j ∈ {1, . . . , J}, and xk = (x1k , . . . , xJ k ) is
a row vector of auxiliary variables. The cube method [DT04] is a general method
for selecting approximately balanced samples with equal or unequal inclusion probabilities and any number of auxiliary variables. The main function for selecting a
balanced sample by means of the cube method is the function Rsamplecube. The
two phases of the cube method, the flight phase (Rfastflightcube) and the landing
phase (Rlandingcube), can be run separately. Additional procedures can be used
to select a balanced stratified sample (Rbalancedstratification), a balanced cluster
sample (Rbalancedcluster), and a balanced two-stage sample (Rbalancedtwostage).
2.4 Calibration
The calibration estimator [DS92] is defined as
btCAL = X wk yk ,
k∈s
where
X
k∈s
wk xk =
X
xk = tx ,
(3)
k∈U
for a row vector of auxiliary variables xk = (x1k , . . . , xJ k ), for which tx is known.
The equation (3) is called the calibration equation. Let dk be the initial weights,
usually equal to 1/πk . Deville and Särndal required that the difference between the
set of sampling design weights dk and wk , k ∈ s, satisfying equation (3), minimizes
some function. The function to minimize is
1476
Yves Tillé and Alina Matei
X
dk qk Gk (wk /dk ) − λ(
k∈s
X
wk xk − tx ),
k∈s
where λ is the vector of the Lagrange multipliers. Minimization leads to the calibration weights wk = dk Fk (x′k λ/qk ), where qk is a weight associated with unit
k, unrelated to dk , that accounts for heteroscedastic residuals from fitting y on
x = (xk ), and Fk is the inverse of the dGk (u)/du function with the property that
Fk (0) = 1, Fk′ (0) = qk > 0.
Two methods of calibration are implemented: the regression estimator
(Rregressionestimator) which uses a chi-square distance, and the raking ratio estimator (Rrakingratio) which uses the Kullback-Leibler divergence ( [DS92]). The
g-weights (equal to wk /dk ) can be bounded for both methods by means of two additional procedures Rboundedregressionestimator, and
Rboundedrakingratio. Since the calibration estimator does not always exist, the
function Rcheckcalibration can check the existence of the solution.
2.5 Additional functions and datasets
The package contains additional facilities such as: computation of inclusion probability for a πps sampling design (Rinclusionprobabilities), computation of inclusion
probabilities for a stratified design (Rinclusionprobastrata), list of all possible samples with fixed sample size (Rwritesample), renumber and suppress the empty strata
of a stratification variable (Rcleanstrata), create a disjunctive codification of a stratification or factor variable (Rdisjunctive).
Three datasets are supplied with the package: MU284 dataset [SSW92], Belgian
municipalities dataset, and Swiss municipalities dataset.
3 Demonstration
3.1 A small overview
R is an environment for statistical computing and graphics based on the S programming language from Bell Labs. The software provides a wide variety of statistical and graphical techniques. R can be freely downloaded from the address
http://www.r-project.org/
Regardless the operating system, the package sampling can be installed by typing
the following command at R prompt: install.packages("sampling")
This will install the latest version from Comprehensive R Archive Network http:
//CRAN.R-project.org/
A few examples of sampling’s capabilities are shown in the following transcript
of a R session. A more extensive demonstration can be seen by loading the package
by
library(sampling)
The list of all function is given by typing
help(package=sampling)
The R package sampling
1477
3.2 Examples
Example 1
A simple example is given below. The vector of the first-order inclusion probabilities
is defined and denoted by Rpik. A sample of fixed size equal to 3 is selected by using
the systematic sampling with unequal probabilities. The sample is represented as in
expression (1). The Horvitz-Thompson estimator of ty is computed.
#define the first-order inclusion probabilities
pik=c(0.2,0.7,0.8,0.5,0.4,0.4)
#the population size N=length(pik)
#define the variable of interest y=c(23.4,5.64,31.45)
#select a sample s=UPsystematic(pik)
#the selected sample is (1:N)[s==1]
#The Horvitz-Thompson estimator of the total is c((1/pik[s==1]) %*% y)
If the selected sample is {1, 3, 4}, the Horvitz-Thompson estimator is 186.95.
Example 2
A more complex example given below involves the selection of samples of fixed size
or expected size equal to 200 with equal or unequal probabilities. The population is
the Belgian municipalities dataset. The first-order inclusion probabilities are computed using an auxiliary information (the variable total 2004, Tot04). The following
9 sampling designs are considered: Poisson sampling, systematic sampling with random order of units in population (denoted in Fig. 1 as rsystematic), pivotal sampling
with random order of units in population (denoted in Fig. 1 as rpivotal), Tillé sampling, Midzuno sampling, systematic sampling, pivotal sampling, multinomial sampling, and simple random sampling without replacement. The Horvitz-Thompson
estimator of the total ty (y is the variable taxable income, TaxableIncome) is computed. Monte-Carlo simulations are executed in order to compare the accuracy of
the Horvitz-Thompson estimator for these different sampling designs. The number
of simulations (given by the variable sim) is fixed to 1000. The simulation results
can be interpreted via boxplots (see Fig. 1). Simple random sampling, multinomial
sampling, and Poisson sampling are not accurate. All the methods of unequal probability sampling seem to have the same accuracy, except from random systematic
sampling and random pivotal sampling that have variances which depend on the
order of the units in the file.
data(belgianmunicipalities)
attach(belgianmunicipalities)
#compute the inclusion probabilities pik
pik=inclusionprobabilities(Tot04,200)
#the population size N=length(pik)
#the sample size n=sum(pik)
#number of simulations
sim=1000
ss=array(0,c(sim,9))
# the variable of interest y=TaxableIncome
#simulations and computation of the Horvitz-Thompson estimator
for(i in 1:sim)
1478
Yves Tillé and Alina Matei
srswor
multinom
pivotal
systematic
midzuno
tille
rpivotal
rsystematic
poisson
1.0 e+11
1.2 e+11
1.4 e+11
1.6 e+11
cat("Step ",i," normal ")
ss[i,]=ss[i,]+c(
HTestimator(y,pik,UPpoisson(pik)),
HTestimator(y,pik,UPrandomsystematic(pik)),
HTestimator(y,pik,UPrandompivotal(pik)),
HTestimator(y,pik,UPtille(pik)),
HTestimator(y,pik,UPmidzuno(pik)),
HTestimator(y,pik,UPsystematic(pik)),
HTestimator(y,pik,UPpivotal(pik)),
HTestimator(y,pik,UPmultinomial(pik)),
HTestimator(y,rep(n/N,N),srswor(n,N)))
# boxplots of the estimators
colnames(ss) <c("poisson","rsystematic","rpivotal","tille","midzuno",
"systematic","pivotal","multinom","srswor") boxplot(data.frame(ss),
las=3)
Fig. 1. Accuracy of the Horvitz-Thompson estimator
Example 3
The third example computes the g-weights for the regression estimator. There are
3 auxiliary variables and 10 population units. The first two auxiliary variables are
categorical, and the last one is numerical. The first-order inclusion probabilities
are equal to 0.2. The known population totals for the auxiliary variables are 24,
The R package sampling
1479
26 and 280. A simple random sample without replacement of size 4 is drawn.
The calibration estimator of ty is computed. # matrix of auxiliary variables
defined by columns
Xs=cbind(c(1,1,1,1,1,0,0,0,0,0),c(0,0,0,0,0,1,1,1,1,1),
c(1,2,3,4,5,6,7,8,9,10))
# the inclusion probabilities piks=rep(0.2,times=10)
# the vector of totals t=c(24,26,280)
# the g-weights g=regressionestimator(Xs,piks,t)
# verify the calibration
# in the affirmative case, the printed values are equal to t
if(checkcalibration(Xs,piks,t,g)) c((g/piks) #draw a srswor of size 4
from a population of size 10
s=srswor(4,10)
#the sample is (1:10)[s==1]
#define the variable of interest
y=c(23.4,5.64,31.45,10.23)
# the calibration estimator is
crossprod((g/piks)[s==1],y) The resulting g-weights are 0.96 0.96 0.96 0.96
0.96 1.04 1.04 1.04 1.04 1.04, and the calibration is possible. For the selected sample
{3, 6, 8, 10}, the calibration estimator is equal to 358.384.
4 Conslusions
The R sampling package is both a training and a teaching tool. It can be used in
official statistics, survey sampling, as well as in biostatistics. There are functions
for selecting and weighting samples. For each package function illustrative examples
can be found. Functions for variance estimations are forthcoming. The last version
of the package and its manual can be freely downloaded from the address http:
//cran.r-project.org/src/contrib/Descriptions/sampling.html
References
[Bre63] K. R. W. Brewer. A model of systematic sampling with unequal probabilites. Australian Journal of Statistics, 5:5–13, 1963.
[DS92] J.-C. Deville and C.-E. Särndal. Calibration estimators in survey sampling.
Journal of the American Statistical Association, 87:376–382, 1992.
[DT98] J.-C. Deville and Y. Tillé. Unequal probability sampling without replacement through a splitting method. Biometrika, 85:89–101, 1998.
[DT04] J.-C. Deville and Y. Tillé. Efficient balanced sampling: the cube method.
Biometrika, 91:893–912, 2004.
[Háj58] J. Hájek. Some contributions to the theory of probability sampling. In ISI,
editor, Bulletin of the International Statistical Institute: Proceedings of the
30th session (Stockholm), volume 36, book 3, pages 127–134, The Hague,
1958.
[Háj64] J. Hájek. Asymptotic theory of rejective sampling with varying probabilities from a finite population. Annals of Mathematical Statistics, 35:1491–
1523, 1964.
1480
Yves Tillé and Alina Matei
[Háj81] J. Hájek. Sampling from a Finite Population. Marcel Dekker, New York,
1981.
[HH43] M.H. Hansen and W.N. Hurwitz. On the theory of sampling from finite
populations. Annals of Mathematical Statistics, 14:333–362, 1943.
[HT52] D.G. Horvitz and D.J. Thompson. A generalization of sampling without
replacement from a finite universe. Journal of the American Statistical
Association, 47:663–685, 1952.
[Mad49] W.G. Madow. On the theory of systematic sampling, II. Annals of Mathematical Statistics, 20:333–354, 1949.
[Mid52] H. Midzuno. On the sampling system with probability proportional to sum
of size. Annals of the Institute of Statistical Mathematics, 3:99–107, 1952.
[Sam67] M.R. Sampford. On sampling without replacement with unequal probabilities of selection. Biometrika, 54:499–513, 1967.
[SSW92] C.-E. Särndal, B. Swensson, and J.H. Wretman. Model Assisted Survey
Sampling. Springer Verlag, New York, 1992.
[Til96] Y. Tillé. An elimination procedure of unequal probability sampling without
replacement. Biometrika, 83:238–241, 1996.
[Til06] Y. Tillé. Sampling Algorithms. Springer, 2006.
[TM06] Y. Tillé and A. Matei. The sampling package. Software manual, CRAN,
http://cran.r-project.org/src/contrib/Descriptions/sampling.html, 2006.
Download