Imbi Traat
University of Tartu, Estonia e-mail: imbi.traat@ut.ee
Abstract
In this paper estimators for quantiles and for some of their functions are considered. These estimators are frequently used as various poverty measures. The corresponding variance estimators are derived for the design often used in household surveys (households are selected through people in the Population Register). A jackknife variance estimator for this design is also given. An illustration by simulation is presented.
This research was made for the Estonian Statistical Office in 2005 ( http://www.stat.ee/169971 ).
Estonia had to run EU-SILC survey (Survey on Income and Living Conditions) and measure various poverty and income indicators called the Laeken Indicators. The survey design was the same as for the Household Budget survey – a stratified unequal probability design. But the required estimators were different. Instead of estimating ordinary means, totals and proportions the quantiles and their functions were needed (median, poverty threshold, at-riskof-poverty rate, etc.). Variance estimators for Estonian EU-SILC survey were absent. The task was to derive them. The results will be briefly presented here. The variable used for quantile calculations was equivalized disposable income.
The survey design is stratified unequal probability sampling of households. Sampling is carried through among the records of population register, whereas the sampling frame consists of people 14 years old and older (14+). Strata are formed geographically by grouping
Estonian counties (and the capital city Tallinn) into three strata. Within each stratum systematic sampling procedure of persons is used with different sampling fractions in the defined strata. Each selected person brings its household (hh) into the sample. All members
16+ of that hh are questioned.
Probabilistic description.
Let I hi be the sampling indicator of the hh i (shows how many times the hh is sampled) in stratum h . The expected sampling count of that hh is
E ( I hi
)
np i
, p i
m
14 hi
/ M
14 h
, where n is sample size in households, m
14 hi is the number of 14+ people in hh i of stratum h , and M
14 h
m
14 hi is the total number of people in the frame (population register with
14+ persons). Note that here the index i refers to the hh. The expected sampling counts are proportional to the 14+ size of the households. The hh’s of big size are more frequently
sampled causing over-representation of big-size hh’s. This needs down weighting by sampling weights: w hi
k hi
/ E ( I hi
), h
1 , 2 , 3 , where k hi
is an outcome of the sampling indicator I hi
, usually equal to 1.
The joint distribution of sampling indicators I hi
in startum h is a multivariate hypergeometric distribution. This distribution is well studied and the variances and covariances of sampling indicators well known (Johnson et al 1997). However, this theoretical framework is not used in sampling literature. Sampling designs as multivariate distributions are considered in Traat et al. (2004).
Denote y i
as a study variable and w i
as a sampling weight (possibly adjusted). The index here refers to a person. The sampling weight is the same for the persons in the same household.
Sample of persons is denoted by s and of households by s . hh
Let y i be sorted into ascending order. Then the estimated
-quantile of the variable y is q
( y y j j
1
, y j
1
) / 2 , if if
i j
1 w i i j
1 w i
i j
1
1 w i
, (1) where
s w i
is the estimated number of people
The estimated median is received for
0 .
2 , 0 .
4 , 0 .
6 , 0 .
8 .
0 .
5 and estimated quintiles for
Several indicators are calculated as functions of quantile estimators. For example, the indicator at-risk-of-poverty-threshold ( I
1e
) is defined as 60% of median, so its estimator is
I
ˆ
1 e
0 .
6 q
0 .
5
.
The Income quintile share ratio ( I
2
) is estimated as
I
ˆ
2
q
0 .
8
/ q
0 .
2
.
The quantiles in the formulae are based on the variable eqinc .
There are several moments, which need special attention when developing variance formulae for Laeken indicators:
sampling unit is household but the estimators are formed with person-level data;
sampling design is complex – unequal probabilities for households and persons; estimators are non-linear; domain variable in estimators has a random threshold calibration and weight adjustments may have introduced unequal weights for persons of the same household.
We use inverse distribution function method for the variance of quantile estimators. The general results and some special cases are given in Särndal et al. (1992). The hypergeometric design is not considered there and will be done here. We skip the stratum index and present the following formulae for the hypergeometric design in one stratum.
First we estimate variance of the distribution function at a sample quantile:
V
ˆ
V
ˆ
( F
ˆ
)
M
14
n
M
14 n ( n
1
1 )
1
2
s hh
( m i p i
)
2
( i
)
2
, where i
1 ,
0 , if y ij
q
,
j , otherwise ,
( j refers to the member of hh i ), m i
is no. of eligible (questioned) members in hh i.
Now, if at a sample quantile is approximately normally distributed around
(which is the case for big samples) we can say that (
, where c
1
, c
2
) is an approximate 95% confidence interval for c
1
Inverting the points
1 .
96 c
1
, c
2
with
V
ˆ
, c
2
1 .
96 V
ˆ
.
F
ˆ
1 which means that we calculate q c
1
, q c
2
from (1), we get that ( q c
1
, q c
2
) is the approximately 95% confidence interval for true quantile Q . From the last interval one can also estimate the variance of quantile estimator (assuming normality):
V
ˆ
( q )
[( q c 1
q c 2
) /( 2
1 .
96 )]
2
The simplest function of quantiles is the poverty threshold I
ˆ
1 e
0 .
6 q
0 .
5
(2)
. Its variance in a straightforward way is
V
ˆ
( I
ˆ
1 e
)
0 .
6
2
V
ˆ
( q
0 .
5
) , where V
ˆ
( q
0 .
5
) is calculated from (2) with
0 .
5 .
The resampling methods are appealing due to their applicational simplicity. In this work we concentrate on the Jacknife method. The statistic considered is the design-weighted sample sum. The two-phase sampling framework is assumed with the hypergeometric design of sample size n a
in the first phase and SI-sampling of size n of already selected hh’s
(multiples included) in the second phase. The first phase estimator is and the second phase one t
ˆ
( I i
| I ai
) y i
/ np i t
ˆ a
. The sampling indicators
( I ai y i
I and ai
) / n a p i
I i
| I ai describe selections of hh i into the first phase and second phase samples, respectively. The following formula for the first phase variance can be derived: where c a
( M
14
n a
)
V
/( a
( t
ˆ a
)
M
14
c a n
n
V ( t
ˆ
| I a
), n a
1 ) comes from the hypergeometric sampling, given means given the first phase sample. If to choose the subsampling size n
n a
I a
1 , and to use that c a
1 , which is usually true in practice, the formula simplifies,
V a
( t
ˆ a
)
( n a
1 ) V ( t
ˆ
The important thing here is the fact that V ( t
ˆ
|
| I a
).
I a
)
(3)
can be estimated as variance of t
ˆ
over second phase samples. The result (3) wich is valid under EU-SILC design for sample sums, was applied to the Gini coefficient and it performed very well.
The population was formed with the Estonian HBS 2003 data and made similar to the true
Estonian population by its hh characteristics (Leiten and Traat, 2005):
Total no. of people 2595; total no. of frame people M
14
2332 ; total no. of hh’s 1263;
Median of eqinc 3000; Poverty threshold I
1 e
1800 ;
At-risk-of-poverty rate I
1
15 .
34 % ; Income quintile share ratio I
2
2 .
58 ;
100 persons were SI-selected in the frame and their hh’s included into sample. The sample quantities were calculated, their means and variances over 4000 repetitions obtained. Some tables are below.
Table 1. Estimated median and related quantities med
V
ˆ
( med ) V
ˆ
( F
ˆ
)
Mean 3036 216.6 0.05
Std Dev 216.5 52.3 0.0006
The first column is for sample median, others for derived variance estimators over 4000 simulations. We see that median as calculated by (1) on average slightly overestimates the true median 3000. Overestimation is small: the median is not big, c .
v .
216 .
5 / 3036
36
7
/
%
3000
1 % . The true sampling variability of
. The second column says that the variance formula worked out by us performs very well, it produces almost unbiased variance estimator, and the estimator is also quite stable (with standard deviation 52.3). The third column characterizes variability of the estimated distribution function at the estimated median. This is basic component when finding confidence intervals and variance of the median with inversion method. Stability of F quarantees good performance of the method.
It appeared that the confidence intervals of the median worked very well. The coverage rate was 95.2% instead of 95%.
The poverty threshold I
ˆ
is median-based quantity and its performance is much defined by
1 e the median.
Table 5.3. Poverty threshold I
ˆ
1 e and related quantities
I
ˆ
1 e
V
ˆ
( I
ˆ
1 e
)
Mean 1821.8 130.0
Std Dev 129.9 31.4
References
Johnson, N.L., Kotz, S., Balakrishnan, N. (1997) Discrete Multivariate Distributions . New-
York: Wiley.
Leiten, E., Traat, I. (2005) Variance of Laeken Indicators in complex surveys.
The statistical
Office of Estonia: http://www.stat.ee/169971
Särndal, C.-E., Swensson, B., Wretman, J. (1992).
Model Assisted Survey Sampling . New-
York: Springer Verlag.
Traat, I., Bondesson, L. Meister, K. (2004) Sampling design and sample selection through distribution theory. Journal of Statistical Planning and Inference , vol. 123, 395-413.