Estimating quantiles under sampling in two information M. Rueda

advertisement
Estimating quantiles under sampling in two
occasions with an effective use of auxiliary
information
M. Rueda1 , J.F. Muñoz1 , S. González2 , I. Sánchez1 , S. Martı́nez3 and A.
Arcos1
1
2
3
Department of Statistic and Operational Research. University of Granada
(Spain) mrueda@ugr.es, jfmunoz@ugr.es, ismasb@ugr.es, arcos@ugr.es
Department of Statistic and Operational Research. University of Jaén (Spain)
sgonza@ujaen.es
Department of Statistic and Applied Mathematics. University of Almerı́a
(Spain) spuertas@ual.es
1 Introduction
Successive sampling is a known technique that can be used to estimate population parameters with a high gain in precision with regard to other methods. In
practice, successive sampling has been extensively used in applied and social
sciences to estimate measures of level and change of a linear parameter such
as a mean or total. For example, an extensive bibliography of this topic for the
case of estimating the population mean can be seen in Jessen (1942), Patterson (1950), Narain (1953), Adhvaryu (1978), Eckler (1955), Gordon (1983),
Arnab and Okafor (1992), Singh and Srivastava (1973), Singh et al. (1992)
and Singh (2003).
For the problem of estimating a population quantile the situation is quite
different. Only recently this problem has been discussed. Thus, most of the
studies related to medians have been developed by assuming simple random
sampling or stratified sampling (Gross, 1980, Sedransk and Meyer, 1978, Sedransk and Smith, 1988) and considering only the variable of interest without
making explicit use of auxiliary variables for the construction of the estimators. The literature relating to the estimation of medians and other quantiles
which use an auxiliary variable is considerably less extensive than in the case of
means and totals. Relevant references are Chambers and Dunstan (1986), Kuk
and Mak (1989), Rao et al. (1990), Mak and Kuk (1993), Rueda et al. (1998),
Allen et al. (2002) and Singh et al. (2001). These studies use an auxiliary variable. In this way, if other variables are available in successive sampling, these
cannot be used. This suggest that more effective use of auxiliary information
is possible.
1690
M. Rueda, J.F. Muñoz, S. González, I. Sánchez, S. Martı́nez and A. Arcos
Assuming sampling in two occasions, we propose a class of quantile estimators based on several auxiliary variables obtained from the previous occasion.
The class is composed by a multivariate ratio estimator from the matched
sample and an estimator based on the unmatched sample of the current occasion. The optimum estimator in the sense of minimizing the asymptotic
variance of the class is also defined. The accuracy of the proposed estimator
is evaluated in a simulation study. This study shows a gain in efficiency with
regard to other estimators.
2 The proposed class of estimators
Consider a finite population U with size N , which is assumed to retain its
composition over two (or more) time periods. Assume a sample, s , with size
n which is drawn on the previous occasion. On the current occasion a subsample, sm , (the matched sample) with size m is taken from the previously
selected n units, and u = n − m units are replaced by the new units selected independently of the matched portion. This last sample is called the
unmatched sample su . Let χ = m/n be the matched fraction and s = sm su .
Let us assume that all the samples and sub-samples are selected under simple random sampling design, and that only two time periods are considered,
the sizes of the samples being different on each occasion.
Let y denote the survey variable, with values y1 , . . . , yN for the N population elements. In the previous occasion exists p auxiliary variables, x1 , x2 , . . . , xp ,
which are used to build the proposed multivariate ratio estimator.
The
finite population distribution function of y is given by FY (t) =
N −1 i∈U ∆(t − yi ), where ∆(a) = 1 if a ≥ 0 and ∆(a) = 0 otherwise.
The corresponding finite population β-quantile of y is
QY (β) = FY−1 (β) = inf{t : FY (t) ≥ β},
where FY−1 (t) is the inverse function and 0 < β < 1.
The general procedure to estimate quantiles is formulated as follows: first,
an estimator of the distribution function based on sample data, FY (t), is
obtained and then the β-quantile is estimated by taking the inverse, i.e.,
Y (β) = F −1 (β) = inf t : FY (t) ≥ β .
Q
Y
The natural candidate to estimate the β-quantile QY (β) based on the
sample s is
Y n (β) = F−1 (β),
Q
(1)
Yn
where FY n (t) = n−1 i∈s ∆(t − yi ) is the sample estimator for the finite
population distribution function, which coincides with the Horvitz-Thompson
type estimator under simple random sampling.
Estimating quantiles under sampling in two occasions
1691
Xi (β) (i = 1, ..., p) be the sample quantiles of order β based on
Let Q
Xim (β) and Q
Y m (β) the sample
s , that is, the first occasion. Denote by Q
quantiles of the matched sample based on the auxiliary and study variables,
Y u (β) the sample quantile based on the unmatched sample of the current
and Q
occasion.
Following Olkin (1958) idea, we propose a multivariate double sampling
ratio estimator of QY (β) based on the matched portion as:
Y rim (β),
M R (β) =
wi Q
Q
Ym
1≤i≤p
where
Y rim (β) = QY m (β) Q
(β)
Q
Xim (β) Xi
Q
and the weights wi ( 1≤i≤p wi = 1) are to be determinated to maximize the
M R (β).
precision of Q
Ym
The variance of this estimator is:
R
M
Y rim (β)) + 2
Y rim (β), Q
Y rjm (β)).
V (Q
wi2 V (Q
wi wj Cov(Q
Y m (β)) =
1≤i≤p
i<j
M R (β)) = w Bw, where w =
This equation can be written as V (Q
Ym
Y rim (β), Q
Y rjm (β)) for i, j =
(w1 , · · · , wp ) , B = (bij ) and bij = Cov(Q
1, . . . , p.
To obtain the extremum, we make use of the generalized Cauchy-Schwarz
inequality, and, since B is positive semidefinite, it follows that the optimum
w is given by
B −1 e
wopt = −1 ,
eB e
where e = (1, · · · , 1) . In this way, the optimum multivariate ratio estimator
M R (β) = wopt
is given by Q
QY rm (β), where
Y mopt
QY rm (β) = (QY r1m (β), . . . , QY rpm (β)) .
M R (β) estiLet us consider a composite estimator that combines the Q
Y mopt
mator based on the matched sample with the unmatched sample estimator:
R
Y (β) = W Q
M
Q
Y mopt (β) + (1 − W )QY u (β),
(2)
where W is a constant (0 < W < 1) that is chosen to achieve the minimum
variance of the estimator. A simple calculation shows that
Wopt =
Y u (β))
V (Q
.
Y u (β)) + V (Q
M R (β))
V (Q
Y mopt
(3)
1692
M. Rueda, J.F. Muñoz, S. González, I. Sánchez, S. Martı́nez and A. Arcos
Therefore, the proposed estimator with the more optimal properties is
given by
R
Y opt (β) = Wopt Q
M
Q
(4)
Y mopt (β) + (1 − Wopt )QY u (β).
The variance of this optimum estimator is given by
Y opt (β)) = W 2 V (Q
M R (β)) + (1 − Wopt )2 V (Q
Y u (β)),
V (Q
opt
Y mopt
(5)
which can be written as:
Y opt (β)) =
V (Q
Y u (β))V (Q
M R (β))
V (Q
Y mopt
.
Y u (β)) + V (Q
M R (β))
V (Q
(6)
Y mopt
M R (β)) must be
Y u (β)) and V (Q
Thus, to calculate this variance, V (Q
Y mopt
determinated. Gross (1980) showed that an asymptotic expression for the
Y u (β) is given by
variance of the estimator Q
Y u (β)) =
V (Q
N −u
β(1 − β)(u)−1 {fY (QY (β))}−2 .
N
(7)
Y rim (β)) and Cov(Q
Y rim (β), Q
Y rjm (β)). Asymp M R (β)) depends on V (Q
V (Q
Y mopt
totic expressions of these quantities can be obtained by Taylor’s series expansion. Thus,
1
1
fY (QY (β))
β(1 − β)
1
1
+
−
− Ri
×
2
f
(Q
(β))
m
N
m
n
f
Xi (QXi (β))
Y Y
fY (QY (β))
P11 (y, xi )
,
+2 1−
× Ri
fXi (QXi (β))
β(1 − β)
Y rim (β)) =
V (Q
and
1
β(1 − β)
1
Cov (QY rim (β), QY rjm (β)) =
+
−
fY
(QY (β))2
m N
1
fY (QY (β)) P11 (y, xi )
1
Ri
−
−1 +
n
m
f
Xi (QXi (β)) β(1 − β)
P11 (y, xj )
fY (QY (β))
1
1
Rj
−
−1 −
m fXj (QXj (β)) β(1 − β)
n
2
P11 (xi , xj )
fY (QY (β))
1
1
Rj Ri
−
−1 ,
n
m
fXj (QXj (β))fXi (QXi (β))
β(1 − β)
where P11 (y, xi ) denotes the proportion of values in the population for which
y ≤ QY (β) and xi ≤ QXi (β), and Ri = QY (β)/QXi (β).
3 Monte Carlo simulations
In the previous section we have defined an optimum estimator into the class
(2). The variance of this estimator is also available. The next step is to measure
Estimating quantiles under sampling in two occasions
1693
the accuracy of this estimator via an empirical study. For this purpose, we
use the Counties population, which is composed by the natural populations
Counties60 and Counties70. These populations have been widely used in finite
population sampling (for example, Royall and Cumberland, 1981 and Valliant
et al, 2000). These populations consist on N = 304 counties in North and
South Carolina and Georgia (U.S.). The interest variable, y, is the population
in 1970 (Counties70 population), whereas the auxiliary variables are given
by x1 , the population in 1960 (Counties60 population) and x2 , the number
of households in 1960 (Counties60 population). In this way, we use p = 2
auxiliary variables.
Y opt (β) estimator is compared with the standard estiThe precision of Q
Y n (β) given in (1) and the Q
Y opt (β) estimator when p = 1 auxiliary
mator Q
variable is used. Therefore, the gain in precision can be observed when the
number of auxiliary variables increases in the estimation stage.
We generated B = 1000 independent samples under sampling in two occasions. All the samples are obtained under simple random sampling without
replacement. The performance of the estimators was evaluated for β = 0.5 in
terms of Relative Bias (RB) and Relative Efficiency (RE) with
RB =
B 1 Q
Y opt (β)b − QY n (β)
B
QY n (β)
b=1
;
RE =
Y opt (β)]
M SE[Q
,
Y n (β)]
M SE[Q
where b indexes the bth simulation run,the empirical Mean Square Er Y opt (β)b − QY n (β)]2 , and
Y opt (β)] = B −1 B [Q
ror is given by M SE[Q
b=1
Y n (β). Different values of n , n and
Y n (β)] is similarly defined for Q
M SE[Q
m are considered. The random generations, calculations and all the estimators were obtained using the R programme. Programming details are available
from the authors.
The Figure 1 represents the RE for the Counties population, β = 0.5 and
different values of n , n and m. In this figure we observe that the proposed
optimum estimator (when p = 2 auxiliary variables are used) is more efficient
in terms of RE than the standard estimator and the proposed optimum estimator when we use p = 1 auxiliary variable, that is, better estimates are
obtained when all the auxiliary variables are used in the estimation stage.
As far as the RB values is concerned, we observe that these are all within a
reasonable range and so they are omitted.
In short, the proposed estimator presents good empirical properties due to
an efficient use of auxiliary information. This suggests that the proposed class
is an attractive alternative to be used in successive sampling. Finally, note that
the proposed estimator presents a simple computation and the asymptotic
expression for the variance of this estimator is also available.
1694
M. Rueda, J.F. Muñoz, S. González, I. Sánchez, S. Martı́nez and A. Arcos
Fig. 1. RE for Counties70 population. β = 0.5.
n’=100. n=50
n’=100. n=100
0.7
1.0
0.9
RE
RE
0.6
0.8
0.5
0.7
0.4
0.6
0.0
0.2
0.4
χ
0.6
0.8
1.0
0.1
0.3
n’=75. n=75
0.5
0.7
χ
0.9
n’=75. n=25
0.45
1.1
0.40
RE
RE
0.9
0.35
0.7
0.30
0.5
0.0
0.2
0.4
χ
0.6
0.8
1.0
0.2
0.3
n’=50. n=50
0.4
0.5
χ
0.6
0.7
0.8
n’=50. n=25
0.55
0.50
RE
RE
0.9
0.7
0.45
0.40
0.35
0.5
0.1
0.3
0.5
χ
0.7
0.9
Standard estimator.
Proposed optimum estimator. p=2.
0.2
0.3
0.4
0.5
χ
0.6
0.7
0.8
Proposed optimum estimator. p=1, x2.
Proposed optimum estimator. p=1, x1.
References
1. Adhvaryu, D.: Successive sampling using multi-auxiliary information. Sankhya,
40, 167-173 (1978).
2. Allen, J., Singh, H.P., Singh, S., Smarandache, F.: A general class of estimatiors of population median using two auxiliary variables in double sampling.
INTERSTAT (2002).
3. Arnab, R., Okafor, F.C.: A note on double sampling over two occasions. Pakistan
Journal of Statistics 8, 9–18 (1992).
4. Chambers, R.L., Dunstan R.: Estimating distribution functions from survey
data. Biometrika 73, 597–604 (1986).
5. Eckler, A.R.: Rotation Sampling. The Annals of Mathematical Statistics, 26,
664–685 (1955).
Estimating quantiles under sampling in two occasions
1695
6. Gordon, L.: Successive sampling in finite populations. The Annals of Statistics,
11, 702–706 (1983).
7. Gross, S.T.: Median estimation in sample survey. Proc. Surv. Res. Meth. Sect.
Amer. Statist. Ass., 181–184 (1980).
8. Jessen, R.J.: Statistical investigation of a sample survey for obtaining farm facts.
Iowa Agricultural Experiment Statistical Research Bulletin, 304, (1942).
9. Kuk, A.Y.C., Mak, T.K.: Median estimation in the presence of auxiliary information. Journal of the Royal Statistical Society B, 1, 261–269 (1989).
10. Mak, T.K., Kuk, A.Y.C.: A new method for estimating finite–population quantiles using auxiliary information. The Canadian Journal of Statistics, 25, 29–38
(1993).
11. Narain, R.D.: On the recurrence formula in sampling on successive occasions.
Journal of the Indian Society of Agricultural Statistics, 5, 96–99 (1953).
12. Olkin, I.: Multivariate ratio estimation for finite population. Biometrika, 45,
154–165 (1958).
13. Patterson, H.D.: Sampling on successive occasions with partial replacement of
units. Journal of the Royal Statistical Society B, 12, 241–255 (1950).
14. Royal, R.M., Cumberland, W.G.: An empirical study of the ratio estimator and
estimators of its variance, Journal of the American Statistical Association, 76,
66–77 (1981).
15. Rao, J.N.K., Kovar, J.G., Mantel, H.J.: On estimating distribution functions
and quantiles from survey data using auxiliary information. Biometrika, 77,
365–375 (1990).
16. Rueda, M., Arcos, A., Artés, E.: Quantile interval estimation in finite population
using a multivariate ratio estimator. Metrika, 47, 203–213, (1998).
17. Sedransk, J., Meyer, J.: Confidence intervals for the quantiles of a finite population: simple random and stratified simple random sampling. Journal of the
Royal Statistical Society B, 40, 239–252 (1978).
18. Sedransk, J., Smith, P.J.: Inference for finite population quantiles. In: Krishnaiah, P.R. and Rao, C. R. (eds.) Handbook of Statistics, Vol 6, Chap. 11, 267-289.
North-Holland. (1988).
19. Singh, S., Joarder, A.H., Tracy, D.S.: Median estimation using double sampling.
Aust. N. Z. J. Stat., 43, 33–46 (2001).
20. Singh, H.P., Singh, H.P., Singh, V.P.: A generalized efficient class of estimators
of population mean in two phase and successive sampling . Inter. J. Mgmt. Syst.,
8, 173–183 (1992).
21. Singh, S., Srivastrava, A.K.: Use of auxiliary information in two stage successive
sampling. Journal of Indian Society of Agricultural Statistic, 25, 101–104 (1973).
22. Singh, S.: Advanced sampling theory with applications: How michael ”selected”
Amy, pp. 1–1247. Kluwer Academic Publisher. The Netherlands. (2003).
23. Valliant, R., Dorfman, A.H., Royall, R.M.: Finite population sampling and inference: A prediction approach. Wiley Series in Probability and Statistics, Survey
Methodology Section. New York. John Wiley and Sons, Inc. (2000)
Download