Estimating quantiles under sampling in two occasions with an effective use of auxiliary information M. Rueda1 , J.F. Muñoz1 , S. González2 , I. Sánchez1 , S. Martı́nez3 and A. Arcos1 1 2 3 Department of Statistic and Operational Research. University of Granada (Spain) mrueda@ugr.es, jfmunoz@ugr.es, ismasb@ugr.es, arcos@ugr.es Department of Statistic and Operational Research. University of Jaén (Spain) sgonza@ujaen.es Department of Statistic and Applied Mathematics. University of Almerı́a (Spain) spuertas@ual.es 1 Introduction Successive sampling is a known technique that can be used to estimate population parameters with a high gain in precision with regard to other methods. In practice, successive sampling has been extensively used in applied and social sciences to estimate measures of level and change of a linear parameter such as a mean or total. For example, an extensive bibliography of this topic for the case of estimating the population mean can be seen in Jessen (1942), Patterson (1950), Narain (1953), Adhvaryu (1978), Eckler (1955), Gordon (1983), Arnab and Okafor (1992), Singh and Srivastava (1973), Singh et al. (1992) and Singh (2003). For the problem of estimating a population quantile the situation is quite different. Only recently this problem has been discussed. Thus, most of the studies related to medians have been developed by assuming simple random sampling or stratified sampling (Gross, 1980, Sedransk and Meyer, 1978, Sedransk and Smith, 1988) and considering only the variable of interest without making explicit use of auxiliary variables for the construction of the estimators. The literature relating to the estimation of medians and other quantiles which use an auxiliary variable is considerably less extensive than in the case of means and totals. Relevant references are Chambers and Dunstan (1986), Kuk and Mak (1989), Rao et al. (1990), Mak and Kuk (1993), Rueda et al. (1998), Allen et al. (2002) and Singh et al. (2001). These studies use an auxiliary variable. In this way, if other variables are available in successive sampling, these cannot be used. This suggest that more effective use of auxiliary information is possible. 1690 M. Rueda, J.F. Muñoz, S. González, I. Sánchez, S. Martı́nez and A. Arcos Assuming sampling in two occasions, we propose a class of quantile estimators based on several auxiliary variables obtained from the previous occasion. The class is composed by a multivariate ratio estimator from the matched sample and an estimator based on the unmatched sample of the current occasion. The optimum estimator in the sense of minimizing the asymptotic variance of the class is also defined. The accuracy of the proposed estimator is evaluated in a simulation study. This study shows a gain in efficiency with regard to other estimators. 2 The proposed class of estimators Consider a finite population U with size N , which is assumed to retain its composition over two (or more) time periods. Assume a sample, s , with size n which is drawn on the previous occasion. On the current occasion a subsample, sm , (the matched sample) with size m is taken from the previously selected n units, and u = n − m units are replaced by the new units selected independently of the matched portion. This last sample is called the unmatched sample su . Let χ = m/n be the matched fraction and s = sm su . Let us assume that all the samples and sub-samples are selected under simple random sampling design, and that only two time periods are considered, the sizes of the samples being different on each occasion. Let y denote the survey variable, with values y1 , . . . , yN for the N population elements. In the previous occasion exists p auxiliary variables, x1 , x2 , . . . , xp , which are used to build the proposed multivariate ratio estimator. The finite population distribution function of y is given by FY (t) = N −1 i∈U ∆(t − yi ), where ∆(a) = 1 if a ≥ 0 and ∆(a) = 0 otherwise. The corresponding finite population β-quantile of y is QY (β) = FY−1 (β) = inf{t : FY (t) ≥ β}, where FY−1 (t) is the inverse function and 0 < β < 1. The general procedure to estimate quantiles is formulated as follows: first, an estimator of the distribution function based on sample data, FY (t), is obtained and then the β-quantile is estimated by taking the inverse, i.e., Y (β) = F −1 (β) = inf t : FY (t) ≥ β . Q Y The natural candidate to estimate the β-quantile QY (β) based on the sample s is Y n (β) = F−1 (β), Q (1) Yn where FY n (t) = n−1 i∈s ∆(t − yi ) is the sample estimator for the finite population distribution function, which coincides with the Horvitz-Thompson type estimator under simple random sampling. Estimating quantiles under sampling in two occasions 1691 Xi (β) (i = 1, ..., p) be the sample quantiles of order β based on Let Q Xim (β) and Q Y m (β) the sample s , that is, the first occasion. Denote by Q quantiles of the matched sample based on the auxiliary and study variables, Y u (β) the sample quantile based on the unmatched sample of the current and Q occasion. Following Olkin (1958) idea, we propose a multivariate double sampling ratio estimator of QY (β) based on the matched portion as: Y rim (β), M R (β) = wi Q Q Ym 1≤i≤p where Y rim (β) = QY m (β) Q (β) Q Xim (β) Xi Q and the weights wi ( 1≤i≤p wi = 1) are to be determinated to maximize the M R (β). precision of Q Ym The variance of this estimator is: R M Y rim (β)) + 2 Y rim (β), Q Y rjm (β)). V (Q wi2 V (Q wi wj Cov(Q Y m (β)) = 1≤i≤p i<j M R (β)) = w Bw, where w = This equation can be written as V (Q Ym Y rim (β), Q Y rjm (β)) for i, j = (w1 , · · · , wp ) , B = (bij ) and bij = Cov(Q 1, . . . , p. To obtain the extremum, we make use of the generalized Cauchy-Schwarz inequality, and, since B is positive semidefinite, it follows that the optimum w is given by B −1 e wopt = −1 , eB e where e = (1, · · · , 1) . In this way, the optimum multivariate ratio estimator M R (β) = wopt is given by Q QY rm (β), where Y mopt QY rm (β) = (QY r1m (β), . . . , QY rpm (β)) . M R (β) estiLet us consider a composite estimator that combines the Q Y mopt mator based on the matched sample with the unmatched sample estimator: R Y (β) = W Q M Q Y mopt (β) + (1 − W )QY u (β), (2) where W is a constant (0 < W < 1) that is chosen to achieve the minimum variance of the estimator. A simple calculation shows that Wopt = Y u (β)) V (Q . Y u (β)) + V (Q M R (β)) V (Q Y mopt (3) 1692 M. Rueda, J.F. Muñoz, S. González, I. Sánchez, S. Martı́nez and A. Arcos Therefore, the proposed estimator with the more optimal properties is given by R Y opt (β) = Wopt Q M Q (4) Y mopt (β) + (1 − Wopt )QY u (β). The variance of this optimum estimator is given by Y opt (β)) = W 2 V (Q M R (β)) + (1 − Wopt )2 V (Q Y u (β)), V (Q opt Y mopt (5) which can be written as: Y opt (β)) = V (Q Y u (β))V (Q M R (β)) V (Q Y mopt . Y u (β)) + V (Q M R (β)) V (Q (6) Y mopt M R (β)) must be Y u (β)) and V (Q Thus, to calculate this variance, V (Q Y mopt determinated. Gross (1980) showed that an asymptotic expression for the Y u (β) is given by variance of the estimator Q Y u (β)) = V (Q N −u β(1 − β)(u)−1 {fY (QY (β))}−2 . N (7) Y rim (β)) and Cov(Q Y rim (β), Q Y rjm (β)). Asymp M R (β)) depends on V (Q V (Q Y mopt totic expressions of these quantities can be obtained by Taylor’s series expansion. Thus, 1 1 fY (QY (β)) β(1 − β) 1 1 + − − Ri × 2 f (Q (β)) m N m n f Xi (QXi (β)) Y Y fY (QY (β)) P11 (y, xi ) , +2 1− × Ri fXi (QXi (β)) β(1 − β) Y rim (β)) = V (Q and 1 β(1 − β) 1 Cov (QY rim (β), QY rjm (β)) = + − fY (QY (β))2 m N 1 fY (QY (β)) P11 (y, xi ) 1 Ri − −1 + n m f Xi (QXi (β)) β(1 − β) P11 (y, xj ) fY (QY (β)) 1 1 Rj − −1 − m fXj (QXj (β)) β(1 − β) n 2 P11 (xi , xj ) fY (QY (β)) 1 1 Rj Ri − −1 , n m fXj (QXj (β))fXi (QXi (β)) β(1 − β) where P11 (y, xi ) denotes the proportion of values in the population for which y ≤ QY (β) and xi ≤ QXi (β), and Ri = QY (β)/QXi (β). 3 Monte Carlo simulations In the previous section we have defined an optimum estimator into the class (2). The variance of this estimator is also available. The next step is to measure Estimating quantiles under sampling in two occasions 1693 the accuracy of this estimator via an empirical study. For this purpose, we use the Counties population, which is composed by the natural populations Counties60 and Counties70. These populations have been widely used in finite population sampling (for example, Royall and Cumberland, 1981 and Valliant et al, 2000). These populations consist on N = 304 counties in North and South Carolina and Georgia (U.S.). The interest variable, y, is the population in 1970 (Counties70 population), whereas the auxiliary variables are given by x1 , the population in 1960 (Counties60 population) and x2 , the number of households in 1960 (Counties60 population). In this way, we use p = 2 auxiliary variables. Y opt (β) estimator is compared with the standard estiThe precision of Q Y n (β) given in (1) and the Q Y opt (β) estimator when p = 1 auxiliary mator Q variable is used. Therefore, the gain in precision can be observed when the number of auxiliary variables increases in the estimation stage. We generated B = 1000 independent samples under sampling in two occasions. All the samples are obtained under simple random sampling without replacement. The performance of the estimators was evaluated for β = 0.5 in terms of Relative Bias (RB) and Relative Efficiency (RE) with RB = B 1 Q Y opt (β)b − QY n (β) B QY n (β) b=1 ; RE = Y opt (β)] M SE[Q , Y n (β)] M SE[Q where b indexes the bth simulation run,the empirical Mean Square Er Y opt (β)b − QY n (β)]2 , and Y opt (β)] = B −1 B [Q ror is given by M SE[Q b=1 Y n (β). Different values of n , n and Y n (β)] is similarly defined for Q M SE[Q m are considered. The random generations, calculations and all the estimators were obtained using the R programme. Programming details are available from the authors. The Figure 1 represents the RE for the Counties population, β = 0.5 and different values of n , n and m. In this figure we observe that the proposed optimum estimator (when p = 2 auxiliary variables are used) is more efficient in terms of RE than the standard estimator and the proposed optimum estimator when we use p = 1 auxiliary variable, that is, better estimates are obtained when all the auxiliary variables are used in the estimation stage. As far as the RB values is concerned, we observe that these are all within a reasonable range and so they are omitted. In short, the proposed estimator presents good empirical properties due to an efficient use of auxiliary information. This suggests that the proposed class is an attractive alternative to be used in successive sampling. Finally, note that the proposed estimator presents a simple computation and the asymptotic expression for the variance of this estimator is also available. 1694 M. Rueda, J.F. Muñoz, S. González, I. Sánchez, S. Martı́nez and A. Arcos Fig. 1. RE for Counties70 population. β = 0.5. n’=100. n=50 n’=100. n=100 0.7 1.0 0.9 RE RE 0.6 0.8 0.5 0.7 0.4 0.6 0.0 0.2 0.4 χ 0.6 0.8 1.0 0.1 0.3 n’=75. n=75 0.5 0.7 χ 0.9 n’=75. n=25 0.45 1.1 0.40 RE RE 0.9 0.35 0.7 0.30 0.5 0.0 0.2 0.4 χ 0.6 0.8 1.0 0.2 0.3 n’=50. n=50 0.4 0.5 χ 0.6 0.7 0.8 n’=50. n=25 0.55 0.50 RE RE 0.9 0.7 0.45 0.40 0.35 0.5 0.1 0.3 0.5 χ 0.7 0.9 Standard estimator. Proposed optimum estimator. p=2. 0.2 0.3 0.4 0.5 χ 0.6 0.7 0.8 Proposed optimum estimator. p=1, x2. Proposed optimum estimator. p=1, x1. References 1. Adhvaryu, D.: Successive sampling using multi-auxiliary information. Sankhya, 40, 167-173 (1978). 2. Allen, J., Singh, H.P., Singh, S., Smarandache, F.: A general class of estimatiors of population median using two auxiliary variables in double sampling. INTERSTAT (2002). 3. Arnab, R., Okafor, F.C.: A note on double sampling over two occasions. Pakistan Journal of Statistics 8, 9–18 (1992). 4. Chambers, R.L., Dunstan R.: Estimating distribution functions from survey data. Biometrika 73, 597–604 (1986). 5. Eckler, A.R.: Rotation Sampling. The Annals of Mathematical Statistics, 26, 664–685 (1955). Estimating quantiles under sampling in two occasions 1695 6. Gordon, L.: Successive sampling in finite populations. The Annals of Statistics, 11, 702–706 (1983). 7. Gross, S.T.: Median estimation in sample survey. Proc. Surv. Res. Meth. Sect. Amer. Statist. Ass., 181–184 (1980). 8. Jessen, R.J.: Statistical investigation of a sample survey for obtaining farm facts. Iowa Agricultural Experiment Statistical Research Bulletin, 304, (1942). 9. Kuk, A.Y.C., Mak, T.K.: Median estimation in the presence of auxiliary information. Journal of the Royal Statistical Society B, 1, 261–269 (1989). 10. Mak, T.K., Kuk, A.Y.C.: A new method for estimating finite–population quantiles using auxiliary information. The Canadian Journal of Statistics, 25, 29–38 (1993). 11. Narain, R.D.: On the recurrence formula in sampling on successive occasions. Journal of the Indian Society of Agricultural Statistics, 5, 96–99 (1953). 12. Olkin, I.: Multivariate ratio estimation for finite population. Biometrika, 45, 154–165 (1958). 13. Patterson, H.D.: Sampling on successive occasions with partial replacement of units. Journal of the Royal Statistical Society B, 12, 241–255 (1950). 14. Royal, R.M., Cumberland, W.G.: An empirical study of the ratio estimator and estimators of its variance, Journal of the American Statistical Association, 76, 66–77 (1981). 15. Rao, J.N.K., Kovar, J.G., Mantel, H.J.: On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika, 77, 365–375 (1990). 16. Rueda, M., Arcos, A., Artés, E.: Quantile interval estimation in finite population using a multivariate ratio estimator. Metrika, 47, 203–213, (1998). 17. Sedransk, J., Meyer, J.: Confidence intervals for the quantiles of a finite population: simple random and stratified simple random sampling. Journal of the Royal Statistical Society B, 40, 239–252 (1978). 18. Sedransk, J., Smith, P.J.: Inference for finite population quantiles. In: Krishnaiah, P.R. and Rao, C. R. (eds.) Handbook of Statistics, Vol 6, Chap. 11, 267-289. North-Holland. (1988). 19. Singh, S., Joarder, A.H., Tracy, D.S.: Median estimation using double sampling. Aust. N. Z. J. Stat., 43, 33–46 (2001). 20. Singh, H.P., Singh, H.P., Singh, V.P.: A generalized efficient class of estimators of population mean in two phase and successive sampling . Inter. J. Mgmt. Syst., 8, 173–183 (1992). 21. Singh, S., Srivastrava, A.K.: Use of auxiliary information in two stage successive sampling. Journal of Indian Society of Agricultural Statistic, 25, 101–104 (1973). 22. Singh, S.: Advanced sampling theory with applications: How michael ”selected” Amy, pp. 1–1247. Kluwer Academic Publisher. The Netherlands. (2003). 23. Valliant, R., Dorfman, A.H., Royall, R.M.: Finite population sampling and inference: A prediction approach. Wiley Series in Probability and Statistics, Survey Methodology Section. New York. John Wiley and Sons, Inc. (2000)