Model-calibration method in the distribution function’s estimation Sergio Martı́nez1 ; Marı́a del Mar Rueda2 ; Helena Martı́nez3 ; Ismael Sánchez-Borrego4 , Silvia González5 and Juan F. Muñoz6 1 2 3 4 5 6 Department of Statistics spuertas@ual.es Department of Statistics Spain mrueda@ugr.es Department of Statistics helenamp@um.es Department of Statistics Spain ismasb@ugr.es Department of Statistics sgonza@ujaen.es Department of Statistics Spain jfmunoz@ugr.es and Applied Mathematics. University of Almerı́a. Spain and Operational Research. University of Granada. and Operational Research. University of Murcia. Spain and Operational Research. University of Granada. and Operational Research. University of Jaén. Spain and Operational Research. University of Granada. 1 Introduction In sample surveys, auxiliary population information is often used at the estimation stage to increase the precision of estimators of a population total, mean or distribution function. To incoporate the auxiliary information in the estimation of the distribution function we will use the calibration method [Deville92] and we will obtain new estimator of the distribution function. We study the principal properties of this new estimator and under some conditions, this calibrated estimator are distribution function. This property is not usual in most of the estimators that use the auxiliary information to estimate the distribution function. Finally a simulation study is included to compare the precision of the new estimator with the usual estimators. 2 The Proposed Estimator Consider a finite population U = {1, . . . , k, . . . , N }, consisting of N different elements. Let s = {1, . . . , n} be the set of n units included in a sample, selected according to a specified sampling design with inclusion probabilities πk and πkl assumed to be strictly positive. Let yk be the value of the study variable y, for the kth population element, with which also is associated and auxiliary vector value xk = (xk1 , xk2 , . . . , xkJ )′ . The values x1 , x2 , . . . , xN are known for the entire population but yk is known only if the kth unit is selected in the sample s. The finite 1666 Authors Suppressed Due to Excessive Length population distribution function of the study variable y, is given by 8 < 0 if t < yk ∆(t − yk ) with ∆(t − yk ) = Fy (t) = : 1 if t ≥ y N k∈U k 1 X Following [Rueda06] the distribution function Fy (t) can be estimated by the calibration estimator, defined by Fbyc (t) = 1 N X ωk ∆(t − yk ) k∈s where the calibration weights ωk , are chosen to minimize the chi-square distance with respect to the basic design weights dk = 1/πk Φs = X (ωk − dk )2 k∈s (1) dk qk with qk known positive constants unrelated to dk subject to the calibration equations 1 N X ωk ∆(t − gk ) = Fg (t) (2) k∈s where ′ ′ t = (t1 , . . . , tP ) ; ∆(t − gk ) = ∆(t1 − gk ), . . . , ∆(tP − gk ) Fg (t) = Fg (t1 ), . . . , Fg (tP ) with tj ′ ; FbGH (t) = FbGH (t1 ), . . . , FbGH (tP ) ′ j = 1, 2, . . . , P points that we choose arbitrarily and assume that t1 < t2 < . . . < t P (3) Fg is the distribution function of the pseudo-variable g; FbGH (t) denotes the Horvitz′ Thompson estimator for the distribution function of g where gk = βb xk for k = 1, 2, ...N, and βb = X ′ dk qk xk xk −1 X k∈s · dk qk xk yk (4) k∈s is a weighted estimator of the multiple regression coefficient β between y and x. The calibration estimator is given by Fbyc (t) = FbY H (t) + Fg (t) − FbGH (t) ′ b ·D (5) with FbY H (t) denote the Horvitz-Thompson estimator of the distribution function of y and b = T −1 · D X k∈s dk qk ∆(t − gk )∆(t − yk ) Model-calibration method in the distribution function’s estimation 1667 assuming that the inverse of the symmetric matrix T exits T = X dk qk ∆(t − gk )∆(t − gk ) ′ k∈s Due to the conditions (2) the calibration estimator Fbyc (t) gives perfect estimates of the distribution function Fg (t) in t, but with these conditions we have assumed that the study variable y and the auxiliary vector x are linearly related. In fact, if ′ yk = α xk for all k ∈ U , then βb = α and the pseudo-variable g coincides with y for all k ∈ U . Consequently the calibration estimator Fbyc (t) give perfect estimates for Fy evaluated in t. Therefore, if the relation between y and x is not linear, the pseudo-variable gk and the conditions (2) are inadequate and it is necessary to adapt or to modify these conditions for nonlinear models. Thus, we assume that the relationship between y and x can be described by the following superpopulation model proposed by [Wu01](the linear or nonlinear regression model) yk = µ(xk , θ) + νk εk k = 1, 2, . . . , N (6) ′ where θ = (θ0 , . . . , θJ ) and σ 2 are unknown superpopulation parameters, µ(xk , θ) is a known function of x and θ, the νk = ν(xk ) is a strictly positive known function of xk , and the εk are independently and identically distributed random variables with Eξ (εk ) = 0 and Vξ (εk ) = σ 2 where Eξ and Vξ denote the expectation and variance with respect to the superpopulation model. We also assume that (y1 , x1 ), . . . , (yN , xN ) are mutually independent. Under model (6), the auxiliary information vector x should be used in the definition b for all k ∈ U where θ is a µ bk = µ(xk , θ) ulation parameters θ (to estimate the model [Wu01]). provided by of following estimation of parameters θ the auxiliary pseudo-variable the superpopwe can follow Now, we consider the calibration estimator Fbymc (t) = 1 N X ωk ∆(t − yk ) (7) k∈s where the calibrated weights ωk are modified from dk by minimizing the chi-square distance (1), subject to the following conditions 1 N X bk ) = Fµb (t) ωk ∆(t − µ (8) k∈s with t = (t1 , . . . , tP ) where tj ; j = 1, 2, . . . , P arbitrarily chosen points that satisfy the condition (3). The resulting calibration weights then follows directly by minimizing (1) subject to (8) using a Lagrange multiplier approach and are given by ωk = dk + dk qk N Fµb (t) − FbµbH (t) ′ −1 · Tµ · ∆(t − µ bk ) (9) 1668 Authors Suppressed Due to Excessive Length assuming that the inverse of Tµ = X dk qk ∆(t − µ bk )∆(t − µbk ) ′ k∈s exits. The model calibration estimator obtained is Fbymc (t) = where 1 N X ωk ∆(t − yk ) = FbY H (t) + Fµb (t) − FbµbH (t) bµ ·D (10) k∈s −1 X b µ = Tµ D ′ · dk qk ∆(t − µ bk )∆(t − yk ) k∈s If Tµ is singular, the calibration process does not have solution and we take Fbymc (t) = FbY H (t). When the estimator Fbymc (t) is applied in the estimation of the distribution function of the pseudo-variable µ b evaluated over the set of points tj ; j = 1, 2, . . . , N it coincides with the values Fµb (tj ) ; j = 1, . . . , P and if the relation between y and x is linear, the estimator Fbymc (t) coincides with the estimator proposed in [Rueda06] and consequently the estimator Fbyc (t) is a particular case of Fbymc (t). bTo study the conditions that guarantees the existence of Tµ−1 , we consider the µ values of sample units in ascending order µ b(1) ≤ µb(2) ≤ . . . µb(n−1) ≤ µb(n) . Following [Rueda06] if we suppose that the value ti is bigger than the first ki sample values of the variable g, with ki > ki−1 for i = 2, . . . , P ; k1 > 0 and kP ≤ n then Tµ−1 is not singular and is a P × P symmetric matrix of the form 0 a11 B a21 B B 0 B T −1 = B 0 B B .. . a12 a22 a32 0 .. . 0 0 0 a23 a33 ··· .. . 0 0 ··· 0 0 ··· 0 ··· ··· 0 ··· ··· 0 .. . aP −1P −1 aP −1P 0 aP P −1 aP P 1 C C C C C C C A (11) where the aij are given by a11 = 1 k1 X + d(k) q(k) ; aP P = d(k) q(k) k=k1 +1 k=1 ai,i = 1 k2 X 1 ki X k=ki−1 +1 + d(k) q(k) 1 kP X . d(k) q(k) k=kP −1 +1 1 X ki+1 ; ai,i+1 = d(k) q(k) k=ki +1 3 Properties of Fbymc(t) In this section we study some properties of Fbymc (t). −1 X ki+1 k=ki +1 d(k) q(k) Model-calibration method in the distribution function’s estimation 1669 3.1 Asymptotic Properties We will begin with the asymptotic behavior that is summarized in the following theorems: Theorem 1. When the vector θb is replaced by θ, the superpopulation model parameters of (6, the asymptotic behaviour of Fbymc (t) is the same as that of the estimator ′ b Fbyµ (t) = FbY H (t) + Fµ (t) − FbµH (t) D with b= D X dk ∆(t − µk )∆(t − µk ) ′ !−1 k∈s (12) X dk ∆(t − µk )∆(t − yk ) k∈s ′ µk = µ(xk , θ) , Fµ (t) = (Fµ (t1 ), ..., Fµ (tP )) , Fµ (tj ) for j = 1, ..., P denotes the ′ distribution function of the variable µ and FbµH (t) = (FbµH (t1 ), ..., FbµH (tP )) with FbµH (tj ) denotes the Horvitz-Thompson estimator for the distribution function of b at the point tj . Theorem 2. The asymptotic behaviour of the estimator Fbyµ (t) is the same that the asymptotic behaviour of ′ FbY H (t) + Fµ (t) − FbµH (t) D with D= X ∆(t − µk )∆(t − µk ) ′ !−1 k∈U X ∆(t − µk )∆(t − yk ) k∈U Therefore Fbymc (t) is asymptotically normal, asymptotically unbiased and its asymptotic is given by AV (Fbymc (t)) = 1 N2 XX ∆kl (dk Ek )(dl El ) (13) k∈U l∈U ′ where ∆kl = πkl − πk πl and Ek = ∆(t − yk ) − ∆(t − µk ) D. To estimate the population value (13), following Deville and Särndal 1992, we can employ the following estimator: Vb (Fbymc (t)) = 1 N2 X X ∆kl k∈s l∈s πkl (ωk ek )(ωl el ) (14) b ω with bk ) D where ωk are the calibrated weights (9) and ek = ∆(t − yk ) − ∆(t − µ ′ bω = D X k∈s ωk qk ∆(t − µ bk )∆(t − µbk ) ′ !−1 X k∈s ωk qk ∆(t − µ bk )∆(t − yk ) 1670 Authors Suppressed Due to Excessive Length 3.2 Distribution Function’s Properties Next we determine if Fbymc (t) is a distribution function. For this, we have to verify if the following conditions are satisfied i. Fbymc (t) is continuous on the right. ii. Fbymc (t) is monotone nondecreasing. iii. lim Fbymc (t) = 0 ; lim Fbymc (t) = 1 t→−∞ t→+∞ It’s easy to verify that condition i) is satisfied. On the other hand, Fbymc (t) is not monotone nondecreasing, in general and with respect to the third condition, we have lim Fbymc (t) = 0 t→−∞ but lim Fbymc (t) = Fg (tP ) + t→+∞ 1 N n X d(k) k=kP +1 this value is not equal to 1 in general. Because the conditions ii) and iii) are not satisfied, the estimator Fbymc (t) is not a distribution function, in general. Since Fbymc (t) is a calibration estimator, it is monotone nondecreasing if ωk are positive for all sample units. The choice qk = c for all sample units, guarantees that ωk ≥ 0 for all k ∈ s and consequently guarantees that the estimator Fbyc (t) is monotone nondecreasing (see [Rueda06]). In order to meet the condition lim Fbymc (t) = 1, it’s easy to see that this condition t→+∞ is equivalent to the following condition 1 N n X ωk = 1 (15) k=1 To add the condition (15) in the calibration conditions (8) it is sufficient to take the value tP sufficiently large, that is, a value tP that it guarantees Fµb (tP ) = 1 (we can bk , see [Rueda06]. take tp = maxk∈U µ Therefore, if we add the conditions: (a). qk = c for all k ∈ U . (b). tP = maxk∈U µ bk . the estimator Fbymc (t) is a distribution function. 4 Simulation Study In this section we compare the precision of the estimator Fbymc (t) with the following estimators, the Chambers-Dunstan estimator FbCD (t) [Cham86], the Rao-KovarMantel estimator FbRKM (t) [Rao90], Fbps (t), [Silva95], the difference estimator Fbd (t) and the ratio estimator FbR (t), respectively. The empirical study has been carried out with a simulated population called YQUAD01 with N = 1000. The model assumed between y and x is given by yk = 1 + 2 · (xk − 0.5)2 + εk Model-calibration method in the distribution function’s estimation 1671 where xk ≈ unif (0, 1) and εk ≈ N (0, (0.1)2 ). Therefore in the calibration condition (8) the function µ(xk , θ) is given by µ(xk , θ) = 1 + 2 · (xk − 0.5)2 . In the construction of the calibration estimator Fbymc (t) we choose the vector t = (t1 , t2 , t3 , t4 ) where t1 = Qµb (0.25) ; t2 = Qµb (0.5) ; t3 = Qµb (0.75) t4 = max µ bk ; k∈U The symbol Q denotes a quantile. We selected 1000 samples for three different sample sizes, n = 50 ; n = 100 and n = 150 under simple random sampling without replacement (SRSWOR) and for every estimator we calculate estimates of the distribution function of the study variable Fy in the points Qy (0.25); Qy (0.5) and Qy (0.75). Thus, with every estimator we realize 1000 estimations in the selected points. With these estimations we calculate the relative bias (RB) and the relative root mean square error (RMSE) RB(t) = 1 B B X Fb(t)b − Fy (t) b=1 ; Fy (t) RE(t) = M SE[Fb(t)] M SE[FbHT (t)] , (16) Table 1. Sample size n = 50 Qy (0.25) Qy (0.50) Qy (0.75) Estimator RB RMSE RB RMSE RB RMSE FbCD Fbd FbR Fbps FbRKM Fbymc 0.225207 0.3868730 0.3008843 0.2361410 0.2372919 0.2217402 -0.00651912 -0.008252 0.002257637 -0.01041388 -0.0008594 -0.005055528 0.1324281 0.1995568 0.1991144 0.1402763 0.1396372 0.1243824 -0.003604693 -0.006117333 -0.001195984 -0.003242377 0.0008752533 0.0004676424 0.0771873 0.1145361 0.1368694 0.08113852 0.07996565 0.06657333 -0.01324536 -0.014192 0.002352866 -0.01853986 -0.00217064 -0.01033644 Table 2. Sample size n = 100 Qy (0.25) Qy (0.50) Qy (0.75) Estimator RB RMSE RB RMSE RB RMSE FbCD Fbd FbR Fbps FbRKM Fbymc 0.1574879 0.2644733 0.2082328 0.1625305 0.1631152 0.1493216 0.0040377 0.001618 0.006659608 0.002893629 0.00761754 0.004293074 0.09613765 0.1398340 0.1406448 0.1000351 0.1007665 0.08791667 -0.002210333 -0.0001013333 0.002443938 0.001387087 0.003454027 0.001349364 0.05418493 0.08067246 0.09687452 0.0555225 0.05551676 0.04697596 −5.532 · 10−5 -0.002928 0.006048526 -0.000782639 0.00585408 0.0004385261 1672 Authors Suppressed Due to Excessive Length Table 3. Sample size n = 150 Qy (0.25) Qy (0.50) Qy (0.75) Estimator RB RMSE RB RMSE RB RMSE FbCD Fbd FbR Fbps FbRKM Fbymc 0.1202317 0.2003743 0.1566497 0.1249551 0.1246686 0.1164229 -0.00083408 -0.00449 -0.001442379 -0.002341488 0.00033484 -0.0005424964 0.07317584 0.1062682 0.105573 0.0748101 0.07473446 0.06574255 -0.005049858 -0.002968889 -0.001783629 -0.001229927 0.0001145156 -0.0002469013 0.04240729 0.06215042 0.07419154 0.04328602 0.04315555 0.03637170 -0.0008094133 -0.004376 0.001790763 -0.001041700 0.003407796 0.002811359 References [Cham86] Chambers, R.L., and Dunstan, R. (1986) Estimating distribution function from survey data. Biometrika, 73, 597–604. [Deville92] Deville, J.C., and Särndal. (1992) Calibration Estimators in Survey Sampling. Journal of the American Statistical Association, 79, 376–604. [Meeden95] Meeden, G. (1995) Median Estimation Using Auxiliary Information. Survey Methodology, 21, 71–77. [Rao90] Rao, J.N.K., Kovar, J.G., and Mantel, H.J. (1990) On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika, 77, 365–375. [Rueda06] Rueda, M., Martı́nez, S., Martı́nez, H., Arcos. A (2006) Estimation of the Distribution Function with Calibration Methods. Journal of Statistical Planning and Inference. [Silva95] Silva, P.L.D., Nascimento and Skinner, C.J. (1995) Estimating distribution functions with auxiliary information using poststratification. J. Official Statist., 11, 277–294 [Wu01] Wu, C., and Sitter, R.R. (2001) A Model-Calibration Approach to Using Complete Auxiliary Information from Survey Data. Journal of the American Statistical Association, 96, 185–193