Model-calibration method in the distribution function’s estimation

advertisement
Model-calibration method in the distribution
function’s estimation
Sergio Martı́nez1 ; Marı́a del Mar Rueda2 ; Helena Martı́nez3 ; Ismael
Sánchez-Borrego4 , Silvia González5 and Juan F. Muñoz6
1
2
3
4
5
6
Department of Statistics
spuertas@ual.es
Department of Statistics
Spain mrueda@ugr.es
Department of Statistics
helenamp@um.es
Department of Statistics
Spain ismasb@ugr.es
Department of Statistics
sgonza@ujaen.es
Department of Statistics
Spain jfmunoz@ugr.es
and Applied Mathematics. University of Almerı́a. Spain
and Operational Research. University of Granada.
and Operational Research. University of Murcia. Spain
and Operational Research. University of Granada.
and Operational Research. University of Jaén. Spain
and Operational Research. University of Granada.
1 Introduction
In sample surveys, auxiliary population information is often used at the estimation
stage to increase the precision of estimators of a population total, mean or distribution function. To incoporate the auxiliary information in the estimation of the
distribution function we will use the calibration method [Deville92] and we will obtain new estimator of the distribution function. We study the principal properties
of this new estimator and under some conditions, this calibrated estimator are distribution function. This property is not usual in most of the estimators that use
the auxiliary information to estimate the distribution function. Finally a simulation study is included to compare the precision of the new estimator with the usual
estimators.
2 The Proposed Estimator
Consider a finite population U = {1, . . . , k, . . . , N }, consisting of N different elements. Let s = {1, . . . , n} be the set of n units included in a sample, selected
according to a specified sampling design with inclusion probabilities πk and πkl assumed to be strictly positive. Let yk be the value of the study variable y, for the
kth population element, with which also is associated and auxiliary vector value
xk = (xk1 , xk2 , . . . , xkJ )′ . The values x1 , x2 , . . . , xN are known for the entire population but yk is known only if the kth unit is selected in the sample s. The finite
1666
Authors Suppressed Due to Excessive Length
population distribution function of the study variable y, is given by
8
< 0 if t < yk
∆(t − yk ) with ∆(t − yk ) =
Fy (t) =
: 1 if t ≥ y
N k∈U
k
1 X
Following [Rueda06] the distribution function Fy (t) can be estimated by the calibration estimator, defined by
Fbyc (t) =
1
N
X
ωk ∆(t − yk )
k∈s
where the calibration weights ωk , are chosen to minimize the chi-square distance
with respect to the basic design weights dk = 1/πk
Φs =
X (ωk − dk )2
k∈s
(1)
dk qk
with qk known positive constants unrelated to dk subject to the calibration equations
1
N
X
ωk ∆(t − gk ) = Fg (t)
(2)
k∈s
where
′
′
t = (t1 , . . . , tP ) ; ∆(t − gk ) = ∆(t1 − gk ), . . . , ∆(tP − gk )
Fg (t) = Fg (t1 ), . . . , Fg (tP )
with tj
′
; FbGH (t) = FbGH (t1 ), . . . , FbGH (tP )
′
j = 1, 2, . . . , P points that we choose arbitrarily and assume that
t1 < t2 < . . . < t P
(3)
Fg is the distribution function of the pseudo-variable g; FbGH (t) denotes the Horvitz′
Thompson estimator for the distribution function of g where gk = βb xk for k =
1, 2, ...N, and
βb =
X
′
dk qk xk xk
−1 X
k∈s
·
dk qk xk yk
(4)
k∈s
is a weighted estimator of the multiple regression coefficient β between y and x.
The calibration estimator is given by
Fbyc (t) = FbY H (t) + Fg (t) − FbGH (t)
′
b
·D
(5)
with FbY H (t) denote the Horvitz-Thompson estimator of the distribution function
of y and
b = T −1 ·
D
X
k∈s
dk qk ∆(t − gk )∆(t − yk )
Model-calibration method in the distribution function’s estimation
1667
assuming that the inverse of the symmetric matrix T exits
T =
X
dk qk ∆(t − gk )∆(t − gk )
′
k∈s
Due to the conditions (2) the calibration estimator Fbyc (t) gives perfect estimates
of the distribution function Fg (t) in t, but with these conditions we have assumed
that the study variable y and the auxiliary vector x are linearly related. In fact, if
′
yk = α xk for all k ∈ U , then βb = α and the pseudo-variable g coincides with y for
all k ∈ U .
Consequently the calibration estimator Fbyc (t) give perfect estimates for Fy evaluated
in t. Therefore, if the relation between y and x is not linear, the pseudo-variable
gk and the conditions (2) are inadequate and it is necessary to adapt or to modify
these conditions for nonlinear models.
Thus, we assume that the relationship between y and x can be described by the following superpopulation model proposed by [Wu01](the linear or nonlinear regression
model)
yk = µ(xk , θ) + νk εk
k = 1, 2, . . . , N
(6)
′
where θ = (θ0 , . . . , θJ ) and σ 2 are unknown superpopulation parameters, µ(xk , θ)
is a known function of x and θ, the νk = ν(xk ) is a strictly positive known
function of xk , and the εk are independently and identically distributed random
variables with Eξ (εk ) = 0 and Vξ (εk ) = σ 2 where Eξ and Vξ denote the expectation
and variance with respect to the superpopulation model. We also assume that
(y1 , x1 ), . . . , (yN , xN ) are mutually independent.
Under model (6), the auxiliary information
vector x should be used in the definition
b for all k ∈ U where θ is a
µ
bk = µ(xk , θ)
ulation parameters θ (to estimate the model
[Wu01]).
provided by
of following
estimation of
parameters θ
the auxiliary
pseudo-variable
the superpopwe can follow
Now, we consider the calibration estimator
Fbymc (t) =
1
N
X
ωk ∆(t − yk )
(7)
k∈s
where the calibrated weights ωk are modified from dk by minimizing the chi-square
distance (1), subject to the following conditions
1
N
X
bk ) = Fµb (t)
ωk ∆(t − µ
(8)
k∈s
with t = (t1 , . . . , tP ) where tj ; j = 1, 2, . . . , P arbitrarily chosen points that satisfy
the condition (3). The resulting calibration weights then follows directly by minimizing (1) subject to (8) using a Lagrange multiplier approach and are given by
ωk = dk + dk qk N Fµb (t) − FbµbH (t)
′ −1
· Tµ
· ∆(t − µ
bk )
(9)
1668
Authors Suppressed Due to Excessive Length
assuming that the inverse of
Tµ =
X
dk qk ∆(t − µ
bk )∆(t − µbk )
′
k∈s
exits. The model calibration estimator obtained is
Fbymc (t) =
where
1
N
X
ωk ∆(t − yk ) = FbY H (t) + Fµb (t) − FbµbH (t)
bµ
·D
(10)
k∈s
−1 X
b µ = Tµ
D
′
·
dk qk ∆(t − µ
bk )∆(t − yk )
k∈s
If Tµ is singular, the calibration process does not have solution and we take
Fbymc (t) = FbY H (t). When the estimator Fbymc (t) is applied in the estimation of the
distribution function of the pseudo-variable µ
b evaluated over the set of points tj ;
j = 1, 2, . . . , N it coincides with the values Fµb (tj ) ; j = 1, . . . , P and if the relation
between y and x is linear, the estimator Fbymc (t) coincides with the estimator
proposed in [Rueda06] and consequently the estimator Fbyc (t) is a particular case of
Fbymc (t).
bTo study the conditions that guarantees the existence of Tµ−1 , we consider the µ
values of sample units in ascending order µ
b(1) ≤ µb(2) ≤ . . . µb(n−1) ≤ µb(n) .
Following [Rueda06] if we suppose that the value ti is bigger than the first ki sample
values of the variable g, with ki > ki−1 for i = 2, . . . , P ; k1 > 0 and kP ≤ n then
Tµ−1 is not singular and is a P × P symmetric matrix of the form
0
a11
B
a21
B
B
0
B
T −1 = B 0
B
B ..
.
a12
a22
a32
0
..
.
0 0
0
a23
a33
···
..
.
0
0
···
0
0
···
0
···
···
0
···
···
0
..
. aP −1P −1 aP −1P
0 aP P −1 aP P
1
C
C
C
C
C
C
C
A
(11)
where the aij are given by
a11 =
1
k1
X
+
d(k) q(k)
;
aP P =
d(k) q(k)
k=k1 +1
k=1
ai,i =
1
k2
X
1
ki
X
k=ki−1 +1
+
d(k) q(k)
1
kP
X
.
d(k) q(k)
k=kP −1 +1
1
X
ki+1
;
ai,i+1 =
d(k) q(k)
k=ki +1
3 Properties of Fbymc(t)
In this section we study some properties of Fbymc (t).
−1
X
ki+1
k=ki +1
d(k) q(k)
Model-calibration method in the distribution function’s estimation
1669
3.1 Asymptotic Properties
We will begin with the asymptotic behavior that is summarized in the following
theorems:
Theorem 1. When the vector θb is replaced by θ, the superpopulation model parameters of (6, the asymptotic behaviour of Fbymc (t) is the same as that of the estimator
′
b
Fbyµ (t) = FbY H (t) + Fµ (t) − FbµH (t) D
with
b=
D
X
dk ∆(t − µk )∆(t − µk )
′
!−1 k∈s
(12)
X
dk ∆(t − µk )∆(t − yk )
k∈s
′
µk = µ(xk , θ) , Fµ (t) = (Fµ (t1 ), ..., Fµ (tP )) , Fµ (tj ) for j = 1, ..., P denotes the
′
distribution function of the variable µ and FbµH (t) = (FbµH (t1 ), ..., FbµH (tP )) with
FbµH (tj ) denotes the Horvitz-Thompson estimator for the distribution function of b
at the point tj .
Theorem 2. The asymptotic behaviour of the estimator Fbyµ (t) is the same that the
asymptotic behaviour of
′
FbY H (t) + Fµ (t) − FbµH (t) D
with
D=
X
∆(t − µk )∆(t − µk )
′
!−1 k∈U
X
∆(t − µk )∆(t − yk )
k∈U
Therefore Fbymc (t) is asymptotically normal, asymptotically unbiased and its asymptotic is given by
AV (Fbymc (t)) =
1
N2
XX
∆kl (dk Ek )(dl El )
(13)
k∈U l∈U
′
where ∆kl = πkl − πk πl and Ek = ∆(t − yk ) − ∆(t − µk ) D.
To estimate the population value (13), following Deville and Särndal 1992, we can
employ the following estimator:
Vb (Fbymc (t)) =
1
N2
X X ∆kl
k∈s l∈s
πkl
(ωk ek )(ωl el )
(14)
b ω with
bk ) D
where ωk are the calibrated weights (9) and ek = ∆(t − yk ) − ∆(t − µ
′
bω =
D
X
k∈s
ωk qk ∆(t − µ
bk )∆(t − µbk )
′
!−1 X
k∈s
ωk qk ∆(t − µ
bk )∆(t − yk )
1670
Authors Suppressed Due to Excessive Length
3.2 Distribution Function’s Properties
Next we determine if Fbymc (t) is a distribution function. For this, we have to verify
if the following conditions are satisfied
i. Fbymc (t) is continuous on the right.
ii. Fbymc (t) is monotone nondecreasing.
iii. lim Fbymc (t) = 0 ; lim Fbymc (t) = 1
t→−∞
t→+∞
It’s easy to verify that condition i) is satisfied. On the other hand, Fbymc (t) is not
monotone nondecreasing, in general and with respect to the third condition, we have
lim Fbymc (t) = 0
t→−∞
but
lim Fbymc (t) = Fg (tP ) +
t→+∞
1
N
n
X
d(k)
k=kP +1
this value is not equal to 1 in general. Because the conditions ii) and iii) are not satisfied, the estimator Fbymc (t) is not a distribution function, in general. Since Fbymc (t)
is a calibration estimator, it is monotone nondecreasing if ωk are positive for all sample units. The choice qk = c for all sample units, guarantees that ωk ≥ 0 for all k ∈ s
and consequently guarantees that the estimator Fbyc (t) is monotone nondecreasing
(see [Rueda06]).
In order to meet the condition lim Fbymc (t) = 1, it’s easy to see that this condition
t→+∞
is equivalent to the following condition
1
N
n
X
ωk = 1
(15)
k=1
To add the condition (15) in the calibration conditions (8) it is sufficient to take the
value tP sufficiently large, that is, a value tP that it guarantees Fµb (tP ) = 1 (we can
bk , see [Rueda06].
take tp = maxk∈U µ
Therefore, if we add the conditions:
(a). qk = c for all k ∈ U .
(b). tP = maxk∈U µ
bk .
the estimator Fbymc (t) is a distribution function.
4 Simulation Study
In this section we compare the precision of the estimator Fbymc (t) with the following estimators, the Chambers-Dunstan estimator FbCD (t) [Cham86], the Rao-KovarMantel estimator FbRKM (t) [Rao90], Fbps (t), [Silva95], the difference estimator Fbd (t)
and the ratio estimator FbR (t), respectively.
The empirical study has been carried out with a simulated population called
YQUAD01 with N = 1000. The model assumed between y and x is given by
yk = 1 + 2 · (xk − 0.5)2 + εk
Model-calibration method in the distribution function’s estimation
1671
where xk ≈ unif (0, 1) and εk ≈ N (0, (0.1)2 ). Therefore in the calibration condition
(8) the function µ(xk , θ) is given by µ(xk , θ) = 1 + 2 · (xk − 0.5)2 .
In the construction of the calibration estimator Fbymc (t) we choose the vector t =
(t1 , t2 , t3 , t4 ) where
t1 = Qµb (0.25)
;
t2 = Qµb (0.5)
;
t3 = Qµb (0.75)
t4 = max µ
bk
;
k∈U
The symbol Q denotes a quantile.
We selected 1000 samples for three different sample sizes, n = 50 ; n = 100 and n =
150 under simple random sampling without replacement (SRSWOR) and for every
estimator we calculate estimates of the distribution function of the study variable
Fy in the points Qy (0.25); Qy (0.5) and Qy (0.75). Thus, with every estimator we
realize 1000 estimations in the selected points. With these estimations we calculate
the relative bias (RB) and the relative root mean square error (RMSE)
RB(t) =
1
B
B
X
Fb(t)b − Fy (t)
b=1
;
Fy (t)
RE(t) =
M SE[Fb(t)]
M SE[FbHT (t)]
,
(16)
Table 1. Sample size n = 50
Qy (0.25)
Qy (0.50)
Qy (0.75)
Estimator RB
RMSE
RB
RMSE
RB
RMSE
FbCD
Fbd
FbR
Fbps
FbRKM
Fbymc
0.225207
0.3868730
0.3008843
0.2361410
0.2372919
0.2217402
-0.00651912
-0.008252
0.002257637
-0.01041388
-0.0008594
-0.005055528
0.1324281
0.1995568
0.1991144
0.1402763
0.1396372
0.1243824
-0.003604693
-0.006117333
-0.001195984
-0.003242377
0.0008752533
0.0004676424
0.0771873
0.1145361
0.1368694
0.08113852
0.07996565
0.06657333
-0.01324536
-0.014192
0.002352866
-0.01853986
-0.00217064
-0.01033644
Table 2. Sample size n = 100
Qy (0.25)
Qy (0.50)
Qy (0.75)
Estimator RB
RMSE
RB
RMSE
RB
RMSE
FbCD
Fbd
FbR
Fbps
FbRKM
Fbymc
0.1574879
0.2644733
0.2082328
0.1625305
0.1631152
0.1493216
0.0040377
0.001618
0.006659608
0.002893629
0.00761754
0.004293074
0.09613765
0.1398340
0.1406448
0.1000351
0.1007665
0.08791667
-0.002210333
-0.0001013333
0.002443938
0.001387087
0.003454027
0.001349364
0.05418493
0.08067246
0.09687452
0.0555225
0.05551676
0.04697596
−5.532 · 10−5
-0.002928
0.006048526
-0.000782639
0.00585408
0.0004385261
1672
Authors Suppressed Due to Excessive Length
Table 3. Sample size n = 150
Qy (0.25)
Qy (0.50)
Qy (0.75)
Estimator RB
RMSE
RB
RMSE
RB
RMSE
FbCD
Fbd
FbR
Fbps
FbRKM
Fbymc
0.1202317
0.2003743
0.1566497
0.1249551
0.1246686
0.1164229
-0.00083408
-0.00449
-0.001442379
-0.002341488
0.00033484
-0.0005424964
0.07317584
0.1062682
0.105573
0.0748101
0.07473446
0.06574255
-0.005049858
-0.002968889
-0.001783629
-0.001229927
0.0001145156
-0.0002469013
0.04240729
0.06215042
0.07419154
0.04328602
0.04315555
0.03637170
-0.0008094133
-0.004376
0.001790763
-0.001041700
0.003407796
0.002811359
References
[Cham86] Chambers, R.L., and Dunstan, R. (1986) Estimating distribution function
from survey data. Biometrika, 73, 597–604.
[Deville92] Deville, J.C., and Särndal. (1992) Calibration Estimators in Survey Sampling. Journal of the American Statistical Association, 79, 376–604.
[Meeden95] Meeden, G. (1995) Median Estimation Using Auxiliary Information.
Survey Methodology, 21, 71–77.
[Rao90] Rao, J.N.K., Kovar, J.G., and Mantel, H.J. (1990) On estimating distribution functions and quantiles from survey data using auxiliary information.
Biometrika, 77, 365–375.
[Rueda06] Rueda, M., Martı́nez, S., Martı́nez, H., Arcos. A (2006) Estimation of the
Distribution Function with Calibration Methods. Journal of Statistical
Planning and Inference.
[Silva95] Silva, P.L.D., Nascimento and Skinner, C.J. (1995) Estimating distribution functions with auxiliary information using poststratification. J. Official Statist., 11, 277–294
[Wu01]
Wu, C., and Sitter, R.R. (2001) A Model-Calibration Approach to Using Complete Auxiliary Information from Survey Data. Journal of the
American Statistical Association, 96, 185–193
Download