On some nonresponse correction for the finite population median estimator

advertisement
On some nonresponse correction for the finite
population median estimator
Wojciech Gamrot
Department of Statistics, University of Economics, Bogucicka 14, 40-226 Katowice
gamrot@ae.katowice.pl
Summary. Several estimation methods have been developed to deal with nonresponse and its consequences. One of them is the double sampling procedure which
relies on re-approaching nonrespondents in order to obtain the missing information.
The research concerning this procedure is mostly concentrated on estimating population means, totals and their functions. This paper focuses on estimating another
parameter - the population median - under stochastic-type nonresponse. The estimator of the finite population median for the double sampling procedure is considered.
Results of a simulation study exploring its properties are presented.
Key words: finite population median, nonresponse, double sampling
1 The double sampling procedure
Under the double sampling procedure, the survey is executed in two phases. In the
first phase a simple random sample s of size n is drawn without replacement from
the population U , according to the sampling design:
p(s) =
N
n
!−1
.
(1)
In the sample s some units respond and some do not. Consequently it splits
into two subsets s1 ⊂ U1 and s2 ⊂ U2 , of sizes n1 and n2 such that
s1 ∪ s2 = s, s1 ∩ s2 = ∅ and n1 + n2 = n. It is assumed that the nonresponse
is a random event governed by some probability distribution q(s1 |s) called response
distribution [SSW92]. Under these assumptions subset sizes n1 and n2 are random
variables and cannot be controlled directly.
In the second phase of the survey a subsample s′ of the size n′ = cn2 (where
0 < c < 1) is drawn without replacement from s2 , with conditional probability:
′
′
p (s |n2 ) =
n2
n′
!−1
.
(2)
1320
Wojciech Gamrot
Another contact attempt is then undertaken for each subsampled unit and complete
response is assumed at the second phase which means that the response distribution
q(s1 |s) reflects only the behaviour of population units in the first phase .
2 Finite population median and its estimators
Let us consider the finite and fixed population U of size N . Assume that values
◦
y1◦ , ..., yN
of some characteristic Y are associated with each population unit. Arranged in order of magnitude they can be written as:
◦
◦
y(1)
≤ ... ≤ y(N)
(3)
Following [DN03] and [Gil00] we define the finite population median of Y as :
M e(U ) =
y◦
if N is odd
(0.5(N+1))
◦
◦
0.5(y(0.5N)
+ y(0.5N+1)
) if N is even
(4)
In the case of complete response all the values y1 , ..., yn of Y in the sample s are
observed. When ordered in the order of magnitude they may be written as:
y(1) ≤ ... ≤ y(n)
(5)
The population median M e(U ) may be estimated by the sample median:
M e(s) =
y(0.5(n+1))
if n is odd
0.5(y(0.5n) + y(0.5n+1) ) if n is even
(6)
More generally, we define the empirical quantile function, which (in its continuous
form) takes the form [Gil00]:
Q(p) = (1 − g)y(k) + gy(k+1)
(7)
where p ∈ (1/n, (n−1)/n) is the order of the quantile, k = [r], g = r−[r], r = np+0.5
and the symbol [r] denotes integer part of r. Hence, the sample median given by (6)
is equivalent to Q(0.5).
Another way of thinking about the quantile is by expressing it in terms of weights
associated with each population unit. In the case of simple random sampling and
complete response each sample unit represents n1 -th part of the population. Consequently, the following weight may be associated with each i-th sample unit:
wi =
1
n
(8)
When sample units are ordered the sample quantile of the order p lies between the
observations y(k) and y(k+1) where k satisfies the condition:
W (k) ≤ p < W (k + 1)
and:
(9)
On some nonresponse correction
X
1321
k−1
W (k) =
wi + 0.5wk
(10)
i=1
The quantile is equal to:
Q(p) = y(k) +
p − W (k)
(y(k+1) − y(k) )
W (k + 1) − W (k)
(11)
This expression is equivalent to (7) and to (6) when p = 0.5 but it may be easily
adapted to the nonresponse situation. When nonresponse occurs and nonrespondents
are subsampled, the observations of Y are available in the respondent subset s1 and
the subsample s′ . Each element from s1 still represents n1 -th part of the population.
However, each subsampled unit now corresponds to the cn12 -th part of nonrespondent
1
-th part of the population. Consequently, the weights are
subset s2 and to the cn
given by:
wi∗
81
<n
=
:1
nc
f or i ∈ s1
(12)
f or i ∈ s′
By merging observations from both sets s and s′ , ordering them in the order of magnitude, applying modified weights wi∗ instead of wi in expressions
(9) - (11) and assuming p = 0.5 we may use (11) as a nonresponse-corrected estimator of the population median.
3 Simulation results
A simulation study was carried out to compare the bias and accuracy of median
estimators under nonresponse for various values of the initial sample size n. The
following three estimators were compared:
• The sample median M e(s) computed for the complete response case according
to (6).
• The ”uncorrected” estimator M e(s1 ) (the median of the first-phase respondent
subset) based on the incomplete data from the first phase.
• The ”corrected” estimator computed according to (7) for p = 0.5 using modified
weights wi∗ .
In the simulations a logistic nonresponse mechanism [EL91] was assumed, according
to which the response probability of i-th unit is given by:
ρi =
1
1 + exp(B0 + B1 yi )
(13)
with B0 and B1 being fixed in advance. This nonresponse mechanism is of the NMAR
(not missing at random) type as the response probability depends on observed and
unobserved values of Y . Complete response was assumed in the second phase so the
response probabilities given above correspond only to the first-phase unit behaviour.
1322
Wojciech Gamrot
The introduction of the second phase allows us to treat the missing data in the subset
s2 as MAR (missing at random).
The experiments were carried out by repeatedly drawing without replacement
simple random samples from the population. To represent the stochastic nonresponse
mechanism, for each unit included in the initial sample an independent random trial
was executed with the probability of success equal to its response probability. The
unit was assumed to respond if the outcome of the trial was a success. A simple
subsample of 30% first-phase nonrespondents was drawn without replacement, and
treated as responding. For each sample-subsample pair the values of estimators were
computed. On the basis of of empirical distribution of estimates, the bias and mean
square error (MSE) of each estimator were evaluated. All simulations were executed
for initial sample size n = 40, 60, ..., 240.
3.1 Experiment 1
In the first experiment the values of the variable under study Y were generated
with pseudo-random number generator of standard normal distribution using the
well known Marsaglia-Bray algorithm [ZW97]. The parameters of the nonresponse
model were arbitrarily set to B0 = 0 and B1 = 2. A total of 500 populations were
generated and 500 sample-subsample pairs were drawn from each.
The dependence between initial sample size n and the bias of estimators is shown
on the figure 4. It is easy to notice, that the estimator for the complete response
case is unbiased. The occurence of nonresponse introduces significant negative bias
that does not seem to depend on the initial sample size. The nonresponse correction
dramatically reduces the bias: the corrected estimator has small positive bias for
very modest sample sizes. Its bias diminishes when n grows.
The dependence between the initial sample size n and the mean square error
of estimators is shown on the figure 2. The nonresponse dramatically increases the
mean square error of the uncorrected estimator when compared to the complete
response case. The MSE of the uncorrected estimator is several times greater than
the one for the complete response case. With growing n it decreases, but the rate
of decrease is slow and for large values of n the MSE stabilizes on the nonzero
value. The introduction of nonresponse correction significantly improves the MSE.
Although for any values of n the observed MSE of the corrected estimator was higher
than the MSE for the complete response case, it clearly tends to zero with growing n.
The figure 3 presents the ratio of squared bias to the overall MSE of all the estimators, as a function of initial sample size n. The observed contribution of squared
bias to the total MSE is negligible in the complete response case and in the case
of corrected estimator. For the uncorrected estimator the bias share in the MSE is
substantial and grows with increasing n. In most cases it holds above 90% which
makes the uncorrected estimator practically useless.
3.2 Experiment 2
In the second experiment, the data on the revenues from municipal taxation in 284
Swedish municipalities [SSW92] represented the characteristic under study. A total
of 250000 sample-subsample pairs were drawn from the population. The parameters
of the nonresponse model were arbitrarily set to B0 = 2, B1 = 0.005 which implies,
On some nonresponse correction
1323
Fig. 1. The bias of estimators as a function of initial sample size n - Gaussian
distribution
that for units with small values of variable under study the response probability
is lower. The dependence between initial sample size n and the bias of estimators
is shown on the figure 4. All three estimators show positive bias, decreasing with
growing initial sample size. The bias of the uncorrected estimator is several times
higher than the biases of two other estimators and it tends to stabilize on a nonzero
level. The biases of the corrected estimator and estimator for the complete case tend
to zero with growing n with the latter being the lowest for any value of n.
The dependence between the initial sample size n and the mean square error
of estimators is shown on the figure 5. The MSE of each estimator decreases with
growing n. The MSE of the uncorrected estimator is several times higher than the
MSE of other estimators. The share of bias in the overall MSE shown on the figure 6
is again substantial for the uncorrected estimator and negligible for other estimators.
4 Conclusions
Simulation results suggest that the considered modification of population median
estimator may significantly reduce its nonresponse bias and the mean square error.
However, it should be stressed that these findings are based on the assumption that
the logistic nonresponse model holds, and that the distribution of the variable under
study is close to the distributions used in simulations. Care should be taken when
generalizing these conclusions to different situations. Finally, let us note that the
approach presented in this paper may also be applied to estimate other population
quantiles, not only the median.
1324
Wojciech Gamrot
Fig. 2. The mean square error of estimators as a function of initial sample size n Gaussian distribution
References
[DN03]
[Gil00]
David, H.A., Nagaraja, H.N.: Order Statistics. Wiley, New York (1992)
Gilchrist, W.G.: Statistical Modelling with Quantile Functions. Chapman&Hall/CRC, Boca Raton (2000)
[SSW92] Sarndal, C.E., Swensson, B., Wretman, J.H.: Model Assisted Survey Sampling. Springer-Verlag, New York (1992)
[ZW97] Zielinski, R., Wieczorkowski, R.: Komputerowe generatory liczb losowych.
WNT, Warsaw (1997)
[EL91]
Ekholm, A., Laaksonen, S.: Weighting via Response Modelling in Finnish
Household Budget Survey. Journal of Official Statistics, 7, 325–338 (1991)
On some nonresponse correction
1325
Fig. 3. The ratio of squared bias to the mean square error as a function of initial
sample size - Gaussian distribution
Fig. 4. The bias of estimators as a function of initial sample size n - tax revenues
data
1326
Wojciech Gamrot
Fig. 5. The mean square error of estimators as a function of initial sample size n tax revenues data
Fig. 6. The ratio of squared bias to the mean square error as a function of initial
sample size n - tax revenues data
Download