On some nonresponse correction for the finite population median estimator Wojciech Gamrot Department of Statistics, University of Economics, Bogucicka 14, 40-226 Katowice gamrot@ae.katowice.pl Summary. Several estimation methods have been developed to deal with nonresponse and its consequences. One of them is the double sampling procedure which relies on re-approaching nonrespondents in order to obtain the missing information. The research concerning this procedure is mostly concentrated on estimating population means, totals and their functions. This paper focuses on estimating another parameter - the population median - under stochastic-type nonresponse. The estimator of the finite population median for the double sampling procedure is considered. Results of a simulation study exploring its properties are presented. Key words: finite population median, nonresponse, double sampling 1 The double sampling procedure Under the double sampling procedure, the survey is executed in two phases. In the first phase a simple random sample s of size n is drawn without replacement from the population U , according to the sampling design: p(s) = N n !−1 . (1) In the sample s some units respond and some do not. Consequently it splits into two subsets s1 ⊂ U1 and s2 ⊂ U2 , of sizes n1 and n2 such that s1 ∪ s2 = s, s1 ∩ s2 = ∅ and n1 + n2 = n. It is assumed that the nonresponse is a random event governed by some probability distribution q(s1 |s) called response distribution [SSW92]. Under these assumptions subset sizes n1 and n2 are random variables and cannot be controlled directly. In the second phase of the survey a subsample s′ of the size n′ = cn2 (where 0 < c < 1) is drawn without replacement from s2 , with conditional probability: ′ ′ p (s |n2 ) = n2 n′ !−1 . (2) 1320 Wojciech Gamrot Another contact attempt is then undertaken for each subsampled unit and complete response is assumed at the second phase which means that the response distribution q(s1 |s) reflects only the behaviour of population units in the first phase . 2 Finite population median and its estimators Let us consider the finite and fixed population U of size N . Assume that values ◦ y1◦ , ..., yN of some characteristic Y are associated with each population unit. Arranged in order of magnitude they can be written as: ◦ ◦ y(1) ≤ ... ≤ y(N) (3) Following [DN03] and [Gil00] we define the finite population median of Y as : M e(U ) = y◦ if N is odd (0.5(N+1)) ◦ ◦ 0.5(y(0.5N) + y(0.5N+1) ) if N is even (4) In the case of complete response all the values y1 , ..., yn of Y in the sample s are observed. When ordered in the order of magnitude they may be written as: y(1) ≤ ... ≤ y(n) (5) The population median M e(U ) may be estimated by the sample median: M e(s) = y(0.5(n+1)) if n is odd 0.5(y(0.5n) + y(0.5n+1) ) if n is even (6) More generally, we define the empirical quantile function, which (in its continuous form) takes the form [Gil00]: Q(p) = (1 − g)y(k) + gy(k+1) (7) where p ∈ (1/n, (n−1)/n) is the order of the quantile, k = [r], g = r−[r], r = np+0.5 and the symbol [r] denotes integer part of r. Hence, the sample median given by (6) is equivalent to Q(0.5). Another way of thinking about the quantile is by expressing it in terms of weights associated with each population unit. In the case of simple random sampling and complete response each sample unit represents n1 -th part of the population. Consequently, the following weight may be associated with each i-th sample unit: wi = 1 n (8) When sample units are ordered the sample quantile of the order p lies between the observations y(k) and y(k+1) where k satisfies the condition: W (k) ≤ p < W (k + 1) and: (9) On some nonresponse correction X 1321 k−1 W (k) = wi + 0.5wk (10) i=1 The quantile is equal to: Q(p) = y(k) + p − W (k) (y(k+1) − y(k) ) W (k + 1) − W (k) (11) This expression is equivalent to (7) and to (6) when p = 0.5 but it may be easily adapted to the nonresponse situation. When nonresponse occurs and nonrespondents are subsampled, the observations of Y are available in the respondent subset s1 and the subsample s′ . Each element from s1 still represents n1 -th part of the population. However, each subsampled unit now corresponds to the cn12 -th part of nonrespondent 1 -th part of the population. Consequently, the weights are subset s2 and to the cn given by: wi∗ 81 <n = :1 nc f or i ∈ s1 (12) f or i ∈ s′ By merging observations from both sets s and s′ , ordering them in the order of magnitude, applying modified weights wi∗ instead of wi in expressions (9) - (11) and assuming p = 0.5 we may use (11) as a nonresponse-corrected estimator of the population median. 3 Simulation results A simulation study was carried out to compare the bias and accuracy of median estimators under nonresponse for various values of the initial sample size n. The following three estimators were compared: • The sample median M e(s) computed for the complete response case according to (6). • The ”uncorrected” estimator M e(s1 ) (the median of the first-phase respondent subset) based on the incomplete data from the first phase. • The ”corrected” estimator computed according to (7) for p = 0.5 using modified weights wi∗ . In the simulations a logistic nonresponse mechanism [EL91] was assumed, according to which the response probability of i-th unit is given by: ρi = 1 1 + exp(B0 + B1 yi ) (13) with B0 and B1 being fixed in advance. This nonresponse mechanism is of the NMAR (not missing at random) type as the response probability depends on observed and unobserved values of Y . Complete response was assumed in the second phase so the response probabilities given above correspond only to the first-phase unit behaviour. 1322 Wojciech Gamrot The introduction of the second phase allows us to treat the missing data in the subset s2 as MAR (missing at random). The experiments were carried out by repeatedly drawing without replacement simple random samples from the population. To represent the stochastic nonresponse mechanism, for each unit included in the initial sample an independent random trial was executed with the probability of success equal to its response probability. The unit was assumed to respond if the outcome of the trial was a success. A simple subsample of 30% first-phase nonrespondents was drawn without replacement, and treated as responding. For each sample-subsample pair the values of estimators were computed. On the basis of of empirical distribution of estimates, the bias and mean square error (MSE) of each estimator were evaluated. All simulations were executed for initial sample size n = 40, 60, ..., 240. 3.1 Experiment 1 In the first experiment the values of the variable under study Y were generated with pseudo-random number generator of standard normal distribution using the well known Marsaglia-Bray algorithm [ZW97]. The parameters of the nonresponse model were arbitrarily set to B0 = 0 and B1 = 2. A total of 500 populations were generated and 500 sample-subsample pairs were drawn from each. The dependence between initial sample size n and the bias of estimators is shown on the figure 4. It is easy to notice, that the estimator for the complete response case is unbiased. The occurence of nonresponse introduces significant negative bias that does not seem to depend on the initial sample size. The nonresponse correction dramatically reduces the bias: the corrected estimator has small positive bias for very modest sample sizes. Its bias diminishes when n grows. The dependence between the initial sample size n and the mean square error of estimators is shown on the figure 2. The nonresponse dramatically increases the mean square error of the uncorrected estimator when compared to the complete response case. The MSE of the uncorrected estimator is several times greater than the one for the complete response case. With growing n it decreases, but the rate of decrease is slow and for large values of n the MSE stabilizes on the nonzero value. The introduction of nonresponse correction significantly improves the MSE. Although for any values of n the observed MSE of the corrected estimator was higher than the MSE for the complete response case, it clearly tends to zero with growing n. The figure 3 presents the ratio of squared bias to the overall MSE of all the estimators, as a function of initial sample size n. The observed contribution of squared bias to the total MSE is negligible in the complete response case and in the case of corrected estimator. For the uncorrected estimator the bias share in the MSE is substantial and grows with increasing n. In most cases it holds above 90% which makes the uncorrected estimator practically useless. 3.2 Experiment 2 In the second experiment, the data on the revenues from municipal taxation in 284 Swedish municipalities [SSW92] represented the characteristic under study. A total of 250000 sample-subsample pairs were drawn from the population. The parameters of the nonresponse model were arbitrarily set to B0 = 2, B1 = 0.005 which implies, On some nonresponse correction 1323 Fig. 1. The bias of estimators as a function of initial sample size n - Gaussian distribution that for units with small values of variable under study the response probability is lower. The dependence between initial sample size n and the bias of estimators is shown on the figure 4. All three estimators show positive bias, decreasing with growing initial sample size. The bias of the uncorrected estimator is several times higher than the biases of two other estimators and it tends to stabilize on a nonzero level. The biases of the corrected estimator and estimator for the complete case tend to zero with growing n with the latter being the lowest for any value of n. The dependence between the initial sample size n and the mean square error of estimators is shown on the figure 5. The MSE of each estimator decreases with growing n. The MSE of the uncorrected estimator is several times higher than the MSE of other estimators. The share of bias in the overall MSE shown on the figure 6 is again substantial for the uncorrected estimator and negligible for other estimators. 4 Conclusions Simulation results suggest that the considered modification of population median estimator may significantly reduce its nonresponse bias and the mean square error. However, it should be stressed that these findings are based on the assumption that the logistic nonresponse model holds, and that the distribution of the variable under study is close to the distributions used in simulations. Care should be taken when generalizing these conclusions to different situations. Finally, let us note that the approach presented in this paper may also be applied to estimate other population quantiles, not only the median. 1324 Wojciech Gamrot Fig. 2. The mean square error of estimators as a function of initial sample size n Gaussian distribution References [DN03] [Gil00] David, H.A., Nagaraja, H.N.: Order Statistics. Wiley, New York (1992) Gilchrist, W.G.: Statistical Modelling with Quantile Functions. Chapman&Hall/CRC, Boca Raton (2000) [SSW92] Sarndal, C.E., Swensson, B., Wretman, J.H.: Model Assisted Survey Sampling. Springer-Verlag, New York (1992) [ZW97] Zielinski, R., Wieczorkowski, R.: Komputerowe generatory liczb losowych. WNT, Warsaw (1997) [EL91] Ekholm, A., Laaksonen, S.: Weighting via Response Modelling in Finnish Household Budget Survey. Journal of Official Statistics, 7, 325–338 (1991) On some nonresponse correction 1325 Fig. 3. The ratio of squared bias to the mean square error as a function of initial sample size - Gaussian distribution Fig. 4. The bias of estimators as a function of initial sample size n - tax revenues data 1326 Wojciech Gamrot Fig. 5. The mean square error of estimators as a function of initial sample size n tax revenues data Fig. 6. The ratio of squared bias to the mean square error as a function of initial sample size n - tax revenues data