Application of new model-based and model-assisted methods for estimating the finite population mean of the IBEX’35 stock market data M. Rueda1 , I. Sánchez-Borrego1 , S. González2 , J.F. Muñoz3 , and S. Martı́nez4 1 2 3 4 Department of Statistics and Operational Research. University of Granada. Campus de Fuentenueva s/n. 18071. Granada. España. mrueda@ugr.es,ismasb@ugr.es Department of Statistics and Operational Research. University of Jaén. Department of Cuantitative Methods for the Economy and the Business. University of Granada. Department of Statistics and Applied Mathematics. University of Almerı́a. Summary. In this paper new model-assisted and model-based estimators of the finite population mean, based on local polynomial regression are applied to the IBEX’35 Spanish stock market data from april 2004 to october 2005. This data set has one discontinuity point in september 2004. We adapt local linear kernel regression to discontinuities and propose two jump-preserving model-based and model-assisted estimators. An empirical comparison between the methods is performed. 1 Introduction Classical sampling estimators are usually constructed based only on the design, making no assumption about the nature of the population under study. In many sample surveys auxiliary information on the finite population is available, and this could improve the precision of the estimators, compared to a purely design-based estimator. To incorporate this information consider a superpopulation model for y built on an unknown function of x: Yj = m(xj ) + ej , j = 1, ..., N. (1) The ej , j = 1, . . . , N are independent and identically distributed with E(ej ) = 0 and V ar(ej ) = σ 2 for j = 1, ..., N . Traditionally, methods that utilize such models are the parametric ones, in which the working model (1) describing the relationship between the auxiliary variable x and the study variable y is assumed. The regression function is selected from a particular class of functions, sometimes based in a prior knowledge of the relationship 1698 M. Rueda, I. Sánchez-Borrego, S. González, J.F. Muñoz, and S. Martı́nez of the variables under study. When there is no such knowledge nonparametric methods are more appropriate, because only smoothing assumptions over the regression function are made. Kuo [Kuo88] adopted a nonparametric model-based approach, which does not place any restrictions on the relationship between the variables under study. Other important works in this area are given by Chambers et al. [Cha93], Dorfman [Dor92] and Dorfman and Hall [DH93]. Under the model-assisted approach Breidt and Opsomer [BO00] relies on local polynomial kernel regression for the estimation of the unknown regression function m(·). Most traditional nonparametric regression methods work under the assumption that the regression function is smooth. However, there are plenty of examples available and real-life data problems where the regression function is smooth but in a finite number of points. For example, the economical impact in a region of a natural disaster such as earthquake. A well-known application is the classical problem of the annual volume of the river Nile, studied by Cobb [Cob78]. More applications can be found in medicine, economics, quality control and so on. Generally, because of the assumptions made over the regression function, local polynomial kernel regression provide smooth estimations of the discontinuous regression functions, so that it will cause biased regression estimates. Therefore these smoothers need to be adapted if discontinuities are present. New model-based and model-assisted estimators are proposed, as well as their corresponding jump-preserving counterparts. The latter are the result of combining local polynomial regression, introduced by Ruppert and Wand [RW94] and Fan and Gijbels [FG96], among others, together with Wu and Chu’s [WC93] method of the projected observations. The projected observations is a method of reusing the available data to increase the number of observations in the jump point location region and contributing to improve the estimation of the discontinuous regression function. The methods are simple to implement and are based on local polynomial smoothing, that has many desirable properties at the nonparametric regression context, including design adaptation, consistency and asymptotic unbiesedness. The paper is organized as follows: Section 2 provides a brief description of the proposed methods for estimating the population mean. Section 3 contains the application to the IBEX’35 Spanish stock market data. 2 Proposed methods Consider U = {1, . . . , N } a population of N units from which a sample s of size n is selected. The chosen sampling design determines inclusion probabilities πj . Let yj be the value of the study variable y, and xj the corresponding value of an auxiliary variable x. The Horvitz-Thompson estimators based on sample s: ȳHT = 1 X yj , N j∈s πj (2) is a design-unbiased estimator of Y which does not use the information of auxiliary variable x. Application of new model-based and new model-assisted methods 1699 It will be assumed in the following the model: Yj = m(xj ) + ej , j = 1, . . . , N, (3) where ej are independent and identically distributed with E( ej ) = 0, and constant variance σ 2 . The unknown regression function m is defined without loss of generality on the interval [0, 1] (it can be extended to the whole R), and is smooth but in a finite and unknown number q of jump points tk (k = 1, . . . , q). Under the model-assisted approach, Breidt and Opsomer’s [BO00] first use the local polynomial kernel regression in the survey sampling setting. We consider this method for estimating the finite population mean as follows Yb MA X yj − m bj 1 = N πj j∈s + X ! m bj (4) , j∈U where m b i is the local linear kernel estimator (Ruppert and Wand [RW94] and Fan and Gijbels [FG96]). If discontinuities are present an adaptation of the nonparametric regression estimator is required. As a previous step, we need estimation of the jump points and in a second step, the regression function is estimated using these jumps points. We estimate jump points following Wu and Chu’s [WC93] method but relying on local linear kernel smoothing instead of the traditional Nadaraya-Watson kernel estimator (Nadaraya [?]). To estimate the jump points we consider the estimator: P m(x) b = {i:xi ∈[−1,2]} P Kh (x − xi ){sn,2 − (x − xi )sn,1 }yiP {i:xi ∈[−1,2]} sn,2 sn,0 − (sn,1 )2 , (5) where K is a kernel function, x is the point of estimation, x ∈ [0, 1] and xi is the pseudo-point, location of the projected observation yiP introduced by Wu and Chu [WC93]. The functions sn,r are given by X sn,r = Kh (x − xi )(x − xi )r . (6) {i:xi ∈[−1,2]} The jump points are estimated through differences of estimators of the type (5). Let s ⊂ U and yi , i = 1, . . . , n, the projected observations for each [b tk−1 , b tk ] for k = 1, . . . , q + 1 are given by yipk = y2−i + 2m b gL (xj )(xi − b tk ) i = 2 − n, 2 − n + 1, . . . , 0, b gR (xj )(xi − b tk−1 ) yipk = y2n−i + 2m (7) i = n + 1, n + 2, . . . , 2n − 1, and yipk = yi for i = 1, . . . , n. m b gL and m b gR are two kernel smoothers which involve different kernel functions and g is a pilot bandwidth, whose choice is motivated by Wu and Chu [WC93] under practical considerations. Finally, for each k = 1, . . . , q + 1 and xj ∈ [b tk−1 , b tk ], m b j estimates the values m(xj ) m bj = q+1 X X k=1 {i:xi ∈[2tbk−1 −tbk ,2tbk −tbk−1 ]} Kh (xj − xi ){sn,2 − (xj − xi )sn,1 }yipk , sn,2 sn,0 − (sn,1 )2 (8) 1700 M. Rueda, I. Sánchez-Borrego, S. González, J.F. Muñoz, and S. Martı́nez where yipk are the projected data obtained from the original observations yi at the design points xi ∈ [b tk−1 , b tk ]. The sn,r are given by: X sn,r = Kh (xj − xi )(xj − xi )r . (9) {i:xi ∈[2tbk−1 −tbk ,2tbk −tbk−1 ]} Finally, the finite mean population is estimated by the model-assisted estimator, b y JP MA 1 = N X yj − m bj πj j∈s + X ! m bj . (10) j∈U A different approach to estimate the population mean is given by model-based estimators, which only predict the non-sampled values. We propose an alternative model-based estimator for estimating the population mean based on a local linear kernel smoother, whose good properties are well-known at the nonparametric regression context. The proposed estimator is given by Yb MB X 1 = N Yj + j∈s X ! m bj (11) , j∈U −s being m b j the local linear kernel smoother (Ruppert and Wand [RW94] and Fan and Gijbels [FG96]). Similarly, we propose the following model-based estimator Yb J P M B = 1 N X j∈s Yj + X ! m bj , (12) j∈U −s b j the adapted version to discontinuities of the local linear kernel smoother being m given in (8). 3 Application to the IBEX’35 data set We consider the population data set of the Spanish stock market IBEX’35, taken from april 2004 to october 2005. It contains 401 units. Plot of this data set is given in Figure 1. Müller and Stadtmüller [?] introduced a nonparametric regression method to determine the number of jump points of the discontinuous regression function. That method fixed that number for this data set in q = 1. This discontinuity point is located at the end of september 2004. We compare the performance of model-assisted estimator (MA), jump-preserving estimator (JPMA), model-assisted estimator (MB), jump-preserving estimator (JPMB), regression estimator (REG) (Singh [Sin03]) and the Horvitz-Thompson estimator (HT) applied to the IBEX’35 data set. A bandwidth is considered h = 0.1 for the nonparametric regression estimators. To estimate the jump points of the discontinuous regression function, an equally spaced grid of 402 points is considered. Samples are generated by simple random Application of new model-based and new model-assisted methods 1701 IBEX’35 11100 10300 9500 8700 7900 7100 april 2004 oct 2004 april 2005 oct 2005 Fig. 1. Scatter plot of IBEX’35 population sampling using sample sizes n = 50, n = 75 and n = 100. We perform 500 replications. For each estimator we compute the relative absolute bias (RAB) b = RAB(θ) R X b i) − Y θ(s Y i=1 , (13) where R is the number of replications and θb is the finite population mean estimator considered. The relative efficiency respect to the Horvitz-Thompson estimator is given by PR i=1 b = P RE(θ) R i=1 b i) − Y θ(s 2 yHT (si ) − Y 2 . (14) The calculations and all the estimators were obtained using the R program. Programming details are available from the authors. Table 1 and Table 2 show the RE and RAB values for the IBEX’35 data set, and Figure 2 reports line plots for the estimators at each sample size. As sample size grows the bias of the estimators decreases. The application to IBEX’35 data set shows the improvement provided by the jump-preserving methods relative to their non jump-preserving counterparts. Moreover, better results in terms of RE and RAB values are given by the model-assisted methods compared to the model-based estimators. 1702 M. Rueda, I. Sánchez-Borrego, S. González, J.F. Muñoz, and S. Martı́nez Table 1. Relative efficiency (RE) of IBEX’35 stock market data with bandwidth h = 0.1 n HT MA 50 1 75 1 100 1 JPMA MB JPMB REG 0.04021 0.03465 0.16749 0.12868 0.13266 0.01968 0.01722 0.02721 0.01917 0.10599 0.01647 0.01341 0.02545 0.01733 0.10371 Table 2. Relative Absolute Bias (RAB) of IBEX’35 stock market data with bandwidth h = 0.1 n HT MA JPMA MB JPMB REG 50 0.01020 0.00106 0.00101 0.00289 0.00242 0.00372 75 0.00818 0.00061 0.00055 0.00087 0.00063 0.00269 100 0.00743 0.00047 0.00046 0.00077 0.00059 0.00237 References [BO00] [Cob78] [Cha93] [Dor92] [DH93] [FG96] [Kuo88] [RW94] [Sin03] [WC93] Breidt, F.J., Opsomer, J.D. (2000) Local polynomial regression estimators in survey sampling. The Annals of Statistics, 28(4), 1026–1053 Cobb, G.W. The problem of the Nile: conditional solution to a changepoint problem, Biometrika, 62, pp. 243–251 (1978) Chambers et al. Bias robust estimation in ¯nite population using nonparametric calibration, Journal of American Statistical Association, 88, 268–277 (1993) Dorfman, A.H. Nonparametric regression for estimating totals in ¯nite populations, Proceedings of the Section on Survey Research Methods, 622–625. American Statistical Association, Alexandria, VA. (1992) Dorfman, A.H. and Hall, P. Estimators of the ¯nite population distribution function using nonparametric regression. Annals of Statistics, 21 1452–1475 (1993) Fan, J. and Gijbels, I. Local Polynomial Modelling and Its Applications. Ed. Chapman and Hall (1996) [MS99]MullMüller, H. G. and StadtmÄuller, U. Discontinuous versus smooth regres- sion. Annals of Statistics, 27(1) 299–337 (1999) [Nad64]NAdar Nadaraya, E.A. On estimating regression. Theory Probab. Applic., 15, 134–137 (1964) Kuo, L. (1988) Classical and Prediction Approaches to Estimating Distribution Functions from Survey Data. Proceeding of the Section on Survey Researh Methods. American Statistical Association, 280–285 Ruppert, D. and Wand, M.P. Multivariate locally weighted least squares regression. The Annals of Statistics, 22(3), 1346–1370 (1994) Singh, S. Advanced sampling theory with applications: How Michael ”selected” Amy. Kluwer Academic Publisher. The Netherlands, 1–1247 (2003) Wu, J.S. and Chu, C.K. Nonparametric function estimation and bandwidth selection for discontinuous regression functions, Statistica Sinica, 3, 557–576 (1993) Application of new model-based and new model-assisted methods RE 1703 RAB 0.15 0.003 0.10 0.002 0.05 0.001 0.00 0.000 50 75 50 100 75 100 n n MA JPMA MB JPMB REG Fig. 2. (RE) and (RAB) for the model-assisted estimator (MA), jump-preserving model-assisted (JPMA), model-based (MB), jump-preserving model-based (JPMB) and the classical regression estimator (REG) and for bandwidth h = 0.1 and sample sizes n = 50, n = 75 and n = 100