Instrumental weighted variables - algorithm Jan Ámos Vı́šek1∗ Faculty of Social Sciences, Charles University, Smetanovo nabřežı́ 6, 110 01 Prague, the Czech Republic visek@mbox.fsv.cuni.cz Summary. An algorithm for the Instrumental Weighted Variables (IW V ) is presented. It represents a modification of algorithm for the Least Weighted Squares. The heuristics for the IW V and the “history” of the algorithm (originally proposed for the Least Trimmed Squares) is recalled. Numerical illustrations are also presented. Key words: Robust regression, failure of orthogonality condition, controllable level of robustness, heuristics of algorithm and its verification, heteroscedasticity. 1 Introduction and notations Sensitivity studies had been always an inseparable part of statistics (see e.g. [ChHa88, BKW80, Vi96a, Vi02c] or [Ro87] and references there). Nevertheless, surprising findings by Hettmansperger and Sheather in 1992 [HeSh92] for the Engine Knock Data (see [MGH89]) have largely increased the interest about them. In [HeSh92] the reasons for the inherent instability of the high-breakdown-point robust procedures with respect to a (small) shift of an observation were firstly thoroughly discussed (see also [Vi96b, Vi00a]). Their study also revealed the crucial role of reliability of the algorithm and its implementation. When later the algorithm proposed by Boček and Lachout [BoLa95]2 found an alternative to HettmanspengerSheather’s Least-Median-of-Squares-model (LM S), see [Vi94], an alternative with smaller value of the minimized 11th order statistic among the squared residuals and much different model, the fact that the studies of robust estimators cannot be restricted on the classical statistical properties became fully clear. Since usually we don’t know the precise value of the robust estimator - even for the simulated data - (as the estimator is typically given as a solution of an extremal problem), verification of reliability of the algorithm (and its implementation) requires to invent a “trick” which hints at least a reasonable hope that the respective value is a tight approximation to the precise value. For the Boček-Lachout’s algorithm such hope might stem from the fact that it gave smaller value of the minimized functional than any other available algorithm (e.g. in PROGRESS or S-PLUS, see ∗ Research was supported by grant of GA ČR number 402/06/0408. The algorithm is based on a procedure similar to the simplex method. The implementation is available on the request from the author of the present paper. 2 778 Jan Ámos Vı́šek [Vi96b, Vi00a]). It gave even smaller value of the h-th order statistics among the squared residuals than the exact Least-Trimmed-Squares-model (LT S) (for data sets, for which we were able to evaluate LS-models for all subsamples of size h, we can establish the exact LT S-model). Below-presented tables offer examples of such data-sets. Now, let us introduce the notations and recall some definitions. Let N denote the set of all positive integers, R the real line and Rp the pdimensional Euclidean space. The linear regression model given as Yi = Xi′ β 0 + ei = p X Xij βj0 + ei , i = 1, 2, ..., n (1) j=1 will be considered (all vectors throughout the paper will be assumed as column 2 ones). For any β ∈ Rp ri (β) = Yi − Xi′ β denotes the i-th residual and r(h) (β) the h-th order statistic among the squared residuals, i.e. we have 2 2 2 r(1) (β) ≤ r(2) (β) ≤ ... ≤ r(n) (β). (2) 2 Definitions of estimators - LMS and LTS Definition 1. Let [n/2] + [(p + 1)/2] ≤ h ≤ n, then 2 β̂ (LM S,n,h) = arg min r(h) (β) p β̂ (LT S,n,w) = arg min p and β∈R β∈R h X 2 r(i) (β) (3) i=1 are called the Least Median of Squares and the Least Trimmed Squares estimators, respectively. Let for any n ∈ N 1 = w1 ≥ w2 ≥ .... ≥ wn , wi ∈ [0, 1] be some weights. Then β̂ (LW S,n,w) = arg min p β∈R n X 2 wi r(i) (β) (4) i=1 is called the Least Weighted Squares (see [Vi01] and also [Vi02a]). Now, following [HaSi67] let us define ranks of squared residuals: For any i ∈ {1, 2, ..., n} put π(β, i) = j ∈ {1, 2, ..., n} ⇔ 2 ri2 (β) = r(j) (β). (5) Then we arrive at β̂ (LW S,n,w) = arg min p β∈R n X w n−1 (π(β, i) − 1) ri2 (β). (6) i=1 It is then easy to show that β̂ (LW S,n,w) is (one of) solution(s) of the normal equations n X i=1 w n−1 (π(β, i) − 1) Xi Yi − Xi′ β = 0. (7) Instrumental weighted variables - algorithm 779 For the consistency, asymptotic normality and optimality of weights see e.g. [Ma03, Ma04b, Pl04, Vi02a]. If IE {ei |Xi } = 6 0, the (Ordinary) Least Squares are generally biased and inconsistent, as the following relations show: β̂ (OLS,n) 0 =β + n 1X Xk Xk′ n k=1 !−1 n n 1X 1X Xi ei and plim Xi ei = IEX1 e1 . n i=1 n→∞ n i=1 (8) The classical theory offers then the method of Instrumental Varibles (see e.g. [CRS95, JGHLL85]) which defines the estimator as the solution of normal equations Z ′ (Y − Xβ) = 03 . We are going to recall a robustified version of Instrumental Varibles. p Definition 2. For any sequence of random vectors {Zi }∞ i=1 ⊂ R the solution(s) of the normal equation n X w n−1 (π(β, i) − 1 Zi Yi − Xi′ β = 0 (9) i=1 will be called the Instrumental Weighted Variables estimator (see [Vi04]) and denoted by β̂ (IW V,n,w) (see [Vi04]). For the consistency, asymptotic normality and Bahadur representation see [Vi06], the algorithm will be discussed in the rest of paper. 3 Algorithm for LTS - heuristics, history and verification Let us return to the above-promised tables. Nevertheless, prior to it, let us recall two things. First of all, Hettmanspenger-Sheather found firstly LM S-model for “contaminated” data - “Air/Fuel ratio” for the 2nd case was inadvertently typed as 15.1, later they considered the correct data with “Air/Fuel ratio” equal to 14.1. Secondly, nowadays due to the quick processors we are able to find exact LT S-model for data sets with number of observations, say, up to 25 (in a reasonable time)4 . Nevertheless, for data sets containing, say, more than 40 cases we need algorithm for evaluating procedures like the LT S. The algorithm was described and tested in [Vi96a,Vi00a]5 (let us call it “Iterative LT S”) and its slight modification for the LW S will be recalled below (PRO-LM S and Bo-La-LM S stay for LM S models found by PROGRESS - by Peter Rousseeuw and Annick Leroy - and Bocek-Lachout algorithms, respectively). Firstly, notice that in all tables the hth order statistic among the squared residuals in “Bo-La-LM S” is smaller than in “PRO-LM S” as well as than in “Exact LT S”. It happened also in the all data samples we have analyzed since 1992. It gives 3 An alternative approach, suitable especially for “exact” sciences, is known as Total Least Squares, see Van Huffel, S.: Total least squares and error-in-variables modelling. In: Antoch, J. (ed) COMPSTAT’2004, 539 - 555. Physica-Verlag, Heidelberg (2004). 4 It represents to evaluate about 25 × 106 LS-models which is possible in a few minutes, e.g. in MATLAB. 5 Recently a similar algorithm appeared in [HaOl]. 780 Jan Ámos Vı́šek Table 1. Engine Data (Air/Fuel 14.1, n = 16, p = 5, h = 11) (p is dimension of data including intercept) Method PRO-LM S th 11 order stat. 0.3221 Sum of squares 0.4239 Bo-La-LM S Exact LT S Iterative LT S 0.22783 0.3092 0.3092 0.3575 0.2707 0.2707 Table 2. Engine Data (Air/Fuel 15.1, n = 16, p = 5, h = 11) Method PRO-LM S th 11 order stat. 0.5729 Sum of squares 1.0481 Bo-La-LM S Exact LT S Iterative LT S 0.4506 0.5392 0.5392 1.432 0.7283 0.7283 Table 3. Stackloss Data (n = 21, p = 4, h = 12) The data were firstly considered by [Br65], later by e.g. [RoLe87, ChHa88]. There is nowadays a huge literature on them Method PRO-LM S th 11 order stat. 0.6640 Sum of squares 2.4441 Bo-La-LM S Exact LT S Iterative LT S 0.5321 0.7014 0.7014 1.9358 1.6371 1.6371 a strong hope that the algorithm and its implementation6 are good. Secondly, the sum of the h smallest order statistics among the squared residuals for “Exact LT S” is smaller than for “Bo-La-LM S” as well as than for“PRO-LM S”. It of course has to be so and hence it only confirms that the implementation of the searching throughout all subsamples of the size h is perhaps correctly implemented. Finally, the values of “Iterative LT S” are the same as of “Exact LT S” (and they were evaluated in much shorter time). It rouses a hope that the iterative algorithm (and its implementation) may be rather good. Let us consider larger data samples. Table 4. Demographical Data (n = 49, p = 7, h = 28) The data were firstly considered by [GuMa80], later by e.g. [ChHa88]. Method PRO-LM S th 27 order stat. 131.50 Sum of squares 134260 Bo-La-LM S Iterative LT S 95.38 104.20 132340 64159 We can observe again that the hth order statistic among the squared residuals in “Bo-La-LM S” is smaller than in “PRO-LM S” as well as than in “Iterative LT S” (the number of observations doesn’t allow to obtain “Exact LT S” by searching through 6 Available on request (in MATLAB or in MATHEMATICA). Instrumental weighted variables - algorithm 781 Table 5. Educational Data (n = 50, p = 4, h = 27). The data were firstly considered by [ChPr77], later by e.g. [RoLe87,ChHa88,PoKo97,At94] to give some among many others. Method PRO-LM S th 27 order stat. 19.3562 Sum of squares 3605.5 Bo-La-LM S Iterative LT S 16.63511 19.0378 3728.6 3414.5 all subsamples of size 28 and 27, respectively). Notice that again the sum of the h smallest order statistics among the squared residuals for “Iterative LT S” is smaller than for “Bo-La-LM S” as well as than for“PRO-LM S”. It may strengthen our belief that the algorithm works well7 . 4 LWS - comparison with Chatterjee & Price Now, let us give at least some reasons why we have used the idea of the implicit weighting8 for the robustification of the Instrumental Variables. Employing the implicit weighting in the definition of LW S we have rid of (generally) instability of LT S (LT S are of course a special case of LW S) and attained higher flexibility (or even adaptivity) to data. Moreover, when processing panel data, we can’t trim off completely any observation, without possible damage of the autocorrelation structure of disturbances and/or explanatory variables, i.e. the LT S or LM S can’t be used. Finally, since the LW S in contrast to the classical Weighted Least Squares accommodate implicitly to the magnitude of residuals, it is able to give more “appropriate” models, especially for heteroscedastic data. For some examples with a generated data with various level of heteroscedasticity see [Pl03]. We will give the example employing Educational Data (which we have already used; they were considered firstly by Chatterjee and Price in 1977, see [ChPr77]). We will consider results from [ChPr77], the 2nd edition, see the paragraph 5.6. The data record information about per capita expenditure on education in the 50 states of U.S. projected for 1975 (SE75 - response variable), per capita income in 1973 (PI73), number of residents per thousand under 18 year of age in 1974 (Y74) and number of residents per thousand living in urban areas in 1970 (URB70) (explanatory variables). Classical diagnostic tools (Cook’s distance and DFIT9 ) indicated Alaska as outlier (any robust processing the data, as L1 , M -estimators with various ψ-functions, LM S, LT S and LW S, found Alaska to be very significant outlier, see also [RoLe87]). Then they applied LS (on the whole data except Alaska), divided the states into 4 groups (Northeast, North central, South and West), evaluated for each group the sums of squared residuals s2A , A = N, N C, S, W and the total sum of squared residuals s2 (see Table 5.5 on the page 137 of the 2nd edition). Finally, they defined weights 7 For other examples of results see [Vi96b, Vi00a]. Let us assume under implicit weighting the process of assigning weights to the order statistics of the squared residuals rather than to residuals. This is the idea the LW S are based on. 9 Originally proposed by [WeKu77], see also [ChPr77], 2nd ed., p. 86. 8 782 Jan Ámos Vı́šek for the cases in the respective group as wA = s/sA , A = N, N C, S, W . In the next ∗ table wA demote normalized weights (so that the largest one is 1) and dA the codes of observations in group A (they are the same for all cases in respective group, e.g. d2 = 1, d21 = 2, d34 = 3, etc., and we will need them later). Table 6. Chatterjee-Price framework Group of States number of cases s2A /s wA ∗ wA dA South 16 0.383 2.611 1 1 West 12 0.794 1.259 0.482 2 Northeast 9 1.110 0.901 0.348 3 North central 12 1.433 0.698 0.267 4 Alaska 1 – 0 0 5 Having assigned the weights, Chatterjee and Price employed the classical Weighted Least Squares (results see below or in Table 5.6 of [ChPr77]). Using their estimates of coefficients, we can reconstruct residuals in their model and ordering the absolute values of them (increasingly), we obtain ranks of them, say R(i). Now, we can expect that these ranks for the observations from the group (S) will have smallest values (as the residuals were expected to be smallest - in absolute value - and hence obtained the largest weights 2.611), etc. Let us put d∗ (i) = 1, i = 1, 2, ..., 16, d∗ (i) = 2, i = 17, 18, ..., 28, d∗ (i) = 3, i = 29, 30, ..., 37, d∗ (i) = 1, i = 38, 39, ..., 49 and d∗5 0 = 5. Then, the statistics D= 50 X |di − d∗ (R(i))| i=1 indicates how far our expectation that the observations from (S) would have smallest residuals, etc. has been justified. We obtained D = 35, which indicates (as the total number of observations is 50) that two third of observations have somewhat inappropriate weights. Finally,we applied on Educational Data the LW S with ∗ weights wA . In the next table the results of both analysis were collected. We have added sample variance of residuals and the coefficient of determination evaluated for data multiplied by square root of weights (for the idea see PROGRESS or [RoLe87] e. g. p. 490). The approach by Rousseeuw and Leroy seems to be more appropriate than the approach by Chatterjee and Price, see [ChPr77] p. 137 - proposing to consider all squared residuals without weighting them. According to the latter for the LM S or the LT S also the residuals of (sometimes evidently) contaminated observations would be also considered in respective sums of squares (σ̂r2 and R02 denote sample variance of residuals and coefficient of determination, respectively). Instrumental weighted variables - algorithm 783 Table 7. Educational Data PI73 Y74 URB70 σ̂r2 R02 Method Intercept Chaterjee-Price-W LS -320.1 0.027 0.063 0.877 377.55 0.890 LW S -306.8 0.066 0.052 0.926 370.09 0.909 5 Algorithm for LWS and IWV Let’s now describe the algorithm for the LW S and then for IW V . As we already said, the algorithm for LW S is a slight modification of algorithm for LT S which was tested in [Vi96b] and [Vi00a], and also implemented in EXPLORE, see [CiVi00]. Let w1 = 1 ≥ w2 ≥ ... ≥ wn ≥ 0 be vector of weights and denote W = diag {w1 , w2 , ..., wn } the diagonal matrix. Further, select some maximal number of iteration of algorithm, say MAX, and minimal number of models, say MIN. A Select randomly p observations10 and evaluate the (regression) plane going through them (if it is given uniquely, otherwise repeat selection of observations). It gives a starting estimate of regression coefficients, say (W LS,n,W ) W LS,n,W β̂(I) , evaluate squared residuals of all observations ri2 (β̂(I) ) = (W LS,n,W ) Yi − Xi′ β̂(I) (W LS,n,W ) 2 r(i) (β̂(I) )’s 2 , i = 1, 2, ..., n, establish the order statistics of them (see (2)). Then find the sum of weighted order statistics (W LS,n,W ) S(0) β̂(I) = n X (W LS,n,W ) 2 wi r(i) (β̂(I) ). i=1 B Reorder the observations according to the order statistics from previousstep, −1 (W LS,n,W ) ′ apply the Weighted Least Squares, i.e. put β̂(t) = X(t−1) W X(t−1) × ′ X(t−1) W Y(t−1) (where X(t−1) is design matrix containing X1 , X2 , ..., Xn but in the order given by the ordered statistics from the previous step. Similarly, Y(t−1) is the vector of response variables with coordinates Y1 , Y2 , ..., Yn but again in the order given by the ordered statistics from the previous step. (W LS,n,W ) (W LS,n,W ) ) = Yi − Xi′ β̂(t) C Evaluate squared residuals ri2 (β̂(t) 1, 2, ..., n, establish the ordered statistics of them, i.e. (W LS,n,W ) 2 r(1) (β̂(t) (W LS,n,W ) 2 ) ≤ r(2) (β̂(t) (W LS,n,W ) 2 ) ≤ ... ≤ r(n) (β̂(t) ) and find the sum of weighted order statistics, i.e. (W LS,n,W ) S(t) β̂(t) = n X (W LS,n,W ) 2 wi r(i) (β̂(t) ). i=1 If (W LS,n,W ) S(t) β̂(t) (W LS,n,W ) < S(t−1) β̂(t−1) , go to B.Otherwise, denote the last sum of weighted order statistics as (W LS,n,W ) S(f inal) β̂(f inal) and go to D. 10 p is number of explanatory variables including intercept, if any. 2 ,i = 784 Jan Ámos Vı́šek D Keep in memory the MIN previous models, each of them was found in repetitions of steps B and C and they are ordered according to their (W LS,n,W ) ’s. If the model (just found in the previous B-C-cycle) S(f inal) β̂(f inal) (W LS,n,W ) has S(f inal) β̂(f inal) smaller than some model among MIN models kept in memory, include it on the appropriate place and restrict the number of saved models again on MIN. If all models kept in memory are the same, stop the evaluation and return as the solution this model. If the algorithm already passed MAX times through A, stop and return that model from MIN models kept in (W LS,n,W ) . Otherwise, go to A. memory which has minimal sum S(f inal) β̂(f inal) The algorithm for the IW V evaluates in the steps B and C (IW V,n,W ) β̂(t) ′ = Z(t−1) W X(t−1) −1 ′ Z(t−1) W Y(t−1) (where Z(t−1) is reordered matrix of instruments, reordered in the same way as X(t−1) ) and (IW V,n,W ) S(t) β̂(t) (IW V,n,W ) = Y − X(t) β̂(t) ′ (IW V,n,W ) W Zt Zt′ W Y − X(t) β̂(t) instead of (W LS,n,W ) (W LS,n,W ) , respectively as one of solution of normal and S(t) β̂(t) β̂(t) equations (9) minimizes (Y − Xβ)′ ZW ′ W Z (Y − Xβ) in β ∈ Rp , see [JGHLL85, Vi06]. All other steps are the same as in algorithm for the LW S. References [At94] [BKW80] [BoLa95] [Br65] [CRS95] [ChHa88] [ChPr77] [CiVi00] [GuMa80] [HaSi67] [HaOl] Atkinson, A.C.: Fast very robust methods for the detection of multiple outliers. JASA 89, 1329 - 1338 (1994) Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley, NY (1980) Boček, P., P. Lachout: Linear programming approach to LM S-estimation. Mem. Vol. Comput. Statist. & Data Analysis, 19, 129 - 134 (1997) Brownlee, K. A.: Statistical Theory and Methodology in Science and Engineering. 2nd ed., Wiley, NY (1965) Carroll, R.J., Ruppert, D., Stefanski, L.A.: Measurement error in Nonlinear Models. Chapmann & Hall/CRC, NY (1995) Chatterjee, S., Hadi, A.S.: Sensitivity Analysis in Linear Regression. Wiley, NY (1988) Chatterjee, S., Price, B.: Regression Analysis by Example. 1st ed. Wiley, NY (1977), 2nd ed. , Wiley, NY (1991) Čı́žek, P., Vı́šek, J.Á.: The least trimmed squares. In: Hrardle, W. (ed) User Guide of Explore (2000) Gunst, R. F., Mason, R.L.: Regression Analysis and Its Application: A Data-Oriented Approach. Marcel Dekker, NY (1980) Hájek, J., Šidák, Z.: Theory of Rank Test. Academic Press, NY (1967) Hawkins, D.M., Olive, D.J.: Improved feasible solution algorithms for breakdown estimation. CSDA, 30, 1 - 12 (1999) Instrumental weighted variables - algorithm 785 [HeSh92] Hettmansperger, T.P., S.J. Sheather: A Cautionary Note on the Method of Least Median Squares. The American Statistician, 46, 79–83 (1992) [JGHLL85] Judge, G.G., W.E. Griffiths, R.C. Hill, H. Lutkepohl, T.C. Lee: The Theory and Practice of Econometrics. Wiley, NY (1985) [MGH89] Mason, R.L., Gunst, R.F., Hess J.L.: Statistical Design and Analysis of Experiments. Wiley, NY (1989) [Ma03] Mašı́ček, L.: Consistency of the least weighted squares estimator. In: M. Hubert et al. (eds) ICORS 2003, 183 - 194. Birkhauser, (2003) [Ma04b] Mašı́ček, L.: Optimality of the least weighted squares estimator. Kybernetika, 40, 715 – 734 (2004) [Pl03] Plát, P. The Least Weighted Squares. PhD. thesis, the Czech Technical University, Prague (2003) [Pl04] Plát, P.: The Least Weighted Squares Estimator. In: Antoch, J. (ed) COMPSTAT’2004, 1653 - 1660. Physica-Verlag, Heidelberg (2004) [PoKo97] Portnoy, S., Koenker, R.: The Gaussian hare and the Laplacian tortoise. Statistical Science, 12, 279 - 300 (1997) [Ro87] Ronchetti, E.: Bounded influence inference in regression: A review. In: Dodge, Y. (ed) Statist. Data Analysis. Based on the L1 -norm. NorthHolland, NY (1987) [RoLe87] Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection. Wiley, NY (1987) [Vi94] Vı́šek, J.Á.: A cautionary note on the method of the Least Median of Squares reconsidered. In: Lachout,P., Vı́šek, J.Á. (eds) Trans. Twelfth Prague Conf, 254 - 259, Prague, (1994) [Vi96a] Vı́šek, J.Á.: Sensitivity analysis of M -estimates. Ann. Inst. Statist. Maths, 48, 469-495 (1996) [Vi96b] Vı́šek, J.Á.: On high breakdown point estimation. Computational Statistics, 11, 137 – 146 (1996) [Vi00a] Vı́šek, J.Á.: On the diversity of estimates. CSDA, 34, 67 - 89 (2000) [Vi01] Vı́šek, J.Á.: Regression with high breakdown point. In: Antoch, J., Dohnal, G. (eds) Robust 2000, 324 - 356. UCMP, Prague (2001) [Vi02a] Vı́šek, J.Á.: LWS, asymptotic linearity, consistency, as. normality. Bull. the Czech Econometric Soc., 9,15, 31 - 58,16, 1 - 28 (2002) [Vi02c] Vı́šek, J. Á.: Sensitivity analysis of M -estimates of regression model: Influence of data subsets. Ann.Inst.Statist.Maths, 54, 261 - 290(2002) [Vi04] Vı́šek, J.Á.: Robustifying instrumental variables. In: Antoch, J. (ed) COMPSTAT’2004, 1947 - 1954. Physica-Ver. Heidelberg [Vi06] Vı́šek, J.Á.: Instrumental weighted variables. Jurečková, J. (ed) Proc.Conf. on Pers. in Modern Statist. Inference III, Springer (2006) [WeKu77] Welsh, R.E., Kuh, E.: Linear regression diagnostics. Technical Report 923-77, Sloan School of Management, Cambridge, Massachusetts (1977)