Running Head: ROBUST REGRESSION DOWNPLAYING OUTLIERS Robust Regression as a Means of Downplaying the Effect of Outliers Megan Oliphint Southern Methodist University Paper presented at the annual meeting of the Southwest Educational Research Association, San Antonio, TX, February 2-4, 2011. 1 ROBUST REGRESSION DOWNPLAYING OUTLIERS 2 Robust Regression as a Means of Downplaying Outliers Educational researchers are often plagued with the less than ideal situations involving non-normal data, specifically the existence of outliers. Normally researchers must decide whether to let outliers violate the assumptions for ordinary least squares (OLS) regression or simply delete these points. With both of these remedies having inherent flaws, robust regression provides a compromise. The purpose of this paper is to show and illustrate more appropriate alternatives for handling outliers in regression analysis. Concepts Underlying Robust Regression Robust regression can be used in any situation in which OLS would be used. However, the researcher must be sure that outliers are truly outliers, in that they are not due to data entry errors, from a different population, or have a justification for removal. Considerable caution should be taken in removing or choosing to ignore outliers. As researcher Fox (1997) noted It is important to investigate why an observation is unusual. Truly bad data can often be corrected, or if correction is not possible, thrown away. When a discrepant data point is correct, we may be able to understand why the data is unusual… Except in clear-cut cases, we are justifiably reluctant to delete observations or to respecify the model to accommodate unusual data. (p. 285) With that caveat, robust regression accommodates outliers by using robust regression estimation techniques. These techniques lessen the influence of outliers by reducing their weights using coefficient estimation techniques (Anderson & Schumaker, 2003). However, to understand robust regression it is important to first understand that where outliers occur effects how outliers perform and their subsequent effect on OLS regression lines. Three terms are used to categorize the impact of outliers on the regression coefficient estimate: ROBUST REGRESSION DOWNPLAYING OUTLIERS 3 leverage, discrepancy, and influence. The first category, leverage is a measure of the distance an independent variable deviates from the mean. A point with leverage is a considerable distance from other points of the data, but leverage does not consider distance from the regression line or direction. The second category, discrepancy, is used to define how far the point is from the regression line. Points with high discrepancy are far from both the regression line and other data points. The third category, influence is the effect of removing the outlier, shown by the change in regression coefficients once the outlying point is removed. It is a product of leverage and discrepancy and is often measured as cook’s distance (Anderson & Schumacker, 2003). Along with how the outliers function, the location is also important in assessing which technique to use. For instance, outliers closer to the x-axis exert more influence than outliers close to the y-axis. Outliers close to the y-axis provide only minimal impact on the regression coefficient estimate (Anderson & Schumaker, 2003), therefore most well known robust regression techniques (such as Huber’s M-estimation and the least absolute deviation method) are amenable to outliers closer to the y-axis. Knowing this information helps researchers determine which of the available robust regression techniques is appropriate. Robust Regression Techniques In handling outliers in regression analysis, Cohen, Cohen, West, and Aiken (2003) identify four approaches as an alternative to the estimation of the OLS regression coefficients: least absolute deviation, least trim squares, M-estimation, and bounded-influence estimators. Described below are these four techniques followed by routines for running these robust regressions in R. Least Absolute Deviation ROBUST REGRESSION DOWNPLAYING OUTLIERS 4 The least absolute deviation method (LAD), also known as L1, least absolute value, or least absolute errors, minimizes the sum of absolute errors. In theory, as the LAD takes the absolute values of residuals, the influence is much less compared to standard OLS, which takes the squared residuals of least squares (Anderson & Schumaker, 2003). The LAD method attempts to minimize the effect of the difference between a person’s actual score and their predicted score by choosing values for the regression effect (beta) that minimize this difference. It is shown mathematically as: n Minimize | e1(b) |. i1 There are several positive and negative effects associated with using the LAD technique. For data with outliers in the y-axis direction, the LAD estimator can be robust. However, the same is not true with regards to outliers in the x-axis direction. As it is neither a high breakdown point estimator nor bounded-influence estimator, one outlying data point will cause the regression line to pass through this point (Anderson & Schumaker, 2003). This can often create inconsistent results. Because of this, LAD is not recommended in cases with a single outlying case with high influence. However, LAD does work well in cases with high discrepancy. Least Trim Squares Another robust regression technique, least trim squares (LTS) provides researchers with the choice regarding the amount (or proportion) of data to be “trimmed.” Unlike LAD, which uses the absolute value of residuals as coefficients or OLS, which uses squared residuals, LTS uses squared residuals to order outliers. This is shown mathematically as: h Minimize (ri 2 ) , i1 ROBUST REGRESSION DOWNPLAYING OUTLIERS 5 where r is the ordered squared residuals, from least to greatest and h, determined a priori, is the proportion of residuals to be excluded from the analysis. Using a breakdown point of .50, LTS is considered a high-breakdown point estimator. LTS works well in general situations, however, there are often cases in which LTS is not recommended. For instance, if all of the outlying points were deleted in the selection of proportion to be deleted, the results would be computationally the same as OLS. Also, this method is not efficient if fewer results are trimmed than the number of outlying data points but if more trimming occurs than there are outlying data points, there is an argument that “good” data is now being excluded. Because of this, LTS is considered subjective and provides biased results when outliers are present in clumps (Cohen, Cohen, West, and Aiken, 2003). This is shown graphically in Figure 1 on a heuristic data set containing 20% outliers with a proportion of ½ being trimmed. The green line represents the robust regression line using the LTS method. The black line represents the OLS regression line. Figure 1: Least Trimmed Squares Method using Heuristic data ROBUST REGRESSION DOWNPLAYING OUTLIERS 6 M-estimation Developed by Huber (1973, 1981), the third robust regression technique, M-estimation uses an iterative form of least reweighted squares to calculate the coefficients used in the regression analysis, with the “M” signifying the maximum likelihood. The iterative form of lease squares regression calculates a new set of weights based on residuals, uses residuals to model the data upon which new weights are based. This process continues until the change in parameter is small enough, a specific number of iterations have occurred, or a criterion is met based on convergence (Berk, 1990). M-estimation uses this variation of weighted least squared regression to minimize outliers by “downplaying” the effect of larger residuals and “upweighting” the effect of smaller residuals as larger residuals result in smaller weights. This is shown mathematically as: h Minimize p(r 2 i ), i1 where p is a symmetric function with a unique minimum at zero. With a breakdown point of 1/n, LTS works well in cases with high discrepancy. However, it is not resistant to outliers on the x-axis or explanatory variables. It also does not produce well in cases with high leverage as well as discrepancy. Using M-estimators to find regression coefficients is similar to OLS estimators with normal errors. However, M-estimators are considered more robust with heavy-tailed error distribution. For this reason, it is considered more statistically efficient than the LAD method (Anderson & Schumaker, 2003). This is shown in Figure 2, with the blue line representing the M-estimation regression line, the red line representing the least absolute deviation regression line, and black line representing the OLS regression line on heuristic data with 20% outlying data. ROBUST REGRESSION DOWNPLAYING OUTLIERS 7 Figure 2: M-estimation compared to Least Absolute Deviation Bounded Influence Estimators The bounded influence robust regression technique, also known as GM-estimation was developed to combat the issue previously seen with M-estimation techniques regarding data outlying on the x-axis. The bounded influence estimators choose weights based on the consideration of leverage and discrepancy (Cohen, Cohen, West, & Aiken, 2003). To downplay outliers, bounded influence estimators minimize the residuals compared to the minimization of the sum of squared residuals that is used in OLS (Anderson & Schumacker, 2003). Bounded influence is shown mathematically as: y x ˆ i s x i 0 . i i1 n Bounded influence estimators provide good performance in many situations (including outliers on the x-axis) but provide poor estimates with clumps of outliers. Also, outliers with high leverage but small error terms perform poorly. Robust Regression Techniques in R ROBUST REGRESSION DOWNPLAYING OUTLIERS Using the Crime dataset from the Statistical Methods for Social Sciences, Third Edition (Agresti & Finlay, 1997), each robust regression technique is shown to allow for comparison of the varying methods. The OLS model of the variables “crime” predicted by “poverty” and “single” is used as the regression line. Figure 3 shows the QQNorm plot, where data point 25 and 9 are identified as having high leverage as well as residual values. Figure 4 shows the R output for the varying models. Figure 3: QQNorm Plot for Crime Dataset Figure 4: R Output for OLS and Robust Regression Techniques > #### OLS MODEL #### > m1<-lm(crime ~poverty + single, data=crime2) > summary(m1) Call: 8 ROBUST REGRESSION DOWNPLAYING OUTLIERS lm(formula = crime ~ poverty + single, data = crime2) Residuals: Min 1Q Median 3Q Max -646.78 -106.42 -15.34 112.66 672.13 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -879.797 247.378 -3.556 0.00087 *** poverty 7.591 8.416 0.902 0.37164 single 120.617 24.456 4.932 1.06e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 228 on 47 degrees of freedom Multiple R-squared: 0.4306, Adjusted R-squared: 0.4064 F-statistic: 17.77 on 2 and 47 DF, p-value: 1.785e-06 > #### LAD MODEL #### > m1.LAD<-rq(crime~poverty+single,data=crime2) > summary(m1.LAD) Call: rq(formula = crime ~ poverty + single, data = crime2) tau: [1] 0.5 Coefficients: coefficients lower bd upper bd (Intercept) -946.44050 -1440.29485 -647.25258 poverty 14.82265 -4.48777 27.97920 single 116.16705 91.46409 169.94039 > ##### LTS MODEL #### > m1.LTS<-ltsreg(crime~poverty+single,data=crime2) > m1.LTS Call: lqs.formula(formula = crime ~ poverty + single, data = crime2, method = "lts") Coefficients: (Intercept) poverty -840.09 23.00 single 97.61 Scale estimates 145.7 164.3 > ##### M-ESTIMATION MODEL #### > m1.huber<-rlm(crime ~poverty + single, data=crime2) 9 ROBUST REGRESSION DOWNPLAYING OUTLIERS 10 > summary(m1.huber) Call: rlm(formula = crime ~ poverty + single, data = crime2) Residuals: Min 1Q Median 3Q Max -706.88 -96.04 -21.72 115.24 675.41 Coefficients: Value Std. Error t value (Intercept) -1046.8870 224.9294 -4.6543 poverty 8.9435 7.6522 1.1688 single 133.8002 22.2370 6.0170 Residual standard error: 165.8 on 47 degrees of freedom > #### BOUNDED INFLUENCE MODEL #### > m1.bi<-rlm(crime~poverty+single,data=crime2,method='MM') > summary(m1.bi) Call: rlm(formula = crime ~ poverty + single, data = crime2, method = "MM") Residuals: Min 1Q Median 3Q Max -753.43 -106.38 -27.81 113.17 670.16 Coefficients: Value Std. Error t value (Intercept) -1148.3514 223.4790 -5.1385 poverty 10.2836 7.6028 1.3526 single 141.6172 22.0936 6.4099 Residual standard error: 185.1 on 47 degrees of freedom Discussion Robust regression techniques often offer a viable solution to outlying data, as compared to simply deleting data. However, as shown with the four techniques previously discussed, there is considerable variability based on the method chosen. Often the researcher must make a bestcase judgment for the technique to be used and justify the case accordingly. In order to do this, researchers should become familiar with the varying robust regression techniques and their appropriateness for different types of data. For instance, the least trim squares method and ROBUST REGRESSION DOWNPLAYING OUTLIERS 11 bounded influence estimators perform poorly when given clumps of outlying data. The least absolute deviation method and M-estimator perform poorly when outliers are closer to the x-axis. The least absolute deviation method is also very sensitive to a single outlying case with high influence. Least trim square generally performs well, however, does involve removing a proportion of the data, which can be viewed as performing an analysis on different data than was intended. As a result it is considered subjective. Knowing this information can help researchers in choosing whether to use robust regression and if so, which technique. The purpose of this paper is to show and illustrate more appropriate alternatives for handling outliers in regression analysis. Robust regression accommodates outliers by using robust regression estimation techniques. Four common robust regression methods are: least absolute deviation, least trimmed squares, M-estimators, and bounded influence estimators. Considerable caution and justification should be provided when choosing the appropriate robust regression method for varying types of outlying data. ROBUST REGRESSION DOWNPLAYING OUTLIERS 12 References Agresti, A., & Finlay, B. (1997) Statistical Methods for Social Science, Third Edition. Upper Saddle River, NJ: Prentice Hall. Anderson, C., & Schumacker, R. E. (2003). A comparison of five robust regression models with ordinary least squares regression: Relative efficiency, bias, and test of the null hypothesis. Understanding Statistics, 2(2), 79-103. Berk, R. A. (1980). A primer on robust regression. In J. Fox & J. S. Long (Eds.), Modern methods of data analysis (pp. 292-323). Newbury Park, CA: Sage. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003) Applied multiple regression/Correlation analysis for the behavioral sciences, third edition. Mahwah, NJ: Lawrence Erlbaum Publishers. Fox, J. (1997). Applied regression analysis, linear models, and related methods. Thousand Oaks, CA: Sage. Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. New York, NY: Cambridge University Press. Huber, P. J. (1973). Robust regression: Asymptotics, conjectures, and Monte Carlo. The Annals of Statistics, 1, 799-821. Huber, P. J. (1981). Robust statistics. New York, NY: Wiley.