Paper

advertisement
Running Head: ROBUST REGRESSION DOWNPLAYING OUTLIERS
Robust Regression as a Means of
Downplaying the Effect of Outliers
Megan Oliphint
Southern Methodist University
Paper presented at the annual meeting of the Southwest Educational Research Association, San
Antonio, TX, February 2-4, 2011.
1
ROBUST REGRESSION DOWNPLAYING OUTLIERS
2
Robust Regression as a Means of Downplaying Outliers
Educational researchers are often plagued with the less than ideal situations involving
non-normal data, specifically the existence of outliers. Normally researchers must decide
whether to let outliers violate the assumptions for ordinary least squares (OLS) regression or
simply delete these points. With both of these remedies having inherent flaws, robust regression
provides a compromise. The purpose of this paper is to show and illustrate more appropriate
alternatives for handling outliers in regression analysis.
Concepts Underlying Robust Regression
Robust regression can be used in any situation in which OLS would be used. However,
the researcher must be sure that outliers are truly outliers, in that they are not due to data entry
errors, from a different population, or have a justification for removal. Considerable caution
should be taken in removing or choosing to ignore outliers. As researcher Fox (1997) noted
It is important to investigate why an observation is unusual. Truly bad data can often be
corrected, or if correction is not possible, thrown away. When a discrepant data point is
correct, we may be able to understand why the data is unusual… Except in clear-cut
cases, we are justifiably reluctant to delete observations or to respecify the model to
accommodate unusual data. (p. 285)
With that caveat, robust regression accommodates outliers by using robust regression estimation
techniques. These techniques lessen the influence of outliers by reducing their weights using
coefficient estimation techniques (Anderson & Schumaker, 2003).
However, to understand robust regression it is important to first understand that where
outliers occur effects how outliers perform and their subsequent effect on OLS regression lines.
Three terms are used to categorize the impact of outliers on the regression coefficient estimate:
ROBUST REGRESSION DOWNPLAYING OUTLIERS
3
leverage, discrepancy, and influence. The first category, leverage is a measure of the distance an
independent variable deviates from the mean. A point with leverage is a considerable distance
from other points of the data, but leverage does not consider distance from the regression line or
direction. The second category, discrepancy, is used to define how far the point is from the
regression line. Points with high discrepancy are far from both the regression line and other data
points. The third category, influence is the effect of removing the outlier, shown by the change
in regression coefficients once the outlying point is removed. It is a product of leverage and
discrepancy and is often measured as cook’s distance (Anderson & Schumacker, 2003).
Along with how the outliers function, the location is also important in assessing which
technique to use. For instance, outliers closer to the x-axis exert more influence than outliers
close to the y-axis. Outliers close to the y-axis provide only minimal impact on the regression
coefficient estimate (Anderson & Schumaker, 2003), therefore most well known robust
regression techniques (such as Huber’s M-estimation and the least absolute deviation method)
are amenable to outliers closer to the y-axis. Knowing this information helps researchers
determine which of the available robust regression techniques is appropriate.
Robust Regression Techniques
In handling outliers in regression analysis, Cohen, Cohen, West, and Aiken (2003)
identify four approaches as an alternative to the estimation of the OLS regression coefficients:
least absolute deviation, least trim squares, M-estimation, and bounded-influence estimators.
Described below are these four techniques followed by routines for running these robust
regressions in R.
Least Absolute Deviation
ROBUST REGRESSION DOWNPLAYING OUTLIERS
4
The least absolute deviation method (LAD), also known as L1, least absolute value, or
least absolute errors, minimizes the sum of absolute errors. In theory, as the LAD takes the
absolute values of residuals, the influence is much less compared to standard OLS, which takes
the squared residuals of least squares (Anderson & Schumaker, 2003). The LAD method
attempts to minimize the effect of the difference between a person’s actual score and their
predicted score by choosing values for the regression effect (beta) that minimize this difference.
It is shown mathematically as:
n
Minimize | e1(b) |.
i1
There are several positive and negative effects associated with using the LAD technique.

For data with outliers in the y-axis direction,
the LAD estimator can be robust. However, the
same is not true with regards to outliers in the x-axis direction. As it is neither a high breakdown
point estimator nor bounded-influence estimator, one outlying data point will cause the
regression line to pass through this point (Anderson & Schumaker, 2003). This can often create
inconsistent results. Because of this, LAD is not recommended in cases with a single outlying
case with high influence. However, LAD does work well in cases with high discrepancy.
Least Trim Squares
Another robust regression technique, least trim squares (LTS) provides researchers with
the choice regarding the amount (or proportion) of data to be “trimmed.” Unlike LAD, which
uses the absolute value of residuals as coefficients or OLS, which uses squared residuals, LTS
uses squared residuals to order outliers. This is shown mathematically as:
h
Minimize  (ri 2 ) ,
i1

ROBUST REGRESSION DOWNPLAYING OUTLIERS
5
where r is the ordered squared residuals, from least to greatest and h, determined a priori, is the
proportion of residuals to be excluded from the analysis.
Using a breakdown point of .50, LTS is considered a high-breakdown point estimator.
LTS works well in general situations, however, there are often cases in which LTS is not
recommended. For instance, if all of the outlying points were deleted in the selection of
proportion to be deleted, the results would be computationally the same as OLS. Also, this
method is not efficient if fewer results are trimmed than the number of outlying data points but if
more trimming occurs than there are outlying data points, there is an argument that “good” data
is now being excluded. Because of this, LTS is considered subjective and provides biased results
when outliers are present in clumps (Cohen, Cohen, West, and Aiken, 2003). This is shown
graphically in Figure 1 on a heuristic data set containing 20% outliers with a proportion of ½
being trimmed. The green line represents the robust regression line using the LTS method. The
black line represents the OLS regression line.
Figure 1: Least Trimmed Squares Method using Heuristic data
ROBUST REGRESSION DOWNPLAYING OUTLIERS
6
M-estimation
Developed by Huber (1973, 1981), the third robust regression technique, M-estimation
uses an iterative form of least reweighted squares to calculate the coefficients used in the
regression analysis, with the “M” signifying the maximum likelihood. The iterative form of
lease squares regression calculates a new set of weights based on residuals, uses residuals to
model the data upon which new weights are based. This process continues until the change in
parameter is small enough, a specific number of iterations have occurred, or a criterion is met
based on convergence (Berk, 1990). M-estimation uses this variation of weighted least squared
regression to minimize outliers by “downplaying” the effect of larger residuals and
“upweighting” the effect of smaller residuals as larger residuals result in smaller weights. This is
shown mathematically as:
h
Minimize
 p(r
2
i
),
i1
where p is a symmetric function with a unique minimum at zero.
With a breakdown point of 1/n, 
LTS works well in cases with high discrepancy.
However, it is not resistant to outliers on the x-axis or explanatory variables. It also does not
produce well in cases with high leverage as well as discrepancy. Using M-estimators to find
regression coefficients is similar to OLS estimators with normal errors. However, M-estimators
are considered more robust with heavy-tailed error distribution. For this reason, it is considered
more statistically efficient than the LAD method (Anderson & Schumaker, 2003). This is shown
in Figure 2, with the blue line representing the M-estimation regression line, the red line
representing the least absolute deviation regression line, and black line representing the OLS
regression line on heuristic data with 20% outlying data.
ROBUST REGRESSION DOWNPLAYING OUTLIERS
7
Figure 2: M-estimation compared to Least Absolute Deviation
Bounded Influence Estimators
The bounded influence robust regression technique, also known as GM-estimation was
developed to combat the issue previously seen with M-estimation techniques regarding data
outlying on the x-axis. The bounded influence estimators choose weights based on the
consideration of leverage and discrepancy (Cohen, Cohen, West, & Aiken, 2003). To downplay
outliers, bounded influence estimators minimize the residuals compared to the minimization of
the sum of squared residuals that is used in OLS (Anderson & Schumacker, 2003). Bounded
influence is shown mathematically as:
y  x
ˆ 
i

  s x i  0 .
i


i1
n
Bounded influence estimators provide good performance in many situations (including

outliers on the x-axis) but provide
poor estimates with clumps of outliers. Also, outliers with
high leverage but small error terms perform poorly.
Robust Regression Techniques in R
ROBUST REGRESSION DOWNPLAYING OUTLIERS
Using the Crime dataset from the Statistical Methods for Social Sciences, Third Edition
(Agresti & Finlay, 1997), each robust regression technique is shown to allow for comparison of
the varying methods. The OLS model of the variables “crime” predicted by “poverty” and
“single” is used as the regression line. Figure 3 shows the QQNorm plot, where data point 25
and 9 are identified as having high leverage as well as residual values. Figure 4 shows the R
output for the varying models.
Figure 3: QQNorm Plot for Crime Dataset
Figure 4: R Output for OLS and Robust Regression Techniques
> #### OLS MODEL ####
> m1<-lm(crime ~poverty + single, data=crime2)
> summary(m1)
Call:
8
ROBUST REGRESSION DOWNPLAYING OUTLIERS
lm(formula = crime ~ poverty + single, data = crime2)
Residuals:
Min
1Q Median
3Q Max
-646.78 -106.42 -15.34 112.66 672.13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -879.797 247.378 -3.556 0.00087 ***
poverty
7.591
8.416 0.902 0.37164
single
120.617 24.456 4.932 1.06e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 228 on 47 degrees of freedom
Multiple R-squared: 0.4306, Adjusted R-squared: 0.4064
F-statistic: 17.77 on 2 and 47 DF, p-value: 1.785e-06
> #### LAD MODEL ####
> m1.LAD<-rq(crime~poverty+single,data=crime2)
> summary(m1.LAD)
Call: rq(formula = crime ~ poverty + single, data = crime2)
tau: [1] 0.5
Coefficients:
coefficients lower bd upper bd
(Intercept) -946.44050 -1440.29485 -647.25258
poverty
14.82265 -4.48777 27.97920
single
116.16705 91.46409 169.94039
> ##### LTS MODEL ####
> m1.LTS<-ltsreg(crime~poverty+single,data=crime2)
> m1.LTS
Call:
lqs.formula(formula = crime ~ poverty + single, data = crime2,
method = "lts")
Coefficients:
(Intercept)
poverty
-840.09
23.00
single
97.61
Scale estimates 145.7 164.3
> ##### M-ESTIMATION MODEL ####
> m1.huber<-rlm(crime ~poverty + single, data=crime2)
9
ROBUST REGRESSION DOWNPLAYING OUTLIERS
10
> summary(m1.huber)
Call: rlm(formula = crime ~ poverty + single, data = crime2)
Residuals:
Min
1Q Median
3Q Max
-706.88 -96.04 -21.72 115.24 675.41
Coefficients:
Value
Std. Error t value
(Intercept) -1046.8870 224.9294 -4.6543
poverty
8.9435 7.6522 1.1688
single
133.8002 22.2370 6.0170
Residual standard error: 165.8 on 47 degrees of freedom
> #### BOUNDED INFLUENCE MODEL ####
> m1.bi<-rlm(crime~poverty+single,data=crime2,method='MM')
> summary(m1.bi)
Call: rlm(formula = crime ~ poverty + single, data = crime2, method = "MM")
Residuals:
Min
1Q Median
3Q Max
-753.43 -106.38 -27.81 113.17 670.16
Coefficients:
Value
Std. Error t value
(Intercept) -1148.3514 223.4790 -5.1385
poverty
10.2836 7.6028 1.3526
single
141.6172 22.0936 6.4099
Residual standard error: 185.1 on 47 degrees of freedom
Discussion
Robust regression techniques often offer a viable solution to outlying data, as compared
to simply deleting data. However, as shown with the four techniques previously discussed, there
is considerable variability based on the method chosen. Often the researcher must make a bestcase judgment for the technique to be used and justify the case accordingly. In order to do this,
researchers should become familiar with the varying robust regression techniques and their
appropriateness for different types of data. For instance, the least trim squares method and
ROBUST REGRESSION DOWNPLAYING OUTLIERS
11
bounded influence estimators perform poorly when given clumps of outlying data. The least
absolute deviation method and M-estimator perform poorly when outliers are closer to the x-axis.
The least absolute deviation method is also very sensitive to a single outlying case with high
influence. Least trim square generally performs well, however, does involve removing a
proportion of the data, which can be viewed as performing an analysis on different data than was
intended. As a result it is considered subjective. Knowing this information can help researchers
in choosing whether to use robust regression and if so, which technique.
The purpose of this paper is to show and illustrate more appropriate alternatives for
handling outliers in regression analysis. Robust regression accommodates outliers by using
robust regression estimation techniques. Four common robust regression methods are: least
absolute deviation, least trimmed squares, M-estimators, and bounded influence estimators.
Considerable caution and justification should be provided when choosing the appropriate robust
regression method for varying types of outlying data.
ROBUST REGRESSION DOWNPLAYING OUTLIERS
12
References
Agresti, A., & Finlay, B. (1997) Statistical Methods for Social Science, Third Edition. Upper
Saddle River, NJ: Prentice Hall.
Anderson, C., & Schumacker, R. E. (2003). A comparison of five robust regression models with
ordinary least squares regression: Relative efficiency, bias, and test of the null
hypothesis. Understanding Statistics, 2(2), 79-103.
Berk, R. A. (1980). A primer on robust regression. In J. Fox & J. S. Long (Eds.), Modern
methods of data analysis (pp. 292-323). Newbury Park, CA: Sage.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003) Applied multiple regression/Correlation
analysis for the behavioral sciences, third edition. Mahwah, NJ: Lawrence Erlbaum
Publishers.
Fox, J. (1997). Applied regression analysis, linear models, and related methods. Thousand Oaks,
CA: Sage.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical
models. New York, NY: Cambridge University Press.
Huber, P. J. (1973). Robust regression: Asymptotics, conjectures, and Monte Carlo. The Annals
of Statistics, 1, 799-821.
Huber, P. J. (1981). Robust statistics. New York, NY: Wiley.
Download