LMS and Regression Through the origin

advertisement
Least Median of Squares and Regression through the Origin
Supporting files online at
http://www.wabash.edu/econexcel/LMSOrigin
By
Humberto Barreto
Department of Economics
Wabash College
Crawfordsville, IN 47933
Barretoh@wabash.edu
and
David Maharry
Department of Mathematics and Computer Science
Wabash College
Crawfordsville, IN 47933
maharryd@wabash.edu
The authors thank Michael Axtell, Frank Howland, and anonymous referees
for suggestions and criticisms.
Comments welcome
Do not quote without the author’s permission
January 2005
Abstract
An exact algorithm is provided for finding the Least Median of Squares (LMS) line for a
bivariate regression with no intercept term. It is shown that the popular PROGRESS routine will
not, in general, find the LMS slope when the intercept is suppressed.
A Microsoft Excel workbook that provides the code in Visual Basic is made available at
www.wabash.edu/econexcel/LMSOrigin
Keywords: LMS, Robust Regression, PROGRESS
687293325
2
1. Introduction
Rousseeuw [1984] introduced Least Median of Squares (LMS) as a robust regression procedure.
Instead of minimizing the sum of squared residuals, coefficients are chosen so as to minimize the
median of the squared residuals. Unlike conventional least squares (LS), there is no closed-form
solution with which to easily calculate the LMS line since the median is an order or rank statistic.
A general non-linear optimization algorithm performs poorly because the median of squared
residuals surface is so bumpy that merely local minima are often incorrectly reported as the
solution.
Although a closed-form solution does not exist and brute force optimization is not reliable,
several algorithms are available for fitting the LMS line (or hyperplane). Perhaps the most
popular approach is called PROGRESS (from Program for RObust reGRESSion). The program
itself is explained in Rousseeuw and Leroy [1987] and the most recent version is available at
http://www.agoras.ua.ac.be/. Several software packages, such as SAS/IML (version 6.12 or
greater), have an LMS routine based on PROGRESS.
This paper focuses on the special problem of finding the LMS fitted line through the origin in the
bivariate case. The next section presents the model and defines the LMS line. Section 3 shows
that the PROGRESS algorithm gives an incorrect solution, in general, when the intercept is
restricted to zero. Section 4 presents an analytical, exact method for finding the minimum
median squared residual for the bivariate, zero intercept case. Finally, a simple example is
provided to illustrate the algorithm and show why PROGRESS fails in the zero-intercept case.
2. The model
Suppose that observed values of y are generated according to the model yi = xi + i. Given a
realization of n (xi, yi) points, the problem is to find the ‘best’ choice for the slope of a straight
line that passes through the origin, yˆ  mx .
687293325
3
One choice for the ‘best’ line is the line that minimizes the median of the individual squares of
the deviations, or residuals. Given this objective, one chooses the value of the slope, m, to
minimize the median value of the squared residuals:
d i (m)  ( yi  yˆ i ) 2  ( yi  mxi ) 2 ,0  i  n
2
Given a value of m, the n squared deviations can be ordered and the median found as the
‘middle’ one. If n is odd the median is the (n  1) / 2 item, while if n is even the median is the
mean of the n/2 and the n/2 + 1 terms. Table 1 shows a simple example.
m = 2.4
median Residual 2 = 0.64
Data
x
1
2
3
4
5
y
3
4
8
6
7
Predicted Y Residual Residual 2
2.4
0.6
0.36
4.8
-0.8
0.64
7.2
0.8
0.64
9.6
-3.6
12.96
12
-5
25
Table 1. Five Points Example of the LMS Fit
The smallest median squared residual, 0.64, is achieved when m=2.4. Other choices of m will
generate greater median squared residual values.
3. PROGRESS and LMS with a zero intercept
The PROGRESS algorithm is based on sampling subsets of points from the data in order to
generate candidate LMS estimates. The size of each subset is determined by the number of
coefficients to be estimated. For the table above, there is one coefficient to be estimated and five
observations so there are five subsets, each containing one data point. For each of the n subsets,
the slope is computed. Using this slope, the squared deviation of each of the data points is
calculated and the median of these squared deviations is determined. The PROGRESS
estimation of the LMS slope is the subset that produces the minimum median deviation.
687293325
4
median
Subset
1
2
3
4
5
x
1
2
3
4
5
y
3
4
8
6
7
slope
3
2
2.667
1.5
1.4
Residual
4
4
1.778
1
1.44
2
Table 2. Five Points Example with PROGRESS, No Intercept
Using the data from Table 2, PROGRESS determines that the minimum median squared residual
has a value of 1, found when the slope is 1.5. Thus PROGRESS generates an LMS line yˆ  1.5 x .
Note that this is not the true minimum median squared residual, which has a value of 0.64 for the
10
9
8
7
6
5
4
3
2
1
0
TRUE LMS:
y = 2.4x
LS:
y = 1.709x
PROGRESS LMS:
y = 1.5x
Med SR = f(m)
5
Median Squared Residual
y
slope of 2.4.
4
3
2
1
PROGRESS
solution
0
0
2
x
4
1
6
1.5
True
Min
2
2.5
3
slope (m)
(a) Scatter Plot with Fits
(b) Median SR Objective Function
Figure 1. Five Points Example Fits
It is well known that PROGRESS will yield an exact, correct solution in the case of simple
regression (i.e., bivariate regression with an intercept and slope) when all two-observation
subsets are examined.1 Given the best slope of the subsets, the intercept (or location parameter)
is adjusted and an exact LMS fit is guaranteed. The casual user may mistakenly suppose that if
PROGRESS finds an exact solution for a simple bivariate regression of the form y = mx + b,
then an exact solution also will be obtained for y = mx. Unfortunately, this does not follow.
“In simple regression (p=2), it follows from (Steele and Steiger 1986) that if all 2-subsets are used and their
intercept is adjusted each time, we obtain the exact LQS.” Rousseeuw and Hubert [1997], p. 9.
1
687293325
5
4. An exact algorithm for LMS with a zero intercept in the bivariate case
The central idea of this algorithm that provides an exact solution to the LMS problem is the
observation that there are a finite number of slopes, bounded by the number of pairs of points,
which could provide the minimum median deviation. In most cases not all of the O(n2) points
need to be checked to determine the optimal slope.
For a given x,y pair, the square of the deviation of the fitted line from the actual data point is
given by d2(m)=(y-ŷ)2=(y-mx)2. As a function of the slope, m, this squared deviation forms a
parabola with an upward concavity. The deviation has its minimum value of 0 when the line
passes directly through the (x,y) point where ŷ=y= mx. A given set of n data points,
{xi , yi ; i  n} defines a set of parabolas {d i2 (m)  ( y i  yˆ ) 2  ( y i  mxi ) 2 ; i  n} . For a given
value of the slope m these parabolas can be ordered and the data point which defines the median
deviation can be determined.
The range of slopes to search in order to find the one that produces the minimum median
deviation is bounded by mmin=min(yi/xi) and mmax=max(yi/xi) over all the n data points for which
the x-value is not zero. Each parabola rises monotonically for all values of the slope that are
greater than the slope that minimizes it, which is m= yi/xi. This implies that all of the n parabolas
are rising when the slope is greater than mmax=max(yi/xi). The minimum median deviation can
not occur in a region where all of the deviations are rising. A similar statement may be made
about slopes less than mmin since in this case all n parabolas are falling when the slope is less
than mmin=min(yi/xi). Thus the slope that produces the minimum median deviation must have a
value between these two bounds.
The minimum value of a continuous function such as the median of the square of the deviations
can occur only at (1) one of the end points, (2) at a point where the derivative of the function is
0, or (3) at a point where the derivative is not defined. Thus the search for the slope that
produces the global minimum can be restricted to the points where two parabolas intersect and
the derivative of the function is undefined or at a point where the median parabola attains its
minimum value which is 0.
687293325
6
The second case, in which the global minimum occurs at a slope where the median parabola
reaches its minimum value of 0, occurs only when a majority of the data points lie on a straight
line, causing the minimum median deviation to occur when the ‘best’ fit straight line passes
directly through these points, giving each of these points a deviation of 0. For all other sets of
data, the first case provides the set of slopes to check for a minimum of the median of the
deviations.
To show why the first case might produce a global minimum of the median deviation one must
notice that if one determines which data point produces the median deviation at a given value for
the slope, this data point continues to generate the median deviation as the slope, m, increases
until a value of the slope is reached where the parabola describing this median data point
intersects the parabola due to some other data point. At this slope this next data point becomes
the median. Since the median parabola is always either rising or falling between intersections
the minimum point must occur at one of the end-points where an intersection occurs.
An algorithm to determine the exact minimum is based on the idea of following the median
parabola as the slope sweeps from mmin to mmax, keeping track of the minimum deviation and the
corresponding slope.
The problem is to find the value of the slope that causes the median deviation to achieve its
minimum value. The algorithm to solve this problem begins by determining the data point that
causes the median deviation at the point mmin=min(yi/xi). The value of the slope and the value of
the deviation at that slope are stored.
Given the parabola that produces the median deviation for that slope, the slope is increased to the
next smallest value of the slope where the median parabola intersects one of the other parabolas.
At this point the deviation is compared with the minimum deviation discovered so far. If a new
minimum has been found, the deviation and the slope which produced it are stored. This
procedure is continued as long as the value of the slope is less than the maximum possible slope,
mmax =max(yi/xi). At the conclusion the minimum median deviation and the corresponding slope
have been found.
687293325
7
5. An Example
Figure 2 shows an example of this algorithm using the five data points from Table 1. The slope
varies from mmin=1.4 due to observation 5 to mmax=3 due to observation 3.
m
1.85
7
6
Squared Residual: (yi - mxi)2
obs 2
obs 3
obs 4
0.09
6.0025
1.96
obs 1
1.3225
1.85
1.85
7
0
obs 5
5.0625
Median SR
1.96
obs 5
Squared Residual
obs 4
5
4
med SR
obs 1
3
2
1
obs 3
obs 2
0
1
1.5
1.85
2
slope (m)
2.5
3
Figure 2. Individual Observation and Median SR as a function of m
Each parabola is labeled by the data point that relates to that parabola, obs 1, obs 2, obs 3, obs 4
and obs 5. The median squared residual for a given slope, m, is the median, or middle, one of
the y values of the 5 parabolas. The thick line follows the median, or 3rd, deviation in this
example of 5 data points. The vertical line at m = 1.85 shows that the second observation,
(x2=2,y2=4), has a squared residual value of 0.09 (=(4-1.85*2)2). At m = 1.85, the median
squared residual has a value of 1.96 and is due to the fourth observation, (x4=4, y4=6).
For this data set, observation 5 has its minimum at a slope of 1.4 which is the minimum slope of
the 5 observations while observation 1 has its minimum at a slope of 3.0 which is the maximum
slope. Thus the minimum median deviation must occur somewhere in the interval between 1.4
687293325
8
and 3.0. The algorithm begins with a slope of 1.4, choosing observation 2 as the median
parabola at that slope. It then determines that the parabola due to observation 2 intersects the
parabola for observation 5 at a slope of approximately 1.57. At this point the parabola for
observation 5 becomes the median deviation with a value of 0.74. At the slope of 1.67 this
parabola intersects the parabola due to observation 1. The deviation at this point is 1.77.
Parabola 1 intersects with parabola 4 at a slope of 1.8 with a value of 1.44. Parabola 4 intersects
parabola 3 at a slope of 2.0 with a deviation of 4.0. Finally parabola 3 intersects parabola 2 at a
slope of 2.4 with a deviation value of 0.64. From these 5 intersections, described by the x-y
points {(1.57,0.74), (1.67,1.77), (1.8,1.44), (2.0,4.0), (2.4,0.64)}, the algorithm chooses the point
with the smallest deviation. The 5th one, at a slope of 2.4 produces the smallest deviation with a
value of 0.64. This is the solution to the problem. The ‘best’ straight line using these 5 data
points is the one with a slope of 2.4. All other slopes produce larger median squared deviations.
It is possible for more than two parabolas to intersect at a point, but the parabola that becomes
the new median can be determined by ordering the intersecting parabolas based on their slope
and curvature at the point of intersection. In the case of an even number of data points it is
necessary to follow two parabolas, representing the (n/2)th deviation and the (n/2+1)th deviation,
since the median is the average of these two values.
When there are n data points the efficiency of this algorithm is O(n2 log n) in the worst case. It
requires determining the intersections of each of the parabolas with the median parabola to
choose the next intersection to use. It is possible that each parabola might be the median
parabola at some point of the algorithm. Thus one may have to determine as many as
 n
   O(n 2 ) intersections and these intersections have to be ordered.
 2
Figure 2 can also be used to show another view of how the PROGRESS algorithm works in the
zero intercept case. For each value of the slope that causes the straight line to pass through a
data point, giving a squared residual of zero, the PROGRESS algorithm computes the median
deviation of the 5 data points. It then chooses the slope that causes the minimum value of this set
of deviations. The discussion presented in the previous paragraphs makes clear why this
687293325
9
approach fails—the global minimum squared residual will not, in general, be associated with a
slope where a squared residual value for an individual observation is zero. This will provide the
correct result only in the case where a majority of the data points lie on a straight line through the
origin.
6. Conclusion
When applying Least Median of Squares, coefficients are chosen so as to minimize the median
of the squared residuals. Because the median is not sensitive to extreme values, it can
outperform conventional least squares when data are contaminated. This paper makes two
contributions to the LMS literature:
1) PROGRESS, the standard algorithm for fitting the LMS estimator, does not find the true
LMS fit when the intercept is suppressed. Any computations based on the estimated slope
(such as regression diagnostics and estimated standard errors) are also wrong.
2) For a bivariate regression with a zero-intercept, yˆ  mx , an algorithmic method based on
keeping track of the median squared residual is demonstrated.
References
Barreto, Humberto (2001) “An Introduction to Least Median of Squares,” unpublished
manuscript, http://www.wabash.edu/econexcel/LMSOrigin (LMSIntro.doc).
Rousseeuw, Peter J. (1984) “Least Median of Squares Regression,” Journal of the American
Statistical Association, 79 (388), 871-880.
Rousseeuw, Peter J. and Annick M. Leroy (1987) Robust Regression and Outlier Detection, John
Wiley & Sons: New York.
687293325
10
Download