LMS and Regression Through the origin

Least Median of Squares and Regression through the Origin Supporting files online at http://www.wabash.edu/econexcel/LMSOrigin By Humberto Barreto Department of Economics Wabash College Crawfordsville, IN 47933 Barretoh@wabash.edu and David Maharry Department of Mathematics and Computer Science Wabash College Crawfordsville, IN 47933 maharryd@wabash.edu The authors thank Michael Axtell, Frank Howland, and anonymous referees for suggestions and criticisms. Comments welcome Do not quote without the author’s permission January 2005 Abstract An exact algorithm is provided for finding the Least Median of Squares (LMS) line for a bivariate regression with no intercept term. It is shown that the popular PROGRESS routine will not, in general, find the LMS slope when the intercept is suppressed. A Microsoft Excel workbook that provides the code in Visual Basic is made available at www.wabash.edu/econexcel/LMSOrigin Keywords: LMS, Robust Regression, PROGRESS 687293325 2 1. Introduction Rousseeuw [1984] introduced Least Median of Squares (LMS) as a robust regression procedure. Instead of minimizing the sum of squared residuals, coefficients are chosen so as to minimize the median of the squared residuals. Unlike conventional least squares (LS), there is no closed-form solution with which to easily calculate the LMS line since the median is an order or rank statistic. A general non-linear optimization algorithm performs poorly because the median of squared residuals surface is so bumpy that merely local minima are often incorrectly reported as the solution. Although a closed-form solution does not exist and brute force optimization is not reliable, several algorithms are available for fitting the LMS line (or hyperplane). Perhaps the most popular approach is called PROGRESS (from Program for RObust reGRESSion). The program itself is explained in Rousseeuw and Leroy [1987] and the most recent version is available at http://www.agoras.ua.ac.be/. Several software packages, such as SAS/IML (version 6.12 or greater), have an LMS routine based on PROGRESS. This paper focuses on the special problem of finding the LMS fitted line through the origin in the bivariate case. The next section presents the model and defines the LMS line. Section 3 shows that the PROGRESS algorithm gives an incorrect solution, in general, when the intercept is restricted to zero. Section 4 presents an analytical, exact method for finding the minimum median squared residual for the bivariate, zero intercept case. Finally, a simple example is provided to illustrate the algorithm and show why PROGRESS fails in the zero-intercept case. 2. The model Suppose that observed values of y are generated according to the model yi = xi + i. Given a realization of n (xi, yi) points, the problem is to find the ‘best’ choice for the slope of a straight line that passes through the origin, yˆ  mx . 687293325 3 One choice for the ‘best’ line is the line that minimizes the median of the individual squares of the deviations, or residuals. Given this objective, one chooses the value of the slope, m, to minimize the median value of the squared residuals: d i (m)  ( yi  yˆ i ) 2  ( yi  mxi ) 2 ,0  i  n 2 Given a value of m, the n squared deviations can be ordered and the median found as the ‘middle’ one. If n is odd the median is the (n  1) / 2 item, while if n is even the median is the mean of the n/2 and the n/2 + 1 terms. Table 1 shows a simple example. m = 2.4 median Residual 2 = 0.64 Data x 1 2 3 4 5 y 3 4 8 6 7 Predicted Y Residual Residual 2 2.4 0.6 0.36 4.8 -0.8 0.64 7.2 0.8 0.64 9.6 -3.6 12.96 12 -5 25 Table 1. Five Points Example of the LMS Fit The smallest median squared residual, 0.64, is achieved when m=2.4. Other choices of m will generate greater median squared residual values. 3. PROGRESS and LMS with a zero intercept The PROGRESS algorithm is based on sampling subsets of points from the data in order to generate candidate LMS estimates. The size of each subset is determined by the number of coefficients to be estimated. For the table above, there is one coefficient to be estimated and five observations so there are five subsets, each containing one data point. For each of the n subsets, the slope is computed. Using this slope, the squared deviation of each of the data points is calculated and the median of these squared deviations is determined. The PROGRESS estimation of the LMS slope is the subset that produces the minimum median deviation. 687293325 4 median Subset 1 2 3 4 5 x 1 2 3 4 5 y 3 4 8 6 7 slope 3 2 2.667 1.5 1.4 Residual 4 4 1.778 1 1.44 2 Table 2. Five Points Example with PROGRESS, No Intercept Using the data from Table 2, PROGRESS determines that the minimum median squared residual has a value of 1, found when the slope is 1.5. Thus PROGRESS generates an LMS line yˆ  1.5 x . Note that this is not the true minimum median squared residual, which has a value of 0.64 for the 10 9 8 7 6 5 4 3 2 1 0 TRUE LMS: y = 2.4x LS: y = 1.709x PROGRESS LMS: y = 1.5x Med SR = f(m) 5 Median Squared Residual y slope of 2.4. 4 3 2 1 PROGRESS solution 0 0 2 x 4 1 6 1.5 True Min 2 2.5 3 slope (m) (a) Scatter Plot with Fits (b) Median SR Objective Function Figure 1. Five Points Example Fits It is well known that PROGRESS will yield an exact, correct solution in the case of simple regression (i.e., bivariate regression with an intercept and slope) when all two-observation subsets are examined.1 Given the best slope of the subsets, the intercept (or location parameter) is adjusted and an exact LMS fit is guaranteed. The casual user may mistakenly suppose that if PROGRESS finds an exact solution for a simple bivariate regression of the form y = mx + b, then an exact solution also will be obtained for y = mx. Unfortunately, this does not follow. “In simple regression (p=2), it follows from (Steele and Steiger 1986) that if all 2-subsets are used and their intercept is adjusted each time, we obtain the exact LQS.” Rousseeuw and Hubert [1997], p. 9. 1 687293325 5 4. An exact algorithm for LMS with a zero intercept in the bivariate case The central idea of this algorithm that provides an exact solution to the LMS problem is the observation that there are a finite number of slopes, bounded by the number of pairs of points, which could provide the minimum median deviation. In most cases not all of the O(n2) points need to be checked to determine the optimal slope. For a given x,y pair, the square of the deviation of the fitted line from the actual data point is given by d2(m)=(y-ŷ)2=(y-mx)2. As a function of the slope, m, this squared deviation forms a parabola with an upward concavity. The deviation has its minimum value of 0 when the line passes directly through the (x,y) point where ŷ=y= mx. A given set of n data points, {xi , yi ; i  n} defines a set of parabolas {d i2 (m)  ( y i  yˆ ) 2  ( y i  mxi ) 2 ; i  n} . For a given value of the slope m these parabolas can be ordered and the data point which defines the median deviation can be determined. The range of slopes to search in order to find the one that produces the minimum median deviation is bounded by mmin=min(yi/xi) and mmax=max(yi/xi) over all the n data points for which the x-value is not zero. Each parabola rises monotonically for all values of the slope that are greater than the slope that minimizes it, which is m= yi/xi. This implies that all of the n parabolas are rising when the slope is greater than mmax=max(yi/xi). The minimum median deviation can not occur in a region where all of the deviations are rising. A similar statement may be made about slopes less than mmin since in this case all n parabolas are falling when the slope is less than mmin=min(yi/xi). Thus the slope that produces the minimum median deviation must have a value between these two bounds. The minimum value of a continuous function such as the median of the square of the deviations can occur only at (1) one of the end points, (2) at a point where the derivative of the function is 0, or (3) at a point where the derivative is not defined. Thus the search for the slope that produces the global minimum can be restricted to the points where two parabolas intersect and the derivative of the function is undefined or at a point where the median parabola attains its minimum value which is 0. 687293325 6 The second case, in which the global minimum occurs at a slope where the median parabola reaches its minimum value of 0, occurs only when a majority of the data points lie on a straight line, causing the minimum median deviation to occur when the ‘best’ fit straight line passes directly through these points, giving each of these points a deviation of 0. For all other sets of data, the first case provides the set of slopes to check for a minimum of the median of the deviations. To show why the first case might produce a global minimum of the median deviation one must notice that if one determines which data point produces the median deviation at a given value for the slope, this data point continues to generate the median deviation as the slope, m, increases until a value of the slope is reached where the parabola describing this median data point intersects the parabola due to some other data point. At this slope this next data point becomes the median. Since the median parabola is always either rising or falling between intersections the minimum point must occur at one of the end-points where an intersection occurs. An algorithm to determine the exact minimum is based on the idea of following the median parabola as the slope sweeps from mmin to mmax, keeping track of the minimum deviation and the corresponding slope. The problem is to find the value of the slope that causes the median deviation to achieve its minimum value. The algorithm to solve this problem begins by determining the data point that causes the median deviation at the point mmin=min(yi/xi). The value of the slope and the value of the deviation at that slope are stored. Given the parabola that produces the median deviation for that slope, the slope is increased to the next smallest value of the slope where the median parabola intersects one of the other parabolas. At this point the deviation is compared with the minimum deviation discovered so far. If a new minimum has been found, the deviation and the slope which produced it are stored. This procedure is continued as long as the value of the slope is less than the maximum possible slope, mmax =max(yi/xi). At the conclusion the minimum median deviation and the corresponding slope have been found. 687293325 7 5. An Example Figure 2 shows an example of this algorithm using the five data points from Table 1. The slope varies from mmin=1.4 due to observation 5 to mmax=3 due to observation 3. m 1.85 7 6 Squared Residual: (yi - mxi)2 obs 2 obs 3 obs 4 0.09 6.0025 1.96 obs 1 1.3225 1.85 1.85 7 0 obs 5 5.0625 Median SR 1.96 obs 5 Squared Residual obs 4 5 4 med SR obs 1 3 2 1 obs 3 obs 2 0 1 1.5 1.85 2 slope (m) 2.5 3 Figure 2. Individual Observation and Median SR as a function of m Each parabola is labeled by the data point that relates to that parabola, obs 1, obs 2, obs 3, obs 4 and obs 5. The median squared residual for a given slope, m, is the median, or middle, one of the y values of the 5 parabolas. The thick line follows the median, or 3rd, deviation in this example of 5 data points. The vertical line at m = 1.85 shows that the second observation, (x2=2,y2=4), has a squared residual value of 0.09 (=(4-1.85*2)2). At m = 1.85, the median squared residual has a value of 1.96 and is due to the fourth observation, (x4=4, y4=6). For this data set, observation 5 has its minimum at a slope of 1.4 which is the minimum slope of the 5 observations while observation 1 has its minimum at a slope of 3.0 which is the maximum slope. Thus the minimum median deviation must occur somewhere in the interval between 1.4 687293325 8 and 3.0. The algorithm begins with a slope of 1.4, choosing observation 2 as the median parabola at that slope. It then determines that the parabola due to observation 2 intersects the parabola for observation 5 at a slope of approximately 1.57. At this point the parabola for observation 5 becomes the median deviation with a value of 0.74. At the slope of 1.67 this parabola intersects the parabola due to observation 1. The deviation at this point is 1.77. Parabola 1 intersects with parabola 4 at a slope of 1.8 with a value of 1.44. Parabola 4 intersects parabola 3 at a slope of 2.0 with a deviation of 4.0. Finally parabola 3 intersects parabola 2 at a slope of 2.4 with a deviation value of 0.64. From these 5 intersections, described by the x-y points {(1.57,0.74), (1.67,1.77), (1.8,1.44), (2.0,4.0), (2.4,0.64)}, the algorithm chooses the point with the smallest deviation. The 5th one, at a slope of 2.4 produces the smallest deviation with a value of 0.64. This is the solution to the problem. The ‘best’ straight line using these 5 data points is the one with a slope of 2.4. All other slopes produce larger median squared deviations. It is possible for more than two parabolas to intersect at a point, but the parabola that becomes the new median can be determined by ordering the intersecting parabolas based on their slope and curvature at the point of intersection. In the case of an even number of data points it is necessary to follow two parabolas, representing the (n/2)th deviation and the (n/2+1)th deviation, since the median is the average of these two values. When there are n data points the efficiency of this algorithm is O(n2 log n) in the worst case. It requires determining the intersections of each of the parabolas with the median parabola to choose the next intersection to use. It is possible that each parabola might be the median parabola at some point of the algorithm. Thus one may have to determine as many as  n    O(n 2 ) intersections and these intersections have to be ordered.  2 Figure 2 can also be used to show another view of how the PROGRESS algorithm works in the zero intercept case. For each value of the slope that causes the straight line to pass through a data point, giving a squared residual of zero, the PROGRESS algorithm computes the median deviation of the 5 data points. It then chooses the slope that causes the minimum value of this set of deviations. The discussion presented in the previous paragraphs makes clear why this 687293325 9 approach fails—the global minimum squared residual will not, in general, be associated with a slope where a squared residual value for an individual observation is zero. This will provide the correct result only in the case where a majority of the data points lie on a straight line through the origin. 6. Conclusion When applying Least Median of Squares, coefficients are chosen so as to minimize the median of the squared residuals. Because the median is not sensitive to extreme values, it can outperform conventional least squares when data are contaminated. This paper makes two contributions to the LMS literature: 1) PROGRESS, the standard algorithm for fitting the LMS estimator, does not find the true LMS fit when the intercept is suppressed. Any computations based on the estimated slope (such as regression diagnostics and estimated standard errors) are also wrong. 2) For a bivariate regression with a zero-intercept, yˆ  mx , an algorithmic method based on keeping track of the median squared residual is demonstrated. References Barreto, Humberto (2001) “An Introduction to Least Median of Squares,” unpublished manuscript, http://www.wabash.edu/econexcel/LMSOrigin (LMSIntro.doc). Rousseeuw, Peter J. (1984) “Least Median of Squares Regression,” Journal of the American Statistical Association, 79 (388), 871-880. Rousseeuw, Peter J. and Annick M. Leroy (1987) Robust Regression and Outlier Detection, John Wiley & Sons: New York. 687293325 10

LMS and Regression Through the origin

Related documents

Products

Support

LMS and Regression Through the origin

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib