Working Paper Least Absolute Value in Regression Based Data Mining Matt Wimble Michigan State University wimble@bus.msu.edu ABSTRACT The existence of non-normal error terms has been shown to exist widely in practice. Least Absolute Value (LAV) regression estimates have been shown in the past to produce better forecasts than Ordinary Least Squares (OLS) when outliers are present in a simple linear model. The utility of least absolute value to regression-based data mining has not been fully explored and is the focus of this paper. Regression based data mining has been used in the past as a first step to identify appropriate input variables for more advanced techniques such as Neural Networks. Twenty demographic variables from U.S. States were used as independent variables. The data was bifurcated into equal parts training and test data and 4 variable regression models were fitted to dependant variables of varying distributions using both OLS and LAV. The resulting equations were used to generate a forecast that was used to compare the performance of the regression methods under different dependant variable distribution conditions. Initial findings indicate LAV produces better forecasts when the dependant variable is non-normal. Because variable selection is a combinatorial problem it remains to be seen if the better forecasts warrant the additional computational cost of LAV. Keywords L1-Norm estimation, least absolute value, variable selection, data mining, robust regression 1 Working Paper INTRODUCTION Regression-based variable selection methods such as Stepwise were among the first techniques that could be considered “data mining”. Optimizing the selection of variables in a regression model has long been a subject of interest (Hocking, 1976) for statisticians. Applying regression-based variable selection has proven useful in identifying input variables for other techniques (Nath, et.al., 1997) such as Neural Network Classifiers. Problems that arise out of violations of regression assumptions have long been a subject of interest (Marquardt, 1974) ever since computerization facilitated automated variable selection techniques. Least squares (LS) regression estimates have been widely shown to provide the best estimates when the error term is normally distributed. Instances of a violation of the underlying normality assumption have been shown to be quite common. In both finance and economics the existence of non-normal error terms have been shown (Murphy, 2001) to exist. Investment returns have been known for some time (Fama, 1965, Myers and Majluf, 1984, Stein and Stein, 1991) to violate assumptions of normality. Non-normal data has been shown to exist in biological laboratory (Hill and Dixon, 1982) data, psychological (Micceri, 1989) data, and RNA concentrations (Dyer, et.al, 1999) in medical data. Problems associated with assuming normality have wide-ranging implications. Statistics textbooks written for applied researchers claim normality (Wilcox, 1997) assumptions are adequate, despite the problems with these assumptions well known in statistics literature. Well accepted financial theory includes normality violations in option pricing models (Black and Scholes, 1973) where lognormality is expressed as a model assumption. Natural phenomena (Starr et. al., 1976) such as tornado damage swaths, flood damage magnitude, and earthquake magnitude have been shown to exhibit normality violations. LS parameters are calculated by minimizing the sum of the squares of the distance between the observer and forecasted values. Least absolute value (LAV) parameters are calculated by minimizing the absolute distance between observed and forecasted values. Although LAV was proposed (Boscovich, 1757) earlier than LS (Legendre, 1805), LS has been adopted as the most widely used regression methodology. The absolute value function used in LAV is a function where the first derivative is discontiguous precluding the use of calculus to find a general solution. The lack of a general solution for LAV makes the method difficult to study from a theoretical standpoint and study is often limited to simulation methods such as Monte Carlo. Several simulation studies comparing the performances of LS and LAV in small samples have been done. The studies of Blattberg and Sargent (1971), Wilson (1978), Pfaffenberger and Dinkel (1978), and Dielman (1986) have suggested that LAV estimators are 80 percent as efficient as LS estimators with normally distributed error terms. When the error distributions contain large tails large gains in efficiency (Dielman 1986) occur. Stepwise variable selection methods are by far the most common and hypothesis testing measures for LAV estimates (Dielman and Rose, 2002) have been shown to exist to facilitate this procedure. This paper will focus on enumerating all regression estimates in order to provide analysis for those using meta-heuristic optimization methods, such as Tabu Search and Genetic Algorithms, to search the selection space. METHODOLOGY The study was conducted using 1997 U.S. State (Stat Abs US, 1998) data obtained via Visual Statistics2 (Doane, 2001) supplementary data sets. Variable selection is a combinatorial problem. For this study 4 variables were chosen out of 20 possible resulting in 4845 possible combinations, size was limited to enable timely enumeration. It is not uncommon for a researcher using this technique attempting to find new insights to try models with upwards of 150 variables. The original dataset contained 132 variables, which were trimmed to keep roughly an even distribution of demographic, economic, environmental, education, health, social, and transportation variables. Criminal, political, and geographic variables were omitted due to the size constraint. The independent variables used are shown in table 1. Table 1 Independent variables used AvBen EarnHour HomeOwn% Income Poverty Unem ColGrad% Dropout GradRate 1996 average weekly state unemployment benefit in dollars 1997 average hourly earnings of mfg production workers 1997 proportion owner households to total occupied households 1997 personal income per capita in current dollars 1996 percentage below the poverty level 1996 unemployment rate, civilian labor force 1990 percent college graduates in population age 25 and over 1995 public high school dropout rate 1995 public high school graduation rate 2 Working Paper SATQ Hazard UninsChild UninsTotal Urban DriverMale% Helmet MilesPop AgeMedian PopChg% PopDen 1997 average SAT quantitative test score 1997 number of hazardous waste sites on Superfund list 1996 percentage of children without health insurance 1996 percentage of people without health insurance 1996 percent of population living in urban areas 1997 percent of licensed drivers who are male 1 if state had a motorcycle helmet law in 1995, 0 otherwise 1997 annual vehicle miles per capita 1996 median age of population 1990 - 1995 percent population change 1997 population density in persons per square mile One dependant variable was chosen that was approximately normally distributed. The remaining dependent variables were randomly chosen from the variables from the original dataset not used as independent variables. Distributions were measured using BestFit. Distribution fit is calculated using Chi-Square Test, Anderson-Darling Statistic (A-D), and the KolmogorovSmirnov Test (KS). The normal variable used was “1995 average daily hospital cost in dollars per patient”. This variable was chosen due to normal distribution representing the best fit in 2 out of the 3 tests. The other dependant variables used were “1997 federal grants per capita for highway trust fund and FTA”, “1996 hospital beds per thousand population”, and “1996 DoD total contract awards in millions of dollars”. The histograms for the dependant variables are shown in figures 1-4. Figure 1. Hospital Cost Histogram Figure 2. Federal Grant Histogram Normal(919.50, 215.92) Pearson5(2.1443, 74.287) Shift=+40.946 2.5 2.5 2.0 Values x 10^-2 1.5 BestFit Student Version For Academic Use Only 1.0 1.5 BestFit Student Version For Academic Use Only 1.0 0.5 0.5 1.4 < 5.0% 90.0% 0.564 5.0% > 1.275 < 5.0% 55.9 3 90.0% 5.0% 220.0 450 400 350 300 250 200 150 Values in Thousands 100 50 0.0 0 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.0 0.4 Values x 10^-3 2.0 > Working Paper Figure 3. Hospital Beds Histogram Figure 4. Defense Contracts Histogram Pearson5(1.9386, 21.147) Shift=-2.6209 InvGauss(363.85, 446.30) Shift=+18.854 7 3.0 6 2.5 4 Values x 10^-3 Values x 10^-2 5 @RISK Student Version For Academic Use Only 3 2.0 @RISK Student Version 1.5 For Academic Use Only 1.0 2 0.5 1 > 1.0 0.8 0.6 0.4 1.4 5.0% 1.2 80 90.0% 70 60 50 40 30 20 10 0 -10 < 5.0% 1.93 0.2 0.0 0.0 0 5.0% > Values in Thousands 90.0% 61.26 0.096 1.019 Once the independent and dependant variables were selected a complete enumeration of all LAV and OLS models was performed. Function minimization was performed using the Premium Solver Add-In for Excel by Frontline Systems. Initial models were verified using conventional regression methods to verify validity-of-technique. Data was bifurcated into even 25 state groups, one for training and one for validation, with data being in alphabetical order by state. RESULTS A total of 4,845 regression models were run for each of the dependant variables. The top 1, 2, and 5 percent of the models for both OLS and LAV were summarized. For example the LAV forecasts with the lowest absolute fitted error were compared with the OLS forecasts with the lowest squared fitted error. Performance is measured based on ability to forecast on the validation set values. Performance was measured in several ways. Percentage of LAV forecast that were closer, relative efficiency of LAV to OLS, mean absolute deviation (MAD), and standard deviation of absolute forecast errors. Percentage of LAV forecasts that were closer is defined by the number of LAV forecasts that were closer to the true value divided by the number of forecasts, in other words how often LAV produced a better forecast. Relative efficiency is defined as 2 yˆ i y i RMFE LAV where RMSFE 1 i 1 RE RMSFE OLS n n n MAD yˆ i 1 i n 1 2 where n = number of forecasts. MAD is defined as yi . MAD is a measure of how far off the Accuracy performance of LAV relative to OLS is shown in Figure 5. Performance is also summarized in greater detail in tables 2-5. 4 Working Paper Figure 5. Comparison by variable for top 2% of fits 120% 100% 80% LAV Closer 60% Efficiency 40% 20% 0% Hospital Cost GovGrant BedHospt Defense Table 2. Performance on Hospital Cost top 1% (48 obs) top 2% (97 obs) top 5% (242 obs) MAD Std. Dev LAV OLS LAV OLS 4077.0 4157.5 4133.5 4062.6 390.7 147.1 419.7 293.0 LAV 4188.1 611.3 OLS 4001.3 458.5 %LAV closer Efficiency 58.3% 96.9% 46.4% 104.0% 47.9% 110.4% Table 3. Performance on Government Grants top 1% (48 obs) top 2% (97obs) top 5% (242obs) LAV OLS LAV OLS LAV OLS MAD 1179.4 1279.7 1147.3 1223.8 1071.2 1153.6 Std. Dev 145.2 103.5 153.7 131.1 141.8 133.7 %LAV closer Efficiency 75.0% 85.6% 68.0% 88.4% 72.3% 86.6% Table 4. Performance on Hospital Beds top 1% (48 obs) top 2% (97obs) top 5% (242obs) LAV OLS LAV OLS LAV OLS MAD 321.9 335.7 320.7 338.4 322.9 339.1 Std. Dev 20.0 19.5 23.0 17.5 21.6 12.6 5 %LAV closer Efficiency 79.2% 92.0% 73.2% 90.0% 76.4% 91.0% Working Paper Table 5. Performance on Defense Contracts top 1% (48 obs) top 2% (97obs) top 5% (242obs) LAV OLS LAV OLS LAV OLS MAD 5762.0 6057.0 5395.5 6050.7 5269.3 6023.4 Std. Dev 1195.3 371.3 1302.6 490.6 1170.5 509.6 %LAV closer Efficiency 66.7% 94.0% 69.1% 83.6% 73.1% 79.7% DISCUSSION Performance can be analyzed in accuracy, efficiency, and consistency terms. Accuracy in this study was measured by both MAD and percent closer terms. LAV performed at or better than what simulation (Dielman, 1996) results tended to suggest in terms of accuracy. In relative efficiency terms LAV performed better than simulation would have suggested. Dielman’s study showed LAV to have relative efficiency measures in the range of 125%. In this study OLS performed only slightly better in relative efficiency terms, with relative efficiency measure ranging from 97-110%. The data in Dielman’s study used simulated data rather than the real data, which is inherently less normal. Variation from normality most likely explains differences in relative efficiency from Dielman’s findings. LAV did provide a somewhat more accurate forecast more often than would be suggested by simulation, with forecast being closer about 10% more often than in Dielman’s study. An interesting finding was that OLS produced a more consistent forecast than LAV for all dependant variables used. It is worth noting that comparing performance in this study to Dielman’s results becomes problematic in that Dielman used symmetric distributions, where the non-normal data in this study exhibited considerable skewness. CONCLUSION Given how often normality violations occur in real data the use of robust estimation techniques, such as LAV, would seem to be useful in regression-based data mining. Preliminary data seems to suggest that LAV could be useful in regression-based data mining models, but far more data is needed to derive any substantial conclusions. Simulation studies of this technique are difficult to conduct due to the factorial nature of the number of possible models that need to be controlled. As a result of these difficulties, studies with real data do present a way to study LAV and other robust techniques as well. Another open question is whether the potential benefits of LAV outweigh the computational overhead of Simplex, versus the guaranteed O(n) of OLS, when used within the metaheuristic necessitated by the problem scale. FUTURE RESEARCH Initial results from this work-in-process suggest that further study of LAV estimation and other robust regression techniques, within the context of variable selection, are worth pursuit. A larger sample of forecast tests would be necessary to provide sufficient justification for using LAV based regression to select variables. More study would need to be conducted in order to determine if the findings of forecast consistency remain and if performance under skewness also remains. The interaction between suboptimal LAV regression estimates within a variable selection metaheuristic, such as Tabu Search or Genetic Algorithms, have not been explored. Additionally LAV is one of many types of robust regression techniques that could be explored within the same problem framework. 6 Working Paper REFERENCES 1. Black, P. and Scholes, M., (1973), “The Pricing of Options and Corporate Liabilities”, Journal of Political Economy, 81, 637-654. 2. Blattberg, R. and Sargent, T., (1971),”Regression with non-Gaussian stable disturbances: some sampling results,” Econmetrica, 39:501-510 3. Boscovich, R., (1757),”De litteraria expeditione per pontificiam ditionem, et synopsis amplioris operas, ac habentur plura ejus ex exemplaria etiam sensorum impressa,” Bononiensi Scientiarum et Artum Instituto Atque Academia Commentarii, 4:353-396 4. Dielman, Terry E., (1986),”A Comparison of Forecasts from Least Absolute Value and Least Squares Regression,” Journal of Forecasting, 5:189-195. 5. Dielman, T. E. and Rose, E. L., (2002), “Bootstrap Versus Traditional Hypothesis Testing Procedures for Coefficients in Least Absolute Value Regression”, Journal Statistical Computation and Simulation, 72:665-675. 6. Doane, D., Tracy, R, and Matheison, K. (2001), Visual Statistics 2, The McGraw-Hill Companies, Inc. 7. Dyer, J., Pilcher, C., Shepard,R., Schock,, J., Eron, J., and Fiscus, S, (1999), “Comparison of NucliSens and Roche Monitor Assays for Quantitation of Levels of Human Immunodeficiency Virus Type 1 RNA in Plasma”, Journal of Clinical Microbiology, v. 37, No. 2, pp 447-449 8. Fama, E., (1965),”The behavior of stock market prices”, Business, 38:34-105. 9. Hill, M. and Dixon, W.J., (1982) “Robustness in real life: A study of clinical laboratory data.”, Biometrics, 38:377396. 10. Hocking, R.R., (1979), “The Analysis and Selection of Variables in Linear Regression”, Biometrics, 32:1-49. 11. Legendre, A. M., (1805), Nouvelles méthodes pour la détermination des orbites des comètes, Paris, p72-75. 12. Marquardt, D.W., (1974), “Discussion (of Beaton and Tukey [1974])”, Technometrics, 15:189-192 13. Micceri, T., (1989), “The unicorn, the normal curve, and other improbable creatures.”, Psychological Bulletin, 105:156-166. 14. Murphy, J.A., (2000), Scientific Investment Analysis, Greenwood Press, pp100 15. Myers, S., Majluf, N., (1984), “Corporate Financing and Investment Decisions When Firms Have Information That Investors Do Not”, Journal of Financial Economics, 13:187-221 16. Nath, R., Rajagopalan, B., & Ryker, R., (1997),"Determining the Saliency of Input Variables in Neural Network Classifiers," Journal of Computers and Operations Research, 24,8:767-773 17. Pfaffenberger, R., and Dinkel, J., (1978), “Absolute deviations curve-fitting: an alternative to least squares,” in David, H (ed.), Contributions to Survey Sampling and Applied Statistics, New York: Academic Press, pp. 279-294 18. Statistical Abstract of the U.S., 1998 19. Starr, C., Rudman, R., Whipple, C., (1976), “Philosophical Basis for Risk Analysis”, Annual Review of Energy, 1:629662 20. Stein, E. and Stein, J., (1991), “Stock Price Distributions with Stochastic Volatility: An Analytic Approach”, The Review of Financial Studies, 4:727-752 21. Wilcox, R. R., (1997), Introduction to Robust Estimation and Hypothesis Testing, Chestnut Hill, MA: Academic Press, pp 1. 22. Wilson, H., (1978),”Least-squares versus minimum absolute deviations estimation in linear models,” Decision Sciences, 9:322-335. 7