doc - Michigan State University

advertisement
Working Paper
Least Absolute Value in
Regression Based Data Mining
Matt Wimble
Michigan State University
wimble@bus.msu.edu
ABSTRACT
The existence of non-normal error terms has been shown to exist widely in practice. Least Absolute Value (LAV) regression
estimates have been shown in the past to produce better forecasts than Ordinary Least Squares (OLS) when outliers are
present in a simple linear model. The utility of least absolute value to regression-based data mining has not been fully
explored and is the focus of this paper. Regression based data mining has been used in the past as a first step to identify
appropriate input variables for more advanced techniques such as Neural Networks. Twenty demographic variables from
U.S. States were used as independent variables. The data was bifurcated into equal parts training and test data and 4 variable
regression models were fitted to dependant variables of varying distributions using both OLS and LAV. The resulting
equations were used to generate a forecast that was used to compare the performance of the regression methods under
different dependant variable distribution conditions. Initial findings indicate LAV produces better forecasts when the
dependant variable is non-normal. Because variable selection is a combinatorial problem it remains to be seen if the better
forecasts warrant the additional computational cost of LAV.
Keywords
L1-Norm estimation, least absolute value, variable selection, data mining, robust regression
1
Working Paper
INTRODUCTION
Regression-based variable selection methods such as Stepwise were among the first techniques that could be considered “data
mining”. Optimizing the selection of variables in a regression model has long been a subject of interest (Hocking, 1976) for
statisticians. Applying regression-based variable selection has proven useful in identifying input variables for other
techniques (Nath, et.al., 1997) such as Neural Network Classifiers. Problems that arise out of violations of regression
assumptions have long been a subject of interest (Marquardt, 1974) ever since computerization facilitated automated variable
selection techniques.
Least squares (LS) regression estimates have been widely shown to provide the best estimates when the error term is
normally distributed. Instances of a violation of the underlying normality assumption have been shown to be quite common.
In both finance and economics the existence of non-normal error terms have been shown (Murphy, 2001) to exist. Investment
returns have been known for some time (Fama, 1965, Myers and Majluf, 1984, Stein and Stein, 1991) to violate assumptions
of normality. Non-normal data has been shown to exist in biological laboratory (Hill and Dixon, 1982) data, psychological
(Micceri, 1989) data, and RNA concentrations (Dyer, et.al, 1999) in medical data. Problems associated with assuming
normality have wide-ranging implications. Statistics textbooks written for applied researchers claim normality (Wilcox,
1997) assumptions are adequate, despite the problems with these assumptions well known in statistics literature. Well
accepted financial theory includes normality violations in option pricing models (Black and Scholes, 1973) where
lognormality is expressed as a model assumption. Natural phenomena (Starr et. al., 1976) such as tornado damage swaths,
flood damage magnitude, and earthquake magnitude have been shown to exhibit normality violations.
LS parameters are calculated by minimizing the sum of the squares of the distance between the observer and forecasted
values. Least absolute value (LAV) parameters are calculated by minimizing the absolute distance between observed and
forecasted values. Although LAV was proposed (Boscovich, 1757) earlier than LS (Legendre, 1805), LS has been adopted as
the most widely used regression methodology. The absolute value function used in LAV is a function where the first
derivative is discontiguous precluding the use of calculus to find a general solution. The lack of a general solution for LAV
makes the method difficult to study from a theoretical standpoint and study is often limited to simulation methods such as
Monte Carlo.
Several simulation studies comparing the performances of LS and LAV in small samples have been done. The studies of
Blattberg and Sargent (1971), Wilson (1978), Pfaffenberger and Dinkel (1978), and Dielman (1986) have suggested that
LAV estimators are 80 percent as efficient as LS estimators with normally distributed error terms. When the error
distributions contain large tails large gains in efficiency (Dielman 1986) occur. Stepwise variable selection methods are by
far the most common and hypothesis testing measures for LAV estimates (Dielman and Rose, 2002) have been shown to
exist to facilitate this procedure. This paper will focus on enumerating all regression estimates in order to provide analysis for
those using meta-heuristic optimization methods, such as Tabu Search and Genetic Algorithms, to search the selection space.
METHODOLOGY
The study was conducted using 1997 U.S. State (Stat Abs US, 1998) data obtained via Visual Statistics2 (Doane, 2001)
supplementary data sets. Variable selection is a combinatorial problem. For this study 4 variables were chosen out of 20
possible resulting in 4845 possible combinations, size was limited to enable timely enumeration. It is not uncommon for a
researcher using this technique attempting to find new insights to try models with upwards of 150 variables. The original
dataset contained 132 variables, which were trimmed to keep roughly an even distribution of demographic, economic,
environmental, education, health, social, and transportation variables. Criminal, political, and geographic variables were
omitted due to the size constraint. The independent variables used are shown in table 1.
Table 1 Independent variables used
AvBen
EarnHour
HomeOwn%
Income
Poverty
Unem
ColGrad%
Dropout
GradRate
1996 average weekly state unemployment benefit in dollars
1997 average hourly earnings of mfg production workers
1997 proportion owner households to total occupied households
1997 personal income per capita in current dollars
1996 percentage below the poverty level
1996 unemployment rate, civilian labor force
1990 percent college graduates in population age 25 and over
1995 public high school dropout rate
1995 public high school graduation rate
2
Working Paper
SATQ
Hazard
UninsChild
UninsTotal
Urban
DriverMale%
Helmet
MilesPop
AgeMedian
PopChg%
PopDen
1997 average SAT quantitative test score
1997 number of hazardous waste sites on Superfund list
1996 percentage of children without health insurance
1996 percentage of people without health insurance
1996 percent of population living in urban areas
1997 percent of licensed drivers who are male
1 if state had a motorcycle helmet law in 1995, 0 otherwise
1997 annual vehicle miles per capita
1996 median age of population
1990 - 1995 percent population change
1997 population density in persons per square mile
One dependant variable was chosen that was approximately normally distributed. The remaining dependent variables were
randomly chosen from the variables from the original dataset not used as independent variables. Distributions were measured
using BestFit. Distribution fit is calculated using Chi-Square Test, Anderson-Darling Statistic (A-D), and the KolmogorovSmirnov Test (KS). The normal variable used was “1995 average daily hospital cost in dollars per patient”. This variable was
chosen due to normal distribution representing the best fit in 2 out of the 3 tests. The other dependant variables used were
“1997 federal grants per capita for highway trust fund and FTA”, “1996 hospital beds per thousand population”, and “1996
DoD total contract awards in millions of dollars”. The histograms for the dependant variables are shown in figures 1-4.
Figure 1. Hospital Cost Histogram
Figure 2. Federal Grant Histogram
Normal(919.50, 215.92)
Pearson5(2.1443, 74.287) Shift=+40.946
2.5
2.5
2.0
Values x 10^-2
1.5
BestFit Student Version
For Academic Use Only
1.0
1.5
BestFit Student Version
For Academic Use Only
1.0
0.5
0.5
1.4
<
5.0%
90.0%
0.564
5.0% >
1.275
< 5.0%
55.9
3
90.0%
5.0%
220.0
450
400
350
300
250
200
150
Values in Thousands
100
50
0.0
0
1.3
1.2
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.0
0.4
Values x 10^-3
2.0
>
Working Paper
Figure 3. Hospital Beds Histogram
Figure 4. Defense Contracts Histogram
Pearson5(1.9386, 21.147) Shift=-2.6209
InvGauss(363.85, 446.30) Shift=+18.854
7
3.0
6
2.5
4
Values x 10^-3
Values x 10^-2
5
@RISK Student Version
For Academic Use Only
3
2.0
@RISK Student Version
1.5
For Academic Use Only
1.0
2
0.5
1
>
1.0
0.8
0.6
0.4
1.4
5.0%
1.2
80
90.0%
70
60
50
40
30
20
10
0
-10
< 5.0%
1.93
0.2
0.0
0.0
0
5.0%
>
Values in Thousands
90.0%
61.26
0.096
1.019
Once the independent and dependant variables were selected a complete enumeration of all LAV and OLS models was
performed. Function minimization was performed using the Premium Solver Add-In for Excel by Frontline Systems. Initial
models were verified using conventional regression methods to verify validity-of-technique. Data was bifurcated into even 25
state groups, one for training and one for validation, with data being in alphabetical order by state.
RESULTS
A total of 4,845 regression models were run for each of the dependant variables. The top 1, 2, and 5 percent of the models for
both OLS and LAV were summarized. For example the LAV forecasts with the lowest absolute fitted error were compared
with the OLS forecasts with the lowest squared fitted error. Performance is measured based on ability to forecast on the
validation set values. Performance was measured in several ways. Percentage of LAV forecast that were closer, relative
efficiency of LAV to OLS, mean absolute deviation (MAD), and standard deviation of absolute forecast errors. Percentage
of LAV forecasts that were closer is defined by the number of LAV forecasts that were closer to the true value divided by the
number of forecasts, in other words how often LAV produced a better forecast. Relative efficiency is defined as

2
   yˆ i  y i 
RMFE LAV
where RMSFE 1   i 1
RE 

RMSFE OLS
n


n
n
MAD 
 yˆ
i 1
i
n
1
2


 where n = number of forecasts. MAD is defined as



 yi
. MAD is a measure of how far off the
Accuracy performance of LAV relative to OLS is shown in Figure 5. Performance is also summarized in greater detail in
tables 2-5.
4
Working Paper
Figure 5. Comparison by variable for top 2% of fits
120%
100%
80%
LAV Closer
60%
Efficiency
40%
20%
0%
Hospital Cost
GovGrant
BedHospt
Defense
Table 2. Performance on Hospital Cost
top 1%
(48 obs)
top 2%
(97 obs)
top 5%
(242
obs)
MAD
Std. Dev
LAV
OLS
LAV
OLS
4077.0
4157.5
4133.5
4062.6
390.7
147.1
419.7
293.0
LAV
4188.1
611.3
OLS
4001.3
458.5
%LAV closer
Efficiency
58.3%
96.9%
46.4%
104.0%
47.9%
110.4%
Table 3. Performance on Government Grants
top 1%
(48 obs)
top 2%
(97obs)
top 5%
(242obs)
LAV
OLS
LAV
OLS
LAV
OLS
MAD
1179.4
1279.7
1147.3
1223.8
1071.2
1153.6
Std. Dev
145.2
103.5
153.7
131.1
141.8
133.7
%LAV closer
Efficiency
75.0%
85.6%
68.0%
88.4%
72.3%
86.6%
Table 4. Performance on Hospital Beds
top 1%
(48 obs)
top 2%
(97obs)
top 5%
(242obs)
LAV
OLS
LAV
OLS
LAV
OLS
MAD
321.9
335.7
320.7
338.4
322.9
339.1
Std. Dev
20.0
19.5
23.0
17.5
21.6
12.6
5
%LAV closer
Efficiency
79.2%
92.0%
73.2%
90.0%
76.4%
91.0%
Working Paper
Table 5. Performance on Defense Contracts
top 1%
(48 obs)
top 2%
(97obs)
top 5%
(242obs)
LAV
OLS
LAV
OLS
LAV
OLS
MAD
5762.0
6057.0
5395.5
6050.7
5269.3
6023.4
Std. Dev
1195.3
371.3
1302.6
490.6
1170.5
509.6
%LAV closer
Efficiency
66.7%
94.0%
69.1%
83.6%
73.1%
79.7%
DISCUSSION
Performance can be analyzed in accuracy, efficiency, and consistency terms. Accuracy in this study was measured by both
MAD and percent closer terms. LAV performed at or better than what simulation (Dielman, 1996) results tended to suggest
in terms of accuracy. In relative efficiency terms LAV performed better than simulation would have suggested. Dielman’s
study showed LAV to have relative efficiency measures in the range of 125%. In this study OLS performed only slightly
better in relative efficiency terms, with relative efficiency measure ranging from 97-110%. The data in Dielman’s study used
simulated data rather than the real data, which is inherently less normal. Variation from normality most likely explains
differences in relative efficiency from Dielman’s findings. LAV did provide a somewhat more accurate forecast more often
than would be suggested by simulation, with forecast being closer about 10% more often than in Dielman’s study. An
interesting finding was that OLS produced a more consistent forecast than LAV for all dependant variables used. It is worth
noting that comparing performance in this study to Dielman’s results becomes problematic in that Dielman used symmetric
distributions, where the non-normal data in this study exhibited considerable skewness.
CONCLUSION
Given how often normality violations occur in real data the use of robust estimation techniques, such as LAV, would seem to
be useful in regression-based data mining. Preliminary data seems to suggest that LAV could be useful in regression-based
data mining models, but far more data is needed to derive any substantial conclusions. Simulation studies of this technique
are difficult to conduct due to the factorial nature of the number of possible models that need to be controlled. As a result of
these difficulties, studies with real data do present a way to study LAV and other robust techniques as well. Another open
question is whether the potential benefits of LAV outweigh the computational overhead of Simplex, versus the guaranteed
O(n) of OLS, when used within the metaheuristic necessitated by the problem scale.
FUTURE RESEARCH
Initial results from this work-in-process suggest that further study of LAV estimation and other robust regression techniques,
within the context of variable selection, are worth pursuit. A larger sample of forecast tests would be necessary to provide
sufficient justification for using LAV based regression to select variables. More study would need to be conducted in order to
determine if the findings of forecast consistency remain and if performance under skewness also remains. The interaction
between suboptimal LAV regression estimates within a variable selection metaheuristic, such as Tabu Search or Genetic
Algorithms, have not been explored. Additionally LAV is one of many types of robust regression techniques that could be
explored within the same problem framework.
6
Working Paper
REFERENCES
1.
Black, P. and Scholes, M., (1973), “The Pricing of Options and Corporate Liabilities”, Journal of Political Economy,
81, 637-654.
2.
Blattberg, R. and Sargent, T., (1971),”Regression with non-Gaussian stable disturbances: some sampling results,”
Econmetrica, 39:501-510
3.
Boscovich, R., (1757),”De litteraria expeditione per pontificiam ditionem, et synopsis amplioris operas, ac habentur
plura ejus ex exemplaria etiam sensorum impressa,” Bononiensi Scientiarum et Artum Instituto Atque Academia
Commentarii, 4:353-396
4.
Dielman, Terry E., (1986),”A Comparison of Forecasts from Least Absolute Value and Least Squares Regression,”
Journal of Forecasting, 5:189-195.
5.
Dielman, T. E. and Rose, E. L., (2002), “Bootstrap Versus Traditional Hypothesis Testing Procedures for
Coefficients in Least Absolute Value Regression”, Journal Statistical Computation and Simulation, 72:665-675.
6.
Doane, D., Tracy, R, and Matheison, K. (2001), Visual Statistics 2, The McGraw-Hill Companies, Inc.
7.
Dyer, J., Pilcher, C., Shepard,R., Schock,, J., Eron, J., and Fiscus, S, (1999), “Comparison of NucliSens and Roche
Monitor Assays for Quantitation of Levels of Human Immunodeficiency Virus Type 1 RNA in Plasma”, Journal of
Clinical Microbiology, v. 37, No. 2, pp 447-449
8.
Fama, E., (1965),”The behavior of stock market prices”, Business, 38:34-105.
9.
Hill, M. and Dixon, W.J., (1982) “Robustness in real life: A study of clinical laboratory data.”, Biometrics, 38:377396.
10. Hocking, R.R., (1979), “The Analysis and Selection of Variables in Linear Regression”, Biometrics, 32:1-49.
11. Legendre, A. M., (1805), Nouvelles méthodes pour la détermination des orbites des comètes, Paris, p72-75.
12. Marquardt, D.W., (1974), “Discussion (of Beaton and Tukey [1974])”, Technometrics, 15:189-192
13. Micceri, T., (1989), “The unicorn, the normal curve, and other improbable creatures.”, Psychological Bulletin,
105:156-166.
14. Murphy, J.A., (2000), Scientific Investment Analysis, Greenwood Press, pp100
15. Myers, S., Majluf, N., (1984), “Corporate Financing and Investment Decisions When Firms Have Information That
Investors Do Not”, Journal of Financial Economics, 13:187-221
16. Nath, R., Rajagopalan, B., & Ryker, R., (1997),"Determining the Saliency of Input Variables in Neural Network
Classifiers," Journal of Computers and Operations Research, 24,8:767-773
17. Pfaffenberger, R., and Dinkel, J., (1978), “Absolute deviations curve-fitting: an alternative to least squares,” in David,
H (ed.), Contributions to Survey Sampling and Applied Statistics, New York: Academic Press, pp. 279-294
18. Statistical Abstract of the U.S., 1998
19. Starr, C., Rudman, R., Whipple, C., (1976), “Philosophical Basis for Risk Analysis”, Annual Review of Energy, 1:629662
20. Stein, E. and Stein, J., (1991), “Stock Price Distributions with Stochastic Volatility: An Analytic Approach”, The
Review of Financial Studies, 4:727-752
21. Wilcox, R. R., (1997), Introduction to Robust Estimation and Hypothesis Testing, Chestnut Hill, MA: Academic Press,
pp 1.
22. Wilson, H., (1978),”Least-squares versus minimum absolute deviations estimation in linear models,” Decision
Sciences, 9:322-335.
7
Download