Uploaded by ifrairfan242

A Comparison between LARS and LASSO for Initialising the

advertisement
Available online at www.sciencedirect.com
Procedia Technology 7 (2013) 282 – 288
2013 Iberoamerican Conference on Electronics Engineering and Computer Science
A Comparison between LARS and LASSO for Initialising the
Time-Series Forecasting Auto-Regressive Equations
Eric Iturbide, Jaime Cerda, Mario Graff∗
Division de Estudios de Postgrado, Facultad de Ingenieria Electrica Universidad Michoacana de San Nicolas de Hidalgo, Ciudad
Universitaria, C.P. 58004, Morelia, Mich., Mexico
Abstract
In this paper the LASSO and LARS estimators to fit auto-regressive time series models as well as OLS are compared.
LASSO and LARS are two widely used methods to tackle the variable selection problem. To this end we used 4,004
different time series taken from the M1 and M3 time series competition. As expected, the experiments corroborates that
LARS and LASSO derive models that outperform OLS models in terms of the mean square error. It is well known that
LARS and LASSO behave similarly; however, the results obtained highlight their differences in terms of forecasting
accuracy.
© 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY-NC-ND license.
c 2011 Published
by Elsevier
Ltd.
Selection
and peer-review
under
responsibility of CIIECC 2013
Keywords: LASSO, LARS, OLS, AR(n) models, Time Series
1. Introduction
The variable selection problem arises in many practical situations when, out of a set of variables involved
in a problem, a subset of variables which best represents the problem need to be chosen . Methods which
face this problem are known as variable selection methods This paper performs a comparison among ordinary least squares (OLS), least absolute shrinkage and selection operator (LASSO) [1] and the least angle
regression (LARS) [2]. These methods were applied to the prediction of 4,004 different time series taken
from the M1 and M3 time series competition (see [3, 4]) using a linear model. A 5-order auto-regressive
model was implemented to initialize each of the time series forecasting models. The principal idea is to
create accurate predictive models with interpretable or parsimonious representations. LASSO, proposed by
Tibshirani [1], is a popular technique of variable selection. It is an OLS- based method which uses 1 -norm
penalty on the regression coefficients and it tends to produce simpler models. LARS, proposed by Efron,
Hastie, Johnstone and Tibshirani[2], is a method that can be viewed as a vector-based version of lasso to
accelerate the computations. LARS proceeds in the equiangular direction and the most correlated variable
with the current residual. In each stage a variable is added to the active set, so this process could continue
∗ Corresponding author. Tel: + 52 (443) 3223500 ext. 4373
Email addresses: eiturbide@dep.fie.umich.mx (Eric Iturbide), jcerda@umich.mx (Jaime Cerda),
mgraffg@dep.fie.umich.mx (Mario Graff )
2212-0173 © 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY-NC-ND license.
Selection and peer-review under responsibility of CIIECC 2013
doi:10.1016/j.protcy.2013.04.035
283
Eric Iturbide et al. / Procedia Technology 7 (2013) 282 – 288
until all variables have been added. A characteristic of LARS is its computational efficiency, the algorithm
requires the same order of computations as OLS. Both LARS and LASSO produce simpler models and
therefore more easily interpretable. The selection of the best subset of variables becomes a fundamental
task to avoid over-fitting, but it is difficult to find the best subset especially when there exist many variables.
To face this problem the tuning parameters of LASSO and LARS for 4,004 different time series have been
chosen. The tuning parameter for LASSO is a constant t which constraints the 1 -norm for the coefficients
whereas for LARS is k, the number of vectors which will appear in the resulting model. A 5-fold crossvalidation process have been used. This is a popular method for error forecasting as well as to compare
different models in machine learning.
This paper is organized as follows. In section 2, the related work to LASSO and LARS are revised. In
section 3, the regression models are introduced. The formulation of OLS, LASSO and LARS for time series
are described in section 4. The results are presented in section 5. Finally, section 6 closes this paper with
some final conclusions.
2. Related work
Currently, variable selection methods have been applied on several fields, e.g. pattern classification [5]
and imaging study [6]. To select the best variables subset is not an easy task, and when the number of
variables is large, the task becomes computationally infeasible, because the total number of the variable
subsets grows exponentially. Methods such as backward selection and forward selection are used for subset
selection. However, these kind of greedy algorithms are known to have poor performance. OLS is a method
to estimate all the regression coefficients to fit data. However, the interest is in obtaining a subset of coefficients which best represents the problem, OLS fails in this approach. Therefore, a review of the alternatives
which are based on OLS is undertaken. The first model is ridge regression [7]. Ridge is the oldest method of
penalty and is not considered a variable selection method. This method consists of a constant nonnegative
that controls the compensation between the fit model and the data, subject to the 2 -norm. The quadratic
2 -norm induces a contraction towards zero, but never becomes zero. LASSO is a new technique for variable
selection. It uses the 1 -norm in the model fitting criterion. If it applies a soft-threshold, LASSO will shrink
the norm and select the variables simultaneously. LASSO contains the sum of absolute values subject to a
positive upper bound t. The solution to this problem involves reducing exactly to zero some coefficients.
Other related works are LASSO- Bayesian [8] and ELASTIC NET [9] two new methodologies that are used
principally in economy. However, the strategies most known for selecting the best subset of variables are
stepwise methods, where the methods select the best model sequentially including or excluding one variable at each step. The main method is LARS, which combined with supervised learning techniques is very
powerful in terms of performance and accuracy. The Annals of Statistics has dedicated 92 pages about this
topic [2]. The paper was followed by a debate about advantages and disadvantages of LARS [10].
3. Regression methods
This section describes the regression models used in this work: the linear model as well as two variable
selection methods are presented. Specifically, LASSO and LARS will be discussed.
3.1. The linear model
The linear regression model is commonly used for analysis of the relationships between a response and
explanatory variables. The input of the model has n instances of a process described by p variables. We
define the matrix X ∈ Rn×p representing the p variables for n samples, and y ∈ Rn representing the values
of the response. The following equation shows this formulation.
⎛
⎜⎜⎜ x11
⎜⎜⎜⎜ ..
⎜⎜⎜ .
⎝
xn1
...
...
...
⎞⎛ ⎞ ⎛ ⎞
x1p ⎟⎟ ⎜⎜β1 ⎟⎟ ⎜⎜y1 ⎟⎟
⎟⎜ ⎟ ⎜ ⎟
.. ⎟⎟⎟⎟ ⎜⎜⎜⎜ .. ⎟⎟⎟⎟ = ⎜⎜⎜⎜ .. ⎟⎟⎟⎟
. ⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟
⎠⎝ ⎠ ⎝ ⎠
yn
xnp β p
(1)
284
Eric Iturbide et al. / Procedia Technology 7 (2013) 282 – 288
Equation 1 can be paraphrased as: find the best set of regression coefficients β j and obtain ŷi , the
estimated response values. Sections 3.2 and 3.3 show how to obtain the coefficients β j with LASSO and
LARS which are based on OLS expressed as
β = (X X)−1 X y
(2)
However, the matrix X X may be singular leading to the use of iterative algorithms [11, 12]. OLS
has two main drawbacks: Firstly, it violates the principle of parsimoniousness, and secondly, it derives
over-fitted models. These reasons justify the use of variable selection methods.
3.2. LASSO
LASSO [1] is a constrained version of OLS. It minimizes the residue, subject to a constraint defined
by the coefficients 1 -norm bounded by a constant t i.e. pj=1 |β j | ≤ t. LASSO is defined by the following
equation.
p
n
(yi −
minimize
βj
i=1
Xi j β j )2
j=1
(3)
p
|β j | ≤ t
subject to
j=1
where t is a non-negative tuning parameter. An implementation, called LASSO-pure, using an optimization
solver in python-scipy [13] was used to solve this model. This is not a real variable selection model as they
tend to find nonzero values for all the β j ’s. If equation 3 is solved with a soft-threshold, LASSO does shrinkage and automatic variable selection simultaneously. LASSO will set some variables to zero depending on
the threshold. If LASSO is started with t = 0 and then t is increased , some of the coefficients returned
by LASSO will be nonzero, one at a time. On the other hand, LASSO-pure will return all the coefficients
with a nonzero value. The tuning parameter t can be determined by the cross-validation. Equivalently 3 is
seen as a minimization problem where the objective function is penalized with a weighted 1 -norm of the
coefficients β j [14].
p
n
f (β j ) =
(yi −
i=1
p
Xi j β j )2 + λ
j=1
|β j |
(4)
j=1
where λ is a nonnegative tuning parameter. The solution of 3 and 4 is a one-to-one correspondence between
λ and the upper bound t. It coincides with the soft thresholding in [15] and it is known as “basis pursuit”.
3.3. Least Angle Regression
Least angle regression [2] is a new model selection algorithm and can be viewed as a vector-based
version of stagewise that accelerates the computations. Only p steps are required for the full set of solutions
equal to OLS, where p is the number of covariates. LARS works as follows. It starts by setting all the
coefficients β to zero and finds the covariate most correlated with the response. Then, it takes the largest
possible step in the direction of this covariate until another covariate has as much correlation with the current
residual. At this point, LARS proceeds in the equiangular direction between the two predictors until a third
variable has as much correlation with the residual as this one. The process continues until the k-th covariate
is incorporated into the model, i.e., βk . If k = p then the OLS model will be obtained The idea is to select
k in order to obtain a simpler and more general model. In this contribution, a cross-validation technique to
determine the number of covariates included in the model was used.
285
Eric Iturbide et al. / Procedia Technology 7 (2013) 282 – 288
4. Variable selection for time series
A time series is a sequence of measurements taken in equally space time intervals, for instance, one
can measure the wind speed daily or hourly, and the sequence of these measurements is a time series, more
specifically, these are two examples of univariate time series.
One of the most traditional techniques used to forecast the next values of a time series is an autoregressive (AR) model (see [16]). This model have the following form:
p
yt =
βi yt−i
(5)
i
where, yt is measurement of the series at a time t, and {β1 , β2 , ..., β p } are the auto-regression coefficients that
need to be identified. In order to identify these coefficients one can solve the system of equations shown
in Equation (1) where the first column corresponds to the first n − p measurements of the time series, the
second goes from the second measurement to n − p + 1 measurement, and so on. The response vector is the
time series starting from measurement p + 1 until n, where n is the length of the time series.
After the time series has been converted to the system shown in Equation (1) one can use OLS to identify
the coefficients βi ; however, this process is likely to produce an AR model that presents overfitting [17].
In order to tackle this problem, one can use techniques such as LARS or LASSO that can select a subset
of covariates among all the possible ones. One only needs to decide the number of covariates to include in
the final model. In order to free the user from this delicate task, a cross-validation technique in the training
set was performed. Specifically, a 5-fold cross-validation (CV) is used (see [18]) that works as follows. Let
X and y be split into five matrices and vectors of equal size. Four of them are joined together and are used
to produce a model (i.e., these are used to identified βi using LARS or LASSO), while the remaining pair is
used to assess its generalization. That is, the remaining X and the identified coefficients are used to estimate
the response, namely Ŷ. This process is repeated 5 times, each time leaving out a different fifth of X and y.
At the end of this process, we have five Ŷ that together can be compared against y to assess the generality of
the model. This cross-validation procedure is perform for each possible value of the configuration parameter
of LARS and LASSO. That is, the parameter that sets the number of variables included in the model. This
process is performed with the aim of identifying the value of the parameter that corresponds to the best
generalization.
5. Results
In this section we present the results obtained of forecasting the 4,004 different time series. The algorithms presented are LARS, LASSO pure, LASSO soft-threshold, and OLS. We used 5-fold cross validation
and the initialising of auto-regressive equations with order 5 (i.e., p = 5). The 4,004 different series time
have a training set and validation set provided by M1 and M3 time series competition. We normalize these
time series in order to have a mean 0 and 1 standard deviation.
Figure 1 shows the number of coefficients different from zero that were selected using the cross-validation
for each of the technique tested. It can be seen that LASSO pure has more nonzero coefficients than the other
models whereas the other algorithms have a similar number of coefficients different to zero. OLS is not depicted in the figure because OLS always sets all the coefficients different to zero for every time series tested.
The number of coefficients selected by each model is related to the quality of the forecast. Table 1 shows
a comparison of the 4 models. The table presents the number of times each model obtained the lowest MSE
in the validation set for 4004 different time series. The results show that LARS is the best model for these
data.
On the other hand, Figure 2 complements the information presented in the table. Here, we compare the
accuracy of the models, on the 4004 time-series and in the training and validation sets, using as baseline the
performance of OLS. That is, in each figure the red line correspond to the error of OLS and the points are
286
Eric Iturbide et al. / Procedia Technology 7 (2013) 282 – 288
Fig. 1. Number of non-zero coefficients for each time-series in training set.
Table 1. Comparison of models for 4004 time-series forecasting in terms of the smallest error MSE.
Model
LARS
LASSO pure
LASSO threshold
OLS
Validation
1726
1020
737
521
the errors of the other models. In order to facilitate the reading, the time series are sorted according to the
error obtained by OLS. 1
The first column of elements in Figure 2 shows the errors in the training set and these are follow, rowwise, by the figures of the errors in the validation set. As can be seen from the figure OLS is the best model
in the training set, as can be corroborated given that almost all the points are above the line, this indicates that
the other models have a higher error. This may be an indication that OLS presents over-fitting. Moreover,
we know that ovefitting is present when the model is complex and in this case OLS uses all the covariates
to produce the model. Excessively complex.
Comparing LASSO, LASSO with soft-threshold, and LARS, against OLS on the training set, we can
see, from the figure, that LASSO is the most similar to OLS, then LARS, and the most different
ff
is LASSO
with threshold. On the other hand, looking at the validation set, we can see that LASSO with soft-threshold
is the one that presents the most number of points above he OLS line, the second one is LASSO (this can be
corroborated with Table 1), and the best is LARS.
Figure 2 allows us to compare qualitative the performance of the different
ff
techniques, for example, if
two techniques are very similar then their figures would be also identical. Using this intuition we can see that
LASSO performs most similar to LARS than to LASSO with soft-threshold. This behaviour is unexpected
given that both LASSO uses an optimization technique and LARS uses linear algebra.
The behaviour of LASSO compared against OLS on the training set caught our attention. As can be seen
from the Figure 2 the majority of the points are on the OLS line, this indicates that both algorithms have a
similar performance on the training set. Based on this one would expect that both algorithms perform similar
in the validation set; however, as seen in the Figure 2, LASSO makes better predictions in the validation
than OLS in almost all the time series tested, in fact LASSO is the second best algorithm.
1 This is the reason that makes the OLS error looks as a line.
Eric Iturbide et al. / Procedia Technology 7 (2013) 282 – 288
(a) Lars training
i i
(b)
b Lars validation
lid i
(c) Lasso-threshold
h h ld training
i i
(d)
d Lasso-threshold
h h ld validation
lid i
(e) Lasso-pure training
i i
(f)
f Lasso-pure validation
lid i
Fig. 2. The error MSE obtained from the models in 4004 different
ff
time series forecasting using LARS, LASSO and OLS in the training
and validation set. Where OLS is taken how reference and the line red is the error MSE.
6. Conclusions
In this paper, we have experimentally compared OLS, LASSO, LASSO with soft-threshold, and LARS
on the problem of time series forecasting. We used 4004 time series taken from M1 and M3 competition.
Our results show that LARS performs the most accurate predictions in the validation set. This is the result
of selecting only the most important variables and the use of a cross-validation to identify the number of
covariates to include. To our surprise the second best algorithm is LASSO; however, being strict LASSO
is not a variable selection algorithm, it only sets a constraint in the sum of the coefficients.The third is a
287
288
Eric Iturbide et al. / Procedia Technology 7 (2013) 282 – 288
version of LASSO with a soft-threshold.
As future work, we will compare these technique against traditional forecasting models such as: the
Exponential smoothing state space model (ETS) (see [19]), the Auto-Regressive Integrate Moving Average
model (ARIMA) (see [20]), and the Exponential smoothing state space model with Box-Cox transformation,
ARMA errors, Trend and Seasonal components (BATS) (see [21]), all of them are found in the forecast
package of R [22].
References
[1] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 58 (1994)
267–288.
[2] B. Efron, Least angle regression, The Annals of Statistics 32 (2) (2004) 407–499. doi:10.1214/009053604000000067.
URL http://projecteuclid.org/euclid.aos/1083178935
[3] S. Makridakis, A. Andersen, R. Carbone, R. Fildes, M. Hibon, R. Lewandowski, J. Newton, E. Parzen, R. Winkler, The accuracy of extrapolation (time series) methods: Results of a forecasting competition, Journal of Forecasting 1 (2) (1982) 111153.
doi:10.1002/for.3980010202.
URL http://onlinelibrary.wiley.com/doi/10.1002/for.3980010202/abstract
[4] S. Makridakis, M. Hibon, The m3-competition: results, conclusions and implications, International Journal of Forecasting 16 (4)
(2000) 451–476. doi:10.1016/S0169-2070(00)00057-1.
URL http://www.sciencedirect.com/science/article/pii/S0169207000000571
[5] P. Drineas, M. W. Mahoney, On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning,
JOURNAL OF MACHINE LEARNING RESEARCH 6.
URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.1193
[6] N. S. Rao, R. D. Nowak, S. J. Wright, N. G. Kingsbury, Convex approaches to model wavelet sparsity patterns, CoRR
abs/1104.4385.
[7] A. E. Hoerl, R. W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67.
[8] T. Park, G. Casella, The bayesian lasso, Tech. rep. (2005).
[9] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society, Series B 67
(2005) 301–320.
[10] S. Weisberg, Discussion of ”least angle regression” by efron et al, The Annals of Statistics (2004) 490–494.
[11] A. Ruhe, Rational krylov algorithms for nonsymmetric eigenvalue problems, ii: Matrix pairs (1992).
[12] S. L. Campbell, C. D. Meyer, Generalized Inverses of Linear Transformations, SIAM, 2009.
[13] E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open source scientific tools for Python (2001–).
URL http://www.scipy.org/
[14] B. M. Pötscher, H. Leeb, On the distribution of penalized maximum likelihood estimators: The lasso, scad, and thresholding, J.
Multivar. Anal. 100 (9) (2009) 2065–2082. doi:10.1016/j.jmva.2009.06.010.
URL http://dx.doi.org/10.1016/j.jmva.2009.06.010
[15] S. S. Chen, D. L. Donoho, Michael, A. Saunders, Atomic decomposition by basis pursuit, SIAM Journal on Scientific Computing
20 (1998) 33–61.
[16] S. Makridakis, S. C. Whellwright, R. J. Hyndamn, Forecasting Methods and Applications, 3rd Edition, John Wiley & Sons, Inc.,
1992.
[17] E. Alpaydin, Introduction to Machine Learning, second edition Edition, The MIT Press, 2009.
[18] S. Arlot, A. Celisse, A survey of cross-validation procedures for model selection, Statistics Surveys 4 (2010) 40–79.
doi:10.1214/09-SS054.
[19] R. J. Hyndman, A. B. Koehler, R. D. Snyder, S. Grose, A state space framework for automatic forecasting using exponential
smoothing methods, International Journal of Forecasting 18 (3) (2002) 439–454. doi:10.1016/S0169-2070(01)00110-8.
URL http://www.sciencedirect.com/science/article/pii/S0169207001001108
[20] R. Shanmugam, Introduction to time series and forecasting, Technometrics 39 (4) (1997) 426426.
URL http://amstat.tandfonline.com/doi/abs/10.1080/00401706.1997.10485165
[21] A. M. De Livera, R. J. Hyndman, R. D. Snyder, Forecasting time series with complex seasonal patterns using exponential
smoothing, Journal of the American Statistical Association 106 (496) (2011) 1513–1527. doi:10.1198/jasa.2011.tm09771.
URL http://amstat.tandfonline.com/doi/abs/10.1198/jasa.2011.tm09771
[22] R. J. Hyndman, Y. Khandakar, Automatic time series forecasting: The forecast package for r, Journal of Statistical Software
27 (3) (2008) 1–22.
URL http://ideas.repec.org/a/jss/jstsof/27i03.html
Download