A Comparison between LARS and LASSO for Initialising the

Available online at www.sciencedirect.com Procedia Technology 7 (2013) 282 – 288 2013 Iberoamerican Conference on Electronics Engineering and Computer Science A Comparison between LARS and LASSO for Initialising the Time-Series Forecasting Auto-Regressive Equations Eric Iturbide, Jaime Cerda, Mario Graff∗ Division de Estudios de Postgrado, Facultad de Ingenieria Electrica Universidad Michoacana de San Nicolas de Hidalgo, Ciudad Universitaria, C.P. 58004, Morelia, Mich., Mexico Abstract In this paper the LASSO and LARS estimators to fit auto-regressive time series models as well as OLS are compared. LASSO and LARS are two widely used methods to tackle the variable selection problem. To this end we used 4,004 different time series taken from the M1 and M3 time series competition. As expected, the experiments corroborates that LARS and LASSO derive models that outperform OLS models in terms of the mean square error. It is well known that LARS and LASSO behave similarly; however, the results obtained highlight their differences in terms of forecasting accuracy. © 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY-NC-ND license. c 2011 Published by Elsevier Ltd. Selection and peer-review under responsibility of CIIECC 2013 Keywords: LASSO, LARS, OLS, AR(n) models, Time Series 1. Introduction The variable selection problem arises in many practical situations when, out of a set of variables involved in a problem, a subset of variables which best represents the problem need to be chosen . Methods which face this problem are known as variable selection methods This paper performs a comparison among ordinary least squares (OLS), least absolute shrinkage and selection operator (LASSO) [1] and the least angle regression (LARS) [2]. These methods were applied to the prediction of 4,004 different time series taken from the M1 and M3 time series competition (see [3, 4]) using a linear model. A 5-order auto-regressive model was implemented to initialize each of the time series forecasting models. The principal idea is to create accurate predictive models with interpretable or parsimonious representations. LASSO, proposed by Tibshirani [1], is a popular technique of variable selection. It is an OLS- based method which uses 1 -norm penalty on the regression coefficients and it tends to produce simpler models. LARS, proposed by Efron, Hastie, Johnstone and Tibshirani[2], is a method that can be viewed as a vector-based version of lasso to accelerate the computations. LARS proceeds in the equiangular direction and the most correlated variable with the current residual. In each stage a variable is added to the active set, so this process could continue ∗ Corresponding author. Tel: + 52 (443) 3223500 ext. 4373 Email addresses: eiturbide@dep.fie.umich.mx (Eric Iturbide), jcerda@umich.mx (Jaime Cerda), mgraffg@dep.fie.umich.mx (Mario Graff ) 2212-0173 © 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Selection and peer-review under responsibility of CIIECC 2013 doi:10.1016/j.protcy.2013.04.035 283 Eric Iturbide et al. / Procedia Technology 7 (2013) 282 – 288 until all variables have been added. A characteristic of LARS is its computational efficiency, the algorithm requires the same order of computations as OLS. Both LARS and LASSO produce simpler models and therefore more easily interpretable. The selection of the best subset of variables becomes a fundamental task to avoid over-fitting, but it is difficult to find the best subset especially when there exist many variables. To face this problem the tuning parameters of LASSO and LARS for 4,004 different time series have been chosen. The tuning parameter for LASSO is a constant t which constraints the 1 -norm for the coefficients whereas for LARS is k, the number of vectors which will appear in the resulting model. A 5-fold crossvalidation process have been used. This is a popular method for error forecasting as well as to compare different models in machine learning. This paper is organized as follows. In section 2, the related work to LASSO and LARS are revised. In section 3, the regression models are introduced. The formulation of OLS, LASSO and LARS for time series are described in section 4. The results are presented in section 5. Finally, section 6 closes this paper with some final conclusions. 2. Related work Currently, variable selection methods have been applied on several fields, e.g. pattern classification [5] and imaging study [6]. To select the best variables subset is not an easy task, and when the number of variables is large, the task becomes computationally infeasible, because the total number of the variable subsets grows exponentially. Methods such as backward selection and forward selection are used for subset selection. However, these kind of greedy algorithms are known to have poor performance. OLS is a method to estimate all the regression coefficients to fit data. However, the interest is in obtaining a subset of coefficients which best represents the problem, OLS fails in this approach. Therefore, a review of the alternatives which are based on OLS is undertaken. The first model is ridge regression [7]. Ridge is the oldest method of penalty and is not considered a variable selection method. This method consists of a constant nonnegative that controls the compensation between the fit model and the data, subject to the 2 -norm. The quadratic 2 -norm induces a contraction towards zero, but never becomes zero. LASSO is a new technique for variable selection. It uses the 1 -norm in the model fitting criterion. If it applies a soft-threshold, LASSO will shrink the norm and select the variables simultaneously. LASSO contains the sum of absolute values subject to a positive upper bound t. The solution to this problem involves reducing exactly to zero some coefficients. Other related works are LASSO- Bayesian [8] and ELASTIC NET [9] two new methodologies that are used principally in economy. However, the strategies most known for selecting the best subset of variables are stepwise methods, where the methods select the best model sequentially including or excluding one variable at each step. The main method is LARS, which combined with supervised learning techniques is very powerful in terms of performance and accuracy. The Annals of Statistics has dedicated 92 pages about this topic [2]. The paper was followed by a debate about advantages and disadvantages of LARS [10]. 3. Regression methods This section describes the regression models used in this work: the linear model as well as two variable selection methods are presented. Specifically, LASSO and LARS will be discussed. 3.1. The linear model The linear regression model is commonly used for analysis of the relationships between a response and explanatory variables. The input of the model has n instances of a process described by p variables. We define the matrix X ∈ Rn×p representing the p variables for n samples, and y ∈ Rn representing the values of the response. The following equation shows this formulation. ⎛ ⎜⎜⎜ x11 ⎜⎜⎜⎜ .. ⎜⎜⎜ . ⎝ xn1 ... ... ... ⎞⎛ ⎞ ⎛ ⎞ x1p ⎟⎟ ⎜⎜β1 ⎟⎟ ⎜⎜y1 ⎟⎟ ⎟⎜ ⎟ ⎜ ⎟ .. ⎟⎟⎟⎟ ⎜⎜⎜⎜ .. ⎟⎟⎟⎟ = ⎜⎜⎜⎜ .. ⎟⎟⎟⎟ . ⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟ ⎜⎜⎜ . ⎟⎟⎟ ⎠⎝ ⎠ ⎝ ⎠ yn xnp β p (1) 284 Eric Iturbide et al. / Procedia Technology 7 (2013) 282 – 288 Equation 1 can be paraphrased as: find the best set of regression coefficients β j and obtain ŷi , the estimated response values. Sections 3.2 and 3.3 show how to obtain the coefficients β j with LASSO and LARS which are based on OLS expressed as β = (X X)−1 X y (2) However, the matrix X X may be singular leading to the use of iterative algorithms [11, 12]. OLS has two main drawbacks: Firstly, it violates the principle of parsimoniousness, and secondly, it derives over-fitted models. These reasons justify the use of variable selection methods. 3.2. LASSO LASSO [1] is a constrained version of OLS. It minimizes the residue, subject to a constraint defined by the coefficients 1 -norm bounded by a constant t i.e. pj=1 |β j | ≤ t. LASSO is defined by the following equation. p n (yi − minimize βj i=1 Xi j β j )2 j=1 (3) p |β j | ≤ t subject to j=1 where t is a non-negative tuning parameter. An implementation, called LASSO-pure, using an optimization solver in python-scipy [13] was used to solve this model. This is not a real variable selection model as they tend to find nonzero values for all the β j ’s. If equation 3 is solved with a soft-threshold, LASSO does shrinkage and automatic variable selection simultaneously. LASSO will set some variables to zero depending on the threshold. If LASSO is started with t = 0 and then t is increased , some of the coefficients returned by LASSO will be nonzero, one at a time. On the other hand, LASSO-pure will return all the coefficients with a nonzero value. The tuning parameter t can be determined by the cross-validation. Equivalently 3 is seen as a minimization problem where the objective function is penalized with a weighted 1 -norm of the coefficients β j [14]. p n f (β j ) = (yi − i=1 p Xi j β j )2 + λ j=1 |β j | (4) j=1 where λ is a nonnegative tuning parameter. The solution of 3 and 4 is a one-to-one correspondence between λ and the upper bound t. It coincides with the soft thresholding in [15] and it is known as “basis pursuit”. 3.3. Least Angle Regression Least angle regression [2] is a new model selection algorithm and can be viewed as a vector-based version of stagewise that accelerates the computations. Only p steps are required for the full set of solutions equal to OLS, where p is the number of covariates. LARS works as follows. It starts by setting all the coefficients β to zero and finds the covariate most correlated with the response. Then, it takes the largest possible step in the direction of this covariate until another covariate has as much correlation with the current residual. At this point, LARS proceeds in the equiangular direction between the two predictors until a third variable has as much correlation with the residual as this one. The process continues until the k-th covariate is incorporated into the model, i.e., βk . If k = p then the OLS model will be obtained The idea is to select k in order to obtain a simpler and more general model. In this contribution, a cross-validation technique to determine the number of covariates included in the model was used. 285 Eric Iturbide et al. / Procedia Technology 7 (2013) 282 – 288 4. Variable selection for time series A time series is a sequence of measurements taken in equally space time intervals, for instance, one can measure the wind speed daily or hourly, and the sequence of these measurements is a time series, more specifically, these are two examples of univariate time series. One of the most traditional techniques used to forecast the next values of a time series is an autoregressive (AR) model (see [16]). This model have the following form: p yt = βi yt−i (5) i where, yt is measurement of the series at a time t, and {β1 , β2 , ..., β p } are the auto-regression coefficients that need to be identified. In order to identify these coefficients one can solve the system of equations shown in Equation (1) where the first column corresponds to the first n − p measurements of the time series, the second goes from the second measurement to n − p + 1 measurement, and so on. The response vector is the time series starting from measurement p + 1 until n, where n is the length of the time series. After the time series has been converted to the system shown in Equation (1) one can use OLS to identify the coefficients βi ; however, this process is likely to produce an AR model that presents overfitting [17]. In order to tackle this problem, one can use techniques such as LARS or LASSO that can select a subset of covariates among all the possible ones. One only needs to decide the number of covariates to include in the final model. In order to free the user from this delicate task, a cross-validation technique in the training set was performed. Specifically, a 5-fold cross-validation (CV) is used (see [18]) that works as follows. Let X and y be split into five matrices and vectors of equal size. Four of them are joined together and are used to produce a model (i.e., these are used to identified βi using LARS or LASSO), while the remaining pair is used to assess its generalization. That is, the remaining X and the identified coefficients are used to estimate the response, namely Ŷ. This process is repeated 5 times, each time leaving out a different fifth of X and y. At the end of this process, we have five Ŷ that together can be compared against y to assess the generality of the model. This cross-validation procedure is perform for each possible value of the configuration parameter of LARS and LASSO. That is, the parameter that sets the number of variables included in the model. This process is performed with the aim of identifying the value of the parameter that corresponds to the best generalization. 5. Results In this section we present the results obtained of forecasting the 4,004 different time series. The algorithms presented are LARS, LASSO pure, LASSO soft-threshold, and OLS. We used 5-fold cross validation and the initialising of auto-regressive equations with order 5 (i.e., p = 5). The 4,004 different series time have a training set and validation set provided by M1 and M3 time series competition. We normalize these time series in order to have a mean 0 and 1 standard deviation. Figure 1 shows the number of coefficients different from zero that were selected using the cross-validation for each of the technique tested. It can be seen that LASSO pure has more nonzero coefficients than the other models whereas the other algorithms have a similar number of coefficients different to zero. OLS is not depicted in the figure because OLS always sets all the coefficients different to zero for every time series tested. The number of coefficients selected by each model is related to the quality of the forecast. Table 1 shows a comparison of the 4 models. The table presents the number of times each model obtained the lowest MSE in the validation set for 4004 different time series. The results show that LARS is the best model for these data. On the other hand, Figure 2 complements the information presented in the table. Here, we compare the accuracy of the models, on the 4004 time-series and in the training and validation sets, using as baseline the performance of OLS. That is, in each figure the red line correspond to the error of OLS and the points are 286 Eric Iturbide et al. / Procedia Technology 7 (2013) 282 – 288 Fig. 1. Number of non-zero coefficients for each time-series in training set. Table 1. Comparison of models for 4004 time-series forecasting in terms of the smallest error MSE. Model LARS LASSO pure LASSO threshold OLS Validation 1726 1020 737 521 the errors of the other models. In order to facilitate the reading, the time series are sorted according to the error obtained by OLS. 1 The first column of elements in Figure 2 shows the errors in the training set and these are follow, rowwise, by the figures of the errors in the validation set. As can be seen from the figure OLS is the best model in the training set, as can be corroborated given that almost all the points are above the line, this indicates that the other models have a higher error. This may be an indication that OLS presents over-fitting. Moreover, we know that ovefitting is present when the model is complex and in this case OLS uses all the covariates to produce the model. Excessively complex. Comparing LASSO, LASSO with soft-threshold, and LARS, against OLS on the training set, we can see, from the figure, that LASSO is the most similar to OLS, then LARS, and the most different ff is LASSO with threshold. On the other hand, looking at the validation set, we can see that LASSO with soft-threshold is the one that presents the most number of points above he OLS line, the second one is LASSO (this can be corroborated with Table 1), and the best is LARS. Figure 2 allows us to compare qualitative the performance of the different ff techniques, for example, if two techniques are very similar then their figures would be also identical. Using this intuition we can see that LASSO performs most similar to LARS than to LASSO with soft-threshold. This behaviour is unexpected given that both LASSO uses an optimization technique and LARS uses linear algebra. The behaviour of LASSO compared against OLS on the training set caught our attention. As can be seen from the Figure 2 the majority of the points are on the OLS line, this indicates that both algorithms have a similar performance on the training set. Based on this one would expect that both algorithms perform similar in the validation set; however, as seen in the Figure 2, LASSO makes better predictions in the validation than OLS in almost all the time series tested, in fact LASSO is the second best algorithm. 1 This is the reason that makes the OLS error looks as a line. Eric Iturbide et al. / Procedia Technology 7 (2013) 282 – 288 (a) Lars training i i (b) b Lars validation lid i (c) Lasso-threshold h h ld training i i (d) d Lasso-threshold h h ld validation lid i (e) Lasso-pure training i i (f) f Lasso-pure validation lid i Fig. 2. The error MSE obtained from the models in 4004 different ff time series forecasting using LARS, LASSO and OLS in the training and validation set. Where OLS is taken how reference and the line red is the error MSE. 6. Conclusions In this paper, we have experimentally compared OLS, LASSO, LASSO with soft-threshold, and LARS on the problem of time series forecasting. We used 4004 time series taken from M1 and M3 competition. Our results show that LARS performs the most accurate predictions in the validation set. This is the result of selecting only the most important variables and the use of a cross-validation to identify the number of covariates to include. To our surprise the second best algorithm is LASSO; however, being strict LASSO is not a variable selection algorithm, it only sets a constraint in the sum of the coefficients.The third is a 287 288 Eric Iturbide et al. / Procedia Technology 7 (2013) 282 – 288 version of LASSO with a soft-threshold. As future work, we will compare these technique against traditional forecasting models such as: the Exponential smoothing state space model (ETS) (see [19]), the Auto-Regressive Integrate Moving Average model (ARIMA) (see [20]), and the Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components (BATS) (see [21]), all of them are found in the forecast package of R [22]. References [1] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 58 (1994) 267–288. [2] B. Efron, Least angle regression, The Annals of Statistics 32 (2) (2004) 407–499. doi:10.1214/009053604000000067. URL http://projecteuclid.org/euclid.aos/1083178935 [3] S. Makridakis, A. Andersen, R. Carbone, R. Fildes, M. Hibon, R. Lewandowski, J. Newton, E. Parzen, R. Winkler, The accuracy of extrapolation (time series) methods: Results of a forecasting competition, Journal of Forecasting 1 (2) (1982) 111153. doi:10.1002/for.3980010202. URL http://onlinelibrary.wiley.com/doi/10.1002/for.3980010202/abstract [4] S. Makridakis, M. Hibon, The m3-competition: results, conclusions and implications, International Journal of Forecasting 16 (4) (2000) 451–476. doi:10.1016/S0169-2070(00)00057-1. URL http://www.sciencedirect.com/science/article/pii/S0169207000000571 [5] P. Drineas, M. W. Mahoney, On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning, JOURNAL OF MACHINE LEARNING RESEARCH 6. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.1193 [6] N. S. Rao, R. D. Nowak, S. J. Wright, N. G. Kingsbury, Convex approaches to model wavelet sparsity patterns, CoRR abs/1104.4385. [7] A. E. Hoerl, R. W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67. [8] T. Park, G. Casella, The bayesian lasso, Tech. rep. (2005). [9] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society, Series B 67 (2005) 301–320. [10] S. Weisberg, Discussion of ”least angle regression” by efron et al, The Annals of Statistics (2004) 490–494. [11] A. Ruhe, Rational krylov algorithms for nonsymmetric eigenvalue problems, ii: Matrix pairs (1992). [12] S. L. Campbell, C. D. Meyer, Generalized Inverses of Linear Transformations, SIAM, 2009. [13] E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open source scientific tools for Python (2001–). URL http://www.scipy.org/ [14] B. M. Pötscher, H. Leeb, On the distribution of penalized maximum likelihood estimators: The lasso, scad, and thresholding, J. Multivar. Anal. 100 (9) (2009) 2065–2082. doi:10.1016/j.jmva.2009.06.010. URL http://dx.doi.org/10.1016/j.jmva.2009.06.010 [15] S. S. Chen, D. L. Donoho, Michael, A. Saunders, Atomic decomposition by basis pursuit, SIAM Journal on Scientific Computing 20 (1998) 33–61. [16] S. Makridakis, S. C. Whellwright, R. J. Hyndamn, Forecasting Methods and Applications, 3rd Edition, John Wiley & Sons, Inc., 1992. [17] E. Alpaydin, Introduction to Machine Learning, second edition Edition, The MIT Press, 2009. [18] S. Arlot, A. Celisse, A survey of cross-validation procedures for model selection, Statistics Surveys 4 (2010) 40–79. doi:10.1214/09-SS054. [19] R. J. Hyndman, A. B. Koehler, R. D. Snyder, S. Grose, A state space framework for automatic forecasting using exponential smoothing methods, International Journal of Forecasting 18 (3) (2002) 439–454. doi:10.1016/S0169-2070(01)00110-8. URL http://www.sciencedirect.com/science/article/pii/S0169207001001108 [20] R. Shanmugam, Introduction to time series and forecasting, Technometrics 39 (4) (1997) 426426. URL http://amstat.tandfonline.com/doi/abs/10.1080/00401706.1997.10485165 [21] A. M. De Livera, R. J. Hyndman, R. D. Snyder, Forecasting time series with complex seasonal patterns using exponential smoothing, Journal of the American Statistical Association 106 (496) (2011) 1513–1527. doi:10.1198/jasa.2011.tm09771. URL http://amstat.tandfonline.com/doi/abs/10.1198/jasa.2011.tm09771 [22] R. J. Hyndman, Y. Khandakar, Automatic time series forecasting: The forecast package for r, Journal of Statistical Software 27 (3) (2008) 1–22. URL http://ideas.repec.org/a/jss/jstsof/27i03.html

A Comparison between LARS and LASSO for Initialising the

Related documents

Products

Support

A Comparison between LARS and LASSO for Initialising the

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib