Machine Learning and Big Data Analytics Section 5 Emily Mower October 2018 Note that the material in these notes draws on the excellent and more thorough treatment of these topics in Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. 1 1.1 Lasso and Ridge Regression Concept Lasso and Ridge Regression both fall under a broader class of models called shrinkage models. Shrinkage models regularize or penalize coefficient estimates, which means that they shrink the coefficients toward zero. Shrinking the coefficient estimates can reduce their variance, so these methods are particularly useful for models where we are concerned about high variance (i.e. overfitting), such as models with a large number of predictors relative to the number of observations. Both Lasso and Ridge regression operate by adding a penalty to the normal OLS (Ordinary Least Squares) minimization problem. These penalties can be thought of as a budget, and sometimes the minimization problem is explicitly formulated as minimizing the Residual Sum of Squares (RSS) subject to a budget constraint. The idea of the budget is if you have a small budget, you can only afford a little “total” β, where the definition of “total” β varies between Lasso and Ridge, but as your budget increases, you get closer and closer to the OLS βs. Lasso and Ridge regression differ in exactly how they penalize the coefficients. In particular, Lasso penalizes the coefficients using an L1 penalty: n X yi − β0 − i=1 p X 2 βj xij + λ j=1 p X |βj | = RSS + λ j=1 p X |βj | (1) βj2 (2) j=1 Ridge regression penalizes the coefficients using an L2 penalty: n X i=1 yi − β0 − p X 2 βj xij + λ j=1 p X j=1 βj2 = RSS + λ p X j=1 You may notice that both the Lasso and Ridge Regression minimization problems feature a parameter λ. This is a tuning parameter that dictates how much we will penalize the “total” coefficients. Tuning parameters appear in many models and get their name from the fact that we tune (adjust / choose) them in order to improve our model’s performance. Since Lasso and Ridge Regression are used when we are concerned about variance (over-fitting), it should come as no surprise that increasing λ decreases variance. It can be helpful to contextualize the λ size by realizing that λ = 0 is OLS and λ = ∞ results in only β0 being assigned a non-zero value. As we increase λ from zero, we decrease variance but increase bias (the classic bias-variance tradeoff). To determine the best λ (e.g. the λ that will lead to the best out of sample predictive performance), it is common to use cross validation. A good function in R that trains Lasso and Ridge Regression models and uses cross-validation to select a good λ is cv.glmnet(), which is part of the glmnet package. 1 1.2 Implementation and Considerations Lasso and Ridge Regression are useful models to use when dealing with a large number of predictors p relative to the number of observations n. It is good to standardize your features so that coefficients are not selected because of their scale rather than their relative importance. For example, suppose you were predicting salary, and suppose you had a standardized test score that was highly predictive of salary and you also had parents’ income, which was only somewhat predictive of salary. Since standardized test scores are measured on a much smaller scale than the outcome and than parents’ income, we would expect the coefficient on test scores to be large relative to the coefficient on parents’ income. This means it would likely be shrunk by adding the penalty, even though it reflects a strong predictive relationship. 1.3 Interpreting Coefficients Coefficients produced by Lasso and Ridge Regression should not be interpreted causally. These methods are used for prediction, and as such, our focus is on ŷ. This is in contrast to an inference problem, where we would be interested in β̂. The intuition behind why we cannot interpret the coefficients causally is similar to the intuition underlying Omitted Variables Bias (OVB). In OVB, we said that if two variables X1 and X2 were correlated with each other and with the outcome Y , then the coefficient on X1 in a regression of Y on X1 would differ from the coefficient on X1 in a regression where X2 was also included. This is because since X1 and X2 are correlated, when we omit X2 , the coefficient on X1 picks up the effect of X1 on Y as well as some of the effect of X2 on Y . In penalized regressions like Lasso and Ridge, variables that are correlated with many predictors may be favored. This is because you can get a big predictive “bang for your buck” by putting a significant coefficient on that one predictor rather than on each of the predictors it is correlated with. However, it is important to remember that the resulting coefficients do not reflect the causal relationship between the variables and the outcome. An interesting feature we saw in class Ridge Regression is that as the penalty increases, the Pfor p total size of the coefficients (defined as j=1 βj2 ) decreases but the size of an individual coefficient may increase. To understand why this is, consider two OLS (no penalty) coefficients that equal 6 and 1, respectively. As we increase λ in ridge regression, decreasing the first coefficient by 2 would reduce the penalty by 62 − 42 = 20 whereas increasing the smaller coefficient by 2 would increase the penalty by 32 − 12 = 8. Therefore, if the two variables are correlated, it can be helpful to have the variable with the smaller OLS coefficient pick up some of the effect of the variable with the larger OLS coefficient since the penalty will be significantly smaller. Note that this is not true for Lasso, where the decrease in penalty of moving from 6 to 4 is the same as the increase in penalty of moving from 1 to 3. 2 Bootstrapping Bootstrapping involves resampling your data with replacement. It is a very useful statistical tool that allows you to quantify the uncertainty of an estimator, such as a coefficient. This is especially useful for models where there is not a convenient standard error formula. When your goal is to estimate the uncertainty of your estimate, you bootstrap your data and perform the estimate many times (e.g. 1,000 times). Then, you look at the distribution of your estimates and take the middle 95% as your 95% confidence interval. Bootstrapping is also an important component of some very powerful machine learning models. One that we will learn about in the next few weeks is Random Forest. The Random Forest resamples your data many times, each time training a different model, and then averages across the models trained on all of your resampled data. We will discuss the exact mechanics in a few weeks when we cover Random Forest. 2