Uploaded by 으나Euna

Ridge Lasso

advertisement
Machine Learning and Big Data Analytics Section 5
Emily Mower
October 2018
Note that the material in these notes draws on the excellent and more thorough treatment
of these topics in Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor
Hastie and Robert Tibshirani.
1
1.1
Lasso and Ridge Regression
Concept
Lasso and Ridge Regression both fall under a broader class of models called shrinkage models.
Shrinkage models regularize or penalize coefficient estimates, which means that they shrink the
coefficients toward zero. Shrinking the coefficient estimates can reduce their variance, so these
methods are particularly useful for models where we are concerned about high variance (i.e. overfitting), such as models with a large number of predictors relative to the number of observations.
Both Lasso and Ridge regression operate by adding a penalty to the normal OLS (Ordinary
Least Squares) minimization problem. These penalties can be thought of as a budget, and sometimes the minimization problem is explicitly formulated as minimizing the Residual Sum of Squares
(RSS) subject to a budget constraint. The idea of the budget is if you have a small budget, you
can only afford a little “total” β, where the definition of “total” β varies between Lasso and Ridge,
but as your budget increases, you get closer and closer to the OLS βs.
Lasso and Ridge regression differ in exactly how they penalize the coefficients. In particular,
Lasso penalizes the coefficients using an L1 penalty:
n
X

yi − β0 −
i=1
p
X
2
βj xij  + λ
j=1
p
X
|βj | = RSS + λ
j=1
p
X
|βj |
(1)
βj2
(2)
j=1
Ridge regression penalizes the coefficients using an L2 penalty:
n
X
i=1

yi − β0 −
p
X
2
βj xij  + λ
j=1
p
X
j=1
βj2
= RSS + λ
p
X
j=1
You may notice that both the Lasso and Ridge Regression minimization problems feature a parameter λ. This is a tuning parameter that dictates how much we will penalize the “total” coefficients.
Tuning parameters appear in many models and get their name from the fact that we tune (adjust
/ choose) them in order to improve our model’s performance.
Since Lasso and Ridge Regression are used when we are concerned about variance (over-fitting),
it should come as no surprise that increasing λ decreases variance. It can be helpful to contextualize
the λ size by realizing that λ = 0 is OLS and λ = ∞ results in only β0 being assigned a non-zero
value.
As we increase λ from zero, we decrease variance but increase bias (the classic bias-variance
tradeoff). To determine the best λ (e.g. the λ that will lead to the best out of sample predictive
performance), it is common to use cross validation. A good function in R that trains Lasso and
Ridge Regression models and uses cross-validation to select a good λ is cv.glmnet(), which is
part of the glmnet package.
1
1.2
Implementation and Considerations
Lasso and Ridge Regression are useful models to use when dealing with a large number of predictors
p relative to the number of observations n. It is good to standardize your features so that coefficients
are not selected because of their scale rather than their relative importance.
For example, suppose you were predicting salary, and suppose you had a standardized test score
that was highly predictive of salary and you also had parents’ income, which was only somewhat
predictive of salary. Since standardized test scores are measured on a much smaller scale than
the outcome and than parents’ income, we would expect the coefficient on test scores to be large
relative to the coefficient on parents’ income. This means it would likely be shrunk by adding the
penalty, even though it reflects a strong predictive relationship.
1.3
Interpreting Coefficients
Coefficients produced by Lasso and Ridge Regression should not be interpreted causally. These
methods are used for prediction, and as such, our focus is on ŷ. This is in contrast to an inference
problem, where we would be interested in β̂.
The intuition behind why we cannot interpret the coefficients causally is similar to the intuition
underlying Omitted Variables Bias (OVB). In OVB, we said that if two variables X1 and X2 were
correlated with each other and with the outcome Y , then the coefficient on X1 in a regression of
Y on X1 would differ from the coefficient on X1 in a regression where X2 was also included. This
is because since X1 and X2 are correlated, when we omit X2 , the coefficient on X1 picks up the
effect of X1 on Y as well as some of the effect of X2 on Y .
In penalized regressions like Lasso and Ridge, variables that are correlated with many predictors
may be favored. This is because you can get a big predictive “bang for your buck” by putting a
significant coefficient on that one predictor rather than on each of the predictors it is correlated
with. However, it is important to remember that the resulting coefficients do not reflect the causal
relationship between the variables and the outcome.
An interesting feature we saw in class
Ridge Regression is that as the penalty increases, the
Pfor
p
total size of the coefficients (defined as j=1 βj2 ) decreases but the size of an individual coefficient
may increase. To understand why this is, consider two OLS (no penalty) coefficients that equal 6
and 1, respectively. As we increase λ in ridge regression, decreasing the first coefficient by 2 would
reduce the penalty by 62 − 42 = 20 whereas increasing the smaller coefficient by 2 would increase
the penalty by 32 − 12 = 8. Therefore, if the two variables are correlated, it can be helpful to have
the variable with the smaller OLS coefficient pick up some of the effect of the variable with the
larger OLS coefficient since the penalty will be significantly smaller. Note that this is not true for
Lasso, where the decrease in penalty of moving from 6 to 4 is the same as the increase in penalty
of moving from 1 to 3.
2
Bootstrapping
Bootstrapping involves resampling your data with replacement. It is a very useful statistical tool
that allows you to quantify the uncertainty of an estimator, such as a coefficient. This is especially
useful for models where there is not a convenient standard error formula. When your goal is to
estimate the uncertainty of your estimate, you bootstrap your data and perform the estimate many
times (e.g. 1,000 times). Then, you look at the distribution of your estimates and take the middle
95% as your 95% confidence interval.
Bootstrapping is also an important component of some very powerful machine learning models.
One that we will learn about in the next few weeks is Random Forest. The Random Forest
resamples your data many times, each time training a different model, and then averages across
the models trained on all of your resampled data. We will discuss the exact mechanics in a few
weeks when we cover Random Forest.
2
Download