Ridge regression and Bayesian linear regression Kenneth D. Harris 6/5/15

advertisement
Ridge regression and
Bayesian linear regression
Kenneth D. Harris
6/5/15
Multiple linear regression
What are you predicting?
Data type
Continuous
Dimensionality
1
What are you predicting it from?
Data type
Continuous
Dimensionality
p
How many data points do you have?
Enough
What sort of prediction do you need?
Single best guess
What sort of relationship can you assume?
Linear
Multiple linear regression
What are you predicting?
Data type
Continuous
Dimensionality
1
What are you predicting it from?
Data type
Continuous
Dimensionality
p
How many data points do you have?
Not enough
What sort of prediction do you need?
Single best guess
What sort of relationship can you assume?
Linear
Multiple predictors, one predicted variable
𝑦𝑖 = 𝐱 𝐢 ⋅ 𝐰
𝐲 = 𝐗𝐰
• Choose 𝐰 to minimize sum-squared error:
𝐿=
𝑖
1
𝑦𝑖 − 𝑦𝑖
2
Optimal weight vector 𝐰 =
2
=
𝑖
−𝟏 𝐓
𝐓
𝐗 𝐗 𝐗 𝐲
1
𝑤𝑥𝑖 − 𝑦𝑖
2
2
= 𝐗\𝐲 (in MATLAB)
Too many predictors
• If 𝑁 = 𝑝, you can fit the training data perfectly
• 𝐲 = 𝐗𝐰 is 𝑁 equations in 𝑁 unknowns
• If 𝑁 < 𝑝, the solution is underconstrained (𝐗 𝐓 𝐗 is not invertible)
• But even if 𝑁 > 𝑝, you can problems with too many predictors
𝑁 = 40,
𝑝 = 30,
𝑦 = 𝑥1
𝑁 = 40,
𝑝 = 30,
𝑦 = 𝑥1 + 𝑛𝑜𝑖𝑠𝑒
𝑁 = 40,
𝑝 = 30,
𝑦 = 𝑥1 + 𝑛𝑜𝑖𝑠𝑒
Noise
Geometric interpretation
Target 𝐲
𝐱𝟐
𝐱𝟏
Signal
Target can be fit exactly, by having a massive positive weight of for 𝐱 𝟐 and a massive negative weight for 𝐱 𝟏 .
It would be better to just fit 𝐲 = 𝐱 𝟏 .
Noise
Geometric interpretation
Target 𝐲
𝐱𝟐
𝐱𝟏
Signal
Target can be fit exactly, by having a massive positive weight of for 𝐱 𝟐 and a massive negative weight for 𝐱 𝟏 .
It would be better to just fit 𝐲 = 𝐱 𝟏 .
Overfitting = large weight vectors
• Solution: weight vector penalty
𝐿=
𝑖
Optimal weight vector 𝐰 =
1
𝑦𝑖 − 𝑦𝑖
2
𝐗𝐓𝐗
+ 𝛌𝐈
2
1
+ 𝜆𝐰
2
−𝟏 𝐓
𝐗 𝐲
The inverse can always be taken, even for 𝑁 < 𝑝.
2
Example
𝜆=0
𝜆=3
Ridge regression introduces a bias
𝜆=0
𝜆 = 50
A quick trick to do ridge regression
• Ordinary linear regression:
𝐰 = 𝐗\𝐲
Minimizes
𝑖
𝑦𝑖 − 𝐱𝐢 ⋅ 𝐰 2 . Define
′
𝐗 =
𝐗
𝜆𝐈𝑝
𝐲
𝑦 = 𝟎
𝐩
′
Then 𝐰 = 𝐗′\𝐲′ is the solution to ridge regression. (Why?)
Regression as a probability model
What are you predicting?
Data type
Continuous
Dimensionality
1
What are you predicting it from?
Data type
Continuous
Dimensionality
p
How many data points do you have?
Enough
What sort of prediction do you need?
Probability distribution
What sort of relationship can you assume?
Linear
Regression as a probability model
• Assume 𝑦𝑖 is random, but 𝐱𝐢 and 𝐰 are just numbers.
𝑦𝑖 ~𝑁 𝐰 ⋅ 𝐱 𝐢 , 𝜎 2
Then the likelihood is
log 𝑝 𝑦𝑖
log 2𝜋𝜎 2
𝑦𝑖 − 𝐰 ⋅ 𝐱 𝐢
=−
−
2
2𝜎 2
Maximum likelihood is the same as least-squares fit.
𝟐
Bayesian linear regression
• Now consider 𝐰 to also be random with prior distribution:
𝐰~N(𝟎, 𝜎 2 𝜆−1 𝐈)
The posterior distribution is
𝑝 𝐲𝐰 𝑝 𝐰
𝑝 𝐰𝐲 =
𝑝 𝐲
Bayesian linear regression
log 𝑝 𝐲|𝐰
log 𝑝 𝐰
𝑁log 2𝜋𝜎 2
=−
−
2
𝑖
𝑦𝑖 − 𝐰 ⋅ 𝐱𝐢
2𝜎 2
log 2𝜋𝜆−1
𝐰𝟐
=−
− 2 −1
2
2𝜎 𝜆
This is all quadratic in 𝐰. So is 𝐰 Gaussian distributed.
𝟐
Bayesian linear regression
𝐰~𝑁 𝐰𝟎 , 𝐒
𝐓
−𝟏 𝐓
𝐓
−𝟏
𝐰𝟎 = 𝐗 𝐗 + λ𝐈
2
𝑺 = 𝜎 𝐗 𝐗 + λ𝐈
𝐗 𝐲
Mean of 𝐰 is exactly the same as in ridge regression. But we also get a
covariance matrix for 𝐰.
Bayesian predictions
• Given a training set 𝐲, 𝐗, and a new value 𝐱𝑡 . Assume 𝐲 is random but 𝐗, 𝐱𝑡 are
fixed.
• To make a prediction of 𝑦𝑡 , integrate over all possible 𝐰:
𝑝 𝑦𝑡 |𝐲 = ∫ 𝑝 𝑦𝑡 𝐰 𝑝 𝐰 𝐲 𝑑𝐰
~𝑁 𝐰0 ⋅ 𝐱𝑡 , 𝐱𝑡𝑇 𝐒𝐱𝑡
Mean is the same as in ridge regression, but we also get a variance:
−𝟏
𝑇
2
𝐓
Var yt = 𝜎 𝐱 𝑡 𝐗 𝐗 + λ𝐈 𝐱𝑡 .
The variance does not depend on the training set 𝐲. It is low when many of the
training set 𝐱 values are collinear with 𝐱 𝑡 .
Download