Ridge regression and Bayesian linear regression Kenneth D. Harris 6/5/15 Multiple linear regression What are you predicting? Data type Continuous Dimensionality 1 What are you predicting it from? Data type Continuous Dimensionality p How many data points do you have? Enough What sort of prediction do you need? Single best guess What sort of relationship can you assume? Linear Multiple linear regression What are you predicting? Data type Continuous Dimensionality 1 What are you predicting it from? Data type Continuous Dimensionality p How many data points do you have? Not enough What sort of prediction do you need? Single best guess What sort of relationship can you assume? Linear Multiple predictors, one predicted variable 𝑦𝑖 = 𝐱 𝐢 ⋅ 𝐰 𝐲 = 𝐗𝐰 • Choose 𝐰 to minimize sum-squared error: 𝐿= 𝑖 1 𝑦𝑖 − 𝑦𝑖 2 Optimal weight vector 𝐰 = 2 = 𝑖 −𝟏 𝐓 𝐓 𝐗 𝐗 𝐗 𝐲 1 𝑤𝑥𝑖 − 𝑦𝑖 2 2 = 𝐗\𝐲 (in MATLAB) Too many predictors • If 𝑁 = 𝑝, you can fit the training data perfectly • 𝐲 = 𝐗𝐰 is 𝑁 equations in 𝑁 unknowns • If 𝑁 < 𝑝, the solution is underconstrained (𝐗 𝐓 𝐗 is not invertible) • But even if 𝑁 > 𝑝, you can problems with too many predictors 𝑁 = 40, 𝑝 = 30, 𝑦 = 𝑥1 𝑁 = 40, 𝑝 = 30, 𝑦 = 𝑥1 + 𝑛𝑜𝑖𝑠𝑒 𝑁 = 40, 𝑝 = 30, 𝑦 = 𝑥1 + 𝑛𝑜𝑖𝑠𝑒 Noise Geometric interpretation Target 𝐲 𝐱𝟐 𝐱𝟏 Signal Target can be fit exactly, by having a massive positive weight of for 𝐱 𝟐 and a massive negative weight for 𝐱 𝟏 . It would be better to just fit 𝐲 = 𝐱 𝟏 . Noise Geometric interpretation Target 𝐲 𝐱𝟐 𝐱𝟏 Signal Target can be fit exactly, by having a massive positive weight of for 𝐱 𝟐 and a massive negative weight for 𝐱 𝟏 . It would be better to just fit 𝐲 = 𝐱 𝟏 . Overfitting = large weight vectors • Solution: weight vector penalty 𝐿= 𝑖 Optimal weight vector 𝐰 = 1 𝑦𝑖 − 𝑦𝑖 2 𝐗𝐓𝐗 + 𝛌𝐈 2 1 + 𝜆𝐰 2 −𝟏 𝐓 𝐗 𝐲 The inverse can always be taken, even for 𝑁 < 𝑝. 2 Example 𝜆=0 𝜆=3 Ridge regression introduces a bias 𝜆=0 𝜆 = 50 A quick trick to do ridge regression • Ordinary linear regression: 𝐰 = 𝐗\𝐲 Minimizes 𝑖 𝑦𝑖 − 𝐱𝐢 ⋅ 𝐰 2 . Define ′ 𝐗 = 𝐗 𝜆𝐈𝑝 𝐲 𝑦 = 𝟎 𝐩 ′ Then 𝐰 = 𝐗′\𝐲′ is the solution to ridge regression. (Why?) Regression as a probability model What are you predicting? Data type Continuous Dimensionality 1 What are you predicting it from? Data type Continuous Dimensionality p How many data points do you have? Enough What sort of prediction do you need? Probability distribution What sort of relationship can you assume? Linear Regression as a probability model • Assume 𝑦𝑖 is random, but 𝐱𝐢 and 𝐰 are just numbers. 𝑦𝑖 ~𝑁 𝐰 ⋅ 𝐱 𝐢 , 𝜎 2 Then the likelihood is log 𝑝 𝑦𝑖 log 2𝜋𝜎 2 𝑦𝑖 − 𝐰 ⋅ 𝐱 𝐢 =− − 2 2𝜎 2 Maximum likelihood is the same as least-squares fit. 𝟐 Bayesian linear regression • Now consider 𝐰 to also be random with prior distribution: 𝐰~N(𝟎, 𝜎 2 𝜆−1 𝐈) The posterior distribution is 𝑝 𝐲𝐰 𝑝 𝐰 𝑝 𝐰𝐲 = 𝑝 𝐲 Bayesian linear regression log 𝑝 𝐲|𝐰 log 𝑝 𝐰 𝑁log 2𝜋𝜎 2 =− − 2 𝑖 𝑦𝑖 − 𝐰 ⋅ 𝐱𝐢 2𝜎 2 log 2𝜋𝜆−1 𝐰𝟐 =− − 2 −1 2 2𝜎 𝜆 This is all quadratic in 𝐰. So is 𝐰 Gaussian distributed. 𝟐 Bayesian linear regression 𝐰~𝑁 𝐰𝟎 , 𝐒 𝐓 −𝟏 𝐓 𝐓 −𝟏 𝐰𝟎 = 𝐗 𝐗 + λ𝐈 2 𝑺 = 𝜎 𝐗 𝐗 + λ𝐈 𝐗 𝐲 Mean of 𝐰 is exactly the same as in ridge regression. But we also get a covariance matrix for 𝐰. Bayesian predictions • Given a training set 𝐲, 𝐗, and a new value 𝐱𝑡 . Assume 𝐲 is random but 𝐗, 𝐱𝑡 are fixed. • To make a prediction of 𝑦𝑡 , integrate over all possible 𝐰: 𝑝 𝑦𝑡 |𝐲 = ∫ 𝑝 𝑦𝑡 𝐰 𝑝 𝐰 𝐲 𝑑𝐰 ~𝑁 𝐰0 ⋅ 𝐱𝑡 , 𝐱𝑡𝑇 𝐒𝐱𝑡 Mean is the same as in ridge regression, but we also get a variance: −𝟏 𝑇 2 𝐓 Var yt = 𝜎 𝐱 𝑡 𝐗 𝐗 + λ𝐈 𝐱𝑡 . The variance does not depend on the training set 𝐲. It is low when many of the training set 𝐱 values are collinear with 𝐱 𝑡 .