Lasso, Support Vector Machines, Generalized linear models Kenneth D. Harris 20/5/15

advertisement
Lasso, Support Vector Machines,
Generalized linear models
Kenneth D. Harris
20/5/15
Multiple linear regression
What are you predicting?
Data type
Continuous
Dimensionality
1
What are you predicting it from?
Data type
Continuous
Dimensionality
p
How many data points do you have?
Enough
What sort of prediction do you need?
Single best guess
What sort of relationship can you assume?
Linear
Ridge regression
What are you predicting?
Data type
Continuous
Dimensionality
1
What are you predicting it from?
Data type
Continuous
Dimensionality
p
How many data points do you have?
Not enough
What sort of prediction do you need?
Single best guess
What sort of relationship can you assume?
Linear
Regression as a probability model
What are you predicting?
Data type
Continuous
Dimensionality
1
What are you predicting it from?
Data type
Continuous
Dimensionality
p
How many data points do you have?
Not enough
What sort of prediction do you need?
Probability distribution
What sort of relationship can you assume?
Linear
Different data types
What are you predicting?
Data type
Discrete, integer, whatever
Dimensionality
1
What are you predicting it from?
Data type
Continuous
Dimensionality
p
How many data points do you have?
Not enough
What sort of prediction do you need?
Single best guess
What sort of relationship can you assume?
Linear – nonlinear
Ridge regression
Linear prediction: 𝑦𝑖 = 𝐰 ⋅ 𝐱 𝑖
Loss function:
𝐿=
𝑖
1
𝑦𝑖 − 𝑦𝑖
2
2
+
1
πœ†π°
2
Fit quality
Both the fit quality and the penalty can be changed.
2
Penalty
“Regularization path” for ridge regression
http://scikit-learn.org/stable/auto_examples/linear_model/plot_ridge_path.html
Changing the penalty
• 𝐰 =
2
𝑀
𝑖 𝑖 is called the “𝐿2 norm”
• 𝐰1 =
𝑖
𝑀𝑖 is called the “𝐿1 norm”
• In general 𝐰
𝑝
=
𝑝
𝑖
𝑀𝑖
𝑝
is called the “𝐿𝑝 norm”
The LASSO
Loss function:
𝐿=
𝑖
Fit quality
1
𝑦𝑖 − 𝑦𝑖
2
2
+
1
πœ†π°1
2
Penalty
LASSO regularization path
• Most weights are exactly zero
• “sparse solution”, selects a
small number of explanatory
variables
• This can help avoid overfitting
when p>>N
• Models are easier to interpret –
but remember there is no
proof of causation.
• Path is piecewise-linear
http://scikit-learn.org/0.11/auto_examples/linear_model/plot_lasso_lars.html
Elastic net
•πΏ=
1
𝑖2
𝑦𝑖 − 𝑦𝑖
2
+
1
πœ†
2 1
𝐰1+
1
πœ†
2 2
𝐰
2
2
Predicting other types of data
Linear prediction: 𝑓𝑖 = 𝐰 ⋅ 𝐱 𝑖
Loss function:
𝐿=
𝐸 𝑓𝑖 , 𝑦𝑖
𝑖
2
Penalty
Fit quality
For ridge regression, 𝐸 𝑓𝑖 , 𝑦𝑖 =
+
1
πœ†π°
2
1
2
𝑓𝑖 − 𝑦𝑖 2 . But it could be anything…
Support vector machine
• For predicting binary data
• “Hinge loss” function
0
1 − 𝑓𝑖
• 𝐸 𝑓𝑖 , 𝑦𝑖 =
0
1 + 𝑓𝑖
𝑦𝑖 = 1, 𝑓𝑖
𝑦𝑖 = 1, 𝑓𝑖
𝑦𝑖 = −1, 𝑓𝑖
𝑦𝑖 = −1, 𝑓𝑖
≥1
<1
≤ −1
> −1
E
f
Errors vs. margins
• Margins are the places where 𝑓𝑖 = ±1
• On the correct side of the margin: zero
error.
• On the incorrect side: error is distance
from margin.
• Penalty term is higher when margins
are close together
• SVM balances classifying points
correctly vs having big margins
Generalized linear models
What are you predicting?
Data type
Discrete, integer, whatever
Dimensionality
1
What are you predicting it from?
Data type
Continuous
Dimensionality
p
How many data points do you have?
Not enough
What sort of prediction do you need?
Probability distribution
What sort of relationship can you assume?
Linear – nonlinear
Generalized linear models
Linear prediction: 𝑓𝑖 = 𝐰 ⋅ 𝐱 𝑖
Loss function:
𝐿=
𝐸 𝑓𝑖 , 𝑦𝑖
1
πœ†π°
2
+
𝑖
For ridge regression, 𝐸 𝑓𝑖 , 𝑦𝑖 =
1
2
𝑓𝑖 − 𝑦𝑖
for a Gaussian distribution with mean 𝑓𝑖 .
2
2
= π‘π‘œπ‘›π‘ π‘‘ − log 𝑝 𝑦𝑖 ; 𝑓𝑖
Generalized linear models
Linear prediction: 𝑓𝑖 = 𝐰 ⋅ 𝐱 𝑖
Loss function:
𝐿=
− log 𝑝 𝑦𝑖 ; 𝑓𝑖
𝑖
+
1
πœ†π°
2
2
Where 𝑝 𝑦𝑖 ; 𝑓𝑖 is a probability distribution for 𝑦𝑖 with parameter 𝑓𝑖 .
Example: logistic regression
1
• 𝑝 𝑦𝑖 ; 𝑓𝑖 =
1+𝑒 −𝑓𝑖
1
1+𝑒 𝑓𝑖
𝑦𝑖 = 1
𝑦𝑖 = −1
P(y; f)
f
Logistic regression loss function
• 𝐸 𝑓𝑖 , 𝑦𝑖 = log 𝑝 𝑦𝑖 ; 𝑓𝑖 = log 1 − 𝑒 −𝑓𝑖 𝑦𝑖
Poisson regression
• When 𝑦𝑖 is a positive integer (e.g. spike count)
• Distribution for 𝑦𝑖 is Poisson with mean 𝑔 𝑓𝑖
• “Link function” 𝑔 must be positive. Often exponential function, but
doesn’t have to be (and it’s not always a good idea).
What to read; what software to use
http://web.stanford.edu/~hastie/glmnet_matlab/
Download