Lasso, Support Vector Machines, Generalized linear models Kenneth D. Harris 20/5/15 Multiple linear regression What are you predicting? Data type Continuous Dimensionality 1 What are you predicting it from? Data type Continuous Dimensionality p How many data points do you have? Enough What sort of prediction do you need? Single best guess What sort of relationship can you assume? Linear Ridge regression What are you predicting? Data type Continuous Dimensionality 1 What are you predicting it from? Data type Continuous Dimensionality p How many data points do you have? Not enough What sort of prediction do you need? Single best guess What sort of relationship can you assume? Linear Regression as a probability model What are you predicting? Data type Continuous Dimensionality 1 What are you predicting it from? Data type Continuous Dimensionality p How many data points do you have? Not enough What sort of prediction do you need? Probability distribution What sort of relationship can you assume? Linear Different data types What are you predicting? Data type Discrete, integer, whatever Dimensionality 1 What are you predicting it from? Data type Continuous Dimensionality p How many data points do you have? Not enough What sort of prediction do you need? Single best guess What sort of relationship can you assume? Linear – nonlinear Ridge regression Linear prediction: π¦π = π° ⋅ π± π Loss function: πΏ= π 1 π¦π − π¦π 2 2 + 1 ππ° 2 Fit quality Both the fit quality and the penalty can be changed. 2 Penalty “Regularization path” for ridge regression http://scikit-learn.org/stable/auto_examples/linear_model/plot_ridge_path.html Changing the penalty • π° = 2 π€ π π is called the “πΏ2 norm” • π°1 = π π€π is called the “πΏ1 norm” • In general π° π = π π π€π π is called the “πΏπ norm” The LASSO Loss function: πΏ= π Fit quality 1 π¦π − π¦π 2 2 + 1 ππ°1 2 Penalty LASSO regularization path • Most weights are exactly zero • “sparse solution”, selects a small number of explanatory variables • This can help avoid overfitting when p>>N • Models are easier to interpret – but remember there is no proof of causation. • Path is piecewise-linear http://scikit-learn.org/0.11/auto_examples/linear_model/plot_lasso_lars.html Elastic net •πΏ= 1 π2 π¦π − π¦π 2 + 1 π 2 1 π°1+ 1 π 2 2 π° 2 2 Predicting other types of data Linear prediction: ππ = π° ⋅ π± π Loss function: πΏ= πΈ ππ , π¦π π 2 Penalty Fit quality For ridge regression, πΈ ππ , π¦π = + 1 ππ° 2 1 2 ππ − π¦π 2 . But it could be anything… Support vector machine • For predicting binary data • “Hinge loss” function 0 1 − ππ • πΈ ππ , π¦π = 0 1 + ππ π¦π = 1, ππ π¦π = 1, ππ π¦π = −1, ππ π¦π = −1, ππ ≥1 <1 ≤ −1 > −1 E f Errors vs. margins • Margins are the places where ππ = ±1 • On the correct side of the margin: zero error. • On the incorrect side: error is distance from margin. • Penalty term is higher when margins are close together • SVM balances classifying points correctly vs having big margins Generalized linear models What are you predicting? Data type Discrete, integer, whatever Dimensionality 1 What are you predicting it from? Data type Continuous Dimensionality p How many data points do you have? Not enough What sort of prediction do you need? Probability distribution What sort of relationship can you assume? Linear – nonlinear Generalized linear models Linear prediction: ππ = π° ⋅ π± π Loss function: πΏ= πΈ ππ , π¦π 1 ππ° 2 + π For ridge regression, πΈ ππ , π¦π = 1 2 ππ − π¦π for a Gaussian distribution with mean ππ . 2 2 = ππππ π‘ − log π π¦π ; ππ Generalized linear models Linear prediction: ππ = π° ⋅ π± π Loss function: πΏ= − log π π¦π ; ππ π + 1 ππ° 2 2 Where π π¦π ; ππ is a probability distribution for π¦π with parameter ππ . Example: logistic regression 1 • π π¦π ; ππ = 1+π −ππ 1 1+π ππ π¦π = 1 π¦π = −1 P(y; f) f Logistic regression loss function • πΈ ππ , π¦π = log π π¦π ; ππ = log 1 − π −ππ π¦π Poisson regression • When π¦π is a positive integer (e.g. spike count) • Distribution for π¦π is Poisson with mean π ππ • “Link function” π must be positive. Often exponential function, but doesn’t have to be (and it’s not always a good idea). What to read; what software to use http://web.stanford.edu/~hastie/glmnet_matlab/