Linear regression with gradient descent Ingmar Schuster Patrick Jähnichen using slides by Andrew Ng Institut für Informatik This lecture covers ● Linear Regression ● ● ● Hypothesis formulation, hypthesis space Optimizing Cost with Gradient Descent Using multiple input features with Linear Regression ● Feature Scaling ● Nonlinear Regression ● Optimizing Cost using derivatives Linear regression w. gradient descent 2 Linear Regression Institut für Informatik Price for buying a flat in Berlin ● Supervised learning problem ● ● Expected answer available for each example in data Regression Problem ● Prediction of continuous output Linear regression w. gradient descent 4 Training data of flat prices ● ● ● ● m Number of training examples x is input (predictor) variable „features“ in ML-speek y is output (response) variable Notation Square meters Price in 1000€ 73 174 146 367 38 69 124 257 ... ... Linear regression w. gradient descent 5 Learning procedure ● Hypothesis parameters Training data ● linear regression, one input variable (univariate) Learning Algorithm Size of flat Estimated price hypothesis (mapping between input and output) How to choose parameters? Linear regression w. gradient descent 6 Optimization objective ● Purpose of learning algorithm expressed in optimization objective and cost function (often called J) ● Fit data well ● Few false negatives ● Few false positives ● ... Fitting data well: least squares cost function ● In regression almost always want to fit data well ● ● smallest average distance to points in training data (h(x) close to y for (x,y) in training data) Cost function often named J Number Number of of training training instances instances ● Squaring – Penalty for positive and negative deviations the same – Penalty for large deviations stronger Linear regression w. gradient descent 8 Optimizing Cost with Gradient Descent Linear regression w. gradient descent 9 Gradient Descent Outline ● Want to minimize ● Start with random ● Keep changing to reduce until we end up at minimum Linear regression w. gradient descent 10 3D plots and contour plots Stepwise Stepwise descent descent towards towards minimum minimum Derivatives Derivatives work work only only for for few few parameters parameters [plot by Andrew Ng] Gradient descent partial partial derivative derivative beware: incremental update incorrect! Linear regression w. gradient descent steps steps become become smaller smaller without without changing changing learning learning rate rate 12 Learning Rate considerations ● ● ● Small learning rate leads to slow convergence Overly large learning rate may not lead to convergence or to divergence Often Linear regression w. gradient descent 13 Checking convergence ● ● Gradient descent works correctly if decreases with every step Possible convergence criterion: converged if decreases by less than constant Linear regression w. gradient descent 14 Local Minima ● Gradient descent can get stuck at local minima (e.g. J not squared error for regression with only one variable) X Linear regression w. gradient descent Random restart with different parameter(s) 15 Variants of Gradient Descent Using multiple input features Linear regression w. gradient descent 16 Multiple features Square Bedrooms Floors Age of building meters (years) x1 ● x2 x3 x4 Price in 1000€ y 200 5 1 45 460 131 3 2 40 232 142 3 2 30 315 756 2 1 36 178 … … … … … Notation Linear regression w. gradient descent 17 Hypothesis representation ● ● More compact with definition Linear regression w. gradient descent 18 Gradient descent for multiple variables ● Generalized cost function ● Generalized gradient descent Linear regression w. gradient descent 19 Partial derivative of cost function for multiple variables ● Calculating the partial derivative Linear regression w. gradient descent 20 Gradient descent for multiple variables ● Simplified gradient descent Linear regression w. gradient descent 21 Conversion considerations for multiple variables ● With multiple variables, comparison of variance in data is lost (scales can vary strongly) Square meters 30 - 400 ● Bedrooms 1 - 10 Price 80 000 2 000 000 Gradient descent converges faster for features on similar scale Linear regression w. gradient descent 22 Feature Scaling Linear regression w. gradient descent 23 Feature scaling ● Different approaches for converting features to comparable scale ● Min-Max-Scaling makes all data fall into range [0, 1] (for single data point of feature j) ● Z-score conversion Linear regression w. gradient descent 24 Z-Score conversion ● Center data on 0 ● Scale data so majority falls into range [-1, 1] mean mean // empirical empirical expected expected value value (mu) (mu) empirical empirical standard standard deviation deviation (sigma) (sigma) ● Z-score conversion of single data point for feature j Linear regression w. gradient descent 25 Visualizing standard deviation Linear regression w. gradient descent 26 Nonlinear Regression (by cheap trickery) Linear regression w. gradient descent 27 Nonlinear Regression Problems Linear regression w. gradient descent 28 Nonlinear Regression Problems (linear approximation) Linear regression w. gradient descent 29 Nonlinear Regression Problems (nonlinear hypothesis) Linear regression w. gradient descent 30 Nonlinear Regression with cheap trickery ● Linear Regression can be used for Nonlinear Problems ● Choose nonlinear hypothesis space ● ● ● Linear regression w. gradient descent 31 Optimizing cost using derivatives Linear regression w. gradient descent 32 Comparison Gradient Descent vs. Setting derivative = 0 ● Instead of Gradient descent solve for all i Linear regression w. gradient descent 33 Comparison Gradient Descent vs. Setting derivative = 0 Gradient Descent ● ● ● Need to choose Needs many iterations, random restarts etc. Works well for many features Derivation ● No need to choose ● No iterations ● ● Slow for many features Linear regression w. gradient descent 34 This lecture covers ● Linear Regression ● ● ● Hypothesis formulation, hypthesis space Optimizing Cost with Gradient Descent Using multiple input features with Linear Regression ● Feature Scaling ● Nonlinear Regression ● Optimizing Cost using derivatives Linear regression w. gradient descent 35 Pictures ● Some public domain plots from en.wikipedia.org and de.wikipedia.org Linear regression w. gradient descent 36