Machine learning Lecture – 1, 2 Syed Qamar Askari Topics • Univariate Simple Linear Regression We’ll start with Regression Prerequisite knowledge required Linear function • It draws line • Standard form: y = mx + b • m is the slope • b is the y-intercept Prerequisite knowledge required Quadratic function • It draws parabola • Standard form: f(x) = a(x - h)2 + k • • • • a positive: Graph opens upward a negative: Graph opens downward line of symmetry at x = h vertex is the point (h, k) Prerequisite knowledge required Cubic function • Forms: f(x)=ax3+bx2+cx+d f(x)=(x-a)3-b Regression Types • Linear and non-linear regression • Simple and multiple regression • Univariate and multivariate regression Simple univariate linear regression Example: House price prediction Size in feet2 (x) 2104 1416 1534 852 … Price ($) in 1000's (y) 460 232 315 178 … Based on given data, can you predict the price of 1600 square feet house ? House price prediction – Solution (on board) 400 Size in Price ($) in feet2 (x) 1000's (y) 2104 460 1416 232 1534 315 852 178 … … 300 Price ($) in 1000’s 200 100 Price of 1600 sq.ft? 0 0 500 1000 1500 Size in feet2 2000 2500 How could computer do it? • Explanation of following steps on board • Simple (solvable by a line) one-featured dataset example on board • Hypothesis, linear modal, parameters (slope and y-intercept) • Do some exercises with different combinations of parameters • Cost/error function formulation • To decrease error and get more fit model, show the relationship with slope and yintercept one-by-one and then simultaneously • Give the concept of slope calculation using the concept of partial derivation • Gradient descent for simple linear regression • • • • Partial derivation of Error function w.r.t slope and y-intercept Concept of learning rate Equation to update the both parameters Overall gradient descent algorithm • Running few iterations of GD on simple example Regression – Working model Training Set Learning Algorithm Size of house h Estimated price How to make function h? Regression – Line fitting Hypothesis: ‘s: Parameters How to choose ‘s ? 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 0 1 2 3 0 1 2 3 How to know, how good the fitting is? Mean squared error Mean squared error y Cost Function: Mean squared error function x Idea: Choose so that is close to for our training examples Goal: Understanding the role of slope We want to fit a line in following data points By fixing θ0 = 0 and varying θ1, understand the behavior of J 3 3 2 2 1 1 y 0 0 1 x 2 3 0 -0.5 0 0.5 1 1.5 2 2.5 θ1 Understanding the role of slope (for fixed θ1 = 1 , this is a function of x) (function of the parameter θ1 = 1 ) 3 3 2 2 1 1 y 0 0 1 x 2 3 0 -0.5 0 0.5 1 1.5 2 2.5 θ1 Understanding the role of slope (function of the parameter θ1 = 0.5) 3 3 2 2 1 1 y 0 0 1 x 2 3 0 -0.5 0 0.5 1 1.5 2 2.5 θ1 Understanding the role of slope (function of the parameter θ1 = 0) 3 3 2 2 1 1 0 0 1 x 2 3 0 -0.5 0 0.5 1 1.5 2 2.5 θ1 Understanding the role of slope (function of the parameter θ1 with different values) 3 2 1 0 -0.5 0 0.5 1 1.5 2 2.5 θ1 So optimal value is 1 for θ1 Similarly by fixing θ1 and change in θ0 will also give such kind of curve What if we change both parameters simultaneously? We’ll come up with a landscape Landscape view of J w.r.t to both parameters J(0,1) 1 0 Idea We’ll generate a random solution and then will search the landscape to reach global minimum J(0,1) 0 1 J(0,1) 0 1 How to search such complex unpredictable landscapes? • Hill climbing • Simulated annealing • Gradient descent Slope of J at some point in space The graph of a function, drawn in black, and a tangent line to that function, drawn in red. The slope of the tangent line is equal to the derivative of the function at the marked point. Partial derivation of J w.r.t both parameters w.r.t θ0 w.r.t θ1 Gradient descent algorithm update both simultaneously Role of alpha and convergence If α is too small, gradient descent can be slow. If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge. at local optima Current value of Gradient descent can converge to a local minimum, even with the learning rate α fixed. As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time. Demonstration of convergence (for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters )