Machine Learning Week 1, Lecture 2 Recap Unknown Target f Supervised Learning 5 0 4 1 9 2 Data Set 1 3 1 4 Hypothesis Set Classification Learning Algorithm Regression Hyperplane Halfspace < 0 Halfspace >0 w Hypothesis h h(x) ≈ f(x) np-hard in general Assume Data Is Linear Separable!!! Perception find separating hyperplane Convex Today • Convex Optimization – Convex sets – Convex functions • Logistic Regression – Maximum Likelihood – Gradient Descent • Maximum likelihood and Linear Regression Convex Optimization Optimization problem, in general very hard (if possible at all)!!! For convex optimization problems theoretical (polynomial time) and practical solutions exist (most of the time) Example: Convex Sets The “line” from x to y must also be in the set Convex Set Non-convex Set Convex Sets Intersection of convex sets Union of convex setsmay not be convex Convex Functions f is concave if –f is convex y,f(y) Concave?, Convex? Both x,f(x) Differentiable Convex Functions Example y,f(y) x,f(x) f(x)+f’(x)(y-x) Twice Differentiable Convex Functions f is convex if the Hessian is positive semi-definite for all x. Real symmetric matrix A is positive semidefinite if for all nonzero x 1D: Simple 2D Example More Examples Affine Functions: Quadratic Functions: Convex if A is positive semidefinite Convexity of Linear Regression Quadratic Functions: Convex if A is positive semidefinite Real and Symmetric: Clearly Epigraph Connection between convex sets and convex functions f is convex if epi(f) is a convex set Sublevel sets Convex function Define α-Sublevel set: Is Convex Convex Optimization f and g are convex, h is affine Local Minima are Global Minima Examples of Convex Optimization • Linear Programming • Quadratic Programming (P is positive semidefinite) Summary Rockafellar stated, in his 1993 SIAM Review survey paper “In fact the great watershed in optimization isn’t between linearity and nonlinearity, but convexity and nonconvexity” Convex GOOD!!!! Estimating Probabilities • Probability of getting cancer given your situation. • Probability that AGF wins against Viborg given the last 5 results. • Probability that the loan is not payed back as a function of credit worthiness • Probability of a student getting an A in Machine Learning given his grades. Data is actual events not probabilities, e.g. some students that failed and some who did not… Breast Cancer http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29 Input Features • • • • • • • • • • Target Function • benign • malignant 1. Sample code number: id number 2. Clump Thickness: 1 - 10 3. Uniformity of Cell Size: 1 - 10 4. Uniformity of Cell Shape: 1 - 10 5. Marginal Adhesion: 1 - 10 6. Single Epithelial Cell Size: 1 - 10 7. Bare Nuclei: 1 - 10 PREDICT PROBABILITY 8. Bland Chromatin: 1 - 10 OF BENIGN AND 9. Normal Nucleoli: 1 - 10 MALIGNANT ON 10. Mitoses: 1 - 10 FUTURE PATIENTS Maximum Likelihood Biased Coin, (bias θ probability of heads) Flip it n times independently (Bernoulli trials), Count the number of heads k After seeing the data what can we infer Take Logs Fix θ, What is the probability of seeing D Likelihood of the data Maximum Likelihood Maximize Minimize Negative Log Likelihood of the data (log is monotone) Compute Gradient solve for 0 Bayesian Perspective Bayes Rule: Want: Need: A Prior Normalizing factor Likelihood x Prior Posterior Bayesian Perspective • Compute the probability of each hypotheses • Pick the most likely and use for predictions (map = maximum a posteriori) • Compute Expected Values (Weighed average over all hypotheses) Logistic Regression Hard andHard SoftThreshold Threshold Assume Independent Data Points, Apply Maximum Likelihood (there is a Bayesian version to) Can and is used for classification. Predict most likely y Maximum Likelihood Logistic Regression Neg. Log likelihood is convex Cannot solve for zero analytically Descent Methods where f is twice continuously differentiable Iteratively move toward a better solution • Pick start point x • Repeat Until Stopping Criterion Satisfied • Compute Descent Direction v • Line Search: Compute Step Size t • Update: x = x + t v Gradient Descent Line (Ray) Search • • • • • Pick start point x Repeat Until Stopping Criterion Satisfied Compute Descent Direction v Line Search: Compute Step Size t Update: x = x + t v • Solve analytically (if possible) • Backwards Search start high and decrease until improving distance found [SL 9.2] • Fix to a small constant • Use size of the gradient scaled with small constant. • Start with constant, let it decrease slowly or when to high Stopping Criteria • Gradient becomes very small • Max number of iterations used Gradient Descent for Linear Reg. GD For Linear Regression Matlab style function theta= GD(X,y,theta) LR = 0.1 for i=1:50 cost = (1/length(y))* sum((X*theta-y).^2) grad = (1/length(y))*2.*X'*(X*theta-y) theta = theta – LR * grad end Note we do not scale gradient to unit vector Learning Rate Learning Rate Learning Rate Gradient Descent Jump Around Use Exact Line Search Starting From (10,1) Gradient Descent Running Time • Number of iterations x Cost per iteration. • Cost Per Iteration is usually not a problem. • Number of iterations depends choice of line search and stopping Criterion clearly. – Very Problem and Data Specific – Need a lot of math to give bounds. – We will not cover it in this course. Gradient Descent For Logistic Regression Handin 1! A long with multiclass extension Stochastic Gradient Descent Pick at random and use Use K points chosen at random Mini-Batch Gradient Descent Linear Classification with K classes • Use Logistic regression All Vs one. – Train K classifiers one for each class – Input X is the same. Y is 1 for all elements from that class and 0 otherwise (All vs. One) – Prediction, compute the probability for all K classifiers output class with highest probability. • Use Softmax Regression – Extension of logistic function to K classes in some sense – Covered in Handin 1. Maximum Likelihood and Linear Regression (Time to spare slide) Assume: Independently Todays Summary • Convex Optimizations – Many Definitions – Local Optimal is Global Optimal – Usually theoretical and practically feasible • Maximum likelihood – Use as a proxy for – Assume Independent Data • Gradient Descent – – – – Minimize function Iteratively finding better solution by local steps based on gradient First order method (Uses gradient) Other methods exist, e.g. Second order methods (use hessian)